Running / runnable
On CPU or ready to run. High CPU here means hot loops or heavy compute—not always "healthy."
dev infraA running program is a process: an address space, open file descriptors, environment variables, and a place in the scheduler's queue. Shells, containers, and systemd all orchestrate the same primitives— fork, exec, wait, and signals.
The kernel assigns each process a unique PID, tracks parent/child relationships (PPID), and switches the CPU between processes in time slices.
| Concept | Meaning |
|---|---|
| PID | Process ID—integer handle for signals, kill, /proc/<pid> |
| PPID | Parent PID; shell is parent of commands it launches |
| UID / GID | Effective user/group for permission checks |
| Thread | Lightweight unit within a process; shares address space |
| Session / process group | Job control grouping; terminal sends signals to the foreground group |
# Current shell's PID; parent of commands you run
echo $$
# Process tree — who spawned whom
ps -ef --forest | head -20
pstree -p $$
Unix does not spawn a fresh process from scratch for each command. It clones the parent, then optionally replaces the child's memory with a new program.
PARENT (bash) CHILD
│ │
│ fork() │
├───────────────────────────────────►│ copy of bash, new PID
│ │
│ (parent continues) │ execve("/bin/ls", ...)
│ ├────► now running ls
│ wait() blocks │ runs, writes stdout
│ │ exit(0)
│◄───────────────────────────────────┤ status available
│ wait returns, shell prompt │ (zombie until wait)
▼ ▼
Exit codes matter in scripts and CI: 0 = success, non-zero = failure. Shell variable $? holds the last foreground command's exit code.
grep -q "error" /var/log/app.log
echo $? # 0 if found, 1 if not
false; echo $? # 1
true; echo $? # 0
# Propagate failure in a pipeline (bash)
set -o pipefail
curl -f https://api.example.com/health || exit 1
See Architecture → fork/exec examples for C-level syscalls and strace.
In ps output, the STAT column abbreviates state. Knowing them explains "why is my process stuck?" and "why are there defunct entries?"
On CPU or ready to run. High CPU here means hot loops or heavy compute—not always "healthy."
dev infraWaiting for an event (disk, network, lock). Normal for idle servers and blocked I/O. D is uninterruptible sleep (often I/O)—harder to kill.
Paused by SIGSTOP, SIGTSTP (Ctrl+Z), or debugger. Resume with fg / kill -CONT.
Exited but parent has not waited. Consumes a PID slot, not memory. Fix the parent or restart the service.
# STAT column: state + optional flags (+ foreground group, < high priority, etc.)
ps aux | awk '$8 ~ /Z/ {print}' # zombies only
# Full state for one PID
cat /proc/1234/status | grep -E '^(Name|State|PPid):'
Pro Tip: A process in D state (uninterruptible sleep) often means stuck NFS or failing disk. Reboot may be the only fix—killing won't work until the kernel completes the I/O.
The terminal attaches a controlling session. One process group runs in the foreground (receives keyboard input and signals like Ctrl+C); others run in the background.
| Mechanism | Behavior |
|---|---|
command & |
Start in background; shell does not wait; job number printed |
| Ctrl+Z | Suspend foreground job (SIGTSTP) |
fg / bg |
Resume job in foreground or background |
jobs |
List shell's active/stopped jobs |
nohup cmd & |
Ignore hangup; keep running after SSH disconnect (stdout/stderr to nohup.out) |
disown |
Remove job from shell's table so hangup does not kill it |
# Long build in background while you keep using the shell
npm run build > build.log 2>&1 &
jobs -l
fg %1 # bring job 1 to foreground
# Accidentally started in foreground — suspend and background
# Ctrl+Z
bg
disown -h %1 # survive terminal close
# Better than nohup for servers: systemd, tmux, or screen
tmux new -s deploy
# ... run deploy; detach with Ctrl+b d
Pro Tip: For remote work, prefer tmux or screen over raw nohup—you get reattachable sessions and scrollback. dev infra
Signals are asynchronous notifications to a process. Default actions: terminate, stop, continue, or ignore.
Well-behaved daemons trap SIGTERM for graceful shutdown; SIGKILL cannot be caught.
| Signal | Default | Typical use |
|---|---|---|
SIGHUP (1) |
Terminate | Terminal disconnected; daemons reload config on HUP (nginx -s reload) |
SIGINT (2) |
Terminate | Ctrl+C in terminal |
SIGQUIT (3) |
Terminate + core dump | Ctrl+\ — debug crashes |
SIGKILL (9) |
Terminate (forced) | Last resort; process cannot handle or block it |
SIGTERM (15) |
Terminate | Polite shutdown—systemctl stop, docker stop default |
SIGTSTP (20) |
Stop | Ctrl+Z job control |
SIGSTOP (19) |
Stop | Cannot be ignored; debugger pause |
SIGCONT (18) |
Continue | Resume after stop |
SIGCHLD (17) |
Ignore | Parent notified when child exits (handled by libc/shell) |
# Graceful then force (same as many orchestrators)
kill -TERM 1234
sleep 5
kill -0 1234 2>/dev/null && kill -KILL 1234
# By name — sends SIGTERM by default
pkill -f "gunicorn master"
killall nginx
# List signal names
kill -l
Warning: kill -9 / SIGKILL skips cleanup—corrupt DB pages, half-written files, orphaned locks. Use SIGTERM first and wait. kill -9 1 (init) is catastrophic—do not experiment on production.
systemctl stop unit sends SIGTERM, waits TimeoutStopSec, then SIGKILL.
Reload often maps to SIGHUP (systemctl reload nginx).
systemctl show nginx -p KillSignal -p RestartKillSignal
journalctl -u nginx -b --no-pager | tail -20
Day-to-day tools map directly to kernel process structures and cgroup limits.
Snapshot of processes. aux for BSD style, -ef for POSIX. Filter with grep or -C comm.
ps aux --sort=-%mem | headps -p 1234 -o pid,ppid,cmd,%cpu,%memLive view sorted by CPU or memory. htop adds tree view and mouse—better for interactive triage.
top -o %CPUhtop -u deployPer-PID virtual files: cmdline, environ, fd, limits, cgroup.
cat /proc/1234/cmdline | tr '\0' ' 'ls -l /proc/1234/fdLower priority (higher nice value) so batch jobs do not starve interactive work.
nice -n 10 tar czf backup.tgz /datarenice +5 -p 1234Per-shell resource limits (open files, core size). systemd sets limits per unit too.
ulimit -nulimit -a# What's listening on port 8080?
ss -tlnp | grep 8080
# or
lsof -i :8080
# Thread count for a Java service
ps -o nlwp= -p $(pgrep -f 'java.*myapp')
# OOM score — who gets killed first when RAM is gone
cat /proc/1234/oom_score_adj # -1000 to 1000
Misunderstanding parent/child lifecycle causes mystery PIDs and runaway process tables.
Child exited; parent never called wait. Zombie row stays until parent exits or reaps children.
Fix: restart the parent service, or patch the buggy daemon to handle SIGCHLD.
Parent died first; child adopted by PID 1 (systemd/init), which normally reaps them. Long-running orphans under PID 1 are fine; zombies under a broken parent are not.
Warning: A script that forks in a loop without exiting can exhaust PIDs and freeze the host. cgroup pids.max and ulimit -u limit damage—configure them on shared CI runners.
# Find zombies and their parents
ps -eo stat,pid,ppid,cmd | awk '$1 ~ /Z/ {print}'
# Parent of zombie PID 9999
ps -o ppid= -p 9999 | xargs ps -p
# Don't kill -9 PID 1 on Linux
ps -p 1 -o comm=
Pro Tip: In Kubernetes, "stuck terminating" pods are often SIGTERM ignored—check terminationGracePeriodSeconds then SIGKILL. Same semantics as bare metal, different tooling.