§ 18.1-18.28 Processes, pthreads, Sync, and Advanced Concurrency
User-space process control from clone flags and COW page tables through pthread lifecycle, synchronization primitives, bounded queues, thread pools, prefork servers, epoll workers, deadlock prevention, PI futexes, shared-memory synchronization, and coroutine models.
1. Overview
Linux exposes processes, threads, and coroutines as different points on the same sharing spectrum: a process is isolated by page tables, a thread is a kernel-scheduled task sharing the process resources, and a coroutine is a user-space control-flow object inside one thread. The kernel center of gravity is copy_process(), which creates a task_struct and decides what to share based on clone flags.
Process vs Thread vs Coroutine
A process buys fault isolation and independent VM; a pthread buys cheap sharing and kernel preemption; a coroutine buys extremely cheap cooperative switching but depends on the current thread to yield. The model below shows what is actually separate.
| Model | Scheduler | Isolation | Typical Switch Cost | Memory Cost |
|---|---|---|---|---|
process | Kernel scheduler | Separate VM, FD table copied unless shared explicitly | Microseconds; may include TLB effects | Page tables, VMAs, copied metadata, private stacks |
thread | Kernel scheduler | Shares VM, heap, most process resources | Usually lower than process switch | One user stack, TLS, and kernel task per thread |
coroutine | User-space event loop or runtime | No memory isolation inside the process | Tens to hundreds of nanoseconds in simple runtimes | Saved registers plus stack or frame state |
2. Key Data Structures
The main structures are not just C structs: they are ownership boundaries. task_struct is the schedulable entity, mm_struct owns the address space, files_struct owns descriptor slots, and ucontext_t is a user-space snapshot for coroutine switching.
| Structure | Important Fields | Purpose | Shared By |
|---|---|---|---|
task_struct | pid, tgid, mm, files, signal | One kernel scheduling object per process thread. | Never shared; each thread has its own. |
mm_struct | mmap, pgd, mm_users | Address space, VMA list/tree, and top page-table pointer. | Threads share it with CLONE_VM. |
files_struct | fdt, close_on_exec, open_fds | Maps integer FDs to open file descriptions. | Pthreads share it with CLONE_FILES; fork copies it. |
vm_area_struct | vm_start, vm_end, vm_flags, vm_file | One mapped region: text, heap, stack, mmap file, shared memory. | Referenced through mm_struct. |
ucontext_t | saved registers, stack pointer, signal mask, successor context | User-space coroutine context; no kernel task is created. | Owned by the coroutine runtime. |
3. Core Mechanism
fork and Copy-On-Write
Background: A shell often forks only to immediately exec a new program. Copying every page of a large parent would be wasteful, so Linux copies metadata and shares physical pages until one side writes.
The COW state is visible in the page tables: both parent and child map the same physical page with write permission removed, even though the VMA still says the mapping is logically writable.
Plan: create a child task, duplicate resource tables, mark writable PTEs read-only, and let the page-fault handler allocate a private page only if a write occurs.
| Step | Parent State | Child State | Kernel Reason |
|---|---|---|---|
| Before fork | Writable PTE points at page A | Does not exist | Normal private anonymous memory. |
| After fork | Read-only PTE points at page A | Read-only PTE points at page A | Sharing makes fork proportional to metadata, not resident memory. |
| Child writes | Still maps page A | Faults, gets copied page B | Private semantics are restored lazily. |
| After fault | Sees original bytes | Sees modified bytes | Each process now has the version it should observe. |
The fork-plus-threads trap
In a multi-threaded process, only the calling thread exists in the child after fork(). Any lock held by a vanished sibling may remain permanently locked in the child process copied memory.
fork, vfork, clone, and exec
fork(), vfork(), and clone() differ mostly in sharing policy. execve() is different: it keeps the task identity but replaces the user-space image.
| Call | Address Space | FD Table | Parent Behavior | Use Case |
|---|---|---|---|---|
fork | Copied metadata, COW physical pages | Copied slots pointing at same open file descriptions | Runs immediately | General process creation. |
vfork | Temporarily shared with parent | Shared until exec or exit | Suspended | Immediate exec in constrained environments. |
clone thread | Shared via CLONE_VM | Shared via CLONE_FILES | Runs immediately | POSIX thread implementation. |
clone namespace | Usually separate like fork | Policy-dependent | Runs immediately | Container bootstrap. |
A typical shell path is fork, adjust descriptors in the child, then exec. The close-on-exec bit is what prevents helper pipes, secrets, and listening sockets from leaking into the new image.
The descriptor table itself survives fork, but execve() consults the close_on_exec bitmap and closes marked descriptors before the new program starts.
| Wrapper | Arguments | PATH Search | Environment |
|---|---|---|---|
execl | List | No | Inherit |
execlp | List | Yes | Inherit |
execle | List | No | Explicit |
execv | Vector | No | Inherit |
execvp | Vector | Yes | Inherit |
execve | Vector | No | Explicit; actual syscall |
4. Minimal C Demo
5. Kernel Source Pointers
kernel/fork.c:kernel_clone,copy_process,copy_mm,copy_files.mm/memory.c:do_cow_fault,wp_page_copy.fs/exec.c:do_execveat_common,begin_new_exec.fs/file.c: descriptor table growth,close_on_execbitmap handling.arch/x86/entry/syscalls/syscall_64.tbl: syscall numbers forclone,fork,vfork,execve.
6. Interview Prep
| Question | Concise Answer |
|---|---|
Why is fork() fast even for a 20 GB process? | Linux copies page-table metadata and marks writable pages COW. Physical pages are copied only on first write. |
What is shared after fork()? | FD entries point to the same open file descriptions, so file offsets and status flags are shared; address-space pages are COW, not independently copied. |
Why is fork() dangerous in a multi-threaded program? | The child keeps only the calling thread but inherits memory containing locks held by vanished threads. Before exec, call only async-signal-safe functions or use atfork carefully. |
| What makes a pthread a thread rather than a process? | NPTL uses clone flags such as CLONE_VM, CLONE_FILES, CLONE_SIGHAND, and CLONE_THREAD. |
Why use O_CLOEXEC instead of open then fcntl? | It is atomic. In a multi-threaded process, another thread could fork and exec between open and fcntl, leaking the FD. |
7. §18.5-18.9 Process Lifecycle, Daemons, Job Control, Limits, and Privilege
After creation and exec, the operating-system story becomes lifecycle management: tasks move between runnable, sleeping, stopped, and zombie states; parents reap children; sessions and process groups control terminal signals; limits constrain damage; and UID/capability state decides which privileged operations are allowed.
8. Key Data Structures
Process state is a compact kernel-visible state machine. A zombie is not a running program; most resources are already gone, but the kernel keeps enough status for the parent to collect a reliable result.
| Structure / API | Important Fields | Purpose | Interview Detail |
|---|---|---|---|
task_struct | state, exit_state, exit_code, real_parent | Tracks scheduling state, parentage, and exit status until reaped. | Zombie state retains PID and status, not the full address space. |
signal_struct | pids, rlim, tty, flags | Thread-group-wide signal, limit, and session-related state. | Limits such as RLIMIT_NOFILE live with signal-wide process state. |
struct rlimit | rlim_cur, rlim_max | Soft and hard resource ceilings inherited across fork and exec. | A non-privileged process can lower hard limits but cannot raise them back. |
cred | uid, euid, suid, fsuid, capability sets | Immutable credential snapshot referenced by permission checks. | Effective UID and effective capabilities are what most checks consult. |
Terminal job control is a hierarchy: a session owns the controlling terminal, the terminal designates one foreground process group, and terminal-generated signals are delivered to that foreground group.
Resource limits are inherited defaults with two ceilings. The soft limit is enforced now; the hard limit bounds how far an unprivileged process can raise the soft limit later.
9. Core Mechanism
Background
A parent starts a worker, the worker exits with status 42, and the parent is busy doing other work. The kernel cannot throw the status away because the parent may later call waitpid(), so the child becomes a zombie until reaped.
Plan
- On exit, release expensive resources such as VM mappings, open files, and userspace memory.
- Retain a small process record containing PID, parent link, signal state, and exit status.
- Notify the parent with
SIGCHLDunless the disposition says otherwise. - When the parent calls
waitpidorwaitid, copy the status out and free the final task record.
Walkthrough
Initial state: parent PID 900 forks child PID 901. PID 901 exits with _exit(42); ps shows state Z because PID 900 has not reaped it. When PID 900 runs waitpid(901, &status, 0), the kernel returns 901, WEXITSTATUS(status) is 42, and PID 901 disappears from the process table.
Daemonization
A classic daemon detaches from the launching terminal by forking, creating a new session, forking again, resetting process environment details, and redirecting standard descriptors. The second fork matters because a session leader can acquire a controlling terminal; the grandchild is not a session leader.
Privilege Transition
Privilege is not just UID 0. Linux combines real/effective/saved UIDs, file ownership transitions, and capability sets. Modern services should acquire the one privileged resource they need, drop IDs and capabilities, and often set PR_SET_NO_NEW_PRIVS before processing untrusted input.
10. Minimal C Demo
These demos keep the mechanisms observable: a zombie appears before waitpid, a reusable daemon function performs the double fork, process group/session IDs change after setsid, descriptor limits produce EMFILE, and a privileged process permanently drops to nobody.
11. Kernel Source Pointers
| Area | Files / symbols | What to inspect |
|---|---|---|
| Exit and reaping | kernel/exit.c, do_exit, do_wait, release_task | How resources are released first and the final zombie record is released only after wait. |
| Signals | kernel/signal.c, do_notify_parent, SIGCHLD | How child exit notification is queued and delivered to the parent. |
| Sessions and groups | kernel/sys.c, setsid, setpgid, drivers/tty/tty_jobctrl.c | How sessions, process groups, and controlling tty foreground groups are maintained. |
| Resource limits | kernel/sys.c, do_prlimit, include/uapi/linux/resource.h | Where soft/hard rlimits are checked and updated. |
| Credentials and capabilities | kernel/cred.c, kernel/capability.c, security/commoncap.c | How immutable credentials and capability checks implement least privilege. |
12. Interview Prep
| Question | Concise Answer |
|---|---|
| What is a zombie process? | A child that exited but has not been reaped. The address space and files are gone; PID and exit status remain for wait. |
| What is an orphan process? | A process whose parent exited first. Linux reparents it to PID 1 or the nearest subreaper, which later reaps it. |
| Why does daemonization fork twice? | The first fork lets the child call setsid. The second fork ensures the final daemon is not a session leader and cannot reacquire a controlling terminal. |
| How do job-control signals reach a pipeline? | The terminal tracks one foreground process group. Ctrl-C and Ctrl-Z are delivered to every member of that foreground group. |
| What is the difference between effective UID and capabilities? | Effective UID is the traditional identity used by many permission checks. Capabilities split root privileges into smaller bits such as CAP_NET_ADMIN or CAP_SYS_PTRACE. |
13. §18.10-18.11 pthread Lifecycle and Mutexes
POSIX threads are user-space API objects backed by Linux kernel tasks in the NPTL 1:1 model. Creating a thread is mostly libc preparing a stack, TLS, and a start trampoline, then asking the kernel to create a schedulable task that shares the process address space and file table.
14. Key Data Structures
A pthread is a small set of coordinated objects: a join handle, a user stack with a guard page, a TCB/TLS area, kernel task state, and optional synchronization words that stay in user memory until contention forces a futex syscall.
| Structure / API | Important Fields | Purpose | Interview Detail |
|---|---|---|---|
pthread_t | Opaque handle, often points at the TCB in NPTL | Identifies a joinable or detached pthread to libc APIs. | Do not compare with ==; use pthread_equal. |
pthread_attr_t | stack size, stack address, guard size, scheduling policy, detach state | Controls how pthread_create builds the thread. | Large default stacks can cap practical thread count before CPU does. |
| Thread stack | guard page, mapped stack region, current frames | Private call frames for one thread in the shared address space. | Overflow corrupts adjacent memory unless the guard page catches it first. |
| TCB / TLS | self pointer, errno slot, thread-local storage vector | Stores per-thread libc and language runtime state. | errno is thread-local, not one global integer. |
pthread_mutex_t | lock word, owner, waiter bit, kind/protocol fields | Mutual exclusion object implemented with atomic fast path plus futex slow path. | Uncontended lock/unlock usually stays entirely in user space. |
The mutex word has a deliberately small state machine. The common case is a compare-exchange from unlocked to locked; the kernel only sees the operation when a thread has to sleep or wake a waiter.
| Mutex Type / Attribute | Behavior | Use Case | Pitfall |
|---|---|---|---|
NORMAL | Default fast mutex; recursive lock by same thread deadlocks. | Most production code with clear ownership. | Undefined behavior on wrong-thread unlock. |
RECURSIVE | Same owner may lock multiple times; unlock decrements count. | Legacy callback-heavy code where re-entry is unavoidable. | Often hides poor lock design and lengthens critical sections. |
ERRORCHECK | Recursive lock returns EDEADLK; bad unlock returns an error. | Debug builds and teaching. | Slower; not a general deadlock detector. |
ROBUST | Next locker gets EOWNERDEAD if the owner died. | Shared memory state that can be recovered after process death. | Caller must repair state and call pthread_mutex_consistent. |
PRIO_INHERIT | Kernel boosts owner priority when a higher-priority waiter blocks. | Realtime threads using SCHED_FIFO or SCHED_RR. | Requires PI futex slow path; not needed for ordinary CFS code. |
PROCESS_SHARED | Mutex may synchronize threads in different processes through shared memory. | mmap MAP_SHARED or POSIX shm regions. | The object must actually live in shared memory, not private heap. |
15. Core Mechanism
Background
A server creates eight worker threads to share an in-memory cache. Thread creation needs independent stacks and TLS, while the cache update path needs mutual exclusion without paying a syscall on every uncontended lock.
Plan
- Initialize attributes, optionally reducing stack size or choosing detached state.
- Call
pthread_create; libc allocates stack/TCB/TLS and uses clone sharing flags. - Keep joinable threads until
pthread_joincollects the return value and releases resources, or detach them when no result is needed. - Protect shared state with a mutex; the uncontended path is atomic user-space code, while contended waiters sleep through futex.
Walkthrough
Initial state: counter=0, four threads each run 100,000 increments. Without a mutex, two threads can read 17, both compute 18, and both store 18, losing one increment. With pthread_mutex_lock, one thread changes the mutex word from 0 to 1, updates the counter, unlocks, and only then can another thread enter; the final protected counter is exactly 400,000.
Robust Owner Death
Robust mutexes are for recoverable shared state, especially across processes. If an owner dies in the critical section, the next locker is told that protected data may be inconsistent rather than silently continuing.
16. Minimal C Demo
The first demo shows pthread creation attributes, kernel TIDs, return values, and joins. The second demo contrasts a racing increment with the same increment protected by a normal mutex.
17. Kernel Source Pointers
| Area | Files / symbols | What to inspect |
|---|---|---|
| Thread creation | kernel/fork.c, kernel_clone, copy_process | How clone flags decide shared mm_struct, files, sighand, and thread group membership. |
| TLS setup | arch/x86/kernel/process_64.c, copy_thread_tls | Where the new thread stack, register state, and TLS pointer are prepared. |
| Futex wait/wake | kernel/futex/core.c, futex_wait, futex_wake | How contended pthread mutexes sleep and wake through a kernel hash table keyed by user address. |
| Priority inheritance | kernel/futex/pi.c, FUTEX_LOCK_PI, rt_mutex | How PI futexes connect pthread priority inheritance to kernel realtime mutex logic. |
| Robust lists | kernel/futex/core.c, exit_robust_list, set_robust_list | How the kernel marks owner-dead mutexes when a task exits while holding them. |
18. Interview Prep
| Question | Concise Answer |
|---|---|
What does pthread_create do on Linux? | NPTL prepares stack and TLS in libc, then uses clone with flags such as CLONE_VM, CLONE_FILES, CLONE_SIGHAND, CLONE_THREAD, and CLONE_SETTLS. |
| Joinable vs detached thread? | A joinable thread retains termination resources and a return value until joined. A detached thread releases resources automatically and cannot be joined. |
| Why can default pthread stacks limit scalability? | Each thread reserves a stack mapping, often megabytes plus a guard page. Thousands of threads can exhaust virtual memory or address-space policy before CPU is saturated. |
| How does a pthread mutex use futex? | The fast path is an atomic user-space transition. On contention, libc marks waiters and calls futex wait; unlock calls futex wake when needed. |
| When do you use robust or process-shared mutexes? | Use robust when owner death must be detected and repaired. Use process-shared only when the mutex object lives in shared memory and coordinates multiple processes. |
19. §18.12-18.17 Sync Primitives Beyond Mutexes
Mutexes protect exclusive critical sections, but real pthread programs also need to wait for predicates, admit many readers, synchronize phases, survive cancellation, initialize shared state once, store per-thread data, and handle process-directed signals without running unsafe code in an async handler.
20. Key Data Structures
A condition variable is not the condition itself. The condition is the protected predicate such as count > 0; the condition variable is only the wait queue used when the predicate is false.
| Primitive | Important State | Purpose | Rule Interviewers Care About |
|---|---|---|---|
pthread_cond_t | wait sequence, wake sequence, futex word, associated mutex protocol | Sleep until a mutex-protected predicate may have changed. | Always wait in while (!predicate), never if. |
pthread_rwlock_t | reader count, writer ownership, reader/writer wait queues | Allow many readers or one writer. | Reader preference can starve writers; recursive read locking can deadlock with pending writers. |
pthread_spinlock_t | one atomic lock word | Busy-wait for extremely short critical sections. | Bad around syscalls, blocking I/O, or any critical section longer than a context switch. |
pthread_barrier_t | required count, arrived count, generation number | Release a group only after N participants reach the same phase. | Reusable barriers need a generation counter so old waiters do not mix with the next round. |
| cleanup stack | LIFO cleanup handlers pushed by pthread_cleanup_push | Run unlock/free handlers when a thread is canceled or exits. | Deferred cancellation is manageable; async cancellation is almost never safe in C. |
pthread_key_t | global key index plus per-thread slot values and destructor | Dynamic thread-specific data with cleanup on thread exit. | Different threads see different values for the same key. |
A reader-writer lock is useful when protected data is read often and mutated rarely. The state must distinguish active readers, an active writer, and waiters whose preference policy decides who enters next.
Spinlocks and barriers solve opposite timing problems: a spinlock avoids sleeping because the hold time should be tiny, while a barrier deliberately waits until every participant has reached the same phase boundary.
Thread-specific data is a two-level lookup. The key is process-wide, but each thread has its own slot value, and a destructor receives that thread's non-null value when the thread exits.
21. Core Mechanism
Background
A bounded queue starts empty. A consumer must sleep until a producer inserts an item, but the sleep must not miss the producer's wakeup and must not consume from an empty buffer after a spurious wake.
Plan
- Protect queue state with one mutex:
head,tail, andcount. - Consumers use
while (count == 0) pthread_cond_wait(¬_empty, &mutex). - Producers use
while (count == CAP) pthread_cond_wait(¬_full, &mutex). - After changing the predicate, signal the opposite side before unlocking or while still holding enough ordering through the mutex.
Walkthrough
Initial state: count=0, consumer C holds the mutex and checks the predicate. C calls pthread_cond_wait, which atomically places C on the wait queue and unlocks the mutex. Producer P locks, writes item 7, changes count to 1, signals not_empty, and unlocks. C wakes, reacquires the mutex before returning, checks count > 0 again, removes item 7, signals not_full, and unlocks. The mutex creates the happens-before edge; the loop creates correctness under spurious or competing wakes.
Cancellation and Cleanup
Cancellation is only sane when resources are paired with cleanup handlers. The common pattern is lock, push an unlock handler, call cancellation points, then pop the handler when the protected region is finished.
Multi-Threaded Signals
Signal handlers are process-wide, but each thread has its own signal mask. Production pthread programs usually block signals in all workers and dedicate one thread to sigwait(), so shutdown code can use normal mutexes and condition variables instead of async-signal-safe handler rules.
22. Minimal C Demo
These demos focus on the failure-resistant patterns: predicate loops with two condition variables, phase synchronization with a barrier, cleanup handlers around cancellation points, exactly-once initialization with TSD destructors, and synchronous signal handling through sigwait.
23. Kernel Source Pointers
| Area | Files / symbols | What to inspect |
|---|---|---|
| Futex-backed waits | kernel/futex/core.c, futex_wait, futex_wake, futex_requeue | How pthread condition variables and locks sleep on user-space words only after libc decides the fast path cannot proceed. |
| Signal delivery | kernel/signal.c, complete_signal, get_signal, do_sigtimedwait | How Linux chooses an eligible thread and how synchronous waiting consumes pending signals. |
| Thread masks | kernel/signal.c, set_current_blocked, rt_sigprocmask | Where per-thread blocked signal masks are updated. |
| Scheduling impact | kernel/sched/core.c, schedule, try_to_wake_up | What happens after a futex wait parks a thread and a wake makes it runnable again. |
24. Interview Prep
| Question | Concise Answer |
|---|---|
Why must pthread_cond_wait be in a while loop? | Because wakeups can be spurious and another waiter may consume the state first. The predicate, not the wakeup, decides whether progress is legal. |
| What is a lost wakeup? | A signal happens before the waiter is actually waiting and the predicate is not protected consistently, so the waiter sleeps even though work is available. |
| When is an rwlock better than a mutex? | When reads dominate, critical sections are large enough to justify rwlock overhead, and writer starvation policy is acceptable or configured. |
| When should you use a pthread spinlock? | Only for very short non-blocking critical sections where sleeping would cost more than waiting. Never hold it across I/O or calls that may block. |
| How should a multi-threaded program handle SIGINT or SIGTERM? | Block those signals in all threads, create one signal thread that uses sigwait or signalfd, then notify workers through normal synchronization. |
25. Overview
Concurrency patterns are reusable shapes for moving work through shared state without letting memory, latency, or failure domains grow without control. The practical Linux building blocks are bounded queues, pthread synchronization, process workers, event loops, and a clear choice of where isolation should live.
26. Key Data Structures
A bounded producer-consumer queue is a ring plus two predicates. not_full is true when producers may insert, and not_empty is true when consumers may remove; both predicates are guarded by the same mutex.
The single-producer single-consumer ring removes the mutex by giving each side exclusive ownership of one index. Release/acquire ordering publishes the slot contents before the consumer observes the advanced tail.
| Structure | Fields | Purpose | Failure Mode |
|---|---|---|---|
| bounded ring queue | buf, head, tail, count, mutex, two condvars | Move finite work between producers and consumers with backpressure. | Using if instead of while consumes from empty or writes into full under competing wakeups. |
| SPSC ring | atomic head, atomic tail, power-of-two slots | Fast one-producer one-consumer handoff. | Extending it to MPMC without CAS claim/publish corrupts slots. |
| thread pool | worker array, bounded queue, shutdown flag, join path | Bound CPU parallelism and thread stack memory. | Unbounded queues hide overload until memory or tail latency fails. |
| prefork worker pool | master process, inherited listen FD, worker PIDs | Isolate faults and avoid thread-shared memory bugs. | High memory footprint and heavier context switches than threads. |
| epoll worker loop | nonblocking FDs, epoll interest list, event batch | Scale many mostly-idle connections with few threads. | One blocking call can stall the whole event loop thread. |
Reader-writer patterns choose different costs for the same read-mostly problem: rwlock blocks writers until readers drain, seqlock makes readers retry, and RCU lets old readers finish before freeing replaced data.
Userspace RCU separates pointer publication from reclamation. Readers mark a read-side region, writers publish a replacement, and the old object is freed only after every pre-existing reader has passed through a quiescent state.
A fixed thread pool is a bounded queue plus sleeping workers. Shutdown is just another state transition: set the quit flag, broadcast the waiters, and join every worker.
27. Core Mechanism
Background
A service receives bursts of requests, but only has enough CPU and memory for a fixed number of workers and queued jobs. The concurrency design must absorb small bursts, apply backpressure during overload, and shut down without losing ownership of queued work.
Plan
- Use a bounded queue so submitters block or fail fast when the system is overloaded.
- Protect queue state with one mutex and use
not_emptyandnot_fullpredicates. - Workers sleep while the queue is empty and the shutdown flag is false.
- Shutdown sets the flag, broadcasts workers, drains or rejects new submissions, then joins every thread.
Walkthrough
Initial state: queue capacity is 8, count=0, and three workers are sleeping on not_empty. A producer submits jobs 1 through 8, advancing tail and signaling workers. Job 9 blocks because count == CAP. Worker 0 pops job 1, decrements count, signals not_full, and executes outside the mutex, so job 9 can be queued while CPU work continues. During shutdown, the main thread sets quit=1 and broadcasts; a worker exits only when the queue is empty and quit is set, so already accepted work is not abandoned.
Work Stealing
Work stealing reduces contention by giving each worker a private deque. The owner pushes and pops recent tasks from the head for cache locality, while idle workers steal older tasks from the tail to rebalance load.
Server Models
Apache prefork spends memory to isolate each request handler in a process. Nginx spends engineering complexity on event-driven workers that epoll shared listening sockets and keep many idle connections without one stack per connection.
In the Nginx-style model, the master supervises workers while the kernel owns readiness notification and accept-queue behavior. SO_REUSEPORT can shard incoming connections across worker sockets.
A master can also accept centrally and pass connected sockets to workers over a Unix socket with SCM_RIGHTS. The receiving worker gets a valid duplicate FD in its own descriptor table.
Multi-threaded servers choose between one thread per connection, a bounded worker pool around epoll, M:N runtimes, and io_uring variants. The common rule is to keep blocking work away from the event loop.
An epoll worker repeatedly sleeps for readiness, accepts or reads nonblocking FDs, writes responses, and re-arms interest when edge-triggered or one-shot modes are used.
Model Comparison
Concurrency models are trade-offs between fault isolation, switch cost, memory footprint, and reasoning cost. Processes isolate failures; threads share memory cheaply but make data races possible; coroutines and event loops are fast but require cooperative nonblocking discipline.
28. Minimal C Demo
These runnable demos show the core interview patterns: a many-producer many-consumer bounded queue, a minimal thread pool with graceful shutdown, and the shape of a prefork worker pool using a pipe as the stand-in for accepted connections.
29. Kernel Source Pointers
| Area | Files / symbols | What to inspect |
|---|---|---|
| Blocking queue waits | kernel/futex/core.c, futex_wait, futex_wake | How pthread condvars park and wake threads after user-space predicates decide the fast path cannot continue. |
| Worker scheduling | kernel/sched/core.c, schedule, try_to_wake_up | What happens when a sleeping worker becomes runnable after a queue signal. |
| Process worker creation | kernel/fork.c, copy_process, dup_fd | How prefork workers inherit the listening socket and why each worker has a separate address space. |
| FD passing | net/unix/af_unix.c, unix_stream_sendmsg, scm_detach_fds | How SCM_RIGHTS transfers references to open files across Unix sockets. |
| Event readiness | fs/eventpoll.c, ep_poll, ep_send_events | How epoll batches readiness for event-loop and thread-pool server designs. |
30. Interview Prep
| Question | Concise Answer |
|---|---|
| How do you design a correct producer-consumer queue? | Use one mutex for queue state, two predicates for not-full and not-empty, wait in loops, signal after changing the predicate, and bound capacity for backpressure. |
| What makes a thread pool production-ready? | Bounded queue, clear rejection or blocking policy, worker lifecycle management, graceful shutdown, no executing jobs while holding the queue mutex, and observable backlog. |
| When is RCU better than an rwlock? | When reads dominate and can tolerate seeing old versions while writers copy-swap and defer reclamation until all old readers finish. |
| Apache prefork vs Nginx worker model? | Prefork favors isolation and simple blocking code at high memory cost. Nginx favors event-driven scalability with fewer workers but requires nonblocking discipline. |
| Thread per connection vs epoll thread pool? | Thread per connection is simple but stack-heavy and context-switch-heavy. An epoll pool handles many idle connections with fewer threads and offloads blocking CPU work to workers. |
31. §18.24-18.28 Deadlocks, PI Futexes, IPC Sync, and Coroutines
Advanced user-space concurrency is about the failure modes that appear after the basic mutex and queue code works: lock cycles, realtime priority inversion, cross-process ownership, file-lock semantics, and runtimes that multiplex logical tasks without creating one kernel thread per task.
32. Key Data Structures
A deadlock is a cycle in a wait-for graph. In the classic two-lock bug, each thread owns one lock and waits for the other, so neither edge can be removed by normal execution.
Cross-process synchronization works only when the synchronization word and protected data live in shared memory. A process-shared pthread mutex on a private heap is just two unrelated copies after fork.
| Structure / API | Important State | Purpose | Interview Detail |
|---|---|---|---|
| wait-for graph | threads, resources, waits-for edges | Detect cycles that imply deadlock. | Pthreads do not maintain a general deadlock detector for you. |
PTHREAD_PRIO_INHERIT | owner TID, waiter priority, PI futex state | Boost a low-priority mutex owner when a higher-priority thread blocks. | Useful for realtime scheduling, not ordinary throughput tuning. |
PROCESS_SHARED mutex | futex word stored in shared mapping | Synchronize independent processes without a server process. | Combine with robust mutexes when owner death must be recovered. |
| file locks | inode, byte range, owner scope | Coordinate access to persistent files across processes. | fcntl process-scoped locks surprise many people; OFD locks are often cleaner. |
ucontext_t | saved registers, stack, signal mask, link context | Stackful cooperative coroutine state in user space. | No preemption: a fiber that never yields blocks all fibers on that thread. |
Linux exposes several cross-process coordination tools. The crucial difference is ownership scope: whole file, byte range, process, open file description, named kernel semaphore, or futex word in a shared page.
A stackful coroutine is just another saved CPU context and stack inside one OS thread. Switching saves the current instruction and stack pointers, then restores another coroutine's saved state.
33. Core Mechanism
Background
A service has two shared indexes guarded by two locks. One path updates cache then index; another updates index then cache. Under load, thread A owns the cache lock and waits for the index lock while thread B owns the index lock and waits for the cache lock.
Plan
- Assign every lock a canonical rank: static enum, object address, or lock class.
- Acquire multiple locks in increasing rank and release them in reverse rank.
- For optional locks, use
pthread_mutex_trylockwith release-and-backoff instead of waiting while holding lower-priority resources. - In complex systems, log wait-for edges and fail fast when adding an edge would create a cycle.
Walkthrough
Initial state: lock A has rank 10 and lock B has rank 20. Thread 1 asks for A then B; thread 2 asks for B then A. The ordered helper sorts both requests into A then B. Thread 1 may run first and hold both; thread 2 waits at A without owning B, so there is no cycle. When thread 1 releases B then A, thread 2 proceeds and the circular-wait Coffman condition is removed.
Priority Inversion
Priority inversion appears when a low-priority lock owner is preempted by medium-priority work while a high-priority thread waits behind it. The high-priority thread is indirectly blocked by a thread that does not even need the lock.
A priority-inheritance mutex lets the kernel see the dependency. When the high-priority waiter blocks, the kernel temporarily boosts the low-priority owner so it can finish and release the mutex.
Coroutine Runtimes
Modern coroutine systems all split logical waiting from OS-thread blocking. Go schedules goroutines onto worker threads, while C++20 and Rust represent suspended work as state machines that an executor resumes when I/O or timers complete.
A stackless coroutine moves through explicit compiler-generated states at each suspension point. Local variables that must survive a suspension live in the coroutine frame rather than on an ordinary caller stack.
34. Minimal C Demo
These demos show the practical core without requiring realtime privileges: ordered two-lock acquisition prevents a deadlock, a process-shared mutex coordinates two forked children through anonymous shared memory, stackful coroutines ping-pong with swapcontext, and the model comparison prints equivalent coroutine shapes.
35. Kernel Source Pointers
| Area | Files / symbols | What to inspect |
|---|---|---|
| PI futexes | kernel/futex/pi.c, futex_lock_pi, futex_unlock_pi | How userspace PRIO_INHERIT mutexes route contention through kernel realtime mutex logic. |
| Realtime mutex core | kernel/locking/rtmutex.c, rt_mutex_adjust_prio_chain | How priority donation walks lock-owner chains and prevents unbounded inversion. |
| Shared-memory futex keys | kernel/futex/core.c, get_futex_key | How the kernel distinguishes private futexes from futex words in shared mappings. |
| File locks | fs/locks.c, fcntl_setlk, flock_lock_inode | Where BSD locks, POSIX byte-range locks, and OFD locks are represented. |
| Scheduling | kernel/sched/core.c, setscheduler, schedule | How realtime priority changes and wakeups affect runnable thread selection. |
36. Interview Prep
| Question | Concise Answer |
|---|---|
| What are the Coffman deadlock conditions? | Mutual exclusion, hold-and-wait, no preemption, and circular wait. Lock ordering removes circular wait. |
| What is priority inversion? | A high-priority thread waits on a lock held by a low-priority thread while medium-priority work prevents the owner from running. PI boosts the owner until it releases the lock. |
| How do process-shared pthread mutexes work? | The mutex object and protected data live in a shared mapping, and pthreads uses futex operations keyed to that shared memory address. |
| Why are OFD locks often safer than classic fcntl locks? | OFD locks are scoped to the open file description, so duplicate FDs and threads behave more predictably than process-scoped POSIX record locks. |
| Stackful fiber vs C++20 coroutine? | A stackful fiber switches an entire stack and can suspend deep in calls. A C++20 coroutine is a compiler-generated state machine that suspends only at explicit coroutine points. |