Part XVIII — Threads & Processes

§ 18.1-18.28 Processes, pthreads, Sync, and Advanced Concurrency

User-space process control from clone flags and COW page tables through pthread lifecycle, synchronization primitives, bounded queues, thread pools, prefork servers, epoll workers, deadlock prevention, PI futexes, shared-memory synchronization, and coroutine models.

1. Overview

Linux exposes processes, threads, and coroutines as different points on the same sharing spectrum: a process is isolated by page tables, a thread is a kernel-scheduled task sharing the process resources, and a coroutine is a user-space control-flow object inside one thread. The kernel center of gravity is copy_process(), which creates a task_struct and decides what to share based on clone flags.

Process vs Thread vs Coroutine

A process buys fault isolation and independent VM; a pthread buys cheap sharing and kernel preemption; a coroutine buys extremely cheap cooperative switching but depends on the current thread to yield. The model below shows what is actually separate.

ModelSchedulerIsolationTypical Switch CostMemory Cost
processKernel schedulerSeparate VM, FD table copied unless shared explicitlyMicroseconds; may include TLB effectsPage tables, VMAs, copied metadata, private stacks
threadKernel schedulerShares VM, heap, most process resourcesUsually lower than process switchOne user stack, TLS, and kernel task per thread
coroutineUser-space event loop or runtimeNo memory isolation inside the processTens to hundreds of nanoseconds in simple runtimesSaved registers plus stack or frame state

2. Key Data Structures

The main structures are not just C structs: they are ownership boundaries. task_struct is the schedulable entity, mm_struct owns the address space, files_struct owns descriptor slots, and ucontext_t is a user-space snapshot for coroutine switching.

StructureImportant FieldsPurposeShared By
task_structpid, tgid, mm, files, signalOne kernel scheduling object per process thread.Never shared; each thread has its own.
mm_structmmap, pgd, mm_usersAddress space, VMA list/tree, and top page-table pointer.Threads share it with CLONE_VM.
files_structfdt, close_on_exec, open_fdsMaps integer FDs to open file descriptions.Pthreads share it with CLONE_FILES; fork copies it.
vm_area_structvm_start, vm_end, vm_flags, vm_fileOne mapped region: text, heap, stack, mmap file, shared memory.Referenced through mm_struct.
ucontext_tsaved registers, stack pointer, signal mask, successor contextUser-space coroutine context; no kernel task is created.Owned by the coroutine runtime.

3. Core Mechanism

fork and Copy-On-Write

Background: A shell often forks only to immediately exec a new program. Copying every page of a large parent would be wasteful, so Linux copies metadata and shares physical pages until one side writes.

The COW state is visible in the page tables: both parent and child map the same physical page with write permission removed, even though the VMA still says the mapping is logically writable.

Plan: create a child task, duplicate resource tables, mark writable PTEs read-only, and let the page-fault handler allocate a private page only if a write occurs.

StepParent StateChild StateKernel Reason
Before forkWritable PTE points at page ADoes not existNormal private anonymous memory.
After forkRead-only PTE points at page ARead-only PTE points at page ASharing makes fork proportional to metadata, not resident memory.
Child writesStill maps page AFaults, gets copied page BPrivate semantics are restored lazily.
After faultSees original bytesSees modified bytesEach process now has the version it should observe.

The fork-plus-threads trap

In a multi-threaded process, only the calling thread exists in the child after fork(). Any lock held by a vanished sibling may remain permanently locked in the child process copied memory.

fork, vfork, clone, and exec

fork(), vfork(), and clone() differ mostly in sharing policy. execve() is different: it keeps the task identity but replaces the user-space image.

CallAddress SpaceFD TableParent BehaviorUse Case
forkCopied metadata, COW physical pagesCopied slots pointing at same open file descriptionsRuns immediatelyGeneral process creation.
vforkTemporarily shared with parentShared until exec or exitSuspendedImmediate exec in constrained environments.
clone threadShared via CLONE_VMShared via CLONE_FILESRuns immediatelyPOSIX thread implementation.
clone namespaceUsually separate like forkPolicy-dependentRuns immediatelyContainer bootstrap.

A typical shell path is fork, adjust descriptors in the child, then exec. The close-on-exec bit is what prevents helper pipes, secrets, and listening sockets from leaking into the new image.

The descriptor table itself survives fork, but execve() consults the close_on_exec bitmap and closes marked descriptors before the new program starts.

WrapperArgumentsPATH SearchEnvironment
execlListNoInherit
execlpListYesInherit
execleListNoExplicit
execvVectorNoInherit
execvpVectorYesInherit
execveVectorNoExplicit; actual syscall

4. Minimal C Demo

Process vs Thread vs Coroutine Switch Cost — C Demo
stdin (optional)
fork + pthread_atfork Lock Repair — C Demo
stdin (optional)
clone Namespace Skeleton — C Demo
stdin (optional)
Mini Shell: pipe + fork + execve — C Demo
stdin (optional)

5. Kernel Source Pointers

  • kernel/fork.c: kernel_clone, copy_process, copy_mm, copy_files.
  • mm/memory.c: do_cow_fault, wp_page_copy.
  • fs/exec.c: do_execveat_common, begin_new_exec.
  • fs/file.c: descriptor table growth, close_on_exec bitmap handling.
  • arch/x86/entry/syscalls/syscall_64.tbl: syscall numbers for clone, fork, vfork, execve.

6. Interview Prep

QuestionConcise Answer
Why is fork() fast even for a 20 GB process?Linux copies page-table metadata and marks writable pages COW. Physical pages are copied only on first write.
What is shared after fork()?FD entries point to the same open file descriptions, so file offsets and status flags are shared; address-space pages are COW, not independently copied.
Why is fork() dangerous in a multi-threaded program?The child keeps only the calling thread but inherits memory containing locks held by vanished threads. Before exec, call only async-signal-safe functions or use atfork carefully.
What makes a pthread a thread rather than a process?NPTL uses clone flags such as CLONE_VM, CLONE_FILES, CLONE_SIGHAND, and CLONE_THREAD.
Why use O_CLOEXEC instead of open then fcntl?It is atomic. In a multi-threaded process, another thread could fork and exec between open and fcntl, leaking the FD.

7. §18.5-18.9 Process Lifecycle, Daemons, Job Control, Limits, and Privilege

After creation and exec, the operating-system story becomes lifecycle management: tasks move between runnable, sleeping, stopped, and zombie states; parents reap children; sessions and process groups control terminal signals; limits constrain damage; and UID/capability state decides which privileged operations are allowed.

8. Key Data Structures

Process state is a compact kernel-visible state machine. A zombie is not a running program; most resources are already gone, but the kernel keeps enough status for the parent to collect a reliable result.

Structure / APIImportant FieldsPurposeInterview Detail
task_structstate, exit_state, exit_code, real_parentTracks scheduling state, parentage, and exit status until reaped.Zombie state retains PID and status, not the full address space.
signal_structpids, rlim, tty, flagsThread-group-wide signal, limit, and session-related state.Limits such as RLIMIT_NOFILE live with signal-wide process state.
struct rlimitrlim_cur, rlim_maxSoft and hard resource ceilings inherited across fork and exec.A non-privileged process can lower hard limits but cannot raise them back.
creduid, euid, suid, fsuid, capability setsImmutable credential snapshot referenced by permission checks.Effective UID and effective capabilities are what most checks consult.

Terminal job control is a hierarchy: a session owns the controlling terminal, the terminal designates one foreground process group, and terminal-generated signals are delivered to that foreground group.

Resource limits are inherited defaults with two ceilings. The soft limit is enforced now; the hard limit bounds how far an unprivileged process can raise the soft limit later.

9. Core Mechanism

Background

A parent starts a worker, the worker exits with status 42, and the parent is busy doing other work. The kernel cannot throw the status away because the parent may later call waitpid(), so the child becomes a zombie until reaped.

Plan

  1. On exit, release expensive resources such as VM mappings, open files, and userspace memory.
  2. Retain a small process record containing PID, parent link, signal state, and exit status.
  3. Notify the parent with SIGCHLD unless the disposition says otherwise.
  4. When the parent calls waitpid or waitid, copy the status out and free the final task record.

Walkthrough

Initial state: parent PID 900 forks child PID 901. PID 901 exits with _exit(42); ps shows state Z because PID 900 has not reaped it. When PID 900 runs waitpid(901, &status, 0), the kernel returns 901, WEXITSTATUS(status) is 42, and PID 901 disappears from the process table.

Daemonization

A classic daemon detaches from the launching terminal by forking, creating a new session, forking again, resetting process environment details, and redirecting standard descriptors. The second fork matters because a session leader can acquire a controlling terminal; the grandchild is not a session leader.

Privilege Transition

Privilege is not just UID 0. Linux combines real/effective/saved UIDs, file ownership transitions, and capability sets. Modern services should acquire the one privileged resource they need, drop IDs and capabilities, and often set PR_SET_NO_NEW_PRIVS before processing untrusted input.

10. Minimal C Demo

These demos keep the mechanisms observable: a zombie appears before waitpid, a reusable daemon function performs the double fork, process group/session IDs change after setsid, descriptor limits produce EMFILE, and a privileged process permanently drops to nobody.

Zombie Until waitpid Reaps — C Demo
stdin (optional)
Reusable daemonize Function — C Demo
stdin (optional)
Process Group and Session IDs — C Demo
stdin (optional)
RLIMIT_NOFILE Produces EMFILE — C Demo
stdin (optional)
Drop Privilege to nobody — C Demo
stdin (optional)

11. Kernel Source Pointers

AreaFiles / symbolsWhat to inspect
Exit and reapingkernel/exit.c, do_exit, do_wait, release_taskHow resources are released first and the final zombie record is released only after wait.
Signalskernel/signal.c, do_notify_parent, SIGCHLDHow child exit notification is queued and delivered to the parent.
Sessions and groupskernel/sys.c, setsid, setpgid, drivers/tty/tty_jobctrl.cHow sessions, process groups, and controlling tty foreground groups are maintained.
Resource limitskernel/sys.c, do_prlimit, include/uapi/linux/resource.hWhere soft/hard rlimits are checked and updated.
Credentials and capabilitieskernel/cred.c, kernel/capability.c, security/commoncap.cHow immutable credentials and capability checks implement least privilege.

12. Interview Prep

QuestionConcise Answer
What is a zombie process?A child that exited but has not been reaped. The address space and files are gone; PID and exit status remain for wait.
What is an orphan process?A process whose parent exited first. Linux reparents it to PID 1 or the nearest subreaper, which later reaps it.
Why does daemonization fork twice?The first fork lets the child call setsid. The second fork ensures the final daemon is not a session leader and cannot reacquire a controlling terminal.
How do job-control signals reach a pipeline?The terminal tracks one foreground process group. Ctrl-C and Ctrl-Z are delivered to every member of that foreground group.
What is the difference between effective UID and capabilities?Effective UID is the traditional identity used by many permission checks. Capabilities split root privileges into smaller bits such as CAP_NET_ADMIN or CAP_SYS_PTRACE.

13. §18.10-18.11 pthread Lifecycle and Mutexes

POSIX threads are user-space API objects backed by Linux kernel tasks in the NPTL 1:1 model. Creating a thread is mostly libc preparing a stack, TLS, and a start trampoline, then asking the kernel to create a schedulable task that shares the process address space and file table.

14. Key Data Structures

A pthread is a small set of coordinated objects: a join handle, a user stack with a guard page, a TCB/TLS area, kernel task state, and optional synchronization words that stay in user memory until contention forces a futex syscall.

Structure / APIImportant FieldsPurposeInterview Detail
pthread_tOpaque handle, often points at the TCB in NPTLIdentifies a joinable or detached pthread to libc APIs.Do not compare with ==; use pthread_equal.
pthread_attr_tstack size, stack address, guard size, scheduling policy, detach stateControls how pthread_create builds the thread.Large default stacks can cap practical thread count before CPU does.
Thread stackguard page, mapped stack region, current framesPrivate call frames for one thread in the shared address space.Overflow corrupts adjacent memory unless the guard page catches it first.
TCB / TLSself pointer, errno slot, thread-local storage vectorStores per-thread libc and language runtime state.errno is thread-local, not one global integer.
pthread_mutex_tlock word, owner, waiter bit, kind/protocol fieldsMutual exclusion object implemented with atomic fast path plus futex slow path.Uncontended lock/unlock usually stays entirely in user space.

The mutex word has a deliberately small state machine. The common case is a compare-exchange from unlocked to locked; the kernel only sees the operation when a thread has to sleep or wake a waiter.

Mutex Type / AttributeBehaviorUse CasePitfall
NORMALDefault fast mutex; recursive lock by same thread deadlocks.Most production code with clear ownership.Undefined behavior on wrong-thread unlock.
RECURSIVESame owner may lock multiple times; unlock decrements count.Legacy callback-heavy code where re-entry is unavoidable.Often hides poor lock design and lengthens critical sections.
ERRORCHECKRecursive lock returns EDEADLK; bad unlock returns an error.Debug builds and teaching.Slower; not a general deadlock detector.
ROBUSTNext locker gets EOWNERDEAD if the owner died.Shared memory state that can be recovered after process death.Caller must repair state and call pthread_mutex_consistent.
PRIO_INHERITKernel boosts owner priority when a higher-priority waiter blocks.Realtime threads using SCHED_FIFO or SCHED_RR.Requires PI futex slow path; not needed for ordinary CFS code.
PROCESS_SHAREDMutex may synchronize threads in different processes through shared memory.mmap MAP_SHARED or POSIX shm regions.The object must actually live in shared memory, not private heap.

15. Core Mechanism

Background

A server creates eight worker threads to share an in-memory cache. Thread creation needs independent stacks and TLS, while the cache update path needs mutual exclusion without paying a syscall on every uncontended lock.

Plan

  1. Initialize attributes, optionally reducing stack size or choosing detached state.
  2. Call pthread_create; libc allocates stack/TCB/TLS and uses clone sharing flags.
  3. Keep joinable threads until pthread_join collects the return value and releases resources, or detach them when no result is needed.
  4. Protect shared state with a mutex; the uncontended path is atomic user-space code, while contended waiters sleep through futex.

Walkthrough

Initial state: counter=0, four threads each run 100,000 increments. Without a mutex, two threads can read 17, both compute 18, and both store 18, losing one increment. With pthread_mutex_lock, one thread changes the mutex word from 0 to 1, updates the counter, unlocks, and only then can another thread enter; the final protected counter is exactly 400,000.

Robust Owner Death

Robust mutexes are for recoverable shared state, especially across processes. If an owner dies in the critical section, the next locker is told that protected data may be inconsistent rather than silently continuing.

16. Minimal C Demo

The first demo shows pthread creation attributes, kernel TIDs, return values, and joins. The second demo contrasts a racing increment with the same increment protected by a normal mutex.

pthread_create, pthread_attr, and join — C Demo
stdin (optional)
Race Counter vs Mutex Counter — C Demo
stdin (optional)

17. Kernel Source Pointers

AreaFiles / symbolsWhat to inspect
Thread creationkernel/fork.c, kernel_clone, copy_processHow clone flags decide shared mm_struct, files, sighand, and thread group membership.
TLS setuparch/x86/kernel/process_64.c, copy_thread_tlsWhere the new thread stack, register state, and TLS pointer are prepared.
Futex wait/wakekernel/futex/core.c, futex_wait, futex_wakeHow contended pthread mutexes sleep and wake through a kernel hash table keyed by user address.
Priority inheritancekernel/futex/pi.c, FUTEX_LOCK_PI, rt_mutexHow PI futexes connect pthread priority inheritance to kernel realtime mutex logic.
Robust listskernel/futex/core.c, exit_robust_list, set_robust_listHow the kernel marks owner-dead mutexes when a task exits while holding them.

18. Interview Prep

QuestionConcise Answer
What does pthread_create do on Linux?NPTL prepares stack and TLS in libc, then uses clone with flags such as CLONE_VM, CLONE_FILES, CLONE_SIGHAND, CLONE_THREAD, and CLONE_SETTLS.
Joinable vs detached thread?A joinable thread retains termination resources and a return value until joined. A detached thread releases resources automatically and cannot be joined.
Why can default pthread stacks limit scalability?Each thread reserves a stack mapping, often megabytes plus a guard page. Thousands of threads can exhaust virtual memory or address-space policy before CPU is saturated.
How does a pthread mutex use futex?The fast path is an atomic user-space transition. On contention, libc marks waiters and calls futex wait; unlock calls futex wake when needed.
When do you use robust or process-shared mutexes?Use robust when owner death must be detected and repaired. Use process-shared only when the mutex object lives in shared memory and coordinates multiple processes.

19. §18.12-18.17 Sync Primitives Beyond Mutexes

Mutexes protect exclusive critical sections, but real pthread programs also need to wait for predicates, admit many readers, synchronize phases, survive cancellation, initialize shared state once, store per-thread data, and handle process-directed signals without running unsafe code in an async handler.

20. Key Data Structures

A condition variable is not the condition itself. The condition is the protected predicate such as count > 0; the condition variable is only the wait queue used when the predicate is false.

PrimitiveImportant StatePurposeRule Interviewers Care About
pthread_cond_twait sequence, wake sequence, futex word, associated mutex protocolSleep until a mutex-protected predicate may have changed.Always wait in while (!predicate), never if.
pthread_rwlock_treader count, writer ownership, reader/writer wait queuesAllow many readers or one writer.Reader preference can starve writers; recursive read locking can deadlock with pending writers.
pthread_spinlock_tone atomic lock wordBusy-wait for extremely short critical sections.Bad around syscalls, blocking I/O, or any critical section longer than a context switch.
pthread_barrier_trequired count, arrived count, generation numberRelease a group only after N participants reach the same phase.Reusable barriers need a generation counter so old waiters do not mix with the next round.
cleanup stackLIFO cleanup handlers pushed by pthread_cleanup_pushRun unlock/free handlers when a thread is canceled or exits.Deferred cancellation is manageable; async cancellation is almost never safe in C.
pthread_key_tglobal key index plus per-thread slot values and destructorDynamic thread-specific data with cleanup on thread exit.Different threads see different values for the same key.

A reader-writer lock is useful when protected data is read often and mutated rarely. The state must distinguish active readers, an active writer, and waiters whose preference policy decides who enters next.

Spinlocks and barriers solve opposite timing problems: a spinlock avoids sleeping because the hold time should be tiny, while a barrier deliberately waits until every participant has reached the same phase boundary.

Thread-specific data is a two-level lookup. The key is process-wide, but each thread has its own slot value, and a destructor receives that thread's non-null value when the thread exits.

21. Core Mechanism

Background

A bounded queue starts empty. A consumer must sleep until a producer inserts an item, but the sleep must not miss the producer's wakeup and must not consume from an empty buffer after a spurious wake.

Plan

  1. Protect queue state with one mutex: head, tail, and count.
  2. Consumers use while (count == 0) pthread_cond_wait(&not_empty, &mutex).
  3. Producers use while (count == CAP) pthread_cond_wait(&not_full, &mutex).
  4. After changing the predicate, signal the opposite side before unlocking or while still holding enough ordering through the mutex.

Walkthrough

Initial state: count=0, consumer C holds the mutex and checks the predicate. C calls pthread_cond_wait, which atomically places C on the wait queue and unlocks the mutex. Producer P locks, writes item 7, changes count to 1, signals not_empty, and unlocks. C wakes, reacquires the mutex before returning, checks count > 0 again, removes item 7, signals not_full, and unlocks. The mutex creates the happens-before edge; the loop creates correctness under spurious or competing wakes.

Cancellation and Cleanup

Cancellation is only sane when resources are paired with cleanup handlers. The common pattern is lock, push an unlock handler, call cancellation points, then pop the handler when the protected region is finished.

Multi-Threaded Signals

Signal handlers are process-wide, but each thread has its own signal mask. Production pthread programs usually block signals in all workers and dedicate one thread to sigwait(), so shutdown code can use normal mutexes and condition variables instead of async-signal-safe handler rules.

22. Minimal C Demo

These demos focus on the failure-resistant patterns: predicate loops with two condition variables, phase synchronization with a barrier, cleanup handlers around cancellation points, exactly-once initialization with TSD destructors, and synchronous signal handling through sigwait.

Bounded Buffer with Two Condition Variables — C Demo
stdin (optional)
RWLock Readers with Barrier Phases — C Demo
stdin (optional)
Cancellation Cleanup Unlocks Mutex — C Demo
stdin (optional)
pthread_once and Thread-Specific Data — C Demo
stdin (optional)
Dedicated sigwait Signal Thread — C Demo
stdin (optional)

23. Kernel Source Pointers

AreaFiles / symbolsWhat to inspect
Futex-backed waitskernel/futex/core.c, futex_wait, futex_wake, futex_requeueHow pthread condition variables and locks sleep on user-space words only after libc decides the fast path cannot proceed.
Signal deliverykernel/signal.c, complete_signal, get_signal, do_sigtimedwaitHow Linux chooses an eligible thread and how synchronous waiting consumes pending signals.
Thread maskskernel/signal.c, set_current_blocked, rt_sigprocmaskWhere per-thread blocked signal masks are updated.
Scheduling impactkernel/sched/core.c, schedule, try_to_wake_upWhat happens after a futex wait parks a thread and a wake makes it runnable again.

24. Interview Prep

QuestionConcise Answer
Why must pthread_cond_wait be in a while loop?Because wakeups can be spurious and another waiter may consume the state first. The predicate, not the wakeup, decides whether progress is legal.
What is a lost wakeup?A signal happens before the waiter is actually waiting and the predicate is not protected consistently, so the waiter sleeps even though work is available.
When is an rwlock better than a mutex?When reads dominate, critical sections are large enough to justify rwlock overhead, and writer starvation policy is acceptable or configured.
When should you use a pthread spinlock?Only for very short non-blocking critical sections where sleeping would cost more than waiting. Never hold it across I/O or calls that may block.
How should a multi-threaded program handle SIGINT or SIGTERM?Block those signals in all threads, create one signal thread that uses sigwait or signalfd, then notify workers through normal synchronization.

25. Overview

Concurrency patterns are reusable shapes for moving work through shared state without letting memory, latency, or failure domains grow without control. The practical Linux building blocks are bounded queues, pthread synchronization, process workers, event loops, and a clear choice of where isolation should live.

26. Key Data Structures

A bounded producer-consumer queue is a ring plus two predicates. not_full is true when producers may insert, and not_empty is true when consumers may remove; both predicates are guarded by the same mutex.

The single-producer single-consumer ring removes the mutex by giving each side exclusive ownership of one index. Release/acquire ordering publishes the slot contents before the consumer observes the advanced tail.

StructureFieldsPurposeFailure Mode
bounded ring queuebuf, head, tail, count, mutex, two condvarsMove finite work between producers and consumers with backpressure.Using if instead of while consumes from empty or writes into full under competing wakeups.
SPSC ringatomic head, atomic tail, power-of-two slotsFast one-producer one-consumer handoff.Extending it to MPMC without CAS claim/publish corrupts slots.
thread poolworker array, bounded queue, shutdown flag, join pathBound CPU parallelism and thread stack memory.Unbounded queues hide overload until memory or tail latency fails.
prefork worker poolmaster process, inherited listen FD, worker PIDsIsolate faults and avoid thread-shared memory bugs.High memory footprint and heavier context switches than threads.
epoll worker loopnonblocking FDs, epoll interest list, event batchScale many mostly-idle connections with few threads.One blocking call can stall the whole event loop thread.

Reader-writer patterns choose different costs for the same read-mostly problem: rwlock blocks writers until readers drain, seqlock makes readers retry, and RCU lets old readers finish before freeing replaced data.

Userspace RCU separates pointer publication from reclamation. Readers mark a read-side region, writers publish a replacement, and the old object is freed only after every pre-existing reader has passed through a quiescent state.

A fixed thread pool is a bounded queue plus sleeping workers. Shutdown is just another state transition: set the quit flag, broadcast the waiters, and join every worker.

27. Core Mechanism

Background

A service receives bursts of requests, but only has enough CPU and memory for a fixed number of workers and queued jobs. The concurrency design must absorb small bursts, apply backpressure during overload, and shut down without losing ownership of queued work.

Plan

  1. Use a bounded queue so submitters block or fail fast when the system is overloaded.
  2. Protect queue state with one mutex and use not_empty and not_full predicates.
  3. Workers sleep while the queue is empty and the shutdown flag is false.
  4. Shutdown sets the flag, broadcasts workers, drains or rejects new submissions, then joins every thread.

Walkthrough

Initial state: queue capacity is 8, count=0, and three workers are sleeping on not_empty. A producer submits jobs 1 through 8, advancing tail and signaling workers. Job 9 blocks because count == CAP. Worker 0 pops job 1, decrements count, signals not_full, and executes outside the mutex, so job 9 can be queued while CPU work continues. During shutdown, the main thread sets quit=1 and broadcasts; a worker exits only when the queue is empty and quit is set, so already accepted work is not abandoned.

Work Stealing

Work stealing reduces contention by giving each worker a private deque. The owner pushes and pops recent tasks from the head for cache locality, while idle workers steal older tasks from the tail to rebalance load.

Server Models

Apache prefork spends memory to isolate each request handler in a process. Nginx spends engineering complexity on event-driven workers that epoll shared listening sockets and keep many idle connections without one stack per connection.

In the Nginx-style model, the master supervises workers while the kernel owns readiness notification and accept-queue behavior. SO_REUSEPORT can shard incoming connections across worker sockets.

A master can also accept centrally and pass connected sockets to workers over a Unix socket with SCM_RIGHTS. The receiving worker gets a valid duplicate FD in its own descriptor table.

Multi-threaded servers choose between one thread per connection, a bounded worker pool around epoll, M:N runtimes, and io_uring variants. The common rule is to keep blocking work away from the event loop.

An epoll worker repeatedly sleeps for readiness, accepts or reads nonblocking FDs, writes responses, and re-arms interest when edge-triggered or one-shot modes are used.

Model Comparison

Concurrency models are trade-offs between fault isolation, switch cost, memory footprint, and reasoning cost. Processes isolate failures; threads share memory cheaply but make data races possible; coroutines and event loops are fast but require cooperative nonblocking discipline.

28. Minimal C Demo

These runnable demos show the core interview patterns: a many-producer many-consumer bounded queue, a minimal thread pool with graceful shutdown, and the shape of a prefork worker pool using a pipe as the stand-in for accepted connections.

MPMC Bounded Queue with Backpressure — C Demo
stdin (optional)
Tiny Thread Pool with Graceful Shutdown — C Demo
stdin (optional)
Prefork Worker Pool Skeleton — C Demo
stdin (optional)

29. Kernel Source Pointers

AreaFiles / symbolsWhat to inspect
Blocking queue waitskernel/futex/core.c, futex_wait, futex_wakeHow pthread condvars park and wake threads after user-space predicates decide the fast path cannot continue.
Worker schedulingkernel/sched/core.c, schedule, try_to_wake_upWhat happens when a sleeping worker becomes runnable after a queue signal.
Process worker creationkernel/fork.c, copy_process, dup_fdHow prefork workers inherit the listening socket and why each worker has a separate address space.
FD passingnet/unix/af_unix.c, unix_stream_sendmsg, scm_detach_fdsHow SCM_RIGHTS transfers references to open files across Unix sockets.
Event readinessfs/eventpoll.c, ep_poll, ep_send_eventsHow epoll batches readiness for event-loop and thread-pool server designs.

30. Interview Prep

QuestionConcise Answer
How do you design a correct producer-consumer queue?Use one mutex for queue state, two predicates for not-full and not-empty, wait in loops, signal after changing the predicate, and bound capacity for backpressure.
What makes a thread pool production-ready?Bounded queue, clear rejection or blocking policy, worker lifecycle management, graceful shutdown, no executing jobs while holding the queue mutex, and observable backlog.
When is RCU better than an rwlock?When reads dominate and can tolerate seeing old versions while writers copy-swap and defer reclamation until all old readers finish.
Apache prefork vs Nginx worker model?Prefork favors isolation and simple blocking code at high memory cost. Nginx favors event-driven scalability with fewer workers but requires nonblocking discipline.
Thread per connection vs epoll thread pool?Thread per connection is simple but stack-heavy and context-switch-heavy. An epoll pool handles many idle connections with fewer threads and offloads blocking CPU work to workers.

31. §18.24-18.28 Deadlocks, PI Futexes, IPC Sync, and Coroutines

Advanced user-space concurrency is about the failure modes that appear after the basic mutex and queue code works: lock cycles, realtime priority inversion, cross-process ownership, file-lock semantics, and runtimes that multiplex logical tasks without creating one kernel thread per task.

32. Key Data Structures

A deadlock is a cycle in a wait-for graph. In the classic two-lock bug, each thread owns one lock and waits for the other, so neither edge can be removed by normal execution.

Cross-process synchronization works only when the synchronization word and protected data live in shared memory. A process-shared pthread mutex on a private heap is just two unrelated copies after fork.

Structure / APIImportant StatePurposeInterview Detail
wait-for graphthreads, resources, waits-for edgesDetect cycles that imply deadlock.Pthreads do not maintain a general deadlock detector for you.
PTHREAD_PRIO_INHERITowner TID, waiter priority, PI futex stateBoost a low-priority mutex owner when a higher-priority thread blocks.Useful for realtime scheduling, not ordinary throughput tuning.
PROCESS_SHARED mutexfutex word stored in shared mappingSynchronize independent processes without a server process.Combine with robust mutexes when owner death must be recovered.
file locksinode, byte range, owner scopeCoordinate access to persistent files across processes.fcntl process-scoped locks surprise many people; OFD locks are often cleaner.
ucontext_tsaved registers, stack, signal mask, link contextStackful cooperative coroutine state in user space.No preemption: a fiber that never yields blocks all fibers on that thread.

Linux exposes several cross-process coordination tools. The crucial difference is ownership scope: whole file, byte range, process, open file description, named kernel semaphore, or futex word in a shared page.

A stackful coroutine is just another saved CPU context and stack inside one OS thread. Switching saves the current instruction and stack pointers, then restores another coroutine's saved state.

33. Core Mechanism

Background

A service has two shared indexes guarded by two locks. One path updates cache then index; another updates index then cache. Under load, thread A owns the cache lock and waits for the index lock while thread B owns the index lock and waits for the cache lock.

Plan

  1. Assign every lock a canonical rank: static enum, object address, or lock class.
  2. Acquire multiple locks in increasing rank and release them in reverse rank.
  3. For optional locks, use pthread_mutex_trylock with release-and-backoff instead of waiting while holding lower-priority resources.
  4. In complex systems, log wait-for edges and fail fast when adding an edge would create a cycle.

Walkthrough

Initial state: lock A has rank 10 and lock B has rank 20. Thread 1 asks for A then B; thread 2 asks for B then A. The ordered helper sorts both requests into A then B. Thread 1 may run first and hold both; thread 2 waits at A without owning B, so there is no cycle. When thread 1 releases B then A, thread 2 proceeds and the circular-wait Coffman condition is removed.

Priority Inversion

Priority inversion appears when a low-priority lock owner is preempted by medium-priority work while a high-priority thread waits behind it. The high-priority thread is indirectly blocked by a thread that does not even need the lock.

A priority-inheritance mutex lets the kernel see the dependency. When the high-priority waiter blocks, the kernel temporarily boosts the low-priority owner so it can finish and release the mutex.

Coroutine Runtimes

Modern coroutine systems all split logical waiting from OS-thread blocking. Go schedules goroutines onto worker threads, while C++20 and Rust represent suspended work as state machines that an executor resumes when I/O or timers complete.

A stackless coroutine moves through explicit compiler-generated states at each suspension point. Local variables that must survive a suspension live in the coroutine frame rather than on an ordinary caller stack.

34. Minimal C Demo

These demos show the practical core without requiring realtime privileges: ordered two-lock acquisition prevents a deadlock, a process-shared mutex coordinates two forked children through anonymous shared memory, stackful coroutines ping-pong with swapcontext, and the model comparison prints equivalent coroutine shapes.

Deadlock Prevention by Lock Ordering — C Demo
stdin (optional)
PROCESS_SHARED Mutex in mmap Memory — C Demo
stdin (optional)
ucontext Coroutine Ping-Pong — C Demo
stdin (optional)
Coroutine Model Comparison — C Demo
stdin (optional)

35. Kernel Source Pointers

AreaFiles / symbolsWhat to inspect
PI futexeskernel/futex/pi.c, futex_lock_pi, futex_unlock_piHow userspace PRIO_INHERIT mutexes route contention through kernel realtime mutex logic.
Realtime mutex corekernel/locking/rtmutex.c, rt_mutex_adjust_prio_chainHow priority donation walks lock-owner chains and prevents unbounded inversion.
Shared-memory futex keyskernel/futex/core.c, get_futex_keyHow the kernel distinguishes private futexes from futex words in shared mappings.
File locksfs/locks.c, fcntl_setlk, flock_lock_inodeWhere BSD locks, POSIX byte-range locks, and OFD locks are represented.
Schedulingkernel/sched/core.c, setscheduler, scheduleHow realtime priority changes and wakeups affect runnable thread selection.

36. Interview Prep

QuestionConcise Answer
What are the Coffman deadlock conditions?Mutual exclusion, hold-and-wait, no preemption, and circular wait. Lock ordering removes circular wait.
What is priority inversion?A high-priority thread waits on a lock held by a low-priority thread while medium-priority work prevents the owner from running. PI boosts the owner until it releases the lock.
How do process-shared pthread mutexes work?The mutex object and protected data live in a shared mapping, and pthreads uses futex operations keyed to that shared memory address.
Why are OFD locks often safer than classic fcntl locks?OFD locks are scoped to the open file description, so duplicate FDs and threads behave more predictably than process-scoped POSIX record locks.
Stackful fiber vs C++20 coroutine?A stackful fiber switches an entire stack and can suspend deep in calls. A C++20 coroutine is a compiler-generated state machine that suspends only at explicit coroutine points.