§ 38 — Process Wakeup: INTERRUPTIBLE → RUNNING
Linux process state machine & sleep states (§38.1) · ep_poll() slow path: set_current_state + schedule_timeout (§38.2) · try_to_wake_up: state transition + SMP IPI (§38.3) · CFS vruntime RB-tree & context_switch (§38.4) · Signal interruption, EINTR, SA_RESTART, epoll_pwait (§38.5)
§ 38.1 — Linux Process State Machine
Every Linux process is always in exactly one scheduler state stored in task_struct.__state. The state controls two things: whether the scheduler will put the process on a CPU, and whether a signal can interrupt its wait. epoll_wait uses TASK_INTERRUPTIBLE — the only sleeping state that allows signals to pull a process back out of sleep.
State Transition Diagram
| State | Value | Used by epoll_wait? | Signal can wake? |
|---|---|---|---|
TASK_RUNNING | 0 | — (on CPU or in run queue) | — |
TASK_INTERRUPTIBLE | 1 | YES | YES → returns EINTR |
TASK_UNINTERRUPTIBLE | 2 | No | No |
TASK_KILLABLE | 0x80000000 | 2 | No | Fatal signals only (SIGKILL) |
__TASK_STOPPED | 4 | No | SIGCONT |
EXIT_ZOMBIE | 32 | No | No |
Why epoll_wait Uses TASK_INTERRUPTIBLE
/* Two sleeping states — very different signal behavior */
/* TASK_INTERRUPTIBLE: network I/O, pipe reads, epoll_wait */
set_current_state(TASK_INTERRUPTIBLE);
schedule(); /* voluntarily yields CPU */
/* wakeup path 1: event arrives → ep_poll_callback → back here */
/* wakeup path 2: signal delivered → signal_pending() → EINTR */
/* TASK_UNINTERRUPTIBLE: disk I/O (e.g. waiting for journal commit) */
set_current_state(TASK_UNINTERRUPTIBLE);
schedule();
/* ONLY wakeup path: I/O completes → back here
Signal arrives? Ignored until I/O finishes.
Reason: cannot abort a disk write mid-flight — filesystem corruption */
/* Rule of thumb:
Network → TASK_INTERRUPTIBLE (reversible, no data hazard)
Block dev → TASK_UNINTERRUPTIBLE (irreversible mid-op state) */§ 38.2 — Going to Sleep Inside epoll_wait
When epoll_wait() finds no ready events it takes the slow path inside ep_poll(): register a wait entry on ep->wq, set state to TASK_INTERRUPTIBLE, then call schedule_timeout() — the point where the process physically leaves the CPU. The surrounding loop guards against spurious wakeups and signal delivery.
ep_poll() Slow Path
CPU Timeline — Sleep, Wakeup, Reschedule
The process yields the CPU the moment schedule_timeout() is called. Another process runs until the NIC interrupt fires, which chains through the TCP stack into ep_poll_callback(), which re-enqueues the sleeping process. The scheduler then picks it up at its next opportunity.
ep_poll() Slow Path — Source
/* fs/eventpoll.c — ep_poll() slow path (simplified) */
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
int maxevents, long timeout)
{
wait_queue_entry_t wait;
long jtimeout = timeout_to_jiffies(timeout);
int res = 0;
spin_lock_irq(&ep->lock);
if (!list_empty(&ep->rdllist))
goto send_events; /* fast path — already have events */
/* Add ourselves to epoll's wait queue */
init_waitqueue_entry(&wait, current); /* func = default_wake_function */
__add_wait_queue_exclusive(&ep->wq, &wait);
spin_unlock_irq(&ep->lock);
for (;;) {
/*
* Set state BEFORE checking conditions.
* The compiler barrier prevents reordering: state must be visible
* to other CPUs before we test rdllist.
*/
set_current_state(TASK_INTERRUPTIBLE);
if (!list_empty(&ep->rdllist) || !jtimeout)
break; /* events arrived or timeout expired */
if (signal_pending(current)) {
res = -EINTR; /* signal interrupted the wait */
break;
}
jtimeout = schedule_timeout(jtimeout); /* yield CPU — sleep here */
/* returns when woken or timeout; loop re-checks all conditions */
}
__remove_wait_queue(&ep->wq, &wait);
set_current_state(TASK_RUNNING);
send_events:
ep_send_events(ep, events, maxevents);
return res ? res : (int)events_written;
}| Call | Purpose | Detail |
|---|---|---|
set_current_state(TASK_INTERRUPTIBLE) | State + barrier | Compiler/memory barrier — state visible before rdllist check; prevents lost wakeup race |
__add_wait_queue_exclusive() | Register on ep->wq | Uses WQ_FLAG_EXCLUSIVE — only one epoll_wait caller woken per event |
schedule_timeout(jtimeout) | Yield CPU | Process leaves CPU here; scheduler runs next task; timer set for timeout |
signal_pending(current) | Signal check | Tests TIF_SIGPENDING flag; true means signal arrived while sleeping |
__remove_wait_queue() | Cleanup | Always runs — even on timeout or EINTR — prevents dangling entry |
set_current_state(TASK_RUNNING) | Restore state | Required before using ep_send_events; marks process as runnable again |
§ 38.3 — try_to_wake_up: Adding to the Run Queue
When a packet arrives, ep_poll_callback() calls wake_up(&ep->wq). This chains through the wait queue machinery to try_to_wake_up() — the single kernel function responsible for transitioning a sleeping task back to TASK_RUNNING and placing it on a CPU's CFS run queue. On SMP systems, if the sleeping process last ran on a different CPU, an IPI (inter-processor interrupt) is sent to enqueue it there.
try_to_wake_up Flowchart
SMP Cross-CPU Wakeup
Cache affinity matters: try_to_wake_up() prefers to re-enqueue a process on the CPU it last ran on, keeping its L1/L2 cache warm. When the wakeup fires on a different CPU, an IPI is the mechanism to hand off the enqueue operation without touching another CPU's run queue directly.
try_to_wake_up — Source
/* kernel/sched/core.c — try_to_wake_up() (simplified) */
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
int cpu, success = 0;
unsigned long flags;
raw_spin_lock_irqsave(&p->pi_lock, flags);
/*
* Check the task is actually in the expected sleep state.
* Guards against double-wakeup races on SMP.
*/
if (!(p->__state & state))
goto out;
p->__state = TASK_RUNNING; /* mark awake under pi_lock */
/*
* select_task_rq: find best CPU for cache affinity.
* Usually returns p->cpu (last-run CPU) to keep cache warm.
*/
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
ttwu_queue(p, cpu, wake_flags);
/* Same CPU → ttwu_do_activate(rq, p) directly.
Diff CPU → send IPI; target CPU handles ttwu_do_activate. */
success = 1;
out:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
return success;
}
/* ttwu_do_activate: final enqueue into CFS run queue */
static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
{
activate_task(rq, p, ENQUEUE_WAKEUP); /* enqueue_task_fair() */
wakeup_preempt(rq, p, wake_flags); /* may trigger preemption */
}| Step | Function | Action |
|---|---|---|
| 1 | wake_up(&ep->wq) | Triggers wait queue traversal from ep_poll_callback |
| 2 | __wake_up_common | Iterates ep->wq, calls each entry's func callback |
| 3 | default_wake_function | Entry func for epoll_wait; calls try_to_wake_up |
| 4 | try_to_wake_up | Acquires pi_lock, verifies sleep state, sets TASK_RUNNING |
| 5 | ttwu_queue | Same CPU: direct enqueue. Different CPU: send IPI |
| 6 | ttwu_do_activate | Calls enqueue_task_fair() — adds task to CFS RB-tree |
§ 38.4 — CFS Scheduler: Run Queue → Running
Once ttwu_do_activate() enqueues the woken process, the CFS (Completely Fair Scheduler) decides when it actually gets a CPU. CFS sorts all runnable tasks in a red-black tree by vruntime (virtual runtime — CPU time adjusted for priority). Newly woken processes receive a vruntime boost to min_vruntime, placing them at the leftmost node — the next to be picked.
CFS Run Queue — vruntime RB-Tree
Context Switch Timeline
pick_next_task_fair() takes the leftmost node, then context_switch() saves the outgoing process's register state and restores the incoming one. The woken process resumes exactly at the instruction after schedule_timeout().
Full Round-Trip: NIC → epoll_wait Returns
Demo 3519 — Process State Animator
Step through the six phases from an idle epoll server to epoll_wait returning. Watch the server move from the sleeping zone to the CFS run queue to the CPU, and see when the rdllist event fires.
| CFS concept | Detail |
|---|---|
vruntime | CPU time weighted by task priority (nice). Lower = ran less recently = scheduled sooner. |
min_vruntime | Minimum vruntime across all runnable tasks on the rq. Newly woken tasks are set here to avoid starvation. |
leftmost node | The RB-tree node with smallest vruntime. pick_next_task_fair() returns this in O(1) — cached as rq->rb_leftmost. |
enqueue_task_fair() | Called by ttwu_do_activate. Sets p->vruntime = max(p->vruntime, min_vruntime), inserts into RB-tree. |
switch_mm() | Swaps CR3 register (page table base) when switching between different processes. Skipped for kernel threads (no mm). |
switch_to() | Saves prev's rsp/rbp/callee-saved registers to its stack. Restores next's. CPU now runs next's code. |
§ 38.5 — Signal Interruption of epoll_wait
Because epoll_wait uses TASK_INTERRUPTIBLE, any signal delivered to the sleeping process triggers the exact same try_to_wake_up path as an incoming I/O event. The difference is discovered only after the process wakes: the ep_poll loop checks signal_pending(current) and breaks with -EINTR instead of dispatching events.
EINTR Wakeup Path
Demo 3520 — epoll_wait Interruption Demo
Toggle SA_RESTART and epoll_pwait mode, then send signals or normal data to observe the kernel's response at each step.
sigprocmask, then calls epoll_wait. But a signal arriving in the window between the two calls is lost. epoll_pwait atomically swaps the signal mask and enters the kernel wait in a single syscall, closing that TOCTOU race./* epoll_pwait: atomic signal mask + wait */
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGINT);
/* atomic: apply mask → epoll_wait → restore mask
prevents race between signal_pending check and schedule() */
int n = epoll_pwait(epfd, events, MAX, timeout, &mask);
if (n < 0 && errno == EINTR) {
/* SA_RESTART not set: retry manually */
}| Scenario | epoll_wait behavior | How to handle |
|---|---|---|
| Signal, no SA_RESTART | Returns -1, errno=EINTR | App checks errno, re-calls manually |
| Signal, SA_RESTART | Auto-restarts after handler returns | No app-level handling needed |
| epoll_pwait() + masked signal | Signal blocked — process stays asleep | Prevents TOCTOU between check and block |
| epoll_pwait2() (Linux 5.11+) | Same as epoll_pwait + nanosecond timeout | Use for sub-millisecond timeout precision |
Interview Prep — Must-Know Questions
| Question | Key answer |
|---|---|
| TASK_INTERRUPTIBLE vs TASK_UNINTERRUPTIBLE — which does epoll_wait use? | TASK_INTERRUPTIBLE. It can be woken by both an I/O event and a signal. epoll_wait must be interruptible so Ctrl-C (SIGINT) can kill the process. Disk I/O uses TASK_UNINTERRUPTIBLE because the operation cannot be safely aborted mid-flight. |
| Full path from ep_poll_callback → process running? | ep_poll_callback → rdllist.add(epitem) → wake_up(&ep->wq) → __wake_up_common → default_wake_function → try_to_wake_up → p->__state = TASK_RUNNING → ttwu_queue → ttwu_do_activate → CFS enqueue → pick_next_task_fair → context_switch → resumes at schedule_timeout() return. |
| What does try_to_wake_up do? Which CPU's run queue? | Locks p->pi_lock, verifies p->__state & TASK_NORMAL, sets state = TASK_RUNNING, calls select_task_rq() for cache affinity (prefers p->cpu), then ttwu_queue. Same CPU → ttwu_do_activate directly. Different CPU → IPI to the target CPU so it enqueues the task. |
| How does CFS prioritize a newly woken process? | enqueue_task_fair() sets p->vruntime = max(p->vruntime, rq->min_vruntime), placing the task at or near the leftmost node of the RB-tree. pick_next_task_fair() returns the leftmost node in O(1) (cached as rq->rb_leftmost), so the woken process is selected in the next schedule() call. |
| Why does ep_poll() re-check rdllist after schedule_timeout returns? | Spurious wakeups — the kernel guarantees nothing about why schedule_timeout returned. The loop must verify list_empty(&ep->rdllist) is false before exiting. It also checks signal_pending and the remaining timeout. |
| How does a signal interrupt epoll_wait? What errno? | signal_wake_up(p, 0) calls try_to_wake_up — identical to an I/O wakeup. After the process resumes in ep_poll, signal_pending(current) is true → the loop breaks with res = -EINTR → epoll_wait returns -1, errno = EINTR. |
| epoll_wait vs epoll_pwait — what race does pwait fix? | epoll_pwait atomically installs a temporary signal mask before entering the kernel wait. Without it, a signal arriving between sigprocmask() and epoll_wait() is lost (TOCTOU). epoll_pwait closes that window in a single syscall. epoll_pwait2 (Linux 5.11) adds nanosecond timeout precision. |