Part XXV — I/O Multiplexing

§ 38 — Process Wakeup: INTERRUPTIBLE → RUNNING

Linux process state machine & sleep states (§38.1) · ep_poll() slow path: set_current_state + schedule_timeout (§38.2) · try_to_wake_up: state transition + SMP IPI (§38.3) · CFS vruntime RB-tree & context_switch (§38.4) · Signal interruption, EINTR, SA_RESTART, epoll_pwait (§38.5)

§ 38.1 — Linux Process State Machine

Every Linux process is always in exactly one scheduler state stored in task_struct.__state. The state controls two things: whether the scheduler will put the process on a CPU, and whether a signal can interrupt its wait. epoll_wait uses TASK_INTERRUPTIBLE — the only sleeping state that allows signals to pull a process back out of sleep.

State Transition Diagram

StateValueUsed by epoll_wait?Signal can wake?
TASK_RUNNING0— (on CPU or in run queue)
TASK_INTERRUPTIBLE1YESYES → returns EINTR
TASK_UNINTERRUPTIBLE2NoNo
TASK_KILLABLE0x80000000 | 2NoFatal signals only (SIGKILL)
__TASK_STOPPED4NoSIGCONT
EXIT_ZOMBIE32NoNo

Why epoll_wait Uses TASK_INTERRUPTIBLE

/* Two sleeping states — very different signal behavior */

/* TASK_INTERRUPTIBLE: network I/O, pipe reads, epoll_wait */
set_current_state(TASK_INTERRUPTIBLE);
schedule();        /* voluntarily yields CPU */
/* wakeup path 1: event arrives → ep_poll_callback → back here */
/* wakeup path 2: signal delivered → signal_pending() → EINTR  */

/* TASK_UNINTERRUPTIBLE: disk I/O (e.g. waiting for journal commit) */
set_current_state(TASK_UNINTERRUPTIBLE);
schedule();
/* ONLY wakeup path: I/O completes → back here
   Signal arrives? Ignored until I/O finishes.
   Reason: cannot abort a disk write mid-flight — filesystem corruption */

/* Rule of thumb:
   Network   → TASK_INTERRUPTIBLE  (reversible, no data hazard)
   Block dev → TASK_UNINTERRUPTIBLE (irreversible mid-op state) */

§ 38.2 — Going to Sleep Inside epoll_wait

When epoll_wait() finds no ready events it takes the slow path inside ep_poll(): register a wait entry on ep->wq, set state to TASK_INTERRUPTIBLE, then call schedule_timeout() — the point where the process physically leaves the CPU. The surrounding loop guards against spurious wakeups and signal delivery.

ep_poll() Slow Path

CPU Timeline — Sleep, Wakeup, Reschedule

The process yields the CPU the moment schedule_timeout() is called. Another process runs until the NIC interrupt fires, which chains through the TCP stack into ep_poll_callback(), which re-enqueues the sleeping process. The scheduler then picks it up at its next opportunity.

ep_poll() Slow Path — Source

/* fs/eventpoll.c — ep_poll() slow path (simplified) */
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
                   int maxevents, long timeout)
{
    wait_queue_entry_t wait;
    long jtimeout = timeout_to_jiffies(timeout);
    int res = 0;

    spin_lock_irq(&ep->lock);
    if (!list_empty(&ep->rdllist))
        goto send_events;          /* fast path — already have events */

    /* Add ourselves to epoll's wait queue */
    init_waitqueue_entry(&wait, current); /* func = default_wake_function */
    __add_wait_queue_exclusive(&ep->wq, &wait);
    spin_unlock_irq(&ep->lock);

    for (;;) {
        /*
         * Set state BEFORE checking conditions.
         * The compiler barrier prevents reordering: state must be visible
         * to other CPUs before we test rdllist.
         */
        set_current_state(TASK_INTERRUPTIBLE);

        if (!list_empty(&ep->rdllist) || !jtimeout)
            break;                 /* events arrived or timeout expired */
        if (signal_pending(current)) {
            res = -EINTR;          /* signal interrupted the wait */
            break;
        }
        jtimeout = schedule_timeout(jtimeout); /* yield CPU — sleep here */
        /* returns when woken or timeout; loop re-checks all conditions */
    }

    __remove_wait_queue(&ep->wq, &wait);
    set_current_state(TASK_RUNNING);

send_events:
    ep_send_events(ep, events, maxevents);
    return res ? res : (int)events_written;
}
CallPurposeDetail
set_current_state(TASK_INTERRUPTIBLE)State + barrierCompiler/memory barrier — state visible before rdllist check; prevents lost wakeup race
__add_wait_queue_exclusive()Register on ep->wqUses WQ_FLAG_EXCLUSIVE — only one epoll_wait caller woken per event
schedule_timeout(jtimeout)Yield CPUProcess leaves CPU here; scheduler runs next task; timer set for timeout
signal_pending(current)Signal checkTests TIF_SIGPENDING flag; true means signal arrived while sleeping
__remove_wait_queue()CleanupAlways runs — even on timeout or EINTR — prevents dangling entry
set_current_state(TASK_RUNNING)Restore stateRequired before using ep_send_events; marks process as runnable again

§ 38.3 — try_to_wake_up: Adding to the Run Queue

When a packet arrives, ep_poll_callback() calls wake_up(&ep->wq). This chains through the wait queue machinery to try_to_wake_up() — the single kernel function responsible for transitioning a sleeping task back to TASK_RUNNING and placing it on a CPU's CFS run queue. On SMP systems, if the sleeping process last ran on a different CPU, an IPI (inter-processor interrupt) is sent to enqueue it there.

try_to_wake_up Flowchart

SMP Cross-CPU Wakeup

Cache affinity matters: try_to_wake_up() prefers to re-enqueue a process on the CPU it last ran on, keeping its L1/L2 cache warm. When the wakeup fires on a different CPU, an IPI is the mechanism to hand off the enqueue operation without touching another CPU's run queue directly.

try_to_wake_up — Source

/* kernel/sched/core.c — try_to_wake_up() (simplified) */
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
    int cpu, success = 0;
    unsigned long flags;

    raw_spin_lock_irqsave(&p->pi_lock, flags);

    /*
     * Check the task is actually in the expected sleep state.
     * Guards against double-wakeup races on SMP.
     */
    if (!(p->__state & state))
        goto out;

    p->__state = TASK_RUNNING;   /* mark awake under pi_lock */

    /*
     * select_task_rq: find best CPU for cache affinity.
     * Usually returns p->cpu (last-run CPU) to keep cache warm.
     */
    cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);

    ttwu_queue(p, cpu, wake_flags);
    /* Same CPU → ttwu_do_activate(rq, p) directly.
       Diff CPU → send IPI; target CPU handles ttwu_do_activate. */

    success = 1;
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

/* ttwu_do_activate: final enqueue into CFS run queue */
static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
{
    activate_task(rq, p, ENQUEUE_WAKEUP);   /* enqueue_task_fair() */
    wakeup_preempt(rq, p, wake_flags);       /* may trigger preemption */
}
StepFunctionAction
1wake_up(&ep->wq)Triggers wait queue traversal from ep_poll_callback
2__wake_up_commonIterates ep->wq, calls each entry's func callback
3default_wake_functionEntry func for epoll_wait; calls try_to_wake_up
4try_to_wake_upAcquires pi_lock, verifies sleep state, sets TASK_RUNNING
5ttwu_queueSame CPU: direct enqueue. Different CPU: send IPI
6ttwu_do_activateCalls enqueue_task_fair() — adds task to CFS RB-tree

§ 38.4 — CFS Scheduler: Run Queue → Running

Once ttwu_do_activate() enqueues the woken process, the CFS (Completely Fair Scheduler) decides when it actually gets a CPU. CFS sorts all runnable tasks in a red-black tree by vruntime (virtual runtime — CPU time adjusted for priority). Newly woken processes receive a vruntime boost to min_vruntime, placing them at the leftmost node — the next to be picked.

CFS Run Queue — vruntime RB-Tree

Context Switch Timeline

pick_next_task_fair() takes the leftmost node, then context_switch() saves the outgoing process's register state and restores the incoming one. The woken process resumes exactly at the instruction after schedule_timeout().

Full Round-Trip: NIC → epoll_wait Returns

Demo 3519 — Process State Animator

Step through the six phases from an idle epoll server to epoll_wait returning. Watch the server move from the sleeping zone to the CFS run queue to the CPU, and see when the rdllist event fires.

epoll_wait server sleeping. Workers occupy the CPU.
CPU
Worker A
Run Queue (CFS)
Worker B
Sleeping (INTERRUPTIBLE)
Server (epoll_wait)
phase: idle1 / 6
CFS conceptDetail
vruntimeCPU time weighted by task priority (nice). Lower = ran less recently = scheduled sooner.
min_vruntimeMinimum vruntime across all runnable tasks on the rq. Newly woken tasks are set here to avoid starvation.
leftmost nodeThe RB-tree node with smallest vruntime. pick_next_task_fair() returns this in O(1) — cached as rq->rb_leftmost.
enqueue_task_fair()Called by ttwu_do_activate. Sets p->vruntime = max(p->vruntime, min_vruntime), inserts into RB-tree.
switch_mm()Swaps CR3 register (page table base) when switching between different processes. Skipped for kernel threads (no mm).
switch_to()Saves prev's rsp/rbp/callee-saved registers to its stack. Restores next's. CPU now runs next's code.

§ 38.5 — Signal Interruption of epoll_wait

Because epoll_wait uses TASK_INTERRUPTIBLE, any signal delivered to the sleeping process triggers the exact same try_to_wake_up path as an incoming I/O event. The difference is discovered only after the process wakes: the ep_poll loop checks signal_pending(current) and breaks with -EINTR instead of dispatching events.

EINTR Wakeup Path

Demo 3520 — epoll_wait Interruption Demo

Toggle SA_RESTART and epoll_pwait mode, then send signals or normal data to observe the kernel's response at each step.

Process state
TASK_INTERRUPTIBLE (sleeping)
epoll_wait sleeping — waiting for I/O events or signal
Tip: enable epoll_pwait + mask SIGINT, then send SIGINT → process stays asleep
epoll_pwait atomicity — A naïve approach unblocks a signal, calls sigprocmask, then calls epoll_wait. But a signal arriving in the window between the two calls is lost. epoll_pwait atomically swaps the signal mask and enters the kernel wait in a single syscall, closing that TOCTOU race.
/* epoll_pwait: atomic signal mask + wait */
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGINT);

/* atomic: apply mask → epoll_wait → restore mask
   prevents race between signal_pending check and schedule() */
int n = epoll_pwait(epfd, events, MAX, timeout, &mask);
if (n < 0 && errno == EINTR) {
    /* SA_RESTART not set: retry manually */
}
Scenarioepoll_wait behaviorHow to handle
Signal, no SA_RESTARTReturns -1, errno=EINTRApp checks errno, re-calls manually
Signal, SA_RESTARTAuto-restarts after handler returnsNo app-level handling needed
epoll_pwait() + masked signalSignal blocked — process stays asleepPrevents TOCTOU between check and block
epoll_pwait2() (Linux 5.11+)Same as epoll_pwait + nanosecond timeoutUse for sub-millisecond timeout precision

Interview Prep — Must-Know Questions

QuestionKey answer
TASK_INTERRUPTIBLE vs TASK_UNINTERRUPTIBLE — which does epoll_wait use?TASK_INTERRUPTIBLE. It can be woken by both an I/O event and a signal. epoll_wait must be interruptible so Ctrl-C (SIGINT) can kill the process. Disk I/O uses TASK_UNINTERRUPTIBLE because the operation cannot be safely aborted mid-flight.
Full path from ep_poll_callback → process running?ep_poll_callback → rdllist.add(epitem) → wake_up(&ep->wq) → __wake_up_common → default_wake_function → try_to_wake_up → p->__state = TASK_RUNNING → ttwu_queue → ttwu_do_activate → CFS enqueue → pick_next_task_fair → context_switch → resumes at schedule_timeout() return.
What does try_to_wake_up do? Which CPU's run queue?Locks p->pi_lock, verifies p->__state & TASK_NORMAL, sets state = TASK_RUNNING, calls select_task_rq() for cache affinity (prefers p->cpu), then ttwu_queue. Same CPU → ttwu_do_activate directly. Different CPU → IPI to the target CPU so it enqueues the task.
How does CFS prioritize a newly woken process?enqueue_task_fair() sets p->vruntime = max(p->vruntime, rq->min_vruntime), placing the task at or near the leftmost node of the RB-tree. pick_next_task_fair() returns the leftmost node in O(1) (cached as rq->rb_leftmost), so the woken process is selected in the next schedule() call.
Why does ep_poll() re-check rdllist after schedule_timeout returns?Spurious wakeups — the kernel guarantees nothing about why schedule_timeout returned. The loop must verify list_empty(&ep->rdllist) is false before exiting. It also checks signal_pending and the remaining timeout.
How does a signal interrupt epoll_wait? What errno?signal_wake_up(p, 0) calls try_to_wake_up — identical to an I/O wakeup. After the process resumes in ep_poll, signal_pending(current) is true → the loop breaks with res = -EINTR → epoll_wait returns -1, errno = EINTR.
epoll_wait vs epoll_pwait — what race does pwait fix?epoll_pwait atomically installs a temporary signal mask before entering the kernel wait. Without it, a signal arriving between sigprocmask() and epoll_wait() is lost (TOCTOU). epoll_pwait closes that window in a single syscall. epoll_pwait2 (Linux 5.11) adds nanosecond timeout precision.