Tech Notes

§ 38.1 — Linux Process State Machine

Every Linux process is always in exactly one scheduler state stored in task_struct.__state. The state controls two things: whether the scheduler will put the process on a CPU, and whether a signal can interrupt its wait. epoll_wait uses TASK_INTERRUPTIBLE — the only sleeping state that allows signals to pull a process back out of sleep.

State Transition Diagram

State	Value	Used by epoll_wait?	Signal can wake?
`TASK_RUNNING`	0	— (on CPU or in run queue)	—
`TASK_INTERRUPTIBLE`	1	YES	YES → returns EINTR
`TASK_UNINTERRUPTIBLE`	2	No	No
`TASK_KILLABLE`	0x80000000 \| 2	No	Fatal signals only (SIGKILL)
`__TASK_STOPPED`	4	No	SIGCONT
`EXIT_ZOMBIE`	32	No	No

Why epoll_wait Uses TASK_INTERRUPTIBLE

/* Two sleeping states — very different signal behavior */

/* TASK_INTERRUPTIBLE: network I/O, pipe reads, epoll_wait */
set_current_state(TASK_INTERRUPTIBLE);
schedule();        /* voluntarily yields CPU */
/* wakeup path 1: event arrives → ep_poll_callback → back here */
/* wakeup path 2: signal delivered → signal_pending() → EINTR  */

/* TASK_UNINTERRUPTIBLE: disk I/O (e.g. waiting for journal commit) */
set_current_state(TASK_UNINTERRUPTIBLE);
schedule();
/* ONLY wakeup path: I/O completes → back here
   Signal arrives? Ignored until I/O finishes.
   Reason: cannot abort a disk write mid-flight — filesystem corruption */

/* Rule of thumb:
   Network   → TASK_INTERRUPTIBLE  (reversible, no data hazard)
   Block dev → TASK_UNINTERRUPTIBLE (irreversible mid-op state) */

§ 38.2 — Going to Sleep Inside epoll_wait

When epoll_wait() finds no ready events it takes the slow path inside ep_poll(): register a wait entry on ep->wq, set state to TASK_INTERRUPTIBLE, then call schedule_timeout() — the point where the process physically leaves the CPU. The surrounding loop guards against spurious wakeups and signal delivery.

ep_poll() Slow Path

CPU Timeline — Sleep, Wakeup, Reschedule

The process yields the CPU the moment schedule_timeout() is called. Another process runs until the NIC interrupt fires, which chains through the TCP stack into ep_poll_callback(), which re-enqueues the sleeping process. The scheduler then picks it up at its next opportunity.

ep_poll() Slow Path — Source

/* fs/eventpoll.c — ep_poll() slow path (simplified) */
static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
                   int maxevents, long timeout)
{
    wait_queue_entry_t wait;
    long jtimeout = timeout_to_jiffies(timeout);
    int res = 0;

    spin_lock_irq(&ep->lock);
    if (!list_empty(&ep->rdllist))
        goto send_events;          /* fast path — already have events */

    /* Add ourselves to epoll's wait queue */
    init_waitqueue_entry(&wait, current); /* func = default_wake_function */
    __add_wait_queue_exclusive(&ep->wq, &wait);
    spin_unlock_irq(&ep->lock);

    for (;;) {
        /*
         * Set state BEFORE checking conditions.
         * The compiler barrier prevents reordering: state must be visible
         * to other CPUs before we test rdllist.
         */
        set_current_state(TASK_INTERRUPTIBLE);

        if (!list_empty(&ep->rdllist) || !jtimeout)
            break;                 /* events arrived or timeout expired */
        if (signal_pending(current)) {
            res = -EINTR;          /* signal interrupted the wait */
            break;
        }
        jtimeout = schedule_timeout(jtimeout); /* yield CPU — sleep here */
        /* returns when woken or timeout; loop re-checks all conditions */
    }

    __remove_wait_queue(&ep->wq, &wait);
    set_current_state(TASK_RUNNING);

send_events:
    ep_send_events(ep, events, maxevents);
    return res ? res : (int)events_written;
}

Call	Purpose	Detail
`set_current_state(TASK_INTERRUPTIBLE)`	State + barrier	Compiler/memory barrier — state visible before rdllist check; prevents lost wakeup race
`__add_wait_queue_exclusive()`	Register on ep->wq	Uses WQ_FLAG_EXCLUSIVE — only one epoll_wait caller woken per event
`schedule_timeout(jtimeout)`	Yield CPU	Process leaves CPU here; scheduler runs next task; timer set for timeout
`signal_pending(current)`	Signal check	Tests TIF_SIGPENDING flag; true means signal arrived while sleeping
`__remove_wait_queue()`	Cleanup	Always runs — even on timeout or EINTR — prevents dangling entry
`set_current_state(TASK_RUNNING)`	Restore state	Required before using ep_send_events; marks process as runnable again

§ 38.3 — try_to_wake_up: Adding to the Run Queue

When a packet arrives, ep_poll_callback() calls wake_up(&ep->wq). This chains through the wait queue machinery to try_to_wake_up() — the single kernel function responsible for transitioning a sleeping task back to TASK_RUNNING and placing it on a CPU's CFS run queue. On SMP systems, if the sleeping process last ran on a different CPU, an IPI (inter-processor interrupt) is sent to enqueue it there.

try_to_wake_up Flowchart

SMP Cross-CPU Wakeup

Cache affinity matters: try_to_wake_up() prefers to re-enqueue a process on the CPU it last ran on, keeping its L1/L2 cache warm. When the wakeup fires on a different CPU, an IPI is the mechanism to hand off the enqueue operation without touching another CPU's run queue directly.

try_to_wake_up — Source

/* kernel/sched/core.c — try_to_wake_up() (simplified) */
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
    int cpu, success = 0;
    unsigned long flags;

    raw_spin_lock_irqsave(&p->pi_lock, flags);

    /*
     * Check the task is actually in the expected sleep state.
     * Guards against double-wakeup races on SMP.
     */
    if (!(p->__state & state))
        goto out;

    p->__state = TASK_RUNNING;   /* mark awake under pi_lock */

    /*
     * select_task_rq: find best CPU for cache affinity.
     * Usually returns p->cpu (last-run CPU) to keep cache warm.
     */
    cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);

    ttwu_queue(p, cpu, wake_flags);
    /* Same CPU → ttwu_do_activate(rq, p) directly.
       Diff CPU → send IPI; target CPU handles ttwu_do_activate. */

    success = 1;
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

/* ttwu_do_activate: final enqueue into CFS run queue */
static void ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
{
    activate_task(rq, p, ENQUEUE_WAKEUP);   /* enqueue_task_fair() */
    wakeup_preempt(rq, p, wake_flags);       /* may trigger preemption */
}

Step	Function	Action
1	`wake_up(&ep->wq)`	Triggers wait queue traversal from ep_poll_callback
2	`__wake_up_common`	Iterates ep->wq, calls each entry's func callback
3	`default_wake_function`	Entry func for epoll_wait; calls try_to_wake_up
4	`try_to_wake_up`	Acquires pi_lock, verifies sleep state, sets TASK_RUNNING
5	`ttwu_queue`	Same CPU: direct enqueue. Different CPU: send IPI
6	`ttwu_do_activate`	Calls enqueue_task_fair() — adds task to CFS RB-tree

§ 38.4 — CFS Scheduler: Run Queue → Running

Once ttwu_do_activate() enqueues the woken process, the CFS (Completely Fair Scheduler) decides when it actually gets a CPU. CFS sorts all runnable tasks in a red-black tree by vruntime (virtual runtime — CPU time adjusted for priority). Newly woken processes receive a vruntime boost to min_vruntime, placing them at the leftmost node — the next to be picked.

CFS Run Queue — vruntime RB-Tree

Context Switch Timeline

pick_next_task_fair() takes the leftmost node, then context_switch() saves the outgoing process's register state and restores the incoming one. The woken process resumes exactly at the instruction after schedule_timeout().

Full Round-Trip: NIC → epoll_wait Returns

Demo 3519 — Process State Animator

Step through the six phases from an idle epoll server to epoll_wait returning. Watch the server move from the sleeping zone to the CFS run queue to the CPU, and see when the rdllist event fires.

epoll_wait server sleeping. Workers occupy the CPU.

CPU

Worker A

Run Queue (CFS)

Worker B

Sleeping (INTERRUPTIBLE)

Server (epoll_wait)

phase: idle1 / 6

CFS concept	Detail
`vruntime`	CPU time weighted by task priority (nice). Lower = ran less recently = scheduled sooner.
`min_vruntime`	Minimum vruntime across all runnable tasks on the rq. Newly woken tasks are set here to avoid starvation.
`leftmost node`	The RB-tree node with smallest vruntime. pick_next_task_fair() returns this in O(1) — cached as rq->rb_leftmost.
`enqueue_task_fair()`	Called by ttwu_do_activate. Sets p->vruntime = max(p->vruntime, min_vruntime), inserts into RB-tree.
`switch_mm()`	Swaps CR3 register (page table base) when switching between different processes. Skipped for kernel threads (no mm).
`switch_to()`	Saves prev's rsp/rbp/callee-saved registers to its stack. Restores next's. CPU now runs next's code.

§ 38.5 — Signal Interruption of epoll_wait

Because epoll_wait uses TASK_INTERRUPTIBLE, any signal delivered to the sleeping process triggers the exact same try_to_wake_up path as an incoming I/O event. The difference is discovered only after the process wakes: the ep_poll loop checks signal_pending(current) and breaks with -EINTR instead of dispatching events.

EINTR Wakeup Path

Demo 3520 — epoll_wait Interruption Demo

Toggle SA_RESTART and epoll_pwait mode, then send signals or normal data to observe the kernel's response at each step.

SA_RESTART(app retries manually)epoll_pwait mode(atomic sigmask)

Process state

TASK_INTERRUPTIBLE (sleeping)

epoll_wait sleeping — waiting for I/O events or signal

Tip: enable epoll_pwait + mask SIGINT, then send SIGINT → process stays asleep

epoll_pwait atomicity — A naïve approach unblocks a signal, calls sigprocmask, then calls epoll_wait. But a signal arriving in the window between the two calls is lost. epoll_pwait atomically swaps the signal mask and enters the kernel wait in a single syscall, closing that TOCTOU race.

/* epoll_pwait: atomic signal mask + wait */
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGINT);

/* atomic: apply mask → epoll_wait → restore mask
   prevents race between signal_pending check and schedule() */
int n = epoll_pwait(epfd, events, MAX, timeout, &mask);
if (n < 0 && errno == EINTR) {
    /* SA_RESTART not set: retry manually */
}

Scenario	epoll_wait behavior	How to handle
Signal, no SA_RESTART	Returns -1, errno=EINTR	App checks errno, re-calls manually
Signal, SA_RESTART	Auto-restarts after handler returns	No app-level handling needed
epoll_pwait() + masked signal	Signal blocked — process stays asleep	Prevents TOCTOU between check and block
epoll_pwait2() (Linux 5.11+)	Same as epoll_pwait + nanosecond timeout	Use for sub-millisecond timeout precision

Interview Prep — Must-Know Questions

Question	Key answer
TASK_INTERRUPTIBLE vs TASK_UNINTERRUPTIBLE — which does epoll_wait use?	TASK_INTERRUPTIBLE. It can be woken by both an I/O event and a signal. epoll_wait must be interruptible so Ctrl-C (SIGINT) can kill the process. Disk I/O uses TASK_UNINTERRUPTIBLE because the operation cannot be safely aborted mid-flight.
Full path from ep_poll_callback → process running?	ep_poll_callback → rdllist.add(epitem) → wake_up(&ep->wq) → __wake_up_common → default_wake_function → try_to_wake_up → p->__state = TASK_RUNNING → ttwu_queue → ttwu_do_activate → CFS enqueue → pick_next_task_fair → context_switch → resumes at schedule_timeout() return.
What does try_to_wake_up do? Which CPU's run queue?	Locks p->pi_lock, verifies p->__state & TASK_NORMAL, sets state = TASK_RUNNING, calls select_task_rq() for cache affinity (prefers p->cpu), then ttwu_queue. Same CPU → ttwu_do_activate directly. Different CPU → IPI to the target CPU so it enqueues the task.
How does CFS prioritize a newly woken process?	enqueue_task_fair() sets p->vruntime = max(p->vruntime, rq->min_vruntime), placing the task at or near the leftmost node of the RB-tree. pick_next_task_fair() returns the leftmost node in O(1) (cached as rq->rb_leftmost), so the woken process is selected in the next schedule() call.
Why does ep_poll() re-check rdllist after schedule_timeout returns?	Spurious wakeups — the kernel guarantees nothing about why schedule_timeout returned. The loop must verify list_empty(&ep->rdllist) is false before exiting. It also checks signal_pending and the remaining timeout.
How does a signal interrupt epoll_wait? What errno?	signal_wake_up(p, 0) calls try_to_wake_up — identical to an I/O wakeup. After the process resumes in ep_poll, signal_pending(current) is true → the loop breaks with res = -EINTR → epoll_wait returns -1, errno = EINTR.
epoll_wait vs epoll_pwait — what race does pwait fix?	epoll_pwait atomically installs a temporary signal mask before entering the kernel wait. Without it, a signal arriving between sigprocmask() and epoll_wait() is lost (TOCTOU). epoll_pwait closes that window in a single syscall. epoll_pwait2 (Linux 5.11) adds nanosecond timeout precision.

§ 38 — Process Wakeup: INTERRUPTIBLE → RUNNING