Tokio's multi-thread scheduler has an unfixed vulnerability where all
worker threads can end up parked on condvars with no worker driving the
I/O reactor. Condvar-parked workers have no timeout and sleep
indefinitely, so once in this state the runtime never recovers.
This was observed on a box migrating from 0.3.5.1: after heavy task
churn (package reinstalls, container operations, logging) all 16 workers
ended up on futex_wait with no thread on epoll_wait. The web server
listened on both HTTP and HTTPS but never replied. The box was stuck
for 7+ hours with 0% CPU.
Two mitigations:
1. Watchdog OS thread (startd.rs): a plain std::thread that every 30s
injects a no-op task via Handle::spawn. This forces a condvar-parked
worker to wake, cycle through park, and grab the driver TryLock —
breaking the stall regardless of what triggered it.
2. block_in_place in the logger (logger.rs): the TeeWriter holds a
std::sync::Mutex across blocking file + stderr writes on worker
threads. Wrapping in block_in_place tells tokio to hand off driver
duties before the worker blocks, reducing the window for starvation.
Guarded by runtime_flavor() to avoid panicking on current-thread
runtimes used by the CLI.