Commit Graph

3190 Commits

Author SHA1 Message Date
Aiden McClelland
e2e88f774e chore: add i18n entries for new CLI args and commands 2026-03-25 13:22:47 -06:00
Aiden McClelland
4bebcafdde fix: tolerate setsid EPERM in subcontainer pre_exec
In TTY mode, pty_process already calls setsid() on the child before
our pre_exec runs. The second setsid() fails with EPERM since the
process is already a session leader. This is harmless — ignore it.
2026-03-25 10:31:29 -06:00
Aiden McClelland
2bb1463f4f fix: mitigate tokio I/O driver starvation (tokio-rs/tokio#4730)
Tokio's multi-thread scheduler has an unfixed vulnerability where all
worker threads can end up parked on condvars with no worker driving the
I/O reactor.  Condvar-parked workers have no timeout and sleep
indefinitely, so once in this state the runtime never recovers.

This was observed on a box migrating from 0.3.5.1: after heavy task
churn (package reinstalls, container operations, logging) all 16 workers
ended up on futex_wait with no thread on epoll_wait.  The web server
listened on both HTTP and HTTPS but never replied.  The box was stuck
for 7+ hours with 0% CPU.

Two mitigations:

1. Watchdog OS thread (startd.rs): a plain std::thread that every 30s
   injects a no-op task via Handle::spawn.  This forces a condvar-parked
   worker to wake, cycle through park, and grab the driver TryLock —
   breaking the stall regardless of what triggered it.

2. block_in_place in the logger (logger.rs): the TeeWriter holds a
   std::sync::Mutex across blocking file + stderr writes on worker
   threads.  Wrapping in block_in_place tells tokio to hand off driver
   duties before the worker blocks, reducing the window for starvation.
   Guarded by runtime_flavor() to avoid panicking on current-thread
   runtimes used by the CLI.
2026-03-25 10:14:03 -06:00
Aiden McClelland
f20ece44a1 chore: bump sdk version in container-runtime lockfile 2026-03-24 19:26:56 -06:00
Aiden McClelland
9fddcb957f chore: bump direct_io buffer from 256KiB to 1MiB 2026-03-24 19:26:56 -06:00
Aiden McClelland
fd502cfb99 fix: probe active block device before vg import cycle
When the target VG is already active (e.g. the running system's own
VG), probe the block device directly instead of going through the
full import/activate/open/cleanup sequence.
2026-03-24 19:26:56 -06:00
Aiden McClelland
ee95eef395 fix: mark backup progress complete unconditionally
Remove the backup_succeeded gate so the progress indicator updates
regardless of the backup outcome — the status field already captures
success/failure separately.
2026-03-24 19:26:56 -06:00
Aiden McClelland
aaa43ce6af fix: network error resilience and wifi state tracking
- Demote transient route-replace errors (vanishing interfaces) to trace
- Tolerate errors during policy routing cleanup on drop
- Use join_all instead of try_join_all for gateway watcher jobs
- Simplify wifi interface detection to always use find_wifi_iface()
- Write wifi enabled state to db instead of interface name
2026-03-24 19:26:55 -06:00
Aiden McClelland
e0f27281d1 feat: load bundled migration images and log progress during os migration
Load pre-saved container images from /usr/lib/startos/migration-images
before migrating packages, removing the need for internet access during
the v1→v2 s9pk conversion.  Add a periodic progress logger so the user
can see which package is being migrated.
2026-03-24 19:26:55 -06:00
Aiden McClelland
ecc4703ae7 build: add migration image bundling to build pipeline
Bundle start9/compat, start9/utils, and tonistiigi/binfmt container
images into the OS image so the v1→v2 s9pk migration can run without
internet access.
2026-03-24 19:26:55 -06:00
Aiden McClelland
d478911311 fix: restore chown on /proc/self/fd/* for subcontainer exec
The pipe-wrap binary guarantees FDs are always pipes (not sockets),
making the chown safe. The chown is still needed because anonymous
pipes have mode 0600 — without it, non-root users cannot re-open
/dev/stderr via /proc/self/fd/2.
2026-03-24 19:26:55 -06:00
Matt Hill
23fe6fb663 align checkbox 2026-03-24 18:57:19 -06:00
Matt Hill
186925065d sdk db backups, wifi ux, release notes, minor copy 2026-03-24 16:39:31 -06:00
Aiden McClelland
53dff95365 revert: remove websocket shutdown signal from RpcContinuations 2026-03-24 11:13:59 -06:00
Aiden McClelland
7f6abf2a80 Merge pull request #3140 from Start9Labs/fix/wifi
bugfixes for alpha.21
v0.4.0-alpha.22
2026-03-23 10:26:04 -06:00
Aiden McClelland
19fa1cb4e3 fix build 2026-03-23 10:12:15 -06:00
Matt Hill
521f61c647 bump sdk for republish 2026-03-23 09:45:16 -06:00
Matt Hill
3d45234aae fix password input for backups and add adjective noun randomizer 2026-03-23 08:58:37 -06:00
Aiden McClelland
f60a1a9ed0 fix: set backup progress complete atomically with status revert
Move BackupProgress { complete: true } into the same db.mutate() as the
DesiredStatus revert in the backup transition. Previously these were
separate mutations—the status would revert to Running before progress
showed complete, causing a visible gap in the UI.
2026-03-23 01:15:54 -06:00
Aiden McClelland
2aa910a3e8 fix: replace stdio chown with prctl(PR_SET_DUMPABLE) and pipe-wrap
After setuid, the kernel clears the dumpable flag, making /proc/self/
entries owned by root. This broke open("/dev/stderr") for non-root
users inside subcontainers. The previous fix (chowning /proc/self/fd/*)
was dangerous because it chowned whatever file the FD pointed to (could
be the journal socket).

The proper fix is prctl(PR_SET_DUMPABLE, 1) after setuid, which restores
/proc/self/ ownership to the current uid.

Additionally, adds a `pipe-wrap` subcommand that wraps a child process
with piped stdout/stderr, relaying to the original FDs. This ensures all
descendants inherit pipes (which support re-opening via /proc/self/fd/N)
even when the outermost FDs are journal sockets. container-runtime.service
now uses this wrapper.

With pipe-wrap guaranteeing pipe-based FDs, the exec and launch non-TTY
paths no longer need their own pipe+relay threads, eliminating the bug
where exec would hang when a child daemonized (e.g. pg_ctl start).
2026-03-23 01:14:49 -06:00
Aiden McClelland
8d1e11e158 fix: pg_dump/pg_restore permission errors in backup subcontainer
- Pre-create and chown dump file for postgres user before pg_dump
- Chown volume mountpoint to postgres before initdb on restore
- Add --no-privileges to pg_restore to skip GRANT/REVOKE for missing roles
2026-03-23 01:13:20 -06:00
Aiden McClelland
b7e4df44bf wip: subcontainer exec log drain via SCM_RIGHTS (reference only)
Implemented pipe FD handoff from exec to launch via Unix socket +
SCM_RIGHTS for grandchild log capture. Superseded by the simpler
PR_SET_DUMPABLE approach which eliminates the need for pipes entirely.
2026-03-22 23:58:14 -06:00
Aiden McClelland
25aa140174 fix: backup status reporting 2026-03-22 23:55:26 -06:00
Matt Hill
7ffb462355 better smtp and backups for postgres and mysql 2026-03-22 19:49:58 -06:00
Aiden McClelland
6ed0afc75f chore: bump sdk to 0.4.0-beta.63 2026-03-22 14:13:28 -06:00
Aiden McClelland
cb7618cb34 fix: e2fsck exit codes 1-3 are non-fatal during btrfs conversion
e2fsck returns 1 when errors are corrected and 2 when corrections
require a reboot. These are expected during ext4→btrfs conversion.
Only exit codes >= 4 indicate actual failure. Previously, .invoke()
treated any non-zero exit as an error, causing the conversion to
fail after successful filesystem repairs.
2026-03-21 18:20:55 -06:00
Aiden McClelland
456c5d6725 fix: graceful shutdown for subcontainer daemons
Two issues fixed:

1. Process group cascade: exec-command processes inherited the
   container runtime's process group. When an entrypoint script
   did kill(0, SIGTERM) during shutdown, it signaled ALL processes
   in the group — including other subcontainers' launch wrappers,
   causing their PID namespaces to collapse. Fixed by calling
   setsid() in exec-command's pre_exec to isolate each service
   in its own process group.

2. Unordered daemon termination: removeChild("main") fired
   onLeaveContext callbacks for all Daemon.of() instances
   simultaneously, bypassing Daemons.term()'s reverse-dependency
   ordering. Fixed by having Daemons.build() mark individual
   daemons as managed (suppressing their onLeaveContext) and
   registering a single onLeaveContext that calls the ordered
   Daemons.term(). The term() method is deduplicated so
   system.stop() and onLeaveContext share the same shutdown.
2026-03-21 18:20:50 -06:00
Matt Hill
bdfa918a33 a bunch of UI cleanup around backups as well as other bug fixes and UII improvements 2026-03-21 16:32:46 -06:00
Aiden McClelland
8b65490d0e feat: add progress step for btrfs conversion during setup/init 2026-03-20 19:32:41 -06:00
Aiden McClelland
c9a93f0a33 fix: rsync progress regex never matched, spamming logs during backup
The regex used `$` (end-of-string anchor) instead of no anchor,
so it never matched the percentage in rsync output. Every line,
including empty ones, was logged instead of parsed.
2026-03-20 17:13:35 -06:00
Aiden McClelland
f5bfbe0465 Revert "fix: RunAction task re-evaluation compared against partial input, not full config"
also apply alternative fix: only re-activate a task that explicitly conflicts with a run action's input

This reverts commit 2999d22d2a.
2026-03-20 16:35:09 -06:00
Aiden McClelland
8bccffcb5c feat: add --arch flag to start-cli registry package download
Use the new flag in the image build recipe to download the tor s9pk
for the target architecture, replacing the standalone download script.
2026-03-20 15:28:45 -06:00
Aiden McClelland
9ff65497a8 fix: replace fire-and-forget restart loop in Daemon with tracked AbortController
- Track the restart loop as an awaitable { abort, done } handle
- Remove shouldBeRunning flag — signal.aborted serves the same purpose
- Remove exiting field — term() awaits command termination inline
- Guard start() on loop existence to prevent concurrent restart loops
- Make backoff sleep abortable so term() returns immediately
- Suppress error logging during intentional termination
- Loop clears its own handle in finally block for natural exit (oneshot)
2026-03-20 14:31:46 -06:00
Aiden McClelland
7335e52ab3 fix: daemon lifecycle cleanup and error logging improvements
- Refactor HealthDaemon to use a tracked session (AbortController + awaitable
  promise) instead of fire-and-forget health check loops, preventing health
  checks from running after a service is stopped
- Stop health checks before terminating daemon to avoid false crash reports
  during intentional shutdown
- Guard onExit callbacks with AbortSignal to prevent stale session callbacks
- Add logErrorOnce utility to deduplicate repeated error logging
- Fix SystemForEmbassy.stop() to capture clean promise before deleting ref
- Treat SIGTERM (signal 15) as successful exit in subcontainer sync
- Fix asError to return original Error instead of wrapping in new Error
- Remove unused ExtendedVersion import from Backups.ts
2026-03-20 13:50:57 -06:00
Aiden McClelland
b54f10af55 fix: rsync backup bugs and optimize flags for encrypted CIFS targets
- Fix restoreBackup using backupOptions instead of restoreOptions
- Add missing await on preRestore/postRestore hooks
- Remove -c (checksum) flag that forced full reads on every run
- Add --partial to keep partially transferred files on interruption
- Add --inplace to avoid temp-file+rename metadata churn
- Add --timeout=300 to prevent hangs on stalled mounts
2026-03-20 11:56:53 -06:00
Matt Hill
0549c7c0ef fix build 2026-03-20 08:50:54 -06:00
Matt Hill
2a8d8c7154 alpha.22 2026-03-20 08:37:36 -06:00
Andreas Schjønhaug
b9f2446cee Fix Safari hard refresh instructions (#3141) 2026-03-20 08:35:11 -06:00
Matt Hill
03d7d5f123 Merge branch 'fix/wifi' of github.com:Start9Labs/start-os into fix/wifi 2026-03-20 08:23:14 -06:00
Matt Hill
2fd674eca8 bump tor 2026-03-20 08:23:12 -06:00
waterplea
0e9c90f2c0 chore: fix icons in marketplace 2026-03-20 11:16:45 +04:00
Matt Hill
bca2e4d630 feat: add restart button to start-tunnel settings page
Adds a VPS restart button to the settings page, above logout. Shows a
spinner while the RPC completes, then a dialog telling the user to wait
1-2 minutes and refresh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 23:27:21 -06:00
Alex Inkin
f41fc75024 chore: make service icons not round and add wifi lock badge (#3139)
* chore: make service icons not round and add wifi lock badge

* chore: comments
2026-03-19 18:18:25 -06:00
Matt Hill
56cb3861bc fix build 2026-03-19 18:00:22 -06:00
Matt Hill
2999d22d2a fix: RunAction task re-evaluation compared against partial input, not full config
Bug: After running an action (e.g. bitcoin's autoconfig), update_tasks was
called with the submitted form input — which for task-triggered actions is
filtered to only the task's fields (e.g. {zmqEnabled: true}). Other services'
tasks targeting the same action were then compared against this partial via
is_partial_of, so any task wanting a field NOT in the submission (e.g.
{blocknotify: "curl..."}) would incorrectly become active, even though the
full config still satisfied it.

This caused a cycling bug: running LND's autoconfig (zmqEnabled) would
activate Datum's task (blocknotify), and vice versa, despite the merge
correctly preserving both values in the config.

Fix: After running an action, fetch the full current config via
get_action_input (same as create_task and recheck_tasks already do) and
compare tasks against that.

The one-liner fix would have been to add a get_action_input call in the
RunAction handler. Instead, we extracted eval_action_tasks on
ServiceActorSeed — a single method that both RunAction and recheck_tasks
now call — because the duplication between these two sites is exactly how
this bug happened: recheck_tasks fetched the full config, RunAction didn't,
and they silently diverged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 14:43:08 -06:00
Matt Hill
bb745c43cc fix: createTask with undefined input values fails to create task
Bug: Setting a task input property to undefined (e.g. { prune: undefined })
to express "this key should be deleted" resulted in no task being created.
JSON.stringify strips undefined values, so { prune: undefined } serialized
as {}, and is_partial_of({}, any_config) always returns true — meaning
input-not-matches saw a "match" and never activated the task.

Fix (two parts):
- SDK: coerce undefined to null in task input values before serialization,
  so they survive JSON.stringify and reach the Rust backend
- Rust: treat null in a partial as matching a missing key in the full
  config, so tasks correctly deactivate when the key is already absent

Assumption: null and undefined/absent are semantically equivalent for
StartOS config values. Input specs produce concrete values (strings,
numbers, booleans, objects, arrays) — null never appears as a meaningful
distinct-from-absent value in real-world configs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 14:28:04 -06:00
Matt Hill
de9a7e4189 fix types 2026-03-19 13:38:40 -06:00
Matt Hill
8fbcf44dec fix 2026-03-19 11:54:28 -06:00
Matt Hill
97b3b548c0 fix type 2026-03-19 11:42:43 -06:00
Matt Hill
6c72a22178 SDK beta.62: fix dynamicSelect crash on empty values, add smtpShape
- Guard z.union() against empty arrays in dynamicSelect/dynamicMultiselect
  by falling back to z.string() (fixes zod v4 _zod TypeError)
- Add smtpShape: typed zod schema for store file models, replacing
  smtpInputSpec.validator which caused cross-zod-instance errors
- Bump version to 0.4.0-beta.62

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 11:30:37 -06:00