Files
alchemist/audit.md
bybrooklyn e50ca64e80 Resolve audit findings + split db.rs into db/ module
- P1: Fix cancel race in pipeline, fix VideoToolbox quality mapping
- P2: SSRF protection, batch cancel N+1, archived filter fixes,
  metadata persistence, reverse proxy hardening, reprobe logging
- TD: Remove AlchemistEvent legacy bridge, fix silent .ok() on DB
  writes, optimize sort-by-size query, split db.rs (3400 LOC) into
  8 focused submodules under src/db/
- UX: Add queue position display for queued jobs
- Docs: Update API docs, engine modes, library doctor, config ref
- Plans: Add plans.md for remaining open items (UX-2/3, FG-4, RG-2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-13 16:02:11 -04:00

421 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Audit Findings
Last updated: 2026-04-12 (second pass)
---
## P1 Issues
---
### [P1-1] Cancel during analysis can be overwritten by the pipeline
**Status: RESOLVED**
**Files:**
- `src/server/jobs.rs:4163`
- `src/media/pipeline.rs:11781221`
- `src/orchestrator.rs:8490`
**Severity:** P1
**Problem:**
`request_job_cancel()` in `jobs.rs` immediately writes `Cancelled` to the DB for jobs in `Analyzing` or `Resuming` state. The pipeline used to have race windows where it could overwrite this state with `Skipped`, `Encoding`, or `Remuxing` if it reached a checkpoint after the cancel was issued but before it could be processed.
**Fix:**
Implemented `cancel_requested: Arc<tokio::sync::RwLock<HashSet<i64>>>` in `Transcoder` (orchestrator). The `update_job_state` wrapper in `pipeline.rs` now checks this set before any DB write for `Encoding`, `Remuxing`, `Skipped`, and `Completed` states. Terminal states (Completed, Failed, Cancelled, Skipped) also trigger removal from the set.
---
### [P1-2] VideoToolbox quality controls are effectively ignored
**Status: RESOLVED**
**Files:**
- `src/media/planner.rs:630650`
- `src/media/ffmpeg/videotoolbox.rs:2554`
- `src/config.rs:8592`
**Severity:** P1
**Problem:**
The planner used to emit `RateControl::Cq` values that were incorrectly mapped for VideoToolbox, resulting in uncalibrated or inverted quality.
**Fix:**
Fixed the mapping in `videotoolbox.rs` to use `-q:v` (1-100, lower is better) and clamped the input range to 1-51 to match user expectations from x264/x265. Updated `QualityProfile` in `config.rs` to provide sane default values (24, 28, 32) for VideoToolbox quality.
---
## P2 Issues
---
### [P2-1] Convert does not reuse subtitle/container compatibility checks
**Status: RESOLVED**
**Files:**
- `src/conversion.rs:372380`
- `src/media/planner.rs`
**Severity:** P2
**Problem:**
The conversion path was not validating subtitle/container compatibility, leading to FFmpeg runtime failures instead of early validation errors.
**Fix:**
Integrated `crate::media::planner::subtitle_copy_supported` into `src/conversion.rs:build_subtitle_plan`. The "copy" mode now returns an `AlchemistError::Config` if the combination is unsupported.
---
### [P2-2] Completed job metadata omitted at the API layer
**Status: RESOLVED**
**Files:**
- `src/db.rs:254263`
- `src/media/pipeline.rs:599`
- `src/server/jobs.rs:343`
**Severity:** P2
**Problem:**
Job details required a live re-probe of the input file to show metadata, which failed if the file was moved or deleted after completion.
**Fix:**
Added `input_metadata_json` column to the `jobs` table (migration `20260412000000_store_job_metadata.sql`). The pipeline now stores the metadata string immediately after analysis. `get_job_detail_handler` reads this stored value, ensuring metadata is always available even if the source file is missing.
---
### [P2-3] LAN-only setup exposed to reverse proxy misconfig
**Status: RESOLVED**
**Files:**
- `src/config.rs``SystemConfig.trusted_proxies`
- `src/server/mod.rs``AppState.trusted_proxies`, `AppState.setup_token`
- `src/server/middleware.rs``is_trusted_peer`, `request_ip`, `auth_middleware`
**Severity:** P2
**Problem:**
The setup wizard gate trusts all private/loopback IPs for header forwarding. When running behind a misconfigured proxy that doesn't set headers, it falls back to the proxy's own IP (e.g. 127.0.0.1), making the setup endpoint accessible to external traffic.
**Fix:**
Added two independent security layers:
1. `trusted_proxies: Vec<String>` to `SystemConfig`. When non-empty, only those exact IPs (plus loopback) are trusted for proxy header forwarding instead of all RFC-1918 ranges. Empty = previous behavior preserved.
2. `ALCHEMIST_SETUP_TOKEN` env var. When set, setup endpoints require `?token=<value>` query param regardless of client IP. Token mode takes precedence over IP-based LAN check.
---
### [P2-4] N+1 DB update in batch cancel
**Status: RESOLVED**
**Files:**
- `src/server/jobs.rs``batch_jobs_handler`
**Severity:** P2
**Problem:**
`batch_jobs_handler` for "cancel" action iterates over jobs and calls `request_job_cancel` which performs an individual `update_job_status` query per job. Cancelling a large number of jobs triggers N queries.
**Fix:**
Restructured the "cancel" branch in `batch_jobs_handler`. Orchestrator in-memory operations (add_cancel_request, cancel_job) still run per-job, but all DB status updates are batched into a single `db.batch_cancel_jobs(&ids)` call (which already existed at db.rs). Immediate-resolution jobs (Queued + successfully signalled Analyzing/Resuming) are collected and written in one UPDATE ... WHERE id IN (...) query.
---
### [P2-5] Missing archived filter in health and stats queries
**Status: RESOLVED**
**Files:**
- `src/db.rs``get_aggregated_stats`, `get_job_stats`, `get_health_summary`
**Severity:** P2
**Problem:**
`get_health_summary` and `get_aggregated_stats` (total_jobs) do not include `AND archived = 0`. Archived (deleted) jobs are incorrectly included in library health metrics and total job counts.
**Fix:**
Added `AND archived = 0` to all three affected queries: `total_jobs` and `completed_jobs` subqueries in `get_aggregated_stats`, the `GROUP BY status` query in `get_job_stats`, and both subqueries in `get_health_summary`. Updated tests that were asserting the old (incorrect) behavior.
---
### [P2-6] Daily summary notifications bypass SSRF protections
**Status: RESOLVED**
**Files:**
- `src/notifications.rs``build_safe_client()`, `send()`, `send_daily_summary_target()`
**Severity:** P2
**Problem:**
`send_daily_summary_target()` used `Client::new()` without any SSRF defences, while `send()` applied DNS timeout, private-IP blocking, no-redirect policy, and request timeout.
**Fix:**
Extracted all client-building logic into `build_safe_client(&self, target)` which applies the full SSRF defence stack. Both `send()` and `send_daily_summary_target()` now use this shared helper.
---
### [P2-7] Silent reprobe failure corrupts saved encode stats
**Status: RESOLVED**
**Files:**
- `src/media/pipeline.rs``finalize_job()` duration reprobe
**Severity:** P2
**Problem:**
When a completed encode's metadata has `duration_secs <= 0.0`, the pipeline reprobes the output file to get the actual duration. If reprobe fails, the error was silently swallowed via `.ok()` and duration defaulted to 0.0, poisoning downstream stats.
**Fix:**
Replaced `.ok().and_then().unwrap_or(0.0)` chain with explicit `match` that logs the error via `tracing::warn!` and falls through to 0.0. Existing guards at the stats computation lines already handle `duration <= 0.0` correctly — operators now see *why* stats are zeroed.
---
## Technical Debt
---
### [TD-1] `db.rs` is a 3481-line monolith
**Status: RESOLVED**
**File:** `src/db/` (was `src/db.rs`)
**Severity:** TD
**Problem:**
The database layer had grown to nearly 3500 lines. Every query, migration flag, and state enum was in one file, making navigation and maintenance difficult.
**Fix:**
Split into `src/db/` module with 8 submodules: `mod.rs` (Db struct, init, migrations, hash fns), `types.rs` (all type defs), `events.rs` (event enums + channels), `jobs.rs` (job CRUD/filtering/decisions), `stats.rs` (encode/aggregated/daily stats), `config.rs` (watch dirs/profiles/notifications/schedules/file settings/preferences), `conversion.rs` (ConversionJob CRUD), `system.rs` (auth/sessions/API tokens/logs/health). All tests moved alongside their methods. Public API unchanged — all types re-exported from `db/mod.rs`.
---
### [TD-2] `AlchemistEvent` legacy bridge is dead weight
**Status: RESOLVED**
**Files:**
- `src/db.rs` — enum and From impls removed
- `src/media/pipeline.rs`, `src/media/executor.rs`, `src/media/processor.rs` — legacy `tx` channel removed
- `src/notifications.rs` — migrated to typed `EventChannels` (jobs + system)
- `src/server/mod.rs`, `src/main.rs` — legacy channel removed from AppState/RunServerArgs
**Severity:** TD
**Problem:**
`AlchemistEvent` was a legacy event type duplicated by `JobEvent`, `ConfigEvent`, and `SystemEvent`. All senders were emitting events on both channels.
**Fix:**
Migrated the notification system (the sole consumer) to subscribe to `EventChannels.jobs` and `EventChannels.system` directly. Added `SystemEvent::EngineIdle` variant. Removed `AlchemistEvent` enum, its `From` impls, the legacy `tx` broadcast channel from all structs, and the `pub use` from `lib.rs`.
---
### [TD-3] `pipeline.rs` legacy `AlchemistEvent::Progress` stub
**Status: RESOLVED**
**Files:**
- `src/media/pipeline.rs:1228`
**Severity:** TD
**Problem:**
The pipeline used to emit zeroed progress events that could overwrite real stats from the executor.
**Fix:**
Emission removed. A comment at line 1228-1229 confirms that `AlchemistEvent::Progress` is no longer emitted from the pipeline wrapper.
---
### [TD-4] Silent `.ok()` on pipeline decision and attempt DB writes
**Status: RESOLVED**
**Files:**
- `src/media/pipeline.rs` — all `add_decision`, `insert_encode_attempt`, `upsert_job_failure_explanation`, and `add_log` call sites
**Severity:** TD
**Problem:**
Decision records, encode attempt records, failure explanations, and error logs were written with `.ok()` or `let _ =`, silently discarding DB write failures. These records are the only audit trail of *why* a job was skipped/transcoded/failed.
**Fix:**
Replaced all `.ok()` / `let _ =` patterns on `add_decision`, `insert_encode_attempt`, `upsert_job_failure_explanation`, and `add_log` calls with `if let Err(e) = ... { tracing::warn!(...) }`. Pipeline still continues on failure, but operators now see the error.
---
### [TD-5] Correlated subquery for sort-by-size in job listing
**Status: RESOLVED**
**Files:**
- `src/db.rs``get_jobs_filtered()` query
**Severity:** TD
**Problem:**
Sorting jobs by file size used a correlated subquery in ORDER BY, executing one subquery per row and producing NULL for jobs without encode_stats.
**Fix:**
Added `LEFT JOIN encode_stats es ON es.job_id = j.id` to the base query. Sort column changed to `COALESCE(es.input_size_bytes, 0)`, ensuring jobs without stats sort as 0 (smallest) instead of NULL.
---
## Reliability Gaps
---
### [RG-1] No encode resume after crash or restart
**Status: PARTIALLY RESOLVED**
**Files:**
- `src/main.rs:320`
- `src/media/processor.rs:255`
**Severity:** RG
**Problem:**
Interrupted encodes were left in `Encoding` state and orphaned temp files remained on disk.
**Fix:**
Implemented `db.reset_interrupted_jobs()` in `main.rs` which resets `Encoding`, `Remuxing`, `Resuming`, and `Analyzing` jobs to `Queued` on startup. Orphaned temp files are also detected and removed. Full bitstream-level resume (resuming from the middle of a file) is still missing.
---
### [RG-2] AMD VAAPI/AMF hardware paths unvalidated
**Files:**
- `src/media/ffmpeg/vaapi.rs`
- `src/media/ffmpeg/amf.rs`
**Severity:** RG
**Problem:**
Hardware paths for AMD (VAAPI on Linux, AMF on Windows) were implemented without real hardware validation.
**Fix:**
Verify exact flag compatibility on AMD hardware and add integration tests gated on GPU presence.
---
## UX Gaps
---
### [UX-1] Queued jobs show no position or estimated wait time
**Status: RESOLVED**
**Files:**
- `src/db.rs``get_queue_position`
- `src/server/jobs.rs``JobDetailResponse.queue_position`
- `web/src/components/jobs/JobDetailModal.tsx`
- `web/src/components/jobs/types.ts``JobDetail.queue_position`
**Severity:** UX
**Problem:**
Queued jobs only show "Waiting" without indicating their position in the priority queue.
**Fix:**
Implemented `db.get_queue_position(job_id)` which counts jobs with higher priority or earlier `created_at` (matching the `priority DESC, created_at ASC` dequeue order). Added `queue_position: Option<u32>` to `JobDetailResponse` — populated only when `status == Queued`. Frontend shows `Queue position: #N` in the empty state card in `JobDetailModal`.
---
### [UX-2] No way to add a single file to the queue via the UI
**Severity:** UX
**Problem:**
Jobs only enter the queue via full library scans. No manual "Enqueue path" exists in the UI.
**Fix:**
Add `POST /api/jobs/enqueue` and a "Add file" action in the `JobsToolbar`.
---
### [UX-3] Workers-blocked reason not surfaced for queued jobs
**Severity:** UX
**Problem:**
Users cannot see why a job is stuck in Queued (paused, scheduled, or slots full).
**Fix:**
Add `GET /api/processor/status` and show the reason in the job detail.
---
## Feature Gaps
---
### [FG-4] Intelligence page content not actionable
**Files:**
- `web/src/components/LibraryIntelligence.tsx`
**Severity:** FG
**Problem:**
Intelligence page is informational only; recommendations cannot be acted upon directly from the page.
**Fix:**
Add "Queue all" for remux opportunities and "Review" actions for duplicates.
---
## What To Fix Next
1. **[UX-2]** Single file enqueue — New feature.
2. **[UX-3]** Workers-blocked reason — New feature.
3. **[FG-4]** Intelligence page actions — New feature.
4. **[RG-2]** AMD VAAPI/AMF validation — Needs real hardware.