mirror of https://github.com/bybrooklyn/alchemist.git synced 2026-04-18 01:43:34 -04:00

Files

bybrooklyn e50ca64e80 Resolve audit findings + split db.rs into db/ module

- P1: Fix cancel race in pipeline, fix VideoToolbox quality mapping
- P2: SSRF protection, batch cancel N+1, archived filter fixes,
  metadata persistence, reverse proxy hardening, reprobe logging
- TD: Remove AlchemistEvent legacy bridge, fix silent .ok() on DB
  writes, optimize sort-by-size query, split db.rs (3400 LOC) into
  8 focused submodules under src/db/
- UX: Add queue position display for queued jobs
- Docs: Update API docs, engine modes, library doctor, config ref
- Plans: Add plans.md for remaining open items (UX-2/3, FG-4, RG-2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-13 16:02:11 -04:00

13 KiB

Raw Permalink Blame History

Audit Findings

Last updated: 2026-04-12 (second pass)

P1 Issues

[P1-1] Cancel during analysis can be overwritten by the pipeline

Status: RESOLVED

Files:

src/server/jobs.rs:41–63
src/media/pipeline.rs:1178–1221
src/orchestrator.rs:84–90

Severity: P1

Problem:

request_job_cancel() in jobs.rs immediately writes Cancelled to the DB for jobs in Analyzing or Resuming state. The pipeline used to have race windows where it could overwrite this state with Skipped, Encoding, or Remuxing if it reached a checkpoint after the cancel was issued but before it could be processed.

Fix:

Implemented cancel_requested: Arc<tokio::sync::RwLock<HashSet<i64>>> in Transcoder (orchestrator). The update_job_state wrapper in pipeline.rs now checks this set before any DB write for Encoding, Remuxing, Skipped, and Completed states. Terminal states (Completed, Failed, Cancelled, Skipped) also trigger removal from the set.

[P1-2] VideoToolbox quality controls are effectively ignored

Status: RESOLVED

Files:

src/media/planner.rs:630–650
src/media/ffmpeg/videotoolbox.rs:25–54
src/config.rs:85–92

Severity: P1

Problem:

The planner used to emit RateControl::Cq values that were incorrectly mapped for VideoToolbox, resulting in uncalibrated or inverted quality.

Fix:

Fixed the mapping in videotoolbox.rs to use -q:v (1-100, lower is better) and clamped the input range to 1-51 to match user expectations from x264/x265. Updated QualityProfile in config.rs to provide sane default values (24, 28, 32) for VideoToolbox quality.

P2 Issues

[P2-1] Convert does not reuse subtitle/container compatibility checks

Status: RESOLVED

Files:

src/conversion.rs:372–380
src/media/planner.rs

Severity: P2

Problem:

The conversion path was not validating subtitle/container compatibility, leading to FFmpeg runtime failures instead of early validation errors.

Fix:

Integrated crate::media::planner::subtitle_copy_supported into src/conversion.rs:build_subtitle_plan. The "copy" mode now returns an AlchemistError::Config if the combination is unsupported.

[P2-2] Completed job metadata omitted at the API layer

Status: RESOLVED

Files:

src/db.rs:254–263
src/media/pipeline.rs:599
src/server/jobs.rs:343

Severity: P2

Problem:

Job details required a live re-probe of the input file to show metadata, which failed if the file was moved or deleted after completion.

Fix:

Added input_metadata_json column to the jobs table (migration 20260412000000_store_job_metadata.sql). The pipeline now stores the metadata string immediately after analysis. get_job_detail_handler reads this stored value, ensuring metadata is always available even if the source file is missing.

[P2-3] LAN-only setup exposed to reverse proxy misconfig

Status: RESOLVED

Files:

src/config.rs — SystemConfig.trusted_proxies
src/server/mod.rs — AppState.trusted_proxies, AppState.setup_token
src/server/middleware.rs — is_trusted_peer, request_ip, auth_middleware

Severity: P2

Problem:

The setup wizard gate trusts all private/loopback IPs for header forwarding. When running behind a misconfigured proxy that doesn't set headers, it falls back to the proxy's own IP (e.g. 127.0.0.1), making the setup endpoint accessible to external traffic.

Fix:

Added two independent security layers:

trusted_proxies: Vec<String> to SystemConfig. When non-empty, only those exact IPs (plus loopback) are trusted for proxy header forwarding instead of all RFC-1918 ranges. Empty = previous behavior preserved.
ALCHEMIST_SETUP_TOKEN env var. When set, setup endpoints require ?token=<value> query param regardless of client IP. Token mode takes precedence over IP-based LAN check.

[P2-4] N+1 DB update in batch cancel

Status: RESOLVED

Files:

src/server/jobs.rs — batch_jobs_handler

Severity: P2

Problem:

batch_jobs_handler for "cancel" action iterates over jobs and calls request_job_cancel which performs an individual update_job_status query per job. Cancelling a large number of jobs triggers N queries.

Fix:

Restructured the "cancel" branch in batch_jobs_handler. Orchestrator in-memory operations (add_cancel_request, cancel_job) still run per-job, but all DB status updates are batched into a single db.batch_cancel_jobs(&ids) call (which already existed at db.rs). Immediate-resolution jobs (Queued + successfully signalled Analyzing/Resuming) are collected and written in one UPDATE ... WHERE id IN (...) query.

[P2-5] Missing archived filter in health and stats queries

Status: RESOLVED

Files:

src/db.rs — get_aggregated_stats, get_job_stats, get_health_summary

Severity: P2

Problem:

get_health_summary and get_aggregated_stats (total_jobs) do not include AND archived = 0. Archived (deleted) jobs are incorrectly included in library health metrics and total job counts.

Fix:

Added AND archived = 0 to all three affected queries: total_jobs and completed_jobs subqueries in get_aggregated_stats, the GROUP BY status query in get_job_stats, and both subqueries in get_health_summary. Updated tests that were asserting the old (incorrect) behavior.

[P2-6] Daily summary notifications bypass SSRF protections

Status: RESOLVED

Files:

src/notifications.rs — build_safe_client(), send(), send_daily_summary_target()

Severity: P2

Problem:

send_daily_summary_target() used Client::new() without any SSRF defences, while send() applied DNS timeout, private-IP blocking, no-redirect policy, and request timeout.

Fix:

Extracted all client-building logic into build_safe_client(&self, target) which applies the full SSRF defence stack. Both send() and send_daily_summary_target() now use this shared helper.

[P2-7] Silent reprobe failure corrupts saved encode stats

Status: RESOLVED

Files:

src/media/pipeline.rs — finalize_job() duration reprobe

Severity: P2

Problem:

When a completed encode's metadata has duration_secs <= 0.0, the pipeline reprobes the output file to get the actual duration. If reprobe fails, the error was silently swallowed via .ok() and duration defaulted to 0.0, poisoning downstream stats.

Fix:

Replaced .ok().and_then().unwrap_or(0.0) chain with explicit match that logs the error via tracing::warn! and falls through to 0.0. Existing guards at the stats computation lines already handle duration <= 0.0 correctly — operators now see why stats are zeroed.

Technical Debt

[TD-1] `db.rs` is a 3481-line monolith

Status: RESOLVED

File: src/db/ (was src/db.rs)

Severity: TD

Problem:

The database layer had grown to nearly 3500 lines. Every query, migration flag, and state enum was in one file, making navigation and maintenance difficult.

Fix:

Split into src/db/ module with 8 submodules: mod.rs (Db struct, init, migrations, hash fns), types.rs (all type defs), events.rs (event enums + channels), jobs.rs (job CRUD/filtering/decisions), stats.rs (encode/aggregated/daily stats), config.rs (watch dirs/profiles/notifications/schedules/file settings/preferences), conversion.rs (ConversionJob CRUD), system.rs (auth/sessions/API tokens/logs/health). All tests moved alongside their methods. Public API unchanged — all types re-exported from db/mod.rs.

[TD-2] `AlchemistEvent` legacy bridge is dead weight

Status: RESOLVED

Files:

src/db.rs — enum and From impls removed
src/media/pipeline.rs, src/media/executor.rs, src/media/processor.rs — legacy tx channel removed
src/notifications.rs — migrated to typed EventChannels (jobs + system)
src/server/mod.rs, src/main.rs — legacy channel removed from AppState/RunServerArgs

Severity: TD

Problem:

AlchemistEvent was a legacy event type duplicated by JobEvent, ConfigEvent, and SystemEvent. All senders were emitting events on both channels.

Fix:

Migrated the notification system (the sole consumer) to subscribe to EventChannels.jobs and EventChannels.system directly. Added SystemEvent::EngineIdle variant. Removed AlchemistEvent enum, its From impls, the legacy tx broadcast channel from all structs, and the pub use from lib.rs.

[TD-3] `pipeline.rs` legacy `AlchemistEvent::Progress` stub

Status: RESOLVED

Files:

src/media/pipeline.rs:1228

Severity: TD

Problem:

The pipeline used to emit zeroed progress events that could overwrite real stats from the executor.

Fix:

Emission removed. A comment at line 1228-1229 confirms that AlchemistEvent::Progress is no longer emitted from the pipeline wrapper.

[TD-4] Silent `.ok()` on pipeline decision and attempt DB writes

Status: RESOLVED

Files:

src/media/pipeline.rs — all add_decision, insert_encode_attempt, upsert_job_failure_explanation, and add_log call sites

Severity: TD

Problem:

Decision records, encode attempt records, failure explanations, and error logs were written with .ok() or let _ =, silently discarding DB write failures. These records are the only audit trail of why a job was skipped/transcoded/failed.

Fix:

Replaced all .ok() / let _ = patterns on add_decision, insert_encode_attempt, upsert_job_failure_explanation, and add_log calls with if let Err(e) = ... { tracing::warn!(...) }. Pipeline still continues on failure, but operators now see the error.

[TD-5] Correlated subquery for sort-by-size in job listing

Status: RESOLVED

Files:

src/db.rs — get_jobs_filtered() query

Severity: TD

Problem:

Sorting jobs by file size used a correlated subquery in ORDER BY, executing one subquery per row and producing NULL for jobs without encode_stats.

Fix:

Added LEFT JOIN encode_stats es ON es.job_id = j.id to the base query. Sort column changed to COALESCE(es.input_size_bytes, 0), ensuring jobs without stats sort as 0 (smallest) instead of NULL.

Reliability Gaps

[RG-1] No encode resume after crash or restart

Status: PARTIALLY RESOLVED

Files:

src/main.rs:320
src/media/processor.rs:255

Severity: RG

Problem:

Interrupted encodes were left in Encoding state and orphaned temp files remained on disk.

Fix:

Implemented db.reset_interrupted_jobs() in main.rs which resets Encoding, Remuxing, Resuming, and Analyzing jobs to Queued on startup. Orphaned temp files are also detected and removed. Full bitstream-level resume (resuming from the middle of a file) is still missing.

[RG-2] AMD VAAPI/AMF hardware paths unvalidated

Files:

src/media/ffmpeg/vaapi.rs
src/media/ffmpeg/amf.rs

Severity: RG

Problem:

Hardware paths for AMD (VAAPI on Linux, AMF on Windows) were implemented without real hardware validation.

Fix:

Verify exact flag compatibility on AMD hardware and add integration tests gated on GPU presence.

UX Gaps

[UX-1] Queued jobs show no position or estimated wait time

Status: RESOLVED

Files:

src/db.rs — get_queue_position
src/server/jobs.rs — JobDetailResponse.queue_position
web/src/components/jobs/JobDetailModal.tsx
web/src/components/jobs/types.ts — JobDetail.queue_position

Severity: UX

Problem:

Queued jobs only show "Waiting" without indicating their position in the priority queue.

Fix:

Implemented db.get_queue_position(job_id) which counts jobs with higher priority or earlier created_at (matching the priority DESC, created_at ASC dequeue order). Added queue_position: Option<u32> to JobDetailResponse — populated only when status == Queued. Frontend shows Queue position: #N in the empty state card in JobDetailModal.

[UX-2] No way to add a single file to the queue via the UI

Severity: UX

Problem:

Jobs only enter the queue via full library scans. No manual "Enqueue path" exists in the UI.

Fix:

Add POST /api/jobs/enqueue and a "Add file" action in the JobsToolbar.

[UX-3] Workers-blocked reason not surfaced for queued jobs

Severity: UX

Problem:

Users cannot see why a job is stuck in Queued (paused, scheduled, or slots full).

Fix:

Add GET /api/processor/status and show the reason in the job detail.

Feature Gaps

[FG-4] Intelligence page content not actionable

Files:

web/src/components/LibraryIntelligence.tsx

Severity: FG

Problem:

Intelligence page is informational only; recommendations cannot be acted upon directly from the page.

Fix:

Add "Queue all" for remux opportunities and "Review" actions for duplicates.

What To Fix Next

[UX-2] Single file enqueue — New feature.
[UX-3] Workers-blocked reason — New feature.
[FG-4] Intelligence page actions — New feature.
[RG-2] AMD VAAPI/AMF validation — Needs real hardware.

13 KiB Raw Permalink Blame History Unescape Escape

Audit Findings

P1 Issues

[P1-1] Cancel during analysis can be overwritten by the pipeline

[P1-2] VideoToolbox quality controls are effectively ignored

P2 Issues

[P2-1] Convert does not reuse subtitle/container compatibility checks

[P2-2] Completed job metadata omitted at the API layer

[P2-3] LAN-only setup exposed to reverse proxy misconfig

[P2-4] N+1 DB update in batch cancel

[P2-5] Missing archived filter in health and stats queries

[P2-6] Daily summary notifications bypass SSRF protections

[P2-7] Silent reprobe failure corrupts saved encode stats

Technical Debt

[TD-1] db.rs is a 3481-line monolith

[TD-2] AlchemistEvent legacy bridge is dead weight

[TD-3] pipeline.rs legacy AlchemistEvent::Progress stub

[TD-4] Silent .ok() on pipeline decision and attempt DB writes

[TD-5] Correlated subquery for sort-by-size in job listing

Reliability Gaps

[RG-1] No encode resume after crash or restart

[RG-2] AMD VAAPI/AMF hardware paths unvalidated

UX Gaps

[UX-1] Queued jobs show no position or estimated wait time

[UX-2] No way to add a single file to the queue via the UI

[UX-3] Workers-blocked reason not surfaced for queued jobs

Feature Gaps

[FG-4] Intelligence page content not actionable

What To Fix Next

13 KiB

Raw Permalink Blame History

[TD-1] `db.rs` is a 3481-line monolith

[TD-2] `AlchemistEvent` legacy bridge is dead weight

[TD-3] `pipeline.rs` legacy `AlchemistEvent::Progress` stub

[TD-4] Silent `.ok()` on pipeline decision and attempt DB writes