# Software Hashes Plan Plan for adding a derived `software_hashes` table, its update pipeline, and JSON snapshot lifecycle to survive DB wipes. --- ## 1) Goals and Scope (Plan Step 1) - Create and maintain `software_hashes` for (at this stage) tape-image downloads. - Preserve existing `_CONTENTS` folders; only create missing ones. - Export `software_hashes` to JSON after each bulk update. - Reimport `software_hashes` JSON during DB wipe in `bin/import_mysql.sh` (or a helper script it invokes). - Ensure all scripts are idempotent and resume-safe. --- ## 2) Confirm Pipeline Touchpoints (Plan Step 2) - Verify `bin/import_mysql.sh` is the authoritative DB wipe/import entry point. - Confirm `bin/sync-downloads.mjs` remains responsible only for CDN cache sync. - Confirm `src/server/schema/zxdb.ts` uses `downloads.id` as the natural FK target. --- ## 3) Define Data Model: `software_hashes` (Plan Step 3) ### Table naming and FK alignment - Table: `software_hashes`. - FK: `download_id` → `downloads.id`. - Column names follow existing DB `snake_case` conventions. ### Planned columns - `download_id` (PK or unique index; FK to `downloads.id`) - `md5` - `crc32` - `size_bytes` - `updated_at` ### Planned indexes / constraints - Unique index on `download_id`. - Index on `md5` for reverse lookup. - Index on `crc32` for reverse lookup. --- ## 4) Define JSON Snapshot Format (Plan Step 4) ### Location - Default: `data/zxdb/software_hashes.json` (or another agreed path). ### Structure ```json { "exportedAt": "2026-02-17T15:18:00.000Z", "rows": [ { "download_id": 123, "md5": "...", "crc32": "...", "size_bytes": 12345, "updated_at": "2026-02-17T15:18:00.000Z" } ] } ``` ### Planned import policy - If snapshot exists: truncate `software_hashes` and bulk insert. - If snapshot missing: log and continue without error. --- ## 5) Implement Tape Image Update Workflow (Plan Step 5) ### Planned script - `bin/update-software-hashes.mjs` (name can be adjusted). ### Planned input dataset - Query `downloads` for tape-image rows (filter by `filetype_id` or joined `filetypes` table). ### Planned per-item process 1. Resolve local zip path using the same CDN mapping used by `sync-downloads`. 2. Compute `_CONTENTS` folder name: `_CONTENTS` (exact match). 3. If `_CONTENTS` exists, keep it untouched. 4. If missing, extract zip into `_CONTENTS` using a library that avoids shell expansion issues with brackets. 5. Locate tape file inside (`.tap`, `.tzx`, `.pzx`, `.csw`): - Apply a deterministic priority order. - If multiple candidates remain, log and skip (or record ambiguity). 6. Compute `md5`, `crc32`, and `size_bytes` for the selected file. 7. Upsert into `software_hashes` keyed by `download_id`. ### Planned error handling - Log missing zips or missing tape files. - Continue after recoverable errors; fail only on critical DB errors. --- ## 6) Implement JSON Export Lifecycle (Plan Step 6) - After each bulk update, export `software_hashes` to JSON. - Write atomically (temp file + rename). - Include `exportedAt` timestamp in snapshot. --- ## 7) Reimport During Wipe (`bin/import_mysql.sh`) (Plan Step 7) ### Planned placement - Immediately after database creation and ZXDB SQL import completes. ### Planned behavior - Attempt to read JSON snapshot. - If present, truncate and reinsert `software_hashes`. - Log imported row count. --- ## 8) Add Idempotency and Resume Support (Plan Step 8) - State file similar to `.sync-downloads.state.json` to track last `download_id` processed. - CLI flags: - `--resume` (default) - `--start-from-id` - `--rebuild-all` - Reprocess when zip file size or mtime changes. --- ## 9) Validation Checklist (Plan Step 9) - `_CONTENTS` folders are never deleted. - Hashes match expected MD5/CRC32 for known samples. - JSON snapshot is created and reimported correctly. - Reverse lookup by `md5`/`crc32`/`size_bytes` identifies misnamed files. - Script can resume safely after interruption. --- ## 10) Open Questions / Confirmations (Plan Step 10) - Final `software_hashes` column list and types. - Exact JSON snapshot path. - Filetype IDs that map to “Tape Image” in `downloads`.