Files
explorer/docs/plans/software-hashes.md

4.1 KiB

Software Hashes Plan

Plan for adding a derived software_hashes table, its update pipeline, and JSON snapshot lifecycle to survive DB wipes.


1) Goals and Scope (Plan Step 1)

  • Create and maintain software_hashes for (at this stage) tape-image downloads.
  • Preserve existing _CONTENTS folders; only create missing ones.
  • Export software_hashes to JSON after each bulk update.
  • Reimport software_hashes JSON during DB wipe in bin/import_mysql.sh (or a helper script it invokes).
  • Ensure all scripts are idempotent and resume-safe.

2) Confirm Pipeline Touchpoints (Plan Step 2)

  • Verify bin/import_mysql.sh is the authoritative DB wipe/import entry point.
  • Confirm bin/sync-downloads.mjs remains responsible only for CDN cache sync.
  • Confirm src/server/schema/zxdb.ts uses downloads.id as the natural FK target.

3) Define Data Model: software_hashes (Plan Step 3)

Table naming and FK alignment

  • Table: software_hashes.
  • FK: download_iddownloads.id.
  • Column names follow existing DB snake_case conventions.

Planned columns

  • download_id (PK or unique index; FK to downloads.id)
  • md5
  • crc32
  • size_bytes
  • updated_at

Planned indexes / constraints

  • Unique index on download_id.
  • Index on md5 for reverse lookup.
  • Index on crc32 for reverse lookup.

4) Define JSON Snapshot Format (Plan Step 4)

Location

  • Default: data/zxdb/software_hashes.json (or another agreed path).

Structure

{
  "exportedAt": "2026-02-17T15:18:00.000Z",
  "rows": [
    {
      "download_id": 123,
      "md5": "...",
      "crc32": "...",
      "size_bytes": 12345,
      "updated_at": "2026-02-17T15:18:00.000Z"
    }
  ]
}

Planned import policy

  • If snapshot exists: truncate software_hashes and bulk insert.
  • If snapshot missing: log and continue without error.

5) Implement Tape Image Update Workflow (Plan Step 5)

Planned script

  • bin/update-software-hashes.mjs (name can be adjusted).

Planned input dataset

  • Query downloads for tape-image rows (filter by filetype_id or joined filetypes table).

Planned per-item process

  1. Resolve local zip path using the same CDN mapping used by sync-downloads.
  2. Compute _CONTENTS folder name: <zip filename>_CONTENTS (exact match).
  3. If _CONTENTS exists, keep it untouched.
  4. If missing, extract zip into _CONTENTS using a library that avoids shell expansion issues with brackets.
  5. Locate tape file inside (.tap, .tzx, .pzx, .csw):
    • Apply a deterministic priority order.
    • If multiple candidates remain, log and skip (or record ambiguity).
  6. Compute md5, crc32, and size_bytes for the selected file.
  7. Upsert into software_hashes keyed by download_id.

Planned error handling

  • Log missing zips or missing tape files.
  • Continue after recoverable errors; fail only on critical DB errors.

6) Implement JSON Export Lifecycle (Plan Step 6)

  • After each bulk update, export software_hashes to JSON.
  • Write atomically (temp file + rename).
  • Include exportedAt timestamp in snapshot.

7) Reimport During Wipe (bin/import_mysql.sh) (Plan Step 7)

Planned placement

  • Immediately after database creation and ZXDB SQL import completes.

Planned behavior

  • Attempt to read JSON snapshot.
  • If present, truncate and reinsert software_hashes.
  • Log imported row count.

8) Add Idempotency and Resume Support (Plan Step 8)

  • State file similar to .sync-downloads.state.json to track last download_id processed.
  • CLI flags:
    • --resume (default)
    • --start-from-id
    • --rebuild-all
  • Reprocess when zip file size or mtime changes.

9) Validation Checklist (Plan Step 9)

  • _CONTENTS folders are never deleted.
  • Hashes match expected MD5/CRC32 for known samples.
  • JSON snapshot is created and reimported correctly.
  • Reverse lookup by md5/crc32/size_bytes identifies misnamed files.
  • Script can resume safely after interruption.

10) Open Questions / Confirmations (Plan Step 10)

  • Final software_hashes column list and types.
  • Exact JSON snapshot path.
  • Filetype IDs that map to “Tape Image” in downloads.