Skip to content

fix: validate downloaded model content + surface onboarding errors (#353, #355)#386

Merged
altic-dev merged 3 commits into
altic-dev:mainfrom
postoso:fix/353-355-download-validation
Jun 26, 2026
Merged

fix: validate downloaded model content + surface onboarding errors (#353, #355)#386
altic-dev merged 3 commits into
altic-dev:mainfrom
postoso:fix/353-355-download-validation

Conversation

@postoso

@postoso postoso commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

What

Two related onboarding/download robustness fixes.

#353 — reject HTML/markup pages saved as model files

A network proxy / secure web gateway can return an HTML block page with HTTP 200 in place of the real file. The downloader persisted that page as the model (e.g. coremldata.bin), permanently caching a corrupt model that never self-heals.

  • validateDownloadedFile now runs before the temp file is moved into place. It rejects markup by Content-Type (text/html, text/xml, application/xml) and by byte-sniffing the leading 512 bytes: after stripping a UTF-8 BOM + ASCII whitespace, any payload beginning with < followed by a markup-ish byte (ASCII letter, !, ?, /) is rejected. No artifact this downloader fetches begins with < (CoreML binaries; JSON {/[; model.milprogram), so false-positive risk is negligible.
  • A rejected/partial temp file is now removed instead of cached.

#355 — surface onboarding model-load failures

Onboarding load/download failures previously only logged, leaving the user stuck at a disabled Continue button with no message. The onboarding catch now sets the existing asr error state and the onboarding view presents a Retry/OK alert.

Verification

swiftlint --strict clean; xcodebuild build succeeds; 4 new unit tests for the markup detection pass (rejects <!doctype/<html/<script/<head/<meta/<!--/<?xml, incl. BOM + leading-whitespace variants; accepts JSON {/[, .mil, and binary/CoreML-magic prefixes).

Note

The Parakeet path fetches models through the upstream FluidAudio package (AsrModels.downloadAndLoad), which doesn't go through this downloader; this hardening covers the HuggingFace / Whisper / Nemotron / external-CoreML path. A FluidAudio-side guard would be a separate upstream change.

Closes #353
Closes #355

@altic-dev

Copy link
Copy Markdown
Owner

Hey man. Thanks for the PRs! I am working on a huge push and it might conflict with your PRs. Would appreciate if you can hold on for a week or lesser until I push 1.6.0 and we can work on pulling your changes in :) wdyt?

@postoso

postoso commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Sounds great — totally happy to hold. 🙌 Go knock out 1.6.0, no rush on my end. Once it lands I'll rebase my open PRs (#384, #385, #387, #386, and #378/diarization) on top so they're clean for you to pull in. Just give me a ping when it's out. 🙏

@postoso postoso force-pushed the fix/353-355-download-validation branch from dce8a90 to 29be8ee Compare June 22, 2026 16:23
@postoso

postoso commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Rebased onto 1.6.0. The #353 downloader validation applied cleanly. The #355 onboarding-error piece needed rework: 1.6 rewrote onboarding and renamed prepareOnboardingModel to prepareOnboardingRoute, so the original standalone .alert (its Retry called the old name) no longer compiled. Turns out 1.6's ContentView alert is bound to asr.showError and already wraps the onboarding view, and 1.6 already uses that pattern for delete failures but left download/load failures log-only. So I kept the real #355 fix (set errorTitle/errorMessage/showError in the prepare catch, so failures actually surface) and dropped the now-redundant standalone alert. Builds green, swiftlint clean. Flagging the dropped alert here in case you would rather onboarding own a dedicated Retry alert instead.

@postoso

postoso commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29be8ee777

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Reject HTML error/block pages (e.g. a corporate proxy returning its
// notification page with HTTP 200) before persisting them as a model
// file, otherwise a corrupt payload is cached permanently. See #353.
try Self.validateDownloadedFile(at: tempUrl, response: response, relativePath: relativePath)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate cached artifacts before skipping downloads

This validation only runs for freshly downloaded temp files, but ensureModelsPresent builds pendingFiles using a plain fileExists check before downloadFile is called. For users who already hit #353 and have an HTML block page cached at a model path, that file is still treated as present and this new validator is never reached, so the corrupt model remains permanently stuck until they manually clear the cache. Please validate existing artifacts or force a re-download when a present file is detected as markup.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7244b3abb3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +107 to 108
if self.needsDownload(relativePath: rel, at: dest) {
pendingFiles.append(rel)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate caches before provider-level skips

For Nemotron/external CoreML caches that already look complete, this new cached-markup check is never reached: NemotronProvider.prepare returns after a file-existence-only modelsExistOnDisk() check, and ExternalCoreMLTranscriptionProvider.ensureArtifactsPresent can return after validation that only proves required entries/manifest metadata exist. A cache with a valid manifest or .mlpackage directory but an HTML page saved inside a binary therefore still skips ensureModelsPresent and remains permanently corrupted; move this validation into those preflight checks or force the downloader to run before returning.

Useful? React with 👍 / 👎.

@postoso

postoso commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Pushed a follow-up addressing the Codex review's cached-file point: ensureModelsPresent now re-validates files already on disk (reusing the same looksLikeHTML byte-sniff), so a corrupt HTML payload cached before this fix gets deleted and re-downloaded instead of staying stuck.

Codex's second point is fair too: NemotronProvider and ExternalCoreML can skip ensureModelsPresent via their own modelsExistOnDisk() checks, so a corrupt-but-present cache could still slip through at the provider preflight level. I left that out of this PR on purpose to keep it scoped to #353 (the downloader itself). Pushing the markup check down into each provider's presence logic is a broader change and a design call on where that validation should live, so it's probably better as a separate follow-up if you want it. Happy to do it.

@postoso

postoso commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Update on the provider-level point from the last Codex review: I went ahead and closed it in this PR rather than splitting it out. NemotronProvider and ExternalCoreML now byte-sniff their cached binaries through the same shared check, so a corrupt-but-present cache gets re-downloaded at the preflight too, not just in the downloader. So #386 now fully closes #353 end to end: downloader content-validation + re-validating already-cached files + both provider preflights. Codex re-reviewed and it's clean.

postoso and others added 3 commits June 26, 2026 00:24
…ltic-dev#353, altic-dev#355)

Reject HTML/markup proxy/block pages before persisting them as model files so a
corrupt payload (e.g. a corporate proxy notification page returned with HTTP 200)
is never cached permanently as a model artifact, and surface model download/load
failures during onboarding instead of silently logging them.

Content validation (altic-dev#353): reject any payload whose Content-Type is HTML/XML, or
whose leading bytes (after stripping a UTF-8 BOM + ASCII whitespace) begin with `<`
followed by a markup-ish byte (an ASCII letter, `!`, `?`, or `/`). No artifact this
downloader fetches legitimately begins with `<` (CoreML `.mlmodelc`/`.mlpackage`
payloads are binary or JSON, `.mil` starts with `program`, vocab is JSON, the
tokenizer is a SentencePiece binary), so this catches `<!doctype`, `<html`,
`<head>`, `<body>`, `<script>`, `<meta>`, `<!-- -->`, and `<?xml` block pages
without over-rejecting real model data. Only the first 512 bytes are read (model
weight files can be gigabytes).

Added unit tests covering markup rejection (incl. leading-whitespace + BOM
variants) and acceptance of JSON/array/binary/MIL model payloads.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ownloaded (altic-dev#353)

ensureModelsPresent built its pendingFiles list using only fileExists, so a
user who already had a corrupt HTML proxy/block page cached at a model path
(the exact altic-dev#353 population) never re-downloaded it: fileExists was true, the
file was skipped, and the download-time validator never ran -- leaving the
corrupt payload stuck until the cache was manually cleared.

Now needsDownload() also byte-sniffs each present file via the new static
cachedFileIsMarkup helper, which reuses the existing looksLikeHTML check on a
512-byte prefix (model files can be gigabytes). A present file that looks like
markup is deleted and re-queued for download so a clean copy is fetched;
unreadable files are left in place (conservative on read error, never delete on
uncertainty). Mirrors the download-time validation for the cached-file case.
…ot just the downloader (altic-dev#353)

NemotronProvider.prepare() and ExternalCoreMLTranscriptionProvider.ensureArtifactsPresent()
could decide models were present via a file-existence / manifest check and return without
running the downloader, so an HTML/markup payload cached in place of a model binary (a proxy
block page persisted before download-time validation existed) would slip through and load
forever. Add a shared HuggingFaceModelDownloader.cachedPayloadContainsMarkup(root:relativePaths:)
that recursively byte-sniffs the cached payload files, reusing the existing
cachedFileIsMarkup/looksLikeHTML detector, and gate both providers' preflight short-circuits on
it. Markup forces the existing download path, which deletes and re-fetches the corrupt files via
needsDownload. The check is local (offline-safe) and conservative on read errors so a valid cache
is never rejected on uncertainty; ExternalCoreML's outdated-bundle-stamp refresh still takes
precedence.
@postoso postoso force-pushed the fix/353-355-download-validation branch from a5e4b0d to 1668ee9 Compare June 26, 2026 04:25
@altic-dev

Copy link
Copy Markdown
Owner

Thanks for a beautiful PR. merged :)

@altic-dev altic-dev merged commit 755f0f5 into altic-dev:main Jun 26, 2026
1 check passed
@postoso

postoso commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Hey @altic-dev, thanks for merging #386, and congrats on 1.6.0. You'd mentioned holding my PRs until it shipped so we could pull them in after, so here's where the rest stand. All rebased onto 1.6.0 and building green, ready whenever you have a window:

No pressure on timing, and happy to rebase again if your next push moves things. Thanks for building this.

@altic-dev

Copy link
Copy Markdown
Owner

@postoso sure! Thanks for the PRs - very helpful. But it will make it insanely helpful if you could follow the PR template that we got and make it easy for me to understand the issues and the fix. Currrent PRs are way too Claude slop and it's very hard for me to review it manually and that's why i am not able to merge it as soon as possible.

  • Follow template
  • attach some video/screenshot of before and after or new. feature etc if possible
  • Codex review fixes ( This you're doing okay )

all of these. help me understand the issues and fixes,

The main reason is I need to make sure it doesn't break anything and works across thousands of computers out there Which makes it harder for me to just merge it when I think the Code seems okay, but there is a lot more verification I do internal data make sure it works pretty decently. Hope it makes sense.

@altic-dev

Copy link
Copy Markdown
Owner

Another example - #378

an amazing feature add on which I'd really love and would merge it instantly if there's more than just 100 lines and a PR request :) Makes it very very hard for me to review it.

"Is the code change here doing what it's supposed to do?"

  • To answer this - I need to know what you're trying to do. exact features and what do I see after compiling it. If I dont have both of these, then I need to work on it more to figure both and I still won't know what your intention was fully.

I hope this makes it a little easier. to understand. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants