fix: validate downloaded model content + surface onboarding errors (#353, #355)#386
Conversation
|
Hey man. Thanks for the PRs! I am working on a huge push and it might conflict with your PRs. Would appreciate if you can hold on for a week or lesser until I push 1.6.0 and we can work on pulling your changes in :) wdyt? |
dce8a90 to
29be8ee
Compare
|
Rebased onto 1.6.0. The #353 downloader validation applied cleanly. The #355 onboarding-error piece needed rework: 1.6 rewrote onboarding and renamed prepareOnboardingModel to prepareOnboardingRoute, so the original standalone .alert (its Retry called the old name) no longer compiled. Turns out 1.6's ContentView alert is bound to asr.showError and already wraps the onboarding view, and 1.6 already uses that pattern for delete failures but left download/load failures log-only. So I kept the real #355 fix (set errorTitle/errorMessage/showError in the prepare catch, so failures actually surface) and dropped the now-redundant standalone alert. Builds green, swiftlint clean. Flagging the dropped alert here in case you would rather onboarding own a dedicated Retry alert instead. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29be8ee777
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Reject HTML error/block pages (e.g. a corporate proxy returning its | ||
| // notification page with HTTP 200) before persisting them as a model | ||
| // file, otherwise a corrupt payload is cached permanently. See #353. | ||
| try Self.validateDownloadedFile(at: tempUrl, response: response, relativePath: relativePath) |
There was a problem hiding this comment.
Validate cached artifacts before skipping downloads
This validation only runs for freshly downloaded temp files, but ensureModelsPresent builds pendingFiles using a plain fileExists check before downloadFile is called. For users who already hit #353 and have an HTML block page cached at a model path, that file is still treated as present and this new validator is never reached, so the corrupt model remains permanently stuck until they manually clear the cache. Please validate existing artifacts or force a re-download when a present file is detected as markup.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7244b3abb3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.needsDownload(relativePath: rel, at: dest) { | ||
| pendingFiles.append(rel) |
There was a problem hiding this comment.
Validate caches before provider-level skips
For Nemotron/external CoreML caches that already look complete, this new cached-markup check is never reached: NemotronProvider.prepare returns after a file-existence-only modelsExistOnDisk() check, and ExternalCoreMLTranscriptionProvider.ensureArtifactsPresent can return after validation that only proves required entries/manifest metadata exist. A cache with a valid manifest or .mlpackage directory but an HTML page saved inside a binary therefore still skips ensureModelsPresent and remains permanently corrupted; move this validation into those preflight checks or force the downloader to run before returning.
Useful? React with 👍 / 👎.
|
Pushed a follow-up addressing the Codex review's cached-file point: Codex's second point is fair too: |
|
Update on the provider-level point from the last Codex review: I went ahead and closed it in this PR rather than splitting it out. |
…ltic-dev#353, altic-dev#355) Reject HTML/markup proxy/block pages before persisting them as model files so a corrupt payload (e.g. a corporate proxy notification page returned with HTTP 200) is never cached permanently as a model artifact, and surface model download/load failures during onboarding instead of silently logging them. Content validation (altic-dev#353): reject any payload whose Content-Type is HTML/XML, or whose leading bytes (after stripping a UTF-8 BOM + ASCII whitespace) begin with `<` followed by a markup-ish byte (an ASCII letter, `!`, `?`, or `/`). No artifact this downloader fetches legitimately begins with `<` (CoreML `.mlmodelc`/`.mlpackage` payloads are binary or JSON, `.mil` starts with `program`, vocab is JSON, the tokenizer is a SentencePiece binary), so this catches `<!doctype`, `<html`, `<head>`, `<body>`, `<script>`, `<meta>`, `<!-- -->`, and `<?xml` block pages without over-rejecting real model data. Only the first 512 bytes are read (model weight files can be gigabytes). Added unit tests covering markup rejection (incl. leading-whitespace + BOM variants) and acceptance of JSON/array/binary/MIL model payloads. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ownloaded (altic-dev#353) ensureModelsPresent built its pendingFiles list using only fileExists, so a user who already had a corrupt HTML proxy/block page cached at a model path (the exact altic-dev#353 population) never re-downloaded it: fileExists was true, the file was skipped, and the download-time validator never ran -- leaving the corrupt payload stuck until the cache was manually cleared. Now needsDownload() also byte-sniffs each present file via the new static cachedFileIsMarkup helper, which reuses the existing looksLikeHTML check on a 512-byte prefix (model files can be gigabytes). A present file that looks like markup is deleted and re-queued for download so a clean copy is fetched; unreadable files are left in place (conservative on read error, never delete on uncertainty). Mirrors the download-time validation for the cached-file case.
…ot just the downloader (altic-dev#353) NemotronProvider.prepare() and ExternalCoreMLTranscriptionProvider.ensureArtifactsPresent() could decide models were present via a file-existence / manifest check and return without running the downloader, so an HTML/markup payload cached in place of a model binary (a proxy block page persisted before download-time validation existed) would slip through and load forever. Add a shared HuggingFaceModelDownloader.cachedPayloadContainsMarkup(root:relativePaths:) that recursively byte-sniffs the cached payload files, reusing the existing cachedFileIsMarkup/looksLikeHTML detector, and gate both providers' preflight short-circuits on it. Markup forces the existing download path, which deletes and re-fetches the corrupt files via needsDownload. The check is local (offline-safe) and conservative on read errors so a valid cache is never rejected on uncertainty; ExternalCoreML's outdated-bundle-stamp refresh still takes precedence.
a5e4b0d to
1668ee9
Compare
|
Thanks for a beautiful PR. merged :) |
|
Hey @altic-dev, thanks for merging #386, and congrats on 1.6.0. You'd mentioned holding my PRs until it shipped so we could pull them in after, so here's where the rest stand. All rebased onto 1.6.0 and building green, ready whenever you have a window:
No pressure on timing, and happy to rebase again if your next push moves things. Thanks for building this. |
|
@postoso sure! Thanks for the PRs - very helpful. But it will make it insanely helpful if you could follow the PR template that we got and make it easy for me to understand the issues and the fix. Currrent PRs are way too Claude slop and it's very hard for me to review it manually and that's why i am not able to merge it as soon as possible.
all of these. help me understand the issues and fixes, The main reason is I need to make sure it doesn't break anything and works across thousands of computers out there Which makes it harder for me to just merge it when I think the Code seems okay, but there is a lot more verification I do internal data make sure it works pretty decently. Hope it makes sense. |
|
Another example - #378 an amazing feature add on which I'd really love and would merge it instantly if there's more than just 100 lines and a PR request :) Makes it very very hard for me to review it. "Is the code change here doing what it's supposed to do?"
I hope this makes it a little easier. to understand. :) |
What
Two related onboarding/download robustness fixes.
#353 — reject HTML/markup pages saved as model files
A network proxy / secure web gateway can return an HTML block page with HTTP 200 in place of the real file. The downloader persisted that page as the model (e.g.
coremldata.bin), permanently caching a corrupt model that never self-heals.validateDownloadedFilenow runs before the temp file is moved into place. It rejects markup byContent-Type(text/html,text/xml,application/xml) and by byte-sniffing the leading 512 bytes: after stripping a UTF-8 BOM + ASCII whitespace, any payload beginning with<followed by a markup-ish byte (ASCII letter,!,?,/) is rejected. No artifact this downloader fetches begins with<(CoreML binaries; JSON{/[;model.mil→program), so false-positive risk is negligible.#355 — surface onboarding model-load failures
Onboarding load/download failures previously only logged, leaving the user stuck at a disabled Continue button with no message. The onboarding
catchnow sets the existingasrerror state and the onboarding view presents a Retry/OK alert.Verification
swiftlint --strictclean;xcodebuild buildsucceeds; 4 new unit tests for the markup detection pass (rejects<!doctype/<html/<script/<head/<meta/<!--/<?xml, incl. BOM + leading-whitespace variants; accepts JSON{/[,.mil, and binary/CoreML-magic prefixes).Note
The Parakeet path fetches models through the upstream FluidAudio package (
AsrModels.downloadAndLoad), which doesn't go through this downloader; this hardening covers the HuggingFace / Whisper / Nemotron / external-CoreML path. A FluidAudio-side guard would be a separate upstream change.Closes #353
Closes #355