fix: validate downloaded model content + surface onboarding errors (#353, #355) by postoso · Pull Request #386 · altic-dev/FluidVoice

postoso · 2026-06-19T05:15:37Z

What

Two related onboarding/download robustness fixes.

#353 — reject HTML/markup pages saved as model files

A network proxy / secure web gateway can return an HTML block page with HTTP 200 in place of the real file. The downloader persisted that page as the model (e.g. coremldata.bin), permanently caching a corrupt model that never self-heals.

validateDownloadedFile now runs before the temp file is moved into place. It rejects markup by Content-Type (text/html, text/xml, application/xml) and by byte-sniffing the leading 512 bytes: after stripping a UTF-8 BOM + ASCII whitespace, any payload beginning with < followed by a markup-ish byte (ASCII letter, !, ?, /) is rejected. No artifact this downloader fetches begins with < (CoreML binaries; JSON {/[; model.mil → program), so false-positive risk is negligible.
A rejected/partial temp file is now removed instead of cached.

#355 — surface onboarding model-load failures

Onboarding load/download failures previously only logged, leaving the user stuck at a disabled Continue button with no message. The onboarding catch now sets the existing asr error state and the onboarding view presents a Retry/OK alert.

Verification

swiftlint --strict clean; xcodebuild build succeeds; 4 new unit tests for the markup detection pass (rejects <!doctype/<html/<script/<head/<meta/<!--/<?xml, incl. BOM + leading-whitespace variants; accepts JSON {/[, .mil, and binary/CoreML-magic prefixes).

Note

The Parakeet path fetches models through the upstream FluidAudio package (AsrModels.downloadAndLoad), which doesn't go through this downloader; this hardening covers the HuggingFace / Whisper / Nemotron / external-CoreML path. A FluidAudio-side guard would be a separate upstream change.

Closes #353
Closes #355

altic-dev · 2026-06-20T00:19:35Z

Hey man. Thanks for the PRs! I am working on a huge push and it might conflict with your PRs. Would appreciate if you can hold on for a week or lesser until I push 1.6.0 and we can work on pulling your changes in :) wdyt?

postoso · 2026-06-20T21:46:31Z

Sounds great — totally happy to hold. 🙌 Go knock out 1.6.0, no rush on my end. Once it lands I'll rebase my open PRs (#384, #385, #387, #386, and #378/diarization) on top so they're clean for you to pull in. Just give me a ping when it's out. 🙏

postoso · 2026-06-22T16:24:52Z

Rebased onto 1.6.0. The #353 downloader validation applied cleanly. The #355 onboarding-error piece needed rework: 1.6 rewrote onboarding and renamed prepareOnboardingModel to prepareOnboardingRoute, so the original standalone .alert (its Retry called the old name) no longer compiled. Turns out 1.6's ContentView alert is bound to asr.showError and already wraps the onboarding view, and 1.6 already uses that pattern for delete failures but left download/load failures log-only. So I kept the real #355 fix (set errorTitle/errorMessage/showError in the prepare catch, so failures actually surface) and dropped the now-redundant standalone alert. Builds green, swiftlint clean. Flagging the dropped alert here in case you would rather onboarding own a dedicated Retry alert instead.

postoso · 2026-06-23T01:04:03Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29be8ee777

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-23T01:08:13Z

+                    // Reject HTML error/block pages (e.g. a corporate proxy returning its
+                    // notification page with HTTP 200) before persisting them as a model
+                    // file, otherwise a corrupt payload is cached permanently. See #353.
+                    try Self.validateDownloadedFile(at: tempUrl, response: response, relativePath: relativePath)


Validate cached artifacts before skipping downloads

This validation only runs for freshly downloaded temp files, but ensureModelsPresent builds pendingFiles using a plain fileExists check before downloadFile is called. For users who already hit #353 and have an HTML block page cached at a model path, that file is still treated as present and this new validator is never reached, so the corrupt model remains permanently stuck until they manually clear the cache. Please validate existing artifacts or force a re-download when a present file is detected as markup.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7244b3abb3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-23T01:35:49Z

+                    if self.needsDownload(relativePath: rel, at: dest) {
                        pendingFiles.append(rel)


Validate caches before provider-level skips

For Nemotron/external CoreML caches that already look complete, this new cached-markup check is never reached: NemotronProvider.prepare returns after a file-existence-only modelsExistOnDisk() check, and ExternalCoreMLTranscriptionProvider.ensureArtifactsPresent can return after validation that only proves required entries/manifest metadata exist. A cache with a valid manifest or .mlpackage directory but an HTML page saved inside a binary therefore still skips ensureModelsPresent and remains permanently corrupted; move this validation into those preflight checks or force the downloader to run before returning.

Useful? React with 👍 / 👎.

postoso · 2026-06-23T07:26:29Z

Pushed a follow-up addressing the Codex review's cached-file point: ensureModelsPresent now re-validates files already on disk (reusing the same looksLikeHTML byte-sniff), so a corrupt HTML payload cached before this fix gets deleted and re-downloaded instead of staying stuck.

Codex's second point is fair too: NemotronProvider and ExternalCoreML can skip ensureModelsPresent via their own modelsExistOnDisk() checks, so a corrupt-but-present cache could still slip through at the provider preflight level. I left that out of this PR on purpose to keep it scoped to #353 (the downloader itself). Pushing the markup check down into each provider's presence logic is a broader change and a design call on where that validation should live, so it's probably better as a separate follow-up if you want it. Happy to do it.

postoso · 2026-06-23T08:20:45Z

Update on the provider-level point from the last Codex review: I went ahead and closed it in this PR rather than splitting it out. NemotronProvider and ExternalCoreML now byte-sniff their cached binaries through the same shared check, so a corrupt-but-present cache gets re-downloaded at the preflight too, not just in the downloader. So #386 now fully closes #353 end to end: downloader content-validation + re-validating already-cached files + both provider preflights. Codex re-reviewed and it's clean.

…ltic-dev#353, altic-dev#355) Reject HTML/markup proxy/block pages before persisting them as model files so a corrupt payload (e.g. a corporate proxy notification page returned with HTTP 200) is never cached permanently as a model artifact, and surface model download/load failures during onboarding instead of silently logging them. Content validation (altic-dev#353): reject any payload whose Content-Type is HTML/XML, or whose leading bytes (after stripping a UTF-8 BOM + ASCII whitespace) begin with `<` followed by a markup-ish byte (an ASCII letter, `!`, `?`, or `/`). No artifact this downloader fetches legitimately begins with `<` (CoreML `.mlmodelc`/`.mlpackage` payloads are binary or JSON, `.mil` starts with `program`, vocab is JSON, the tokenizer is a SentencePiece binary), so this catches `<!doctype`, `<html`, `<head>`, `<body>`, `<script>`, `<meta>`, ``, and `<?xml` block pages without over-rejecting real model data. Only the first 512 bytes are read (model weight files can be gigabytes). Added unit tests covering markup rejection (incl. leading-whitespace + BOM variants) and acceptance of JSON/array/binary/MIL model payloads. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ownloaded (altic-dev#353) ensureModelsPresent built its pendingFiles list using only fileExists, so a user who already had a corrupt HTML proxy/block page cached at a model path (the exact altic-dev#353 population) never re-downloaded it: fileExists was true, the file was skipped, and the download-time validator never ran -- leaving the corrupt payload stuck until the cache was manually cleared. Now needsDownload() also byte-sniffs each present file via the new static cachedFileIsMarkup helper, which reuses the existing looksLikeHTML check on a 512-byte prefix (model files can be gigabytes). A present file that looks like markup is deleted and re-queued for download so a clean copy is fetched; unreadable files are left in place (conservative on read error, never delete on uncertainty). Mirrors the download-time validation for the cached-file case.

…ot just the downloader (altic-dev#353) NemotronProvider.prepare() and ExternalCoreMLTranscriptionProvider.ensureArtifactsPresent() could decide models were present via a file-existence / manifest check and return without running the downloader, so an HTML/markup payload cached in place of a model binary (a proxy block page persisted before download-time validation existed) would slip through and load forever. Add a shared HuggingFaceModelDownloader.cachedPayloadContainsMarkup(root:relativePaths:) that recursively byte-sniffs the cached payload files, reusing the existing cachedFileIsMarkup/looksLikeHTML detector, and gate both providers' preflight short-circuits on it. Markup forces the existing download path, which deletes and re-fetches the corrupt files via needsDownload. The check is local (offline-safe) and conservative on read errors so a valid cache is never rejected on uncertainty; ExternalCoreML's outdated-bundle-stamp refresh still takes precedence.

altic-dev · 2026-06-26T06:52:31Z

Thanks for a beautiful PR. merged :)

postoso · 2026-06-26T20:15:44Z

Hey @altic-dev, thanks for merging #386, and congrats on 1.6.0. You'd mentioned holding my PRs until it shipped so we could pull them in after, so here's where the rest stand. All rebased onto 1.6.0 and building green, ready whenever you have a window:

fix: honor hide-from-dock when launching at login without window (#396) #398: honor hide-from-dock on login/cold launch ([🐞 BUG] "Hide from dock" not working when "Show window when launched at login" is disabled #396), the dock-icon-on-first-launch one
fix(privacy): stop persisting dictated content to the disk log #403: stop persisting dictated content to the disk log (privacy)
fix: hide dictation overlay from VoiceOver and fast-path system shortcuts (#401) #402: hide the dictation overlay from VoiceOver, fast-path system shortcuts (VoiceOver: dictation overlay steals focus and Cmd+Tab becomes sluggish #401)
fix: add clipboard fallback for Write Mode selection reading (#259) #387: clipboard fallback for Write Mode selection reading ([🐞 BUG] Write Mode cannot read selected text from any Application #259)
Add optional speaker labeling (diarization) to file transcription #378: optional speaker diarization for file transcription (feature, no rush)

No pressure on timing, and happy to rebase again if your next push moves things. Thanks for building this.

altic-dev · 2026-06-26T22:53:16Z

@postoso sure! Thanks for the PRs - very helpful. But it will make it insanely helpful if you could follow the PR template that we got and make it easy for me to understand the issues and the fix. Currrent PRs are way too Claude slop and it's very hard for me to review it manually and that's why i am not able to merge it as soon as possible.

Follow template
attach some video/screenshot of before and after or new. feature etc if possible
Codex review fixes ( This you're doing okay )

all of these. help me understand the issues and fixes,

The main reason is I need to make sure it doesn't break anything and works across thousands of computers out there Which makes it harder for me to just merge it when I think the Code seems okay, but there is a lot more verification I do internal data make sure it works pretty decently. Hope it makes sense.

altic-dev · 2026-06-26T22:58:05Z

Another example - #378

an amazing feature add on which I'd really love and would merge it instantly if there's more than just 100 lines and a PR request :) Makes it very very hard for me to review it.

"Is the code change here doing what it's supposed to do?"

To answer this - I need to know what you're trying to do. exact features and what do I see after compiling it. If I dont have both of these, then I need to work on it more to figure both and I still won't know what your intention was fully.

I hope this makes it a little easier. to understand. :)

postoso force-pushed the fix/353-355-download-validation branch from dce8a90 to 29be8ee Compare June 22, 2026 16:23

chatgpt-codex-connector Bot reviewed Jun 23, 2026

View reviewed changes

postoso and others added 3 commits June 26, 2026 00:24

postoso force-pushed the fix/353-355-download-validation branch from a5e4b0d to 1668ee9 Compare June 26, 2026 04:25

altic-dev merged commit 755f0f5 into altic-dev:main Jun 26, 2026
1 check passed

		if self.needsDownload(relativePath: rel, at: dest) {
		pendingFiles.append(rel)

Uh oh!

Conversation

postoso commented Jun 19, 2026

What

#353 — reject HTML/markup pages saved as model files

#355 — surface onboarding model-load failures

Verification

Note

Uh oh!

altic-dev commented Jun 20, 2026

Uh oh!

postoso commented Jun 20, 2026

Uh oh!

postoso commented Jun 22, 2026

Uh oh!

postoso commented Jun 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

postoso commented Jun 23, 2026

Uh oh!

postoso commented Jun 23, 2026

Uh oh!

altic-dev commented Jun 26, 2026

Uh oh!

Uh oh!

postoso commented Jun 26, 2026

Uh oh!

altic-dev commented Jun 26, 2026

Uh oh!

altic-dev commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants