Skip to content

Auto-detect audio format in OpenAISpeechToTextClient#7575

Open
jozkee wants to merge 1 commit into
mainfrom
issue-7543
Open

Auto-detect audio format in OpenAISpeechToTextClient#7575
jozkee wants to merge 1 commit into
mainfrom
issue-7543

Conversation

@jozkee

@jozkee jozkee commented Jun 16, 2026

Copy link
Copy Markdown
Member

When the audio stream is not a FileStream, the client now peeks at the leading bytes to detect the format (wav, webm, m4a, mp3) and sets the multipart filename accordingly. This fixes HTTP 400 errors when sending non-MP3 audio (e.g. WAV) in a MemoryStream, since the OpenAI API uses the file extension to determine the audio format.

  • Add DetectAudioExtension using Span.SequenceEqual for readability
  • Add integration tests for all OpenAI-supported formats (mp3, wav, m4a, webm)
  • Add unit tests covering each magic-byte detection branch
  • Add ExpectedAudioFilename assertion to VerbatimMultiPartHttpHandler

Fixes #7543

Microsoft Reviewers: Open in CodeFlow

…#7543)

When the audio stream is not a FileStream, the client now peeks at the
leading bytes to detect the format (wav, webm, m4a, mp3) and sets the
multipart filename accordingly. This fixes HTTP 400 errors when sending
non-MP3 audio (e.g. WAV) in a MemoryStream, since the OpenAI API uses
the file extension to determine the audio format.

- Add DetectAudioExtension using Span.SequenceEqual for readability
- Add integration tests for all OpenAI-supported formats (mp3, wav, m4a, webm)
- Add unit tests covering each magic-byte detection branch
- Add ExpectedAudioFilename assertion to VerbatimMultiPartHttpHandler

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jozkee jozkee requested a review from rogerbarreto June 16, 2026 21:46
@jozkee jozkee self-assigned this Jun 16, 2026
@jozkee jozkee requested a review from a team as a code owner June 16, 2026 21:46
Copilot AI review requested due to automatic review settings June 16, 2026 21:46
@jozkee jozkee added the area-ai Microsoft.Extensions.AI libraries label Jun 16, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates OpenAISpeechToTextClient to auto-detect audio format (wav/webm/m4a/mp3) from leading “magic bytes” when the provided audio stream is not a FileStream, and uses the detected extension in the multipart filename so OpenAI can correctly infer the format (fixing 400s for non-MP3 MemoryStream inputs).

Changes:

  • Add stream-header “magic byte” detection and filename resolution logic in OpenAISpeechToTextClient.
  • Add unit tests validating filename selection for each supported format and branch.
  • Add integration coverage for multiple embedded audio formats and enhance multipart handler assertions to validate the uploaded filename.

Reviewed changes

Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs Adds filename resolution with magic-byte detection for non-FileStream inputs.
test/Libraries/Microsoft.Extensions.AI.OpenAI.Tests/OpenAISpeechToTextClientTests.cs Adds theory-based unit tests asserting detected multipart filenames for different headers.
test/Libraries/Microsoft.Extensions.AI.Integration.Tests/VerbatimMultiPartHttpHandler.cs Adds optional filename assertion for multipart “file” fields.
test/Libraries/Microsoft.Extensions.AI.Integration.Tests/SpeechToTextClientIntegrationTests.cs Adds integration test that exercises auto-detection across multiple audio formats.
test/Libraries/Microsoft.Extensions.AI.Integration.Tests/Microsoft.Extensions.AI.Integration.Tests.csproj Embeds additional audio resource files used by the new integration test.
Comments suppressed due to low confidence (1)

src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs:121

  • In GetStreamingTextAsync, ResolveFilename(audioSpeechStream) is executed unconditionally even for translation requests, but the translation branch immediately delegates to GetTextAsync(...) (which resolves the filename again). With the new magic-byte peek, this results in redundant header reads/rewinds for translation streaming.
        _ = Throw.IfNull(audioSpeechStream);

        string filename = ResolveFilename(audioSpeechStream);

        if (IsTranslationRequest(options))
        {
            foreach (var update in (await GetTextAsync(audioSpeechStream, options, cancellationToken).ConfigureAwait(false)).ToSpeechToTextResponseUpdates())

}

/// <summary>Detects the audio format extension from the leading bytes of the audio data.</summary>
private static string DetectAudioExtension(ReadOnlySpan<byte> header)

@jozkee jozkee Jun 16, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, OpenAI supported formats are: mp3, mp4, mpeg, mpga, m4a, wav, and webm. And quotes from the specs related to the matching occurring in this method:

  1. WAV — RIFF at offset 0, WAVE at offset 8
    Source: Microsoft Multimedia Programming Interface and Data Specifications 1.0 (August 1991), referenced from:
    https://www.mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html

Field Length Contents
ckID 4 Chunk ID: "RIFF"
cksize 4 Chunk size: 4+n
WAVEID 4 WAVE ID: "WAVE"

And later under Examples, the full structure shows bytes 0–3 = RIFF, bytes 4–7 = size, and the WAVEID field at bytes 8–11 is WAVE.

  1. MP3 / MPEG / MPGA — ID3 at offset 0, or frame sync 0xFF 0xE_
    Source: http://www.mp3-tech.org/programmer/frame_header.html (authoritative MP3 technical reference, derived from ISO/IEC 11172-3)

Verified citation (exact text):

The first twelve bits (or first eleven bits in the case of the MPEG 2.5 extension) of a frame header are always set to 1 and are called "frame sync".

And the header table shows:

Sign Length (bits) Position (bits) Description
A 11 (31-21) Frame sync (all bits must be set)
11 bits set = bytes 0xFF + top 3 bits of next byte set = (header[1] & 0xE0) == 0xE0

For ID3v2 tags preceding MP3 data:
Source: https://id3.org/id3v2.3.0 — Section 3.1 "ID3v2 header"

"The first three bytes of the tag are always "ID3" to indicate that this is an ID3v2 tag"

  1. MP4 / M4A — ftyp at offset 4
    Source: W3C Note "ISO BMFF Byte Stream Format" (referencing ISO/IEC 14496-12 "ISO Base Media File Format"):
    https://www.w3.org/TR/mse-byte-stream-format-isobmff/

Verified citation (exact text):

An ISO BMFF initialization segment is defined in this specification as a single File Type Box (ftyp) followed by a single Movie Box (moov).

Per ISO 14496-12 box format: bytes 0–3 = box size (uint32 big-endian), bytes 4–7 = box type (FourCC). The first box MUST be ftyp.

  1. WebM — 0x1A 0x45 0xDF 0xA3 at offset 0
    Source: RFC 8794 — "Extensible Binary Meta Language" (IETF Standards Track), Section 8.1 "EBML Header":
    https://www.rfc-editor.org/rfc/rfc8794.txt

Verified citation (exact text from Section 8.1):

The EBML Header MUST contain a single Master Element with an Element Name of "EBML" and Element ID of "0x1A45DFA3" (see Section 11.2.1)

WebM is a profile of Matroska (RFC 9559), which is an EBML Document Type. Every WebM file begins with the EBML Header whose first element has ID 0x1A45DFA3.

@dotnet-comment-bot

Copy link
Copy Markdown
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.Diagnostics.Testing Line 99 98.65 🔻
Microsoft.Extensions.Telemetry Line 93 91.95 🔻
Microsoft.Extensions.AI Line 89 88.53 🔻
Microsoft.Extensions.AI Branch 89 88.57 🔻
Microsoft.Extensions.AI.OpenAI Line 75 62.86 🔻
Microsoft.Extensions.AI.OpenAI Branch 75 50.31 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Line 75 4.46 🔻
Microsoft.Extensions.DataIngestion.MarkItDown Branch 75 0 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Line 99 96.03 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring Branch 99 94.39 🔻
Microsoft.Extensions.Diagnostics.ResourceMonitoring.Kubernetes Line 99 97.73 🔻
Microsoft.Extensions.ServiceDiscovery.Dns Line 75 69.93 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Line 75 42.11 🔻
Microsoft.Extensions.ServiceDiscovery.Abstractions Branch 75 42.86 🔻
Microsoft.Extensions.ServiceDiscovery Line 75 67.81 🔻
Microsoft.Extensions.ServiceDiscovery Branch 75 71.43 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Line 75 73.85 🔻
Microsoft.Extensions.ServiceDiscovery.Yarp Branch 75 70 🔻
Microsoft.Extensions.VectorData.Abstractions Line 75 37.39 🔻
Microsoft.Extensions.VectorData.Abstractions Branch 75 22.73 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Gen.BuildMetadata 97 100
Microsoft.Gen.MetadataExtractor 57 73
Microsoft.Gen.MetricsReports 67 69
Microsoft.Extensions.AI.Abstractions 82 85
Microsoft.Extensions.AI.Evaluation.NLP 0 78
Microsoft.Extensions.Caching.Hybrid 82 89
Microsoft.Extensions.DataIngestion 75 89
Microsoft.Extensions.DataIngestion.Markdig 75 90
Microsoft.Extensions.Http.Resilience 97 100

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1467244&view=codecoverage-tab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-ai Microsoft.Extensions.AI libraries

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ISpeechToTextClient does not allow to specify the audio FileName

3 participants