Auto-detect audio format in OpenAISpeechToTextClient#7575
Conversation
…#7543) When the audio stream is not a FileStream, the client now peeks at the leading bytes to detect the format (wav, webm, m4a, mp3) and sets the multipart filename accordingly. This fixes HTTP 400 errors when sending non-MP3 audio (e.g. WAV) in a MemoryStream, since the OpenAI API uses the file extension to determine the audio format. - Add DetectAudioExtension using Span.SequenceEqual for readability - Add integration tests for all OpenAI-supported formats (mp3, wav, m4a, webm) - Add unit tests covering each magic-byte detection branch - Add ExpectedAudioFilename assertion to VerbatimMultiPartHttpHandler Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates OpenAISpeechToTextClient to auto-detect audio format (wav/webm/m4a/mp3) from leading “magic bytes” when the provided audio stream is not a FileStream, and uses the detected extension in the multipart filename so OpenAI can correctly infer the format (fixing 400s for non-MP3 MemoryStream inputs).
Changes:
- Add stream-header “magic byte” detection and filename resolution logic in
OpenAISpeechToTextClient. - Add unit tests validating filename selection for each supported format and branch.
- Add integration coverage for multiple embedded audio formats and enhance multipart handler assertions to validate the uploaded filename.
Reviewed changes
Copilot reviewed 5 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs | Adds filename resolution with magic-byte detection for non-FileStream inputs. |
| test/Libraries/Microsoft.Extensions.AI.OpenAI.Tests/OpenAISpeechToTextClientTests.cs | Adds theory-based unit tests asserting detected multipart filenames for different headers. |
| test/Libraries/Microsoft.Extensions.AI.Integration.Tests/VerbatimMultiPartHttpHandler.cs | Adds optional filename assertion for multipart “file” fields. |
| test/Libraries/Microsoft.Extensions.AI.Integration.Tests/SpeechToTextClientIntegrationTests.cs | Adds integration test that exercises auto-detection across multiple audio formats. |
| test/Libraries/Microsoft.Extensions.AI.Integration.Tests/Microsoft.Extensions.AI.Integration.Tests.csproj | Embeds additional audio resource files used by the new integration test. |
Comments suppressed due to low confidence (1)
src/Libraries/Microsoft.Extensions.AI.OpenAI/OpenAISpeechToTextClient.cs:121
- In
GetStreamingTextAsync,ResolveFilename(audioSpeechStream)is executed unconditionally even for translation requests, but the translation branch immediately delegates toGetTextAsync(...)(which resolves the filename again). With the new magic-byte peek, this results in redundant header reads/rewinds for translation streaming.
_ = Throw.IfNull(audioSpeechStream);
string filename = ResolveFilename(audioSpeechStream);
if (IsTranslationRequest(options))
{
foreach (var update in (await GetTextAsync(audioSpeechStream, options, cancellationToken).ConfigureAwait(false)).ToSpeechToTextResponseUpdates())
| } | ||
|
|
||
| /// <summary>Detects the audio format extension from the leading bytes of the audio data.</summary> | ||
| private static string DetectAudioExtension(ReadOnlySpan<byte> header) |
There was a problem hiding this comment.
For reference, OpenAI supported formats are: mp3, mp4, mpeg, mpga, m4a, wav, and webm. And quotes from the specs related to the matching occurring in this method:
- WAV — RIFF at offset 0, WAVE at offset 8
Source: Microsoft Multimedia Programming Interface and Data Specifications 1.0 (August 1991), referenced from:
https://www.mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html
Field Length Contents
ckID 4 Chunk ID: "RIFF"
cksize 4 Chunk size: 4+n
WAVEID 4 WAVE ID: "WAVE"
And later under Examples, the full structure shows bytes 0–3 = RIFF, bytes 4–7 = size, and the WAVEID field at bytes 8–11 is WAVE.
- MP3 / MPEG / MPGA — ID3 at offset 0, or frame sync 0xFF 0xE_
Source: http://www.mp3-tech.org/programmer/frame_header.html (authoritative MP3 technical reference, derived from ISO/IEC 11172-3)
Verified citation (exact text):
The first twelve bits (or first eleven bits in the case of the MPEG 2.5 extension) of a frame header are always set to 1 and are called "frame sync".
And the header table shows:
Sign Length (bits) Position (bits) Description
A 11 (31-21) Frame sync (all bits must be set)
11 bits set = bytes 0xFF + top 3 bits of next byte set = (header[1] & 0xE0) == 0xE0
For ID3v2 tags preceding MP3 data:
Source: https://id3.org/id3v2.3.0 — Section 3.1 "ID3v2 header"
"The first three bytes of the tag are always "ID3" to indicate that this is an ID3v2 tag"
- MP4 / M4A — ftyp at offset 4
Source: W3C Note "ISO BMFF Byte Stream Format" (referencing ISO/IEC 14496-12 "ISO Base Media File Format"):
https://www.w3.org/TR/mse-byte-stream-format-isobmff/
Verified citation (exact text):
An ISO BMFF initialization segment is defined in this specification as a single File Type Box (ftyp) followed by a single Movie Box (moov).
Per ISO 14496-12 box format: bytes 0–3 = box size (uint32 big-endian), bytes 4–7 = box type (FourCC). The first box MUST be ftyp.
- WebM — 0x1A 0x45 0xDF 0xA3 at offset 0
Source: RFC 8794 — "Extensible Binary Meta Language" (IETF Standards Track), Section 8.1 "EBML Header":
https://www.rfc-editor.org/rfc/rfc8794.txt
Verified citation (exact text from Section 8.1):
The EBML Header MUST contain a single Master Element with an Element Name of "EBML" and Element ID of "0x1A45DFA3" (see Section 11.2.1)
WebM is a profile of Matroska (RFC 9559), which is an EBML Document Type. Every WebM file begins with the EBML Header whose first element has ID 0x1A45DFA3.
🎉 Good job! The coverage increased 🎉
Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=1467244&view=codecoverage-tab |
When the audio stream is not a FileStream, the client now peeks at the leading bytes to detect the format (wav, webm, m4a, mp3) and sets the multipart filename accordingly. This fixes HTTP 400 errors when sending non-MP3 audio (e.g. WAV) in a MemoryStream, since the OpenAI API uses the file extension to determine the audio format.
Fixes #7543
Microsoft Reviewers: Open in CodeFlow