Skip to content

Add unintrusive h264 DTS extractor #61#124

Open
Curid wants to merge 1 commit into
scottlamb:mainfrom
Curid:dts3
Open

Add unintrusive h264 DTS extractor #61#124
Curid wants to merge 1 commit into
scottlamb:mainfrom
Curid:dts3

Conversation

@Curid

@Curid Curid commented Dec 15, 2025

Copy link
Copy Markdown
Contributor

Rust implementation of MediaMTX's magic DTS extractor.

RTSP streams don't keep track of when video frames need to be decoded, and there doesn't seem to be any official specification for how to do it in real-time.

I didn't document how the algorithm works because I don't really know and it'd probably be better to document it upstream anyway.

I'd try to document the API, but it'd probably take longer to review my bad writing than to do it yourself, and I don't want to waste your time.

Comment thread src/dts_extractor/h264.rs
let sps = match self.spsp.as_mut() {
Some(sps) => sps,
None => {
let sps_rbsp = h264_reader::rbsp::decode_nal(sps).map_err(DecodeSps)?;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit inconvenient to parse the SPS again when we already parsed it internally in Depacketizer. I'm guessing you don't want to expose SeqParameterSet in the public API surface. Maybe we could find a way to access it from a private method on Stream or VideoFrame?

@scottlamb

scottlamb commented Feb 21, 2026

Copy link
Copy Markdown
Owner

I'm struggling with this one. Maybe it will help to talk it through.

I'm starting to understand mediamtx's algorithm for getting the difference in numbers of frames. Arguably it's good enough although a couple things seem sketchy to me:

  • I think it really wants the encoder to increment the POC by 1, 2, or 4 per frame. If it say increments by 3 or 5 or 8, it will not be happy. There have been a couple fixes already with this heuristic and could still be more to come.
  • It tracks only the bottom (log2_max_pic_order_cnt_lsb_minus4 + 4) bits of the POC, and I'm not sure it always does the right thing on wraparound, especially if there are let's say more than half that many reordered frames. (It says there can be at most 10 reordered frames; actually I think MaxDpbFrames can be up to 16 based on H.264 section A.3.1.)

fwiw, it doesn't support a few cases at all:

  • interlaced video. (To be fair, interlaced video is probably mostly a historical relic that might not ever show up with B frame support. And I've never tested Retina with interlaced video at all; at least the type name retina::codec::VideoFrame would be more properly retina::codec::VideoPicture if I'd put thought into it.)
  • (I think) SVC and MVC, although again to be fair Retina probably doesn't do well here either today.
  • pic_order_cnt_type = 1 (which I guess is just complicated).

Then it comes up with a dts which matches the original order and mostly tries to space frames out but in one case ends up piling them 1 ms apart. I think it could then exceed the bit rate limits in H.264 Table A-1 where a more evenly-spaced dts wouldn't.

Stepping back a bit, I think the intent is basically to come up with decode timestamps without ever having to buffer frames. vs say gstreamer's https://github.com/GStreamer/gstreamer/tree/main/subprojects/gst-plugins-bad/gst/codectimestamper which does delay frames. But...I'm wondering if that difference matters:

  • If we just want to feed directly into a decoder, I don't think we need decode timestamps at all, beacuse It's already in the right order. We just need the presentation timestamps to know when to show the frames.
  • If we're muxing it into a container that wants decode timestamps, I think we'll always be putting it together in the same segment with the next frame to be presented? So we'll be buffering it up anyway; I think there's minimal if any delay incurred by the gstreamer algorithm.

I was wondering if Retina did the right thing with backwards timestamps anyway. I guess retina::Timestamp does. But the mp4 example I think will just error out if they're not monotonically increasing.

In some cases Retina callers may want to come up with timestamps in a completely different way (see scottlamb/moonfire-nvr#322). In that case I don't know if this extractor helps as much as it should. Maybe having it at a higher level after you plug in your own timestamp would be better then. (But on the other hand, I don't know how easy it will be to come up with those timestamps without understanding the decode order, so again I'm not quite sure if the interface is what it'd need to be for that use case. Speaking of: MediaMTX has some of these things figured out. Do they handle these cameras with totally broken timestamps, and if so, how?)

@Curid

Curid commented Feb 21, 2026

Copy link
Copy Markdown
Contributor Author

If we're muxing it into a container that wants decode timestamps, I think we'll always be putting it together in the same segment with the next frame to be presented? So we'll be buffering it up anyway; I think there's minimal if any delay incurred by the gstreamer algorithm.

My streamer wraps every frame in it's own mp4 fragment.

How many frames does the gstreamer algorithm actually buffer? If the stream only has a frame rate of 3 fps then that could be several seconds of delay.

MediaMTX has some of these things figured out. Do they handle these cameras with totally broken timestamps, and if so, how?)

I couldn't find any code that deals with identical RTP timestamps, but I did find some interesting stuff:

https://github.com/bluenviron/gortsplib/blob/7dbc38520457792ce32f9a3c13a4388d36d471ea/client.go#L2423
https://github.com/bluenviron/gortsplib/blob/7dbc38520457792ce32f9a3c13a4388d36d471ea/pkg/rtpreceiver/receiver.go#L14
https://github.com/bluenviron/gortsplib/blob/7dbc38520457792ce32f9a3c13a4388d36d471ea/pkg/rtptime/global_decoder.go#L54

https://mediamtx.org/docs/usage/route-absolute-timestamps
bluenviron/mediamtx#1300 (comment)
bluenviron/mediamtx#5078 (comment)

bluenviron/mediamtx#1002 (comment)

This algorithm works with about 90% of streams, the problem is that there are some streams that are generated in such a bad way that they declare the wrong number of B-frames, consequently the expected POC is wrong and the DTS is wrong. The DTS is subjected to some coherence checks, and this leads to the error mentioned in this issue.

@scottlamb

scottlamb commented Feb 22, 2026

Copy link
Copy Markdown
Owner

My streamer wraps every frame in it's own mp4 fragment.

Fair enough. I guess this is probably what Moonfire would do too—even though the B-frame won't be displayed until after some sent-later frame, it's still better to start transferring it to the client before receiving that later frame from the camera and likewise to feed it into the decoder as soon as possible.

(when you're doing low-latency serving with .mp4 at all. I'm really tempted to switch Moonfire's browser-based interface over to entirely WebCodecs, keeping .mp4 only for "download video" functionality. I was just playing with a new Retina WebCodecs example today. It's so much nicer! No surprises! You can do exactly your own buffering logic, including no buffering at all if you just want the thing to be as live as possible. I haven't tried it with B-frames yet, but in theory it should just work without any of this dts extractor code at all. The only problem is Firefox. They enable WebCodecs VideoDecoder support for H.265 only behind a config flag on macOS and entirely disable VideoDecoder on Android for now. I'm not really sure what they're waiting to roll it out fully, so I don't know when this will change.)

@Curid

Curid commented Feb 22, 2026

Copy link
Copy Markdown
Contributor Author

Browser streaming without dts extraction would be a huge deal. We probably wouldn't even need to buffer one frame to calculate the frame duration.

Do you think decode timestamps need to be stored on disk or can they always be generated on demand? A timeline feature would need to fetch and decode GOPs in any arbitrary order. Does gstreamer's timestamper work on GOPs independently or does the output differ if it receives multiple GOPs?

I'm curious if you'll run into any fun issues with WebCodecs if you switch. Do you plan to keep the old streamer around as a fallback for Firefox?

Do you still want this PR? A zero latency dts extractor would still be useful for anyone that wants to convert RTSP streams to SRT or RTMP.

@scottlamb

Copy link
Copy Markdown
Owner

Browser streaming without dts extraction would be a huge deal. We probably wouldn't even need to buffer one frame to calculate the frame duration.

Yes. The new webcodecs example still receives and then send an entire frame at a time (rather than streaming packets through) mostly because I didn't want to rewrite all that code today, but it doesn't wait for a second frame for duration calculation. Total glass-to-glass latency (including the camera's H.265 encoder and my wifi) is about 160 ms, as you can see here (the overlay is a timestamp as HH:MM:SS.SSS, with giant seconds and milliseconds):

Screenshot 2026-02-22 at 10 26 12

Do you think decode timestamps need to be stored on disk or can they always be generated on demand?

That's an interesting question. I'm not sure how sentryshot has this structured, but in Moonfire's case I currently can generate the entire moov or moof from the SQLite database, without having to hit spinning disk to examine the video samples. I'd like to preserve that so it doesn't have to do a ton of seeks right at the start of generating a long .mp4 file. So I think we'd want to preserve within each GOP either the actual cts = pts-dts (in timestamp units) or the pts_dts_diff (in POC/frame units).

A timeline feature would need to fetch and decode GOPs in any arbitrary order. Does gstreamer's timestamper work on GOPs independently or does the output differ if it receives multiple GOPs?

I haven't looked that closely, but I assume it works on GOPs independently. It's certainly easy to detect a new IDR frame and reset everything.

I'm curious if you'll run into any fun issues with WebCodecs if you switch. Do you plan to keep the old streamer around as a fallback for Firefox?

Probably. I wish I didn't have to make it second-class and/or keep extra code for it, but I do think the WebCodecs approach is really that much better. MSE has been frustrating, and I don't think the 160 ms latency shown above is even possible with it; the browser just insists on buffering more.

Do you still want this PR? A zero latency dts extractor would still be useful for anyone that wants to convert RTSP streams to SRT or RTMP.

Yes, I think you're right, it's still worth doing, even though I'm personally not planning on restreaming with either protocol. So it's just down to the details. ...but maybe we can consider incremental changes after landing the mediamtx approach as-is. I see you've ported over the test cases, which will help with that.

Can we declare the interface unstable for now via cargo feature? As I do wonder if we will want to change it up, for example to give applications more flexibility to feed their own ptses into the algorithm. I'm sure you'd like it to be stable eventually, but this step will still be an improvement over having to maintain your own branch indefinitely.

@Curid

Curid commented Feb 22, 2026

Copy link
Copy Markdown
Contributor Author

That's an interesting question. I'm not sure how sentryshot has this structured, but in Moonfire's case I currently can generate the entire moov or moof from the SQLite database,

Same but it's all dumb flat files: https://codeberg.org/SentryShot/sentryshot/src/branch/master/src/recording

I'd like to preserve that so it doesn't have to do a ton of seeks right at the start of generating a long .mp4 file. So I think we'd want to preserve within each GOP either the actual cts = pts-dts (in timestamp units) or the pts_dts_diff (in POC/frame units).

Would it be much harder to seek to the nearest start and end of the GOPs and then feed all the timestamps into gstreamer's dts extractor?

I haven't looked that closely, but I assume it works on GOPs independently. It's certainly easy to detect a new IDR frame and reset everything.

I'm thinking it might create a stutter between the GOPs.

Can we declare the interface unstable for now via cargo feature?

Sure

for example to give applications more flexibility to feed their own ptses into the algorithm.

Can't they do that already?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants