[codex] Add CLI outputs, language detection, punctuation removal, and streaming demos by MXuer · Pull Request #109 · DataoceanAI/Dolphin

MXuer · 2026-06-11T03:10:58Z

Summary

This PR integrates fixes/features for four reported issues:

没有直接的输出文本 #80: add direct CLI output support with --output and --output_format {txt,json,srt}.
希望后面能有手动关闭标点符号的选项 #92: add --remove_punctuation / remove_punctuation=True for punctuation-free transcription output.
Language Detection Model 【语言检测模型】 #93: expose language detection through dolphin.detect_language(...) and CLI --task detect_language.
small.cn.streaming流式模型的调用demo是否可以提供一下？或者提供一下相关支持？ #106: replace the old file chunk demo with cache-level streaming via forward_encoder_chunk, add CTC greedy partial output, CTC endpointing, optional final attention rescoring, and a microphone streaming terminal demo.

The streaming demo no longer includes any punctuation-model handling. It emits raw ASR partial/final text and relies on CTC endpoint rules for segmentation.

Validation

env TRANSFORMERS_NO_TF=1 USE_TF=0 python -m pytest
- 23 passed
python -m py_compile examples/streaming_demo.py examples/microphone_streaming_demo.py
- passed
python examples/streaming_demo.py --help
- passed
python examples/microphone_streaming_demo.py --help
- passed
git diff --check
- passed
Real audio validation with Dolphin base on CPU:
- short zh-CN demo audio
- long zh-CN audio
- long hi-IN audio
Real CLI language detection:
- short zh-CN: zh CN
- long zh-CN: zh CN
- long hi-IN: hi IN
Real SRT + punctuation removal validation:
- short zh-CN: 1 cue, valid SRT, 0 subtitle-body punctuation chars
- long zh-CN: 60 cues, valid SRT, 0 subtitle-body punctuation chars
- long hi-IN: 174 cues, valid SRT, 0 subtitle-body punctuation chars
Real streaming smoke tests using small.cn.streaming on CPU:
- short zh-CN demo audio with --chunk_size 16 --emit line --final_rescore attention
- forced endpoint smoke test with --endpoint_rule3_min_utterance_length_ms 3000

Test Reports

reports/issue-80-cli-output-test-report.md
reports/issue-92-disable-punctuation-test-report.md
reports/issue-93-language-detection-test-report.md
reports/issue-106-streaming-demo-test-report.md
reports/integration-issues-80-92-93-test-report.md

Notes

--remove_punctuation is output post-processing and does not change model decoding or weights.
Language detection still loads a Dolphin ASR model; this does not add a separate lightweight LID-only model.
The streaming demos are experimental terminal demos, not a production streaming server.

Copilot

Pull request overview

This PR enhances Dolphin’s CLI/Python API usability by adding (1) direct CLI output emission to stdout/files with multiple formats, (2) a language-detection-only task exposed via both CLI and dolphin.detect_language(...), and (3) optional punctuation removal as an output post-processing step.

Changes:

Add CLI --output + --output_format {txt,json,srt} and implement formatting/emission helpers.
Add --task detect_language + --lid_duration and export detect_language at the package top level.
Add --remove_punctuation / remove_punctuation=True to strip Unicode punctuation from returned text and word timestamps, with unit tests and updated README examples.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_punctuation.py	Adds unit tests for punctuation removal (text, special tokens, word timestamps) and CLI parsing.
tests/test_language_detection.py	Adds unit tests for detect-language API behavior, package export, CLI output, and audio duration limiting.
tests/test_cli_output.py	Adds unit tests for txt/json/srt formatting and stdout/file emission (including nested output dirs).
reports/issue-93-language-detection-test-report.md	Documents validation steps/results for language detection feature.
reports/issue-92-disable-punctuation-test-report.md	Documents validation steps/results for punctuation removal feature.
reports/issue-80-cli-output-test-report.md	Documents validation steps/results for CLI output formats and file writing.
reports/integration-issues-80-92-93-test-report.md	Integration validation report across all three features and their CLI interactions.
README.md	Updates install URL/model links and adds CLI/Python usage examples for new flags/tasks.
dolphin/transcribe.py	Implements punctuation removal, language detection duration limiting, CLI output formatting/emission, and new CLI arguments.
dolphin/model_registry.py	Fixes `small.cn` `model_id` typo.
dolphin/init.py	Exports `detect_language` at the package top level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    use_two_stage_filter: bool = False,
    use_prompt_hotword: bool = False,
    prompt_filter_threshold: float = -2.0,
+    remove_punctuation: bool = False,


    use_two_stage_filter: bool = False,
    use_prompt_hotword: bool = False,
    prompt_filter_threshold: float = -4.0,
+    remove_punctuation: bool = False,


    parser.add_argument("--use_prompt_hotword", type=str2bool, default=False, help="use prompt-based hotword (default: false)")
    parser.add_argument("--prompt_filter_threshold", type=float, default=-2.0, help="filter threshold for prompt hotwords (default: -2.0)")
+    parser.add_argument("--remove_punctuation", type=str2bool, default=False, help="remove punctuation from transcription text output (default: false)")
+    parser.add_argument("--lid_duration", type=float, default=SPEECH_LENGTH, help="seconds of audio to use for language detection; set 0 to use full audio (default: 30)")


wgb14 · 2026-06-11T03:34:43Z

perfect👍

MXuer added 4 commits June 11, 2026 10:54

Add CLI output formats

9a36c51

Expose language detection task

5e429b2

Add punctuation removal option

8ed0086

Add integration test report

5ad9a0f

This was referenced Jun 11, 2026

没有直接的输出文本 #80

Open

希望后面能有手动关闭标点符号的选项 #92

Open

Language Detection Model 【语言检测模型】 #93

Open

MXuer added 2 commits June 11, 2026 11:14

Fix README repository and model links

0ddfbbb

Fix small CN model spelling

89ed386

MXuer requested a review from wgb14 June 11, 2026 03:22

wgb14 requested a review from Copilot June 11, 2026 03:23

Copilot started reviewing on behalf of wgb14 June 11, 2026 03:23 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Add experimental streaming demo

c69ccc1

MXuer mentioned this pull request Jun 11, 2026

small.cn.streaming流式模型的调用demo是否可以提供一下？或者提供一下相关支持？ #106

Open

Add cache-level streaming demos

65e4ca0

MXuer marked this pull request as ready for review June 11, 2026 07:36

MXuer changed the title ~~[codex] Add CLI outputs, language detection task, and punctuation removal~~ [codex] Add CLI outputs, language detection, punctuation removal, and streaming demos Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add CLI outputs, language detection, punctuation removal, and streaming demos#109

[codex] Add CLI outputs, language detection, punctuation removal, and streaming demos#109
MXuer wants to merge 8 commits into
mainfrom
codex/integration-issues-80-92-93

MXuer commented Jun 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

wgb14 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MXuer commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Test Reports

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

wgb14 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MXuer commented Jun 11, 2026 •

edited

Loading