ReduceFix is an automated program repair (APR) system that leverages LLM-generated reducers to minimize failure-inducing test inputs before using them to guide repair generation. This repository contains the implementation and experimental scripts for reproducing the results presented in our paper.
Prompt Formats: We have organized all prompt formats used in the paper into the
prompt_formats/directory for easy reference. See Prompt Formats for details.
ReduceFix consists of three main phases:
- Reducer Generation: Prompts a code LLM once to generate a customized reducer script that can automatically reduce failure-inducing inputs for the specific task
- Input Reduction: Executes the generated reducer script to shrink the failure-inducing input iβ into a reduced test input i* while preserving the failure
-
Patch Generation: Embeds β¨P,
$s_w$ , i*β© in a repair prompt, samples candidate patches, and validates each one against the entire test suite until a correct program is found
The pipeline receives five inputs: the task description P, a correct reference solution A, a faulty submission
- Python 3.8+
- Access to LLM APIs (Qwen-Plus, DeepSeek-V3, etc.) with API keys configured in
config.py
-
Setup Configuration:
# Configure your API keys in config.py # Edit config.py with your actual API credentials
-
run the following scripts:
./rq1.sh
./rq2.sh
./rq3.sh
./rq4.sh
./rq5.sh- The LFTBench (including C++ and Python versions) is located in the
lftbenchsub-directory.
For the current journal manuscript, the compact result snapshots used to check
the reported tables are in artifact_snapshot/:
python3 artifact_snapshot/check_snapshot_numbers.pyReduceFix/
βββ lftbench/ # LFTBench dataset
β βββ data/ # Ground truth and submissions
β β βββ ground_truth/ # AC (Accepted) solutions
β β β βββ cpp/ # C++ AC codes
β β β βββ python/ # Python AC codes
β β βββ submissions/ # WA (Wrong Answer) submissions
β β βββ cpp/ # C++ submissions
β β βββ python/ # Python submissions
β βββ metadata/ # Problem metadata
β β βββ problems.json # Problem descriptions, samples
β β βββ cpp_submissions.jsonl
β β βββ python_submissions.jsonl
β βββ tests/ # Full test suites
β β βββ abc361/C/in/ # Test inputs for each problem
β β βββ ...
β βββ README.md # Dataset documentation
β
βββ results/ # Experiment artifacts
β βββ abc361c/ # Per-problem directory
β β βββ reducer.py # Generated reducer
β β βββ ac.cpp # AC code
β β βββ 68123456_cpp/ # Per-submission artifacts
β β βββ ...
β βββ ...
β
βββ oss_fuzz_results/ # OSS-Fuzz evaluation data
β βββ cases_data/ # Case-by-case data
β βββ experiment_results/ # Experiment statistics
β βββ compute_*.py # OSS-Fuzz analysis scripts
βββ dbms_results/ # SQLancer SQLite compact-reproducer experiment
βββ artifact_snapshot/ # Compact snapshots and checker for manuscript numbers
βββ prompt_formats/
β βββ repair.prompt # Prompt format of repair on LFTBench (C++)
β βββ repair-py.prompt # Prompt format of repair on LFTBench-Py (Python)
β βββ repair-ossfuzz.prompt # Prompt format of repair on OSS-Fuzz
β βββ repair-diffline.prompt # Prompt format of repair with Diff Lines strategy
β βββ reducer.prompt # Prompt format of reducer generation for LFTBench
β βββ reducer-ossfuzz.prompt # Prompt format of reducer generation for OSS-Fuzz
βββ result_reducer_*.json # Minimized reduction results (RQ1)
βββ result_repair_*.json # Minimized repair results (RQ2, RQ3)
βββ result_chatrepair.json # ChatRepair results (RQ4)
βββ result_cref.json # CREF results (RQ4)
β
βββ rq1.sh # RQ1: Reducer effectiveness
βββ rq2.sh # RQ2: Repair with reduced tests
βββ rq3.sh # RQ3: Prompt composition
βββ rq4.sh # RQ4: ChatRepair & CREF integration
βββ rq5.sh # RQ5: OSS-Fuzz evaluation
β
βββ reducer_builder.py # Generate problem-specific reducers
βββ reducer_test.py # Test reducers on submissions
βββ retest_single.py # Single submission testing (RQ1 demo)
β
βββ evaluate_repair.py # Repair evaluation (main)
βββ evaluate_repair_main.py # Repair evaluation (RQ3)
βββ evaluate_repair_with_chatrepair.py # ChatRepair evaluation
β
βββ analyze_*.py # Statistical analysis for each RQ
βββ compare_rq1_methods.py # RQ1 comparison table
βββ summarize_*.py # Result summarization scripts
β
βββ llm.py # LLM API interface
βββ tutor.py # API client configuration
βββ config.py # API keys and configuration
βββ lftbench_utils.py # Dataset access utilities
βββ tools.py # Utility functions
βββ README.md # This filelftbench/: Self-contained benchmark dataset with problems, solutions, submissions, and test suitesresults/: Generated reducers and per-problem/per-submission artifactsrq*.sh: Main entry points for reproducing each research questionresult_*.json: Consolidated experimental results (minimized for portability)artifact_snapshot/: Compact machine-readable snapshots for the current manuscript tablesdbms_results/: SQLancer SQLite reduction harness, data, and current result summariestemp/: Full-version result files with detailed metadata (not for distribution)- Core scripts:
reducer_builder.py,reducer_test.py,evaluate_repair*.py - Analysis scripts:
analyze_*.py,summarize_*.pyfor statistical analysis prompt_formats/: Prompt templates used in the paper (see Prompt Formats)
The prompt_formats/ directory contains all prompt format templates used in the paper for easy reference and reproduction:
-
repair.prompt- Standard repair prompt for LFTBench (C++)- Contains: problem description, faulty code, reduced test case (input/output)
- Used in: RQ2 as ReduceFix's main repair strategy
-
repair-py.prompt- Repair prompt for LFTBench-Py (Python)- Same structure as
repair.prompt, but for Python code - Used in: RQ2 for cross-language validation
- Same structure as
-
repair-ossfuzz.prompt- Repair prompt for OSS-Fuzz- Contains: crash info, stack trace, annotated source code, reduced test case (hex format)
- Output format: SEARCH/REPLACE style patches
- Used in: RQ5 for real-world project repairs
-
repair-diffline.prompt- Prompt for Diff Lines strategy- Shows only the first 10 lines of output differences, without full input
- Used in: RQ3 to test the impact of information selection
-
reducer.prompt- LFTBench reducer generation prompt- Input: problem description, example reducer code
- Output: problem-specific reducer.py script
- Used in: RQ1, RQ2, RQ3, RQ4 to generate reducers for LFTBench problems
-
reducer-ossfuzz.prompt- OSS-Fuzz reducer generation prompt- Input: project info, file format analysis, example reducer code
- Output: format-aware
generated_reduce()function - Used in: RQ5 to generate reducers for various file formats (PDF, fonts, images, etc.)
- All
{variable_name}placeholders in prompt files will be replaced with actual values during execution - For complete prompt construction logic, refer to:
- Repair:
generate_llm_repair()function inevaluate_repair.py - Reducer (LFTBench):
build_generation_prompt()function inreducer_builder.py - Reducer (OSS-Fuzz):
build_generation_prompt()function inoss_fuzz_results/reducer_builder.py
- Repair:
This research question evaluates the reliability and effectiveness of LLM-generated reducers at shrinking failure-inducing inputs. We compare three reduction approaches:
- ReduceFix (LLM-generated reducer + ddmin): Our approach that generates task-specific reducers
- DDmin-only: Pure ddmin baseline without custom reduction logic
- Pure LLM: Direct LLM-based reduction without ddmin refinement
Key metrics:
- Success Rate: Percentage of cases where reduction succeeds (reduced size < original size)
- Compression Ratio: Average reduction in input size (higher = better compression)
Run the command of RQ-1:
./rq1.shFor details and options, see the script content.
You can run a demo for RQ-1:
./rq1.sh --retest abc376e 66915962The violin plot in the following figure confirms this pattern: most points cluster near the median (100%), while a long lower tail for hard tasks (i.e., difficulty group E and F) highlights a few cases with modest shrinkage that lower the mean.
The figure below is a summary chart of all 20 automatically synthesized task-specific reducers, highlighting for each one the structural decomposition it applies and the semantic invariants it enforces beyond plain ddmin.
This research question evaluates whether reduced test cases can improve automated program repair effectiveness. We test ReduceFix's repair approach across four different LLM models on C++ submissions, and validate cross-language portability on Python submissions.
Models evaluated:
- Qwen2.5-Coder-7B-instruct: Small open-source model (7B parameters)
- GLM-4-9B-chat: Another small open-source model (9B parameters)
- Qwen-Plus: Large commercial model (cloud API)
- DeepSeek-V3: Large commercial model (cloud API)
Prompting strategies compared:
- Baseline (no test case): Only problem description and faulty code
- Origin Test: Full failure-inducing input/output pair
- Reduced Test (ReduceFix): Reduced input/output pair from our reducer
Datasets:
- LFTBench (C++): 200 faulty C++ submissions across 20 problems, evaluated with all 4 LLMs
- LFTBench (Python): 20 faulty Python submissions, evaluated with Qwen-Plus for cross-language validation
Evaluation metric: pass@k (k β {1, 5, 10}) - probability that at least one of k generated patches is correct
Baseline comparison: We also evaluate an end-to-end ddmin-only baseline that uses the reduced input when ddmin succeeds, otherwise falls back to the original failure-inducing test
The following prompt format shows the exact prompt template. If truncation occurred, the ellipsis token appears inside the fenced block to signal omitted lines. No other explanatory text is added, keeping the prompt well below typical context limits even on compact LLMs.
### Your Incorrect Code
```cpp
{wa_code}
```
### Failing Case
Input:
```
{reduced_failing_input}
```
Your Output:
```
{wa_output}
```
Expected Output:
```
{expected_output}
```
### Your Task
Fix the C++ code to pass ALL test cases (including hidden ones).
### Critical Guidelines
1. Focus on algorithmic correctness - NO hard-coded values
2. Keep complexity reasonable (target $O(N\log N)$ where possible)
3. Handle edge cases (empty input, single element, max constraints)
4. Use standard C++20 and avoid non-portable extensions
### Output Format
Provide ONLY the complete fixed C++ program inside a single cpp block.
Run the command of RQ-2:
./rq2.shFor details and options, see the script content.
This research question investigates the distinct influence of two factors within ReduceFix: (i) length reduction (fewer tokens to keep fault-relevant text within the model's attention span) and (ii) information selection (retaining minimal concrete evidence that still exposes the fault). We compare five prompt strategies on Qwen2.5-Coder-7B-instruct:
- Baseline (~3.2KB, ~130 lines): Problem + faulty code only
- Origin Test (~30.5KB, ~2381 lines): + full failure-inducing input/output pair
- Diff Lines (~3.2KB, ~133 lines): + up to 10 mismatched output lines (sparse evidence, same length as Baseline)
- Reduced Test (~6.6KB, ~514 lines): + reduced input/output pair (ReduceFix's default, joint action of length control and full information)
- Reduced + Origin (~36.4KB, ~2638 lines): + both reduced and full tests (redundant information, maximum length)
Key insight: The conjunction of compact length and complete counterexample information is important in this setting. Reduced Test has the highest observed overall pass@10 among these variants (25.5%), compared with Diff Lines (20.0%) and Origin Test (19.0%).
The Diff Lines strategy stays the same length as Baseline but appends up to 10 mismatched output lines, providing sparse evidence without increasing prompt size. This comparison evaluates whether minimal failure information alone (without full input/output) can guide repair.
### Problem Description
{full problem text}
### Your Incorrect Code
```cpp
{faulty code here}
```
### Failure Evidence (diff only)
Line 1: Got '42', Expected '43'
Line 2: Got '...', Expected '...'
...
### Your Task
Fix the code so that the diff disappears on all tests.
Return only the complete corrected C++ program in a ```cpp block.
Run the command of RQ-3:
./rq3.shFor details and options, see the script content.
This research question checks whether ReduceFix can be inserted into existing APR pipelines as a pre-repair evidence step. We replace only the failure-inducing test input while keeping patch generation and validation unchanged, using the same 10-sample budget across settings.
Systems evaluated:
-
ChatRepair: Conversational repair framework that alternates between user proxy and LLM with feedback
- Settings: MAX_RETRY=1 (one feedback round), length=2 (conversation window)
- First turn: task description + faulty code + failure-inducing test
- Second turn: test verdict (pass/fail) from harness
-
CREF: Context-aware reference-based repair with retrieval augmentation
- Settings: Two-turn setup using AtCoder editorial
- First turn: official editorial as high-level solution description
- Second turn: failure-inducing test case after validating first-turn patch
Comparison:
- Original: Using full failure-inducing input (Origin Test)
- + ReduceFix: Using reduced input from our reducer (Reduced Test)
Key results:
- ChatRepair: Overall pass@1/5/10 changes from 9.2/22.6/30.5 to 12.1/29.0/37.0.
- CREF: Overall pass@1/5/10 changes from 13.9/31.6/39.0 to 14.7/32.7/40.0.
Run the command of RQ-4:
./rq4.shFor details and options, see the script content.
This research question evaluates ReduceFix on repository-level crash-inducing inputs from OSS-Fuzz. The journal experiment uses a 167-case complete-run subset from the frozen >=4KB manifest where Baseline, Origin Test, and Reduced Test all have the same 10-candidate repair budget.
Evaluation dimensions:
- Test case reduction: Success rate and compression ratio across three approaches (DDmin-only, ReduceFix, Pure LLM)
- Repair effectiveness: pass@k (k β {1, 5, 10}) for three prompting strategies (Baseline, Origin Test, Reduced Test)
Key findings: ReduceFix reduces 83.2% of crash inputs, with 46.5%/51.3% average/median end-to-end compression rate when unsuccessful reductions count as 0. DDmin-only reaches 43.7% success and 37.5%/0.0% average/median end-to-end compression rate; Pure LLM reduction reaches 37.1% success and 33.2%/0.0%. Under Docker-grounded Qwen2.5-Plus validation, Reduced Test reaches 14.4% observed pass@10, compared with 11.4% for Origin Test and 12.0% for Baseline.
Legacy pilot command:
./rq5.shThis command belongs to the earlier small OSS-Fuzz pilot. The journal-scale RQ-5 reported in the manuscript is the 167-case complete-run cohort summarized above. Use artifact_snapshot/oss_fuzz_ge4k_complete10_deep_stats.json to check the reported tables and oss_fuzz_results/journal_manifests/ossfuzz_journal_ge4k_cases_filtered_reviewer_v1.json for the fixed 167-case cohort.

