Code for the paper "Emergence of Context Characteristics Sensitivity in Large Language Models".
This repository investigates how LLMs' sensitivity to context characteristics evolves across instruction fine-tuning (IFT) stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR).
Key findings:
- SFT consistently instills sensitivity toward easy-to-understand contexts (high lexical similarity, low perplexity, shorter length) across model families and datasets.
- DPO's effect on sensitivity is driven by characteristic differences between chosen and rejected responses in the training data. Careful dataset curation is needed to neutralize SFT-induced biases.
- RLVR broadly preserves DPO's direction of change.
.
├── data/
│ ├── evaluation_dataset/ # ConflictQA, Context-Reliance, DRUID eval sets
│ └── property_detection/
│ └── evaluation_dataset/ # Property-annotated eval sets (.tsv)
├── src/
│ ├── get_model_predictions/
│ │ ├── get_predictions.py # Model inference entrypoint
│ │ └── prompts.py # Prompt templates
│ ├── property_detection/
│ │ ├── get_properties.py # Heuristic property extraction (Jaccard, Flesch, length, etc.)
│ │ ├── get_perplexity.py # Perplexity calculation via HuggingFace
│ ├── prepare_data.ipynb # Dataset preprocessing
│ └── get_plots.ipynb # Figure generation
└── requirements.txt
pip install -r requirements.txtAdditional dependencies not in requirements.txt:
spacywithen_core_web_trf— for named entity recognition (python -m spacy download en_core_web_trf)
Set your HuggingFace token in a .env file:
HF_TOKEN=your_token_here
python src/property_detection/get_properties.py \
--data_path data/evaluation_dataset/context-reliance.csv \
--save_folder data/output/ \
--properties "claim_evidence_jaccard_sim flesch_reading_ease_score evidence_length claim_length uncertain_rate_lexicon"Available properties: claim_evidence_jaccard_sim, claim_entity_overlap, flesch_reading_ease_score, evidence_length, claim_length
To compute perplexity:
python src/property_detection/get_perplexity.py <model_name> <data_path> [revision]python -m src.get_model_predictions.get_predictions \
--data_file data/evaluation_dataset/context-reliance.csv \
--save_folder data/output/ \
--use_evidence yes \
--model_name meta-llama/Llama-3.2-1B \
--prompt_name qaUse --prompt_name claim_verification for DRUID; use qa for ConflictQA and Context-Reliance.
Use --instruct for instruction-tuned models.
Use --revision to evaluate a specific checkpoint.
Open src/prepare_data.ipynb first for data preprocessing and src/get_plots.ipynb to reproduce the paper's figures.
Llama-3.2-1B was fine-tuned using the Open-Instruct library on 4×A100 GPUs:
| Stage | Learning Rate | Batch Size | Grad. Accum. | Epochs | Max Seq. Len |
|---|---|---|---|---|---|
| SFT | 3e-5 | 2 | 16 | 2 | 4096 |
| DPO | 2.5e-6 | 2 | 16 | 1 | 2048 |
Checkpoints are saved every 1,000 steps (SFT) and every 250 steps (DPO).