Skip to content

deborahdore/RooseBERT

Repository files navigation

RooseBERT: A New Deal For Political Language Modelling

arXiv

Our models are available on HuggingFace, in the RooseBERT's collection. If you use them, cite us:

@misc{dore2025roosebertnewdealpolitical,
    title = {RooseBERT: A New Deal For Political Language Modelling},
    author = {Deborah Dore and Elena Cabrio and Serena Villata},
    year = {2025},
    eprint = {2508.03250},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL},
    url = {https://arxiv.org/abs/2508.03250},
}

1️⃣ Description

The goal of this project is to pre-train a domain-specific language model on a curated corpus of English political debates. By training on domain-specific content, we aim to generate embeddings that capture the nuanced language, rhetoric, and argumentation style unique to political discourse. The project investigates whether these enhanced embeddings improve performance on downstream tasks related to political debates such as sentiment analysis, stance detection, argument classification and relation classification.

RooseBERT was trained using two strategies:

  1. Continued Pre-Training (CONT): We initialise from BERT's original weights and vocabulary and continue training on the political debate corpus.
  2. Training from Scratch (SCR): We train BERT from random initialisation using a custom WordPiece tokenizer built from the domain corpus. This produces a domain-specific vocabulary that encodes political terminology as single tokens.

Each strategy was applied in both cased and uncased variants, yielding four RooseBERT models in total.

Objectives:

  1. Pre-Training:
    We pre-train BERT (CONT and SCR) on political debate transcripts to generate embeddings that reflect the intricate structure and linguistic patterns in political dialogue.
  2. Evaluation on Downstream Tasks:
    The effectiveness of these embeddings is assessed across a variety of downstream tasks, with a focus on tasks relevant to the political domain.
  3. Analysis:
    By comparing the performance of RooseBERT against BERT, ModernBERT, ConfliBERT, and PoliBERTweet, we demonstrate the effectiveness of domain-specific pre-training for political NLP.

2️⃣ Datasets

The following datasets were used for pre-training:

3️⃣ Models

This project produces RooseBERT, a domain-specific language model for English political debates, in four variants:

Model Strategy Vocab
RooseBERT-cont-cased Continued pre-training from bert-base-cased Original BERT cased vocab
RooseBERT-cont-uncased Continued pre-training from bert-base-uncased Original BERT uncased vocab
RooseBERT-scr-cased Trained from scratch Custom cased WordPiece vocab
RooseBERT-scr-uncased Trained from scratch Custom uncased WordPiece vocab

Comparison baselines used in the paper: bert-base-cased, bert-base-uncased, ModernBERT-base, ConfliBERT-cont-cased, ConfliBERT-cont-uncased, ConfliBERT-scr-cased, ConfliBERT-scr-uncased, and PoliBERTweet.

4️⃣ Installation

Conda Setup

# clone project
git clone https://github.com/MARIANNE-INRIA/RooseBERT
cd RooseBERT

# create conda environment and install dependencies
conda env create -f environment.yaml -n rooseBERT

# activate conda environment
conda activate rooseBERT

5️⃣ How to Run

πŸš€ Download the Corpora

Use the download_pretraining_data.sh script to download and prepare the datasets required for continued BERT pre-training. This script will use the prepare_training_dataset.py script to create the train/dev split from the raw dataset.

πŸ’‘ Hint: For optimal BERT pre-training, we use sequences of length 128 for 80% of the time, and sequences of length 512 for the remaining 20%.

python  script/prepare_training_dataset.py

πŸš€ Pre-Training: Continued Pre-Training (CONT)

To continue pre-training BERT using Masked Language Modeling (MLM), use the run_mlm.py script and the run_mlm.sh shell script. The pre-training process consists of two phases:

  1. Phase 1: Train for 120k steps with a maximum sequence length of 128.
  2. Phase 2: Resume from the Phase 1 checkpoint and continue to a cumulative total of 150k steps (i.e., 30k additional steps) with a maximum sequence length of 512.

Below is the recommended configuration, though you can modify parameters as needed. A ready-to-run script is provided here.

Phase 1: Training with Sequence Length 128

python -m torch.distributed.launch --nproc_per_node=8 \
        --master_addr=123 \
        src/run_mlm.py \
        --model_name_or_path "bert-base-cased" \
        --cache_dir "cache/bert-base-cased-batch2048-lr5e-4/" \
        --train_file "data/training/max_128/train.csv" \
        --validation_file "data/training/max_128/dev.csv" \
        --max_seq_length 128 \
        --preprocessing_num_workers 4 \
        --output_dir "logs/bert-base-cased-batch2048-lr5e-4/" \
        --do_train \
        --do_eval \
        --eval_strategy "steps" \
        --per_device_train_batch_size 64 \
        --per_device_eval_batch_size 64 \
        --gradient_accumulation_steps 4 \
        --learning_rate 5e-4 \
        --weight_decay 0.01 \
        --adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-6 \
        --max_steps 120000 \
        --warmup_steps=10000 \
        --logging_dir "logs/bert-base-cased-batch2048-lr5e-4/" \
        --logging_strategy "steps" \
        --logging_steps 500 \
        --save_strategy "steps" \
        --save_steps 20000 \
        --save_total_limit 3 \
        --seed 42 \
        --data_seed 42 \
        --fp16 \
        --local_rank 0 \
        --eval_steps 1000 \
        --dataloader_num_workers 8 \
        --run_name "bert-base-cased-batch2048-lr5e-4" \
        --deepspeed "configs/deepspeed_config.json" \
        --report_to "wandb" \
        --eval_on_start \
        --log_level "detail"

Phase 2: Training with Sequence Length 512

python -m torch.distributed.launch --nproc_per_node=8 \
        --master_addr=123 \
        src/run_mlm.py \
        --model_name_or_path "logs/bert-base-cased-batch2048-lr5e-4/checkpoint-120000" \
        --overwrite_output_dir  \
        --resume_from_checkpoint "logs/bert-base-cased-batch2048-lr5e-4/checkpoint-120000" \
        --cache_dir "cache/bert-base-cased-batch2048-lr5e-4/" \
        --train_file "data/training/max_512/train.csv" \
        --validation_file "data/training/max_512/dev.csv" \
        --max_seq_length 512 \
        --preprocessing_num_workers 4 \
        --output_dir "logs/bert-base-cased-batch2048-lr5e-4/" \
        --do_train \
        --do_eval \
        --eval_strategy "steps" \
        --per_device_train_batch_size 64 \
        --per_device_eval_batch_size 64 \
        --gradient_accumulation_steps 4 \
        --learning_rate 5e-4 \
        --weight_decay 0.01 \
        --adam_beta1 0.9 --adam_beta2 0.98 --adam_epsilon 1e-6 \
        --max_steps 150000 \
        --logging_dir "logs/bert-base-cased-batch2048-lr5e-4/" \
        --logging_strategy "steps" \
        --logging_steps 500 \
        --save_strategy "steps" \
        --save_steps 20000 \
        --save_total_limit 3 \
        --seed 42 \
        --data_seed 42 \
        --fp16 \
        --local_rank 0 \
        --eval_steps 1000 \
        --dataloader_num_workers 8 \
        --run_name "bert-base-cased-batch2048-lr5e-4" \
        --deepspeed "configs/deepspeed_config.json" \
        --report_to "wandb" \
        --eval_on_start \
        --log_level "detail"

Notes

  • The DeepSpeed configuration file (deepspeed_config.json) is used for optimization along with FP16 and gradient accumulation to speed up the training.
  • The above example uses bert-base-cased; replace with bert-base-uncased for the uncased CONT variant.

πŸš€ Pre-Training: Training from Scratch (SCR)

To train RooseBERT from scratch with a custom domain vocabulary, use the run_mlm_scratch.sh script. A custom WordPiece tokenizer must be trained first on the political debate corpus and saved to ./tokenizer_cased/ or ./tokenizer_uncased/.

The SCR pre-training also uses two phases:

  1. Phase 1: Train for 200k steps with a maximum sequence length of 128.
  2. Phase 2: Resume from the Phase 1 checkpoint and continue to a cumulative total of 250k steps (i.e., 50k additional steps) with a maximum sequence length of 512.
# Edit run_mlm_scratch.sh to set TYPE="cased" or TYPE="uncased", then:
sbatch sh/run_mlm_scratch.sh

πŸš€ Downstream Tasks

We evaluated RooseBERT and all comparison models (BERT, ModernBERT, ConfliBERT, PoliBERTweet) on the following downstream tasks. Below is a summary of the tasks and their datasets:

  • ParlVote (sentence-pair, binary classification)
    • Sentiment analysis of UK Parliamentary Debates using both motion and speech text
  • HanDeSeT (binary classification)
    • Sentiment analysis of UK Parliamentary Debates
  • ConVote (binary classification)
    • Stance detection of US Congressional floor debates
  • AusHansard (binary classification, cross-domain)
    • Stance detection on Australian Parliamentary Debates; used for cross-domain evaluation
  • ElecDeb60to20 β€” two tasks:
    • Argument Component Detection and Classification (sequence labelling) in US Presidential Debates
    • Argument Relation Prediction and Classification (sentence-pair, multi-class) in US Presidential Debates
  • ArgUNSC β€” two tasks:
    • Argument Component Detection and Classification (sequence labelling) in UN Security Council debates
    • Argument Relation Prediction and Classification (sentence-pair, multi-class) in UN Security Council debates
  • ParlVote+ (multi-class classification)
    • Policy preference classification of UK Parliamentary speeches (34 policy categories)
  • NEREx (NER / token classification)
    • Named entity recognition in US Presidential Debate transcripts (37 entity types)

To sum up:

Task Type Count
binary classification 4
multi-class classification 4
sequence labelling 3
Task Type Count
single sentence 5
sentence-pair 3
ner 3
Task Type Count
sentiment analysis 2
stance detection 2
policy preference classification 2
argument component detection and classification 2
argument component relation prediction and classification 2
NER 1

To download all the necessary datasets use the download_downstream_data.sh script. Then use the prepare_downstream_data.py script to process all datasets.

./download_downstream_data.sh

python script/prepare_downstream_data.py

πŸš€ Extract Results

At the end of each run, the results will be available in the RooseBERT/logs/task_name/model_name/ folder. The extract_results.py script will automatically process the results and save them in a csv file.

python extract_results.py

If you have run the model multiple times with different seeds, use the compute_stats.py script to extract mean and standard deviation.

python compute_stats.py

Acknowledgement

This work has been supported by the French government, through the 3IA Cote d’Azur Investments in the project managed by the National Research Agency (ANR) with the reference number ANR-23-IACL-0001. This project was provided with computing AI and storage resources by GENCI at IDRIS thanks to the grant 2026-AD011016047R1 on the supercomputer Jean Zay’s A100 partition.

About

Fine-Tuning BERT on Political Debates for Enhanced Embeddings in Political Analysis

Topics

Resources

License

Stars

Watchers

Forks

Contributors