Skip to content

thmsgo18/Information-Retrieval-Engine

Repository files navigation

Information Retrieval Engine

Stack Exchange Document Ranking β€” Kaggle Competition

Python PyTorch scikit-learn Sentence Transformers BM25+ Jupyter Kaggle


Best Score 0.49541 Β Β·Β  Classifier Macro-F1 0.9852 Β Β·Β  216,041 documents Β Β·Β  141 queries


🌐 Language: Β πŸ‡¬πŸ‡§ English Β |Β  πŸ‡«πŸ‡· Lire en FranΓ§ais


Table of Contents


Overview

This project is a two-phase information retrieval system built for a university Kaggle competition on the Stack Exchange ecosystem. Given 141 natural-language queries, the system ranks 216,041 technical documents from five communities β€” Android, Gaming, Programmers, TeX, Unix β€” and returns the top-100 most relevant documents per query.

The system combines:

  • Sparse retrieval β€” BM25+ with a tech-aware tokenizer (handles c++, c#, .net, …)
  • Dense retrieval β€” semantic embeddings via all-MiniLM-L12-v2 with SHA-256 caching
  • Fusion β€” Reciprocal Rank Fusion (RRF) to merge both signals
  • Classification β€” LinearSVC to predict the source Stack Exchange community
  • Reranking β€” hard-filter promoting category-matched documents to the top

Pipeline

Phase 1 β€” Hybrid Retrieval

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚        Raw Query            β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚                                                   β”‚
               β–Ό                                                   β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚      BM25+ Branch     β”‚                        β”‚    Embedding Branch       β”‚
   β”‚                       β”‚                        β”‚                           β”‚
   β”‚  Tech-aware tokenizer β”‚                        β”‚  all-MiniLM-L12-v2        β”‚
   β”‚  (c++, c#, .net, …)   β”‚                        β”‚  384-dim vectors          β”‚
   β”‚  BM25Plus over 216k   β”‚                        β”‚  SHA-256 cache (.npy)     β”‚
   β”‚  docs                 β”‚                        β”‚  Cosine similarity        β”‚
   β”‚                       β”‚                        β”‚  GPU / MPS / CPU auto     β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚                                                  β”‚
               β”‚  rank_bm25(d)                        rank_emb(d) β”‚
               β”‚                                                  β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚     RRF Fusion               β”‚
                      β”‚                              β”‚
                      β”‚  score(d) =                  β”‚
                      β”‚    w_emb  / (k + rank_emb)   β”‚
                      β”‚  + w_bm25 / (k + rank_bm25)  β”‚
                      β”‚                              β”‚
                      β”‚  w_emb=3.5  w_bm25=0.5  k=20 β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
                              Top-100 Documents

Key parameters:

Parameter Default Description
embedding_model_name all-MiniLM-L12-v2 SentenceTransformer model
hybrid_rrf_k 20 RRF constant (rank discount)
hybrid_weight_embeddings 3.5 Embedding branch weight
hybrid_weight_bm25 0.5 BM25 branch weight
top_k 100 Final number of returned results

Phase 2 β€” Classification & Reranking

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚        Test Query           β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚
                                         β–Ό
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚  1. Feature Extraction           β”‚
                      β”‚     TF-IDF (~5 000 terms)        β”‚
                      β”‚     fitted on 216k documents     β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚  2. LinearSVC Classifier         β”‚
                      β”‚     Macro-F1 = 0.9852            β”‚
                      β”‚     β†’ predicted_category         β”‚
                      β”‚     android | gaming | tex |     β”‚
                      β”‚     unix | programmers           β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                                 β”‚
                    β–Ό                                 β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  3. Query Expansion      β”‚     β”‚  4. Pseudo-Relevance         β”‚
     β”‚                          β”‚     β”‚     Feedback (PRF)           β”‚
     β”‚  query += category vocab β”‚     β”‚                              β”‚
     β”‚  e.g. "android" adds:    β”‚     β”‚  BM25 top-3 docs             β”‚
     β”‚  "mobile app java kotlin β”‚     β”‚  β†’ extract tags              β”‚
     β”‚   apk development"       β”‚     β”‚  β†’ append (max 10 tags)      β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                                 β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚  5. Hybrid Retrieval             β”‚
                      β”‚     BM25+ + Embeddings + RRF     β”‚
                      β”‚     Candidate pool = 1 000 docs  β”‚
                      β”‚     (top_k Γ— retrieval_k_mult.)  β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚  6. Hard-Filter Reranking        β”‚
                      β”‚     category-match docs β†’ front  β”‚
                      β”‚     relative order preserved     β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                                     β–Ό
                              Top-100 final ranking

Key parameters:

Parameter Default Description
classifier_method svc svc | logreg | nb | mlp
feature_method tfidf tfidf | count | embeddings
rerank_strategy hard_filter hard_filter | soft_boost
retrieval_k_multiplier 10 Candidate pool = top_k Γ— 10
prf_top_k 3 Number of PRF feedback documents
prf_max_tags 10 Max tags injected via PRF
hybrid_rrf_k 30 RRF constant for Phase 2

Project Structure

.
β”œβ”€β”€ src/                              # Core library
β”‚   β”œβ”€β”€ config.py                     # Dataclass configs & path auto-resolution
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ load.py                   # JSON data loader
β”‚   β”‚   └── preprocess.py             # Content field builder & text cleaner
β”‚   β”œβ”€β”€ retrieval/
β”‚   β”‚   β”œβ”€β”€ tfidf.py                  # TF-IDF baseline
β”‚   β”‚   β”œβ”€β”€ bm25.py                   # BM25 / BM25+ with tech tokenizer
β”‚   β”‚   β”œβ”€β”€ embeddings.py             # Dense retrieval + SHA-256 cache
β”‚   β”‚   └── hybrid.py                 # RRF fusion
β”‚   β”œβ”€β”€ classification/
β”‚   β”‚   β”œβ”€β”€ interfaces.py             # Category definitions & protocols
β”‚   β”‚   β”œβ”€β”€ features.py               # TF-IDF / Count / Embedding feature builders
β”‚   β”‚   β”œβ”€β”€ model.py                  # Unified classifier (SVC / LogReg / MLP / NB)
β”‚   β”‚   └── rerank.py                 # hard_filter & soft_boost strategies
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   β”œβ”€β”€ metrics.py                # Precision@k, Recall@k, MRR@k, F1@k
β”‚   β”‚   └── evaluate.py               # Evaluation pipelines
β”‚   └── kaggle/
β”‚       β”œβ”€β”€ format.py                 # Kaggle CSV formatter
β”‚       β”œβ”€β”€ submit_prep.py            # Phase 1 CLI entry point
β”‚       └── submit_phase2.py          # Phase 2 CLI entry point
β”‚
β”œβ”€β”€ notebooks/                        # Step-by-step exploration & ablations
β”‚   β”œβ”€β”€ 01_data_exploration.ipynb     # EDA, dataset stats
β”‚   β”œβ”€β”€ 02_tfidf_retrieval.ipynb      # TF-IDF baseline
β”‚   β”œβ”€β”€ 03_bm25_retrieval.ipynb       # BM25 / BM25+ tuning
β”‚   β”œβ”€β”€ 04_embeddings_retrieval.ipynb # Dense retrieval + UMAP
β”‚   β”œβ”€β”€ 05_evaluation_suite.ipynb     # Unified method comparison
β”‚   β”œβ”€β”€ 06_phase1_submission.ipynb    # β˜… Standalone Phase 1 pipeline
β”‚   β”œβ”€β”€ 07_phase2_classification_ablation.ipynb  # 11-config ablation
β”‚   β”œβ”€β”€ 08_phase2_pipeline_eval.ipynb # End-to-end Phase 2 evaluation
β”‚   └── 09_phase2_submission.ipynb    # β˜… Standalone Phase 2 pipeline
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/         # Original JSON files (docs, queries, qrels)
β”‚   β”œβ”€β”€ processed/   # Pre-built content field
β”‚   └── cache/       # Embedding .npy arrays (SHA-256 keyed)
β”‚
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ runs/        # Evaluation logs (JSON)
β”‚   └── submissions/ # Kaggle submission CSVs
β”‚
└── requirements.txt

Getting Started

Requirements

  • Python 3.10+
  • A Kaggle account + API token (to download the dataset)
  • GPU with CUDA or Apple Silicon (optional β€” CPU also works)

Installation

# 1. Clone the repository
git clone https://github.com/thmsgo18/Information-Retrieval-Engine.git
cd Information-Retrieval-Engine

# 2. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate         # macOS / Linux
# venv\Scripts\activate.bat      # Windows

# 3. Install dependencies
pip install -r requirements.txt

Data Setup

# Configure Kaggle credentials (one-time)
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# Download competition data
cd data/raw
kaggle competitions download -c retrieval-engine-competition
unzip retrieval-engine-competition.zip && rm retrieval-engine-competition.zip
cd ../..

Expected files in data/raw/:

docs.json           ← 216,041 Stack Exchange documents
queries_train.json  ← Training queries with ground truth
queries_test.json   ← 141 test queries (Kaggle evaluation)
qgts_train.json     ← Relevance judgments

Usage

Phase 1 β€” Hybrid Retrieval

CLI (recommended):

# Default production config
python3 -m src.kaggle.submit_prep

# Quick local test (smaller subset + lighter model)
python3 -m src.kaggle.submit_prep \
  --max-docs 20000 \
  --embedding-model-name all-MiniLM-L6-v2 \
  --embedding-precision int8

Notebook:

jupyter lab
# Open notebooks/06_phase1_submission.ipynb β†’ Run All

Output: outputs/submissions/submission.csv


Phase 2 β€” Classification & Reranking

CLI (recommended):

# Default production config
python3 -m src.kaggle.submit_phase2

# Custom options
python3 -m src.kaggle.submit_phase2 \
  --classifier svc \
  --feature-method tfidf \
  --rerank hard_filter \
  --embedding-model all-MiniLM-L12-v2

Notebook:

jupyter lab
# Open notebooks/09_phase2_submission.ipynb β†’ Run All
# Auto-detects local / Kaggle / Colab environment

Outputs:

  • outputs/submissions/submission.csv
  • outputs/runs/phase2/run_YYYYMMDD_HHMMSS.json

Notebooks Overview

Notebook Description
01_data_exploration Dataset statistics, community distribution, document lengths
02_tfidf_retrieval TF-IDF baseline: indexing, retrieval, evaluation
03_bm25_retrieval BM25 / BM25+ tuning and ablation
04_embeddings_retrieval Dense retrieval with SentenceTransformers + UMAP/t-SNE
05_evaluation_suite Precision@k, Recall@k, MRR@k across all methods
06_phase1_submission β˜… End-to-end Phase 1 Kaggle submission
07_phase2_classification_ablation 11-config ablation (3 feature types Γ— 4 classifiers)
08_phase2_pipeline_eval Full Phase 2 evaluation & error analysis
09_phase2_submission β˜… Final Phase 2 submission (Kaggle/Colab/local)

Recommended order: 01 β†’ 02 β†’ 03 β†’ 04 β†’ 05 β†’ 06 (Phase 1), then 07 β†’ 08 β†’ 09 (Phase 2).


Configuration

All hyperparameters are centralized in src/config.py as Python dataclasses.

Phase 1 β€” Phase1HybridSubmissionConfig:

embedding_model_name: str = "all-MiniLM-L12-v2"
embedding_batch_size: int = 64
embedding_device: str | None = "auto"       # auto β†’ cuda / mps / cpu
embedding_precision: str = "float32"        # float32 | int8 | uint8 | binary
hybrid_bm25_method: str = "plus"            # plus | okapi
hybrid_rrf_k: int = 20
hybrid_weight_embeddings: float = 3.5
hybrid_weight_bm25: float = 0.5
top_k: int = 100
max_docs: int | None = None                 # None = full corpus

Phase 2 β€” Phase2Config:

classifier_method: str = "svc"              # svc | logreg | nb | mlp
feature_method: str = "tfidf"              # tfidf | count | embeddings
rerank_strategy: str = "hard_filter"       # hard_filter | soft_boost
retrieval_k_multiplier: int = 10           # candidate pool = top_k Γ— 10
prf_top_k: int = 3
prf_max_tags: int = 10
hybrid_rrf_k: int = 30
hybrid_weight_embeddings: float = 3.5
hybrid_weight_bm25: float = 0.5
random_seed: int = 42

Results

Kaggle Leaderboard

Phase Configuration Score
Phase 2 SVC + hard_filter, RETRIEVAL_K=200, rrf_k=20 0.450
Phase 2 SVC + hard_filter, RETRIEVAL_K=500, rrf_k=60 0.469
Phase 2 SVC + hard_filter + cross-encoder, RETRIEVAL_K=1000 0.400
Phase 2 SVC + hard_filter, RETRIEVAL_K=1000, rrf_k=30 0.49541 βœ…

Classifier Ablation (notebook 07)

Features Classifier Macro-F1
TF-IDF LinearSVC 0.9852 βœ…
TF-IDF LogisticRegression 0.9794
Count LinearSVC 0.9793
Embeddings MLP 0.9695

Per-Category Performance (best model)

Category Precision Recall F1
android ~0.99 ~0.99 ~0.99
gaming ~0.97 ~0.98 ~0.97
programmers ~0.99 ~0.98 ~0.98
tex ~1.00 ~1.00 ~1.00
unix ~0.99 ~0.98 ~0.99
Macro avg 0.9852

Authors

Thomas Gourmelen β€” @thmsgo18

Occasional contributions: Clara Ait Mokhtar, Maria Aydin, Vincent Tan


Master IAD β€” Data Science Project 2025-2026 Β Β·Β  UniversitΓ© Paris

About

Information retrieval engine combining TF-IDF, BM25+, embeddings, and category-based reranking on Stack Exchange documents.

Topics

Resources

Stars

Watchers

Forks

Contributors