Best Score 0.49541 Β Β·Β Classifier Macro-F1 0.9852 Β Β·Β 216,041 documents Β Β·Β 141 queries
π Language: Β π¬π§ English Β |Β π«π· Lire en FranΓ§ais
- Information Retrieval Engine
This project is a two-phase information retrieval system built for a university Kaggle competition on the Stack Exchange ecosystem. Given 141 natural-language queries, the system ranks 216,041 technical documents from five communities β Android, Gaming, Programmers, TeX, Unix β and returns the top-100 most relevant documents per query.
The system combines:
- Sparse retrieval β BM25+ with a tech-aware tokenizer (handles
c++,c#,.net, β¦) - Dense retrieval β semantic embeddings via
all-MiniLM-L12-v2with SHA-256 caching - Fusion β Reciprocal Rank Fusion (RRF) to merge both signals
- Classification β LinearSVC to predict the source Stack Exchange community
- Reranking β hard-filter promoting category-matched documents to the top
βββββββββββββββββββββββββββββββ
β Raw Query β
ββββββββββββββββ¬βββββββββββββββ
β
βββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββ βββββββββββββββββββββββββββββ
β BM25+ Branch β β Embedding Branch β
β β β β
β Tech-aware tokenizer β β all-MiniLM-L12-v2 β
β (c++, c#, .net, β¦) β β 384-dim vectors β
β BM25Plus over 216k β β SHA-256 cache (.npy) β
β docs β β Cosine similarity β
β β β GPU / MPS / CPU auto β
βββββββββββββ¬ββββββββββββ βββββββββββββββ¬ββββββββββββββ
β β
β rank_bm25(d) rank_emb(d) β
β β
βββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββ
β RRF Fusion β
β β
β score(d) = β
β w_emb / (k + rank_emb) β
β + w_bm25 / (k + rank_bm25) β
β β
β w_emb=3.5 w_bm25=0.5 k=20 β
ββββββββββββββββ¬ββββββββββββββββ
β
βΌ
Top-100 Documents
Key parameters:
| Parameter | Default | Description |
|---|---|---|
embedding_model_name |
all-MiniLM-L12-v2 |
SentenceTransformer model |
hybrid_rrf_k |
20 |
RRF constant (rank discount) |
hybrid_weight_embeddings |
3.5 |
Embedding branch weight |
hybrid_weight_bm25 |
0.5 |
BM25 branch weight |
top_k |
100 |
Final number of returned results |
βββββββββββββββββββββββββββββββ
β Test Query β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β 1. Feature Extraction β
β TF-IDF (~5 000 terms) β
β fitted on 216k documents β
ββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β 2. LinearSVC Classifier β
β Macro-F1 = 0.9852 β
β β predicted_category β
β android | gaming | tex | β
β unix | programmers β
ββββββββββββββββ¬ββββββββββββββββββββ
β
ββββββββββββββββββ΄βββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β 3. Query Expansion β β 4. Pseudo-Relevance β
β β β Feedback (PRF) β
β query += category vocab β β β
β e.g. "android" adds: β β BM25 top-3 docs β
β "mobile app java kotlin β β β extract tags β
β apk development" β β β append (max 10 tags) β
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β β
ββββββββββββββββββ¬βββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β 5. Hybrid Retrieval β
β BM25+ + Embeddings + RRF β
β Candidate pool = 1 000 docs β
β (top_k Γ retrieval_k_mult.) β
ββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β 6. Hard-Filter Reranking β
β category-match docs β front β
β relative order preserved β
ββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
Top-100 final ranking
Key parameters:
| Parameter | Default | Description |
|---|---|---|
classifier_method |
svc |
svc | logreg | nb | mlp |
feature_method |
tfidf |
tfidf | count | embeddings |
rerank_strategy |
hard_filter |
hard_filter | soft_boost |
retrieval_k_multiplier |
10 |
Candidate pool = top_k Γ 10 |
prf_top_k |
3 |
Number of PRF feedback documents |
prf_max_tags |
10 |
Max tags injected via PRF |
hybrid_rrf_k |
30 |
RRF constant for Phase 2 |
.
βββ src/ # Core library
β βββ config.py # Dataclass configs & path auto-resolution
β βββ data/
β β βββ load.py # JSON data loader
β β βββ preprocess.py # Content field builder & text cleaner
β βββ retrieval/
β β βββ tfidf.py # TF-IDF baseline
β β βββ bm25.py # BM25 / BM25+ with tech tokenizer
β β βββ embeddings.py # Dense retrieval + SHA-256 cache
β β βββ hybrid.py # RRF fusion
β βββ classification/
β β βββ interfaces.py # Category definitions & protocols
β β βββ features.py # TF-IDF / Count / Embedding feature builders
β β βββ model.py # Unified classifier (SVC / LogReg / MLP / NB)
β β βββ rerank.py # hard_filter & soft_boost strategies
β βββ evaluation/
β β βββ metrics.py # Precision@k, Recall@k, MRR@k, F1@k
β β βββ evaluate.py # Evaluation pipelines
β βββ kaggle/
β βββ format.py # Kaggle CSV formatter
β βββ submit_prep.py # Phase 1 CLI entry point
β βββ submit_phase2.py # Phase 2 CLI entry point
β
βββ notebooks/ # Step-by-step exploration & ablations
β βββ 01_data_exploration.ipynb # EDA, dataset stats
β βββ 02_tfidf_retrieval.ipynb # TF-IDF baseline
β βββ 03_bm25_retrieval.ipynb # BM25 / BM25+ tuning
β βββ 04_embeddings_retrieval.ipynb # Dense retrieval + UMAP
β βββ 05_evaluation_suite.ipynb # Unified method comparison
β βββ 06_phase1_submission.ipynb # β
Standalone Phase 1 pipeline
β βββ 07_phase2_classification_ablation.ipynb # 11-config ablation
β βββ 08_phase2_pipeline_eval.ipynb # End-to-end Phase 2 evaluation
β βββ 09_phase2_submission.ipynb # β
Standalone Phase 2 pipeline
β
βββ data/
β βββ raw/ # Original JSON files (docs, queries, qrels)
β βββ processed/ # Pre-built content field
β βββ cache/ # Embedding .npy arrays (SHA-256 keyed)
β
βββ outputs/
β βββ runs/ # Evaluation logs (JSON)
β βββ submissions/ # Kaggle submission CSVs
β
βββ requirements.txt
- Python 3.10+
- A Kaggle account + API token (to download the dataset)
- GPU with CUDA or Apple Silicon (optional β CPU also works)
# 1. Clone the repository
git clone https://github.com/thmsgo18/Information-Retrieval-Engine.git
cd Information-Retrieval-Engine
# 2. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate # macOS / Linux
# venv\Scripts\activate.bat # Windows
# 3. Install dependencies
pip install -r requirements.txt# Configure Kaggle credentials (one-time)
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
# Download competition data
cd data/raw
kaggle competitions download -c retrieval-engine-competition
unzip retrieval-engine-competition.zip && rm retrieval-engine-competition.zip
cd ../..Expected files in data/raw/:
docs.json β 216,041 Stack Exchange documents
queries_train.json β Training queries with ground truth
queries_test.json β 141 test queries (Kaggle evaluation)
qgts_train.json β Relevance judgments
CLI (recommended):
# Default production config
python3 -m src.kaggle.submit_prep
# Quick local test (smaller subset + lighter model)
python3 -m src.kaggle.submit_prep \
--max-docs 20000 \
--embedding-model-name all-MiniLM-L6-v2 \
--embedding-precision int8Notebook:
jupyter lab
# Open notebooks/06_phase1_submission.ipynb β Run AllOutput: outputs/submissions/submission.csv
CLI (recommended):
# Default production config
python3 -m src.kaggle.submit_phase2
# Custom options
python3 -m src.kaggle.submit_phase2 \
--classifier svc \
--feature-method tfidf \
--rerank hard_filter \
--embedding-model all-MiniLM-L12-v2Notebook:
jupyter lab
# Open notebooks/09_phase2_submission.ipynb β Run All
# Auto-detects local / Kaggle / Colab environmentOutputs:
outputs/submissions/submission.csvoutputs/runs/phase2/run_YYYYMMDD_HHMMSS.json
| Notebook | Description |
|---|---|
01_data_exploration |
Dataset statistics, community distribution, document lengths |
02_tfidf_retrieval |
TF-IDF baseline: indexing, retrieval, evaluation |
03_bm25_retrieval |
BM25 / BM25+ tuning and ablation |
04_embeddings_retrieval |
Dense retrieval with SentenceTransformers + UMAP/t-SNE |
05_evaluation_suite |
Precision@k, Recall@k, MRR@k across all methods |
06_phase1_submission |
β End-to-end Phase 1 Kaggle submission |
07_phase2_classification_ablation |
11-config ablation (3 feature types Γ 4 classifiers) |
08_phase2_pipeline_eval |
Full Phase 2 evaluation & error analysis |
09_phase2_submission |
β Final Phase 2 submission (Kaggle/Colab/local) |
Recommended order: 01 β 02 β 03 β 04 β 05 β 06 (Phase 1), then 07 β 08 β 09 (Phase 2).
All hyperparameters are centralized in src/config.py as Python dataclasses.
Phase 1 β Phase1HybridSubmissionConfig:
embedding_model_name: str = "all-MiniLM-L12-v2"
embedding_batch_size: int = 64
embedding_device: str | None = "auto" # auto β cuda / mps / cpu
embedding_precision: str = "float32" # float32 | int8 | uint8 | binary
hybrid_bm25_method: str = "plus" # plus | okapi
hybrid_rrf_k: int = 20
hybrid_weight_embeddings: float = 3.5
hybrid_weight_bm25: float = 0.5
top_k: int = 100
max_docs: int | None = None # None = full corpusPhase 2 β Phase2Config:
classifier_method: str = "svc" # svc | logreg | nb | mlp
feature_method: str = "tfidf" # tfidf | count | embeddings
rerank_strategy: str = "hard_filter" # hard_filter | soft_boost
retrieval_k_multiplier: int = 10 # candidate pool = top_k Γ 10
prf_top_k: int = 3
prf_max_tags: int = 10
hybrid_rrf_k: int = 30
hybrid_weight_embeddings: float = 3.5
hybrid_weight_bm25: float = 0.5
random_seed: int = 42| Phase | Configuration | Score |
|---|---|---|
| Phase 2 | SVC + hard_filter, RETRIEVAL_K=200, rrf_k=20 | 0.450 |
| Phase 2 | SVC + hard_filter, RETRIEVAL_K=500, rrf_k=60 | 0.469 |
| Phase 2 | SVC + hard_filter + cross-encoder, RETRIEVAL_K=1000 | 0.400 |
| Phase 2 | SVC + hard_filter, RETRIEVAL_K=1000, rrf_k=30 | 0.49541 β |
| Features | Classifier | Macro-F1 |
|---|---|---|
| TF-IDF | LinearSVC | 0.9852 β |
| TF-IDF | LogisticRegression | 0.9794 |
| Count | LinearSVC | 0.9793 |
| Embeddings | MLP | 0.9695 |
| Category | Precision | Recall | F1 |
|---|---|---|---|
| android | ~0.99 | ~0.99 | ~0.99 |
| gaming | ~0.97 | ~0.98 | ~0.97 |
| programmers | ~0.99 | ~0.98 | ~0.98 |
| tex | ~1.00 | ~1.00 | ~1.00 |
| unix | ~0.99 | ~0.98 | ~0.99 |
| Macro avg | 0.9852 |
Thomas Gourmelen β @thmsgo18
Occasional contributions: Clara Ait Mokhtar, Maria Aydin, Vincent Tan
Master IAD β Data Science Project 2025-2026 Β Β·Β UniversitΓ© Paris