End-to-end BI solution built individually on 16M+ SNCF railway records: full ETL pipeline (Talend), star schema data warehouse (MySQL), and 6 interactive Power BI dashboards delivering insights across €2.19Bn in revenue and 30M passengers.
Academic Project · ENSA Fès · Data Science & AI Engineering · 2025
Built by: Doha Skouf
| Metric | Value |
|---|---|
| Total Records Loaded | 16,000,000+ |
| Reservations | 10M (2021–2025) |
| Routes (Trajets) | 6M |
| Trains | 2,500 |
| Stations (Gares) | 198 |
| Clients | 100,000 |
| Fact Table Rows | 12,150,000 |
| Dimension Tables | 9 |
| ETL Jobs Built | 5 |
| Power BI Dashboard Pages | 6 |
| Total Revenue Analyzed | €2.19 Billion |
| Total Passengers | 30 Million |
| Cancellation Rate | 7.01% |
CSV Source Files (5 entities × 2021–2025)
│
▼
┌──────────────────────┐
│ Talend ETL Jobs │ ← 5 jobs: RESERVATION, TRAJET, TRAIN, GARE, CLIENT
│ (Extract, Clean, │
│ Transform, Load) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ sncf_oltp (MySQL) │ ← Operational staging database
└──────────┬───────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ sncf_dw — Star Schema DW │
│ │
│ dim_client (SCD Type 2) dim_train dim_gare │
│ dim_date dim_tarif dim_canal dim_classe │
│ dim_statut dim_mode_pmt │
│ ↘ ↙ │
│ fact_ventes (12.15M rows) │
└──────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────┐
│ Power BI Semantic │ ← DAX measures, relationships
│ Model + Dashboard │ ← 6 analysis pages
└──────────────────────┘
Extracts and deduplicates categorical attributes into dimension tables in a single job: dim_tarif, dim_canal_vente, dim_mode_paiement, dim_statut_reservation, dim_classe_reservation.
Reads 198 stations from OLTP, applies text cleaning (trim), removes redundant columns, and loads into dim_gare with auto-generated surrogate keys.
The most complex dimension job: reads 100K clients, applies transformations (name capitalization, age bracket calculation, postal code formatting), and implements Slowly Changing Dimension Type 2 — separating new inserts, unchanged records, and modified records into three distinct output flows with insert_date, update_date, and pr_ligne_active columns.
The most technically demanding job: reads 10M reservations year by year (2020–2025 via tLoop), joins with the TRAJET table, then performs 9 simultaneous dimension lookups (dim_client, dim_date, dim_train, dim_gare_dep, dim_gare_arr, dim_tarif, dim_canal, dim_classe, dim_statut, dim_mode_paiement) to resolve all foreign keys and load the final 12.15M row fact table.
Overview: €2.19Bn total revenue · Revenue by train type · Monthly CA trends · Payment method breakdown · Cancellation KPIs · Sales channel distribution · Reservation class split
| Page | Key Insight |
|---|---|
| Revenus | €2.19Bn total · TER dominates by volume · Web channel = 55% of bookings |
| Demande | 30M passengers · Eurostar_like trains are underperforming |
| Gares | Gare Lyon leads revenue · Mixte stations = 54.58% of CA |
| Clients | 100K clients · 36–50 age bracket = highest CA segment |
| Tarifs/Canaux | Plein tariff = highest avg price · CB = 72% of transactions |
| KPI Ferroviaires | 36.53% fill rate · Spring generates highest CA · YoY growth peaked 2021 |
- SCD Type 2 on
dim_client: full historization with surrogate keys,pr_ligne_activeflag, insert/update timestamps - tLoop component: iterates over 6 years (2020–2025) for incremental reservation loading without memory overflow
- JVM optimization:
-Xms8192m -Xmx16384mconfigured for large-scale fact table processing - Multi-lookup joins: 9 simultaneous dimension lookups in a single tMap for the fact table job
- DAX measures: MoM growth, YoY growth, fill rate ratio, cancellation rate, geographic aggregations
SNCF-Data-Warehouse/
├── ETL_talend/
│ ├── data_migration/ ← OLTP loading jobs (RESERVATION, TRAJET, TRAIN, GARE, CLIENT)
│ └── jobs_dim/ ← DW dimension + fact jobs (dim_*, fact_ventes)
├
│
├── screenshots/ ← ETL job screenshots, Dashboard screenshots (full .pbix not uploaded — large file)
└── README.md
Note: Raw CSV data and Power BI
.pbixfile are not included due to file size. The ETL Talend jobs and this documentation fully represent the technical implementation.
| Layer | Technology |
|---|---|
| ETL | Talend Open Studio for Big Data |
| Databases | MySQL 8 (OLTP + DW) |
| Data Modeling | Star Schema, SCD Type 2 |
| BI & Visualization | Power BI Desktop, DAX |
| Version Control | Git / GitHub |
Built entirely solo as part of the Data Science & AI Engineering curriculum at ENSA Fès.
Every ETL job — including the hardest multi-lookup fact table — was designed and implemented individually.
Contact: linkedin.com/in/doha-skouf · skoufd@gmail.com





