Skip to content

DohaSK/SNCF-Data-Warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 

Repository files navigation

🚂 SNCF Data Warehouse — Business Intelligence Pipeline

End-to-end BI solution built individually on 16M+ SNCF railway records: full ETL pipeline (Talend), star schema data warehouse (MySQL), and 6 interactive Power BI dashboards delivering insights across €2.19Bn in revenue and 30M passengers.

Academic Project · ENSA Fès · Data Science & AI Engineering · 2025
Built by: Doha Skouf


📊 Key Numbers

Metric Value
Total Records Loaded 16,000,000+
Reservations 10M (2021–2025)
Routes (Trajets) 6M
Trains 2,500
Stations (Gares) 198
Clients 100,000
Fact Table Rows 12,150,000
Dimension Tables 9
ETL Jobs Built 5
Power BI Dashboard Pages 6
Total Revenue Analyzed €2.19 Billion
Total Passengers 30 Million
Cancellation Rate 7.01%

STAR SCHEMA

🏗️ Architecture

CSV Source Files (5 entities × 2021–2025)
        │
        ▼
┌──────────────────────┐
│   Talend ETL Jobs    │  ← 5 jobs: RESERVATION, TRAJET, TRAIN, GARE, CLIENT
│   (Extract, Clean,   │
│    Transform, Load)  │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  sncf_oltp (MySQL)   │  ← Operational staging database
└──────────┬───────────┘
           │
           ▼
┌──────────────────────────────────────────────────────┐
│              sncf_dw — Star Schema DW                │
│                                                      │
│  dim_client (SCD Type 2)   dim_train   dim_gare      │
│  dim_date   dim_tarif      dim_canal   dim_classe    │
│  dim_statut dim_mode_pmt                             │
│                    ↘   ↙                             │
│              fact_ventes (12.15M rows)               │
└──────────┬───────────────────────────────────────────┘
           │
           ▼
┌──────────────────────┐
│  Power BI Semantic   │  ← DAX measures, relationships
│  Model + Dashboard   │  ← 6 analysis pages
└──────────────────────┘

🔧 ETL Pipeline — Talend Jobs

Dim Junk Job (Junk Dimensions)

Extracts and deduplicates categorical attributes into dimension tables in a single job: dim_tarif, dim_canal_vente, dim_mode_paiement, dim_statut_reservation, dim_classe_reservation.

Talend Dim Junk Job


Dim Gare Job (Station Dimension)

Reads 198 stations from OLTP, applies text cleaning (trim), removes redundant columns, and loads into dim_gare with auto-generated surrogate keys.

Talend Dim Gare Job


Dim Client Job — SCD Type 2

The most complex dimension job: reads 100K clients, applies transformations (name capitalization, age bracket calculation, postal code formatting), and implements Slowly Changing Dimension Type 2 — separating new inserts, unchanged records, and modified records into three distinct output flows with insert_date, update_date, and pr_ligne_active columns.

Talend Dim Client SCD2 Job


Fact Ventes Job (Central Fact Table)

The most technically demanding job: reads 10M reservations year by year (2020–2025 via tLoop), joins with the TRAJET table, then performs 9 simultaneous dimension lookups (dim_client, dim_date, dim_train, dim_gare_dep, dim_gare_arr, dim_tarif, dim_canal, dim_classe, dim_statut, dim_mode_paiement) to resolve all foreign keys and load the final 12.15M row fact table.

Talend Fact Ventes Job


📈 Power BI Dashboards

Dashboard 1 — Revenue Analysis (Full View)

Overview: €2.19Bn total revenue · Revenue by train type · Monthly CA trends · Payment method breakdown · Cancellation KPIs · Sales channel distribution · Reservation class split

Revenue Dashboard Full


Dashboard Summary

Page Key Insight
Revenus €2.19Bn total · TER dominates by volume · Web channel = 55% of bookings
Demande 30M passengers · Eurostar_like trains are underperforming
Gares Gare Lyon leads revenue · Mixte stations = 54.58% of CA
Clients 100K clients · 36–50 age bracket = highest CA segment
Tarifs/Canaux Plein tariff = highest avg price · CB = 72% of transactions
KPI Ferroviaires 36.53% fill rate · Spring generates highest CA · YoY growth peaked 2021

🧠 Advanced Technical Highlights

  • SCD Type 2 on dim_client: full historization with surrogate keys, pr_ligne_active flag, insert/update timestamps
  • tLoop component: iterates over 6 years (2020–2025) for incremental reservation loading without memory overflow
  • JVM optimization: -Xms8192m -Xmx16384m configured for large-scale fact table processing
  • Multi-lookup joins: 9 simultaneous dimension lookups in a single tMap for the fact table job
  • DAX measures: MoM growth, YoY growth, fill rate ratio, cancellation rate, geographic aggregations

📁 Repository Structure

SNCF-Data-Warehouse/
├── ETL_talend/
│   ├── data_migration/        ← OLTP loading jobs (RESERVATION, TRAJET, TRAIN, GARE, CLIENT)
│   └── jobs_dim/              ← DW dimension + fact jobs (dim_*, fact_ventes)
├
│            
├── screenshots/               ← ETL job screenshots,  Dashboard screenshots (full .pbix not uploaded — large file)
└── README.md

Note: Raw CSV data and Power BI .pbix file are not included due to file size. The ETL Talend jobs and this documentation fully represent the technical implementation.


🛠️ Tech Stack

Layer Technology
ETL Talend Open Studio for Big Data
Databases MySQL 8 (OLTP + DW)
Data Modeling Star Schema, SCD Type 2
BI & Visualization Power BI Desktop, DAX
Version Control Git / GitHub

👩‍💻 About

Built entirely solo as part of the Data Science & AI Engineering curriculum at ENSA Fès.
Every ETL job — including the hardest multi-lookup fact table — was designed and implemented individually.

Contact: linkedin.com/in/doha-skouf · skoufd@gmail.com

About

ETL pipeline and star schema data warehouse processing 16M+ SNCF railway records. Includes 6 Power BI dashboards analyzing €2.19B revenue.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors