Predicting weekly department-level sales across 45 Walmart stores using machine learning — built as an end-to-end data science project with an interactive Streamlit app.
Dataset: 421,570 weekly records · 45 stores · 81 departments · Feb 2010 – Oct 2012 · $6.7B total revenue
- Data Exploration — interactive overview of the three source datasets with missing value analysis
- Data Processing — cleaning pipeline: imputation, date parsing, and dataset merging
- Analysis & Visualization — interactive Plotly charts: correlation matrix, sales distribution, store rankings, time trends, and holiday impact
- Modeling — Linear Regression vs Random Forest with R², RMSE, MAE metrics, Actual vs Predicted chart, and feature importance
- Live Predictions — input any store/department/context and get an instant sales forecast
| Layer | Libraries |
|---|---|
| Data | Pandas, NumPy |
| ML | Scikit-Learn (LinearRegression, RandomForestRegressor) · Joblib (model persistence) |
| Visualization | Plotly |
| App | Streamlit |
| Deployment | Streamlit Cloud |
git clone https://github.com/cnoret/retail-data-analysis.git
cd retail-data-analysis
pip install -r requirements.txt
streamlit run app.pyApp available at http://localhost:8501
retail-data-analysis/
├── app.py # Entry point
├── content/
│ ├── intro.py
│ ├── exploration.py
│ ├── preparation.py
│ ├── visualisation.py
│ ├── modelisation.py
│ └── resources.py
├── data/ # CSV datasets
├── models/ # Pre-trained models (joblib)
├── images/ # UI assets
└── requirements.txt
| Model | R² | RMSE |
|---|---|---|
| Linear Regression | ~0.06 | ~$22,000 |
Random Forest (n=20, depth=10) |
~0.84 | ~$9,000 |
Random Forest significantly outperforms Linear Regression because Store and Dept are categorical identifiers — tree-based splits handle them naturally while linear models treat them as continuous values. RF parameters are tuned for Streamlit Cloud memory constraints.
MIT - LICENSE.