This repository contains an end-to-end analysis of online chess games using methods from data science and machine learning. Each module in the project is self-contained and can be run independently. This README provides instructions for running each section and locating the generated outputs.
For a full overview of the project’s goals, methodology, and findings, see the full report: Online Chess Games: A Multi-Faceted Data Science Exploration (PDF)
Before running any scripts, install the required dependencies:
pip install -r requirements.txtScript:
statistics/code/chess_statistical_analysis.py
How to Run:
python statistics/code/chess_statistical_analysis.pyOutputs:
- Plots:
statistics/plots/ - Summary tables:
statistics/data/
Script:
prediction/main.py
How to Run:
python prediction/main.pyOutputs:
- Evaluation plots, metrics:
prediction/results/ - Trained models:
prediction/models/ - CatBoost logs:
prediction/catboost_info/
Scripts:
clustering/cluster_games.py(main)clustering/data_cleaner.py(preprocessing)
How to Run:
python clustering/cluster_games.pyOutputs:
- Cluster visualizations and evaluation metrics:
clustering/plots/ - Cluster profiles and summaries:
clustering/data/
Scripts:
community/network_analysis_basic.pycommunity/network_analysis_advanced.pycommunity/data_cleaner.py
How to Run:
python community/network_analysis_basic.py
python community/network_analysis_advanced.pyOutputs:
All results, plots, and centrality metrics are saved to:
community/communities_analysis/
Scripts:
time_sequence/stage5_time_control_analysis.pytime_sequence/stage5_extra_plots.pytime_sequence/stage5_move_sequence_similarity.py
How to Run:
python time_sequence/stage5_time_control_analysis.py
python time_sequence/stage5_extra_plots.py
python time_sequence/stage5_move_sequence_similarity.pyOutputs:
All visualizations and result files are saved in:
time_sequence/stage5_analysis/
Click to expand full file tree
/
├── clustering
│ ├── cluster_games.py
│ ├── data
│ │ ├── chess_games.csv
│ │ ├── cleaned_chess_games_clusters.csv
│ │ ├── hierarchical_cluster_annotated_summary.csv
│ │ ├── hierarchical_cluster_profiles.csv
│ │ ├── kmeans_cluster_annotated_summary.csv
│ │ └── kmeans_cluster_profiles.csv
│ ├── data_cleaner.py
│ └── plots
│ ├── combined_k_selection.png
│ ├── dendrogram.png
│ ├── dendrogram_truncated.png
│ ├── hierarchical_clusters_annotated.png
│ ├── hierarchical_tsne_refined.png
│ ├── kmeans_clusters_annotated.png
│ ├── kmeans_tsne_refined.png
│ ├── pca_preclustering.png
│ ├── similarity_matrix_ordered.png
│ ├── tsne_Hierarchical_boundaries.png
│ └── tsne_KMeans_boundaries.png
├── community
│ ├── communities_analysis
│ │ ├── centrality_vs_rating.csv
│ │ ├── degree_centrality_vs_rating.png
│ │ ├── longest_win_chain.txt
│ │ ├── opening_to_opening_transitions.png
│ │ ├── pagerank_vs_rating.png
│ │ ├── player_centrality_ranking.csv
│ │ ├── player_opening_bipartite.png
│ │ ├── temporal_network_2019.png
│ │ ├── temporal_network_2020.png
│ │ ├── temporal_network_2021.png
│ │ ├── temporal_network_2022.png
│ │ ├── top_player_network_communities.png
│ │ └── top_rivalries.png
│ ├── data
│ │ ├── chess_games.csv
│ │ └── cleaned_chess_games_prediction.csv
│ ├── data_cleaner.py
│ ├── network_analysis_advanced.py
│ ├── network_analysis_basic.py
│ └── Stage3_Explained.pdf
├── file_tree.py
├── prediction
│ ├── catboost_info/
│ ├── data/
│ ├── data_cleaner.py
│ ├── main.py
│ ├── models/
│ └── results/
├── requirements.txt
├── statistics
│ ├── code/
│ ├── data/
│ └── plots/
└── time_sequence
├── data/
├── stage5_analysis/
├── stage5_extra_plots.py
├── stage5_move_sequence_similarity.py
└── stage5_time_control_analysis.py
- Make sure the input CSVs are present in each module’s
data/folder before running scripts. - Figures are saved as
.pngfor inclusion in slides or reports. - Processed CSVs can be used for further analysis or visual inspection.