Runtime vs Structure Preservation in Graph Analysis
I built this project after learning more about Professor Hang Liu’s HPDA Lab at Rutgers and its work in graph analytics and high-performance data systems. I wanted to create something concrete that connects to the lab’s research interests while also building on my current background in Python, data analysis, and practical problem-solving.
"When we sample a graph to make analysis faster, how much useful graph structure do we lose?"
- generates three kinds of graphs
- samples them with three simple sampling methods
- runs PageRank and connected components on the full graph and the sampled graph
- compares runtime and basic structure-preservation metrics
- saves CSV tables and plots in
results/
- Erdős-Rényi random graph
- Barabási-Albert scale-free graph
- Watts-Strogatz small-world graph
- Random node sampling
- Random edge sampling
- Random walk sampling
- PageRank
- Connected components
- original and sampled node/edge counts
- PageRank runtime on the full graph and sampled graph
- connected components runtime on the full graph and sampled graph
- top-10 PageRank overlap
- edge retention percentage
- density change
- connected component count difference
The graph sizes in this project are 1,000, 3,000, and 5,000 nodes. I kept them small enough to run on a normal laptop. The project is CPU-based and uses NetworkX, not a high-performance graph system.
These are the main patterns I saw in my run:
- Lower sampling rates usually saved more time, but they also lost more structure.
- Random walk sampling was the best balance overall in this run. On average it saved about 59% of PageRank runtime and about 56% of connected components runtime, while keeping about 44% top-10 PageRank overlap.
- Random node sampling also saved a lot of time, but it usually had lower PageRank overlap than random walk.
- Random edge sampling kept more edges by design, but it did not always give the best runtime savings.
- Barabási-Albert graphs had higher PageRank top-10 overlap than the other graph types in this run, which makes sense because hub nodes matter a lot in scale-free graphs.
I would not treat these numbers as universal. They are just one small benchmark run on synthetic graphs.
- This is small-scale and CPU-based.
- It uses NetworkX, not an optimized graph-processing system.
- The graphs are synthetic and simpler than many real-world datasets.
- The goal is to practice graph analytics evaluation, not to reproduce high-performance systems research.
- Some sampled graphs are not always faster in every case, especially with NetworkX overhead, which is part of why the benchmark is useful.
pip install -r requirements.txt
python run_benchmark.pyIf python does not work on your machine, use python3 run_benchmark.py.
After running the script, the project saves:
results/runtime_results.csvresults/summary_results.csvresults/plots/runtime_comparison.pngresults/plots/pagerank_overlap.pngresults/plots/structure_tradeoff.png
- I learned how graph structure affects algorithm results.
- I learned that making a graph smaller is not automatically better if important structure is lost.
- I learned how to organize benchmark outputs into tables and plots.
- I learned more about the kind of evaluation work used in graph analytics research.