Skip to content

vanditb/graph-sampling-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph Sampling Benchmark

Runtime vs Structure Preservation in Graph Analysis

Why I Built This

I built this project after learning more about Professor Hang Liu’s HPDA Lab at Rutgers and its work in graph analytics and high-performance data systems. I wanted to create something concrete that connects to the lab’s research interests while also building on my current background in Python, data analysis, and practical problem-solving.

Project Question

"When we sample a graph to make analysis faster, how much useful graph structure do we lose?"

What This Project Does

  • generates three kinds of graphs
  • samples them with three simple sampling methods
  • runs PageRank and connected components on the full graph and the sampled graph
  • compares runtime and basic structure-preservation metrics
  • saves CSV tables and plots in results/

Methods

Graph types

  • Erdős-Rényi random graph
  • Barabási-Albert scale-free graph
  • Watts-Strogatz small-world graph

Sampling methods

  • Random node sampling
  • Random edge sampling
  • Random walk sampling

Algorithms

  • PageRank
  • Connected components

Metrics

  • original and sampled node/edge counts
  • PageRank runtime on the full graph and sampled graph
  • connected components runtime on the full graph and sampled graph
  • top-10 PageRank overlap
  • edge retention percentage
  • density change
  • connected component count difference

The graph sizes in this project are 1,000, 3,000, and 5,000 nodes. I kept them small enough to run on a normal laptop. The project is CPU-based and uses NetworkX, not a high-performance graph system.

Results

These are the main patterns I saw in my run:

  • Lower sampling rates usually saved more time, but they also lost more structure.
  • Random walk sampling was the best balance overall in this run. On average it saved about 59% of PageRank runtime and about 56% of connected components runtime, while keeping about 44% top-10 PageRank overlap.
  • Random node sampling also saved a lot of time, but it usually had lower PageRank overlap than random walk.
  • Random edge sampling kept more edges by design, but it did not always give the best runtime savings.
  • Barabási-Albert graphs had higher PageRank top-10 overlap than the other graph types in this run, which makes sense because hub nodes matter a lot in scale-free graphs.

I would not treat these numbers as universal. They are just one small benchmark run on synthetic graphs.

Limitations

  • This is small-scale and CPU-based.
  • It uses NetworkX, not an optimized graph-processing system.
  • The graphs are synthetic and simpler than many real-world datasets.
  • The goal is to practice graph analytics evaluation, not to reproduce high-performance systems research.
  • Some sampled graphs are not always faster in every case, especially with NetworkX overhead, which is part of why the benchmark is useful.

How to Run

pip install -r requirements.txt
python run_benchmark.py

If python does not work on your machine, use python3 run_benchmark.py.

Output Files

After running the script, the project saves:

  • results/runtime_results.csv
  • results/summary_results.csv
  • results/plots/runtime_comparison.png
  • results/plots/pagerank_overlap.png
  • results/plots/structure_tradeoff.png

What I Learned

  • I learned how graph structure affects algorithm results.
  • I learned that making a graph smaller is not automatically better if important structure is lost.
  • I learned how to organize benchmark outputs into tables and plots.
  • I learned more about the kind of evaluation work used in graph analytics research.

About

project i made for Professor Hang Liu

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages