Graph Sampling Benchmark

Runtime vs Structure Preservation in Graph Analysis

Why I Built This

I built this project after learning more about Professor Hang Liu’s HPDA Lab at Rutgers and its work in graph analytics and high-performance data systems. I wanted to create something concrete that connects to the lab’s research interests while also building on my current background in Python, data analysis, and practical problem-solving.

Project Question

"When we sample a graph to make analysis faster, how much useful graph structure do we lose?"

What This Project Does

generates three kinds of graphs
samples them with three simple sampling methods
runs PageRank and connected components on the full graph and the sampled graph
compares runtime and basic structure-preservation metrics
saves CSV tables and plots in results/

Methods

Graph types

Erdős-Rényi random graph
Barabási-Albert scale-free graph
Watts-Strogatz small-world graph

Sampling methods

Random node sampling
Random edge sampling
Random walk sampling

Algorithms

PageRank
Connected components

Metrics

original and sampled node/edge counts
PageRank runtime on the full graph and sampled graph
connected components runtime on the full graph and sampled graph
top-10 PageRank overlap
edge retention percentage
density change
connected component count difference

The graph sizes in this project are 1,000, 3,000, and 5,000 nodes. I kept them small enough to run on a normal laptop. The project is CPU-based and uses NetworkX, not a high-performance graph system.

Results

These are the main patterns I saw in my run:

Lower sampling rates usually saved more time, but they also lost more structure.
Random walk sampling was the best balance overall in this run. On average it saved about 59% of PageRank runtime and about 56% of connected components runtime, while keeping about 44% top-10 PageRank overlap.
Random node sampling also saved a lot of time, but it usually had lower PageRank overlap than random walk.
Random edge sampling kept more edges by design, but it did not always give the best runtime savings.
Barabási-Albert graphs had higher PageRank top-10 overlap than the other graph types in this run, which makes sense because hub nodes matter a lot in scale-free graphs.

I would not treat these numbers as universal. They are just one small benchmark run on synthetic graphs.

Limitations

This is small-scale and CPU-based.
It uses NetworkX, not an optimized graph-processing system.
The graphs are synthetic and simpler than many real-world datasets.
The goal is to practice graph analytics evaluation, not to reproduce high-performance systems research.
Some sampled graphs are not always faster in every case, especially with NetworkX overhead, which is part of why the benchmark is useful.

How to Run

pip install -r requirements.txt
python run_benchmark.py

If python does not work on your machine, use python3 run_benchmark.py.

Output Files

After running the script, the project saves:

results/runtime_results.csv
results/summary_results.csv
results/plots/runtime_comparison.png
results/plots/pagerank_overlap.png
results/plots/structure_tradeoff.png

What I Learned

I learned how graph structure affects algorithm results.
I learned that making a graph smaller is not automatically better if important structure is lost.
I learned how to organize benchmark outputs into tables and plots.
I learned more about the kind of evaluation work used in graph analytics research.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
results		results
src		src
README.md		README.md
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph Sampling Benchmark

Why I Built This

Project Question

What This Project Does

Methods

Graph types

Sampling methods

Algorithms

Metrics

Results

Limitations

How to Run

Output Files

What I Learned

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Graph Sampling Benchmark

Why I Built This

Project Question

What This Project Does

Methods

Graph types

Sampling methods

Algorithms

Metrics

Results

Limitations

How to Run

Output Files

What I Learned

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages