At Deutsche Bank I kept running into the same problem: Python data pipelines that were slow, sometimes embarrassingly slow, and nobody could tell you why. There was no visibility into where time was actually going, no consistent way to compare before and after a change, and no guidance on what to fix first.
So I built this. It's three focused tools, a CPU profiler, a latency/throughput benchmarker, and a memory tracker, plus an optimization advisor that reads the output from all three and gives you severity-tagged recommendations. Nothing magic, just the structured analysis layer I wished existed.
Wraps cProfile and surfaces the results in a structured format: cumulative time, call counts, per-call averages. The top_hotspots(n) method returns the N most expensive functions sorted by cumulative time, and to_report() formats everything into a markdown table.
Measures real latency distributions (min, max, mean, p50, p95, p99, std) over many iterations with warm-up, and sustained throughput (ops/sec) over a fixed time window. The compare() method gives you speedup_x and percent improvement side-by-side.
Uses tracemalloc to measure peak and net allocation per call. The find_leaks() method runs a function N times and flags iterations where allocation grows beyond a threshold, which is useful for catching module-level accumulators and unbounded caches.
Takes ProfileResult + MemoryResult + LeakReports and returns a list of OptimizationSuggestion objects with severity (HIGH/MEDIUM/LOW), category (CPU_HOTSPOT/MEMORY_LEAK/INEFFICIENT_LOOP/REDUNDANT_CALLS), description, and recommended_fix.
pip install pytest numpy # only hard deps
git clone https://github.com/LaelaZorana/python-perf-optimizer
cd python-perf-optimizerfrom perf_optimizer.benchmarker import Benchmarker
from perf_optimizer.memory_tracker import MemoryTracker
from perf_optimizer.optimizer import OptimizationAdvisor
bench = Benchmarker()
result = bench.measure_latency(my_function, arg1, iterations=500)
print(result.summary_line())
# my_function: mean=4.231ms p50=4.198ms p95=5.012ms p99=6.891ms std=0.312ms# Profile a function by dotted path
python -m perf_optimizer profile examples.string_concat_demo.slow_concat
# Benchmark
python -m perf_optimizer benchmark examples.numpy_vectorization_demo.python_loop_sum_of_squares
# Full report (profile + memory + advisor suggestions)
python -m perf_optimizer report examples.string_concat_demo.fast_concatRunning the string concatenation demo:
Benchmarking string concatenation over 5000 items
=======================================================
[slow] naive += loop:
slow_concat: mean=2.841ms p50=2.803ms p95=3.112ms p99=3.984ms std=0.198ms
[fast] str.join():
fast_concat: mean=0.312ms p50=0.308ms p95=0.341ms p99=0.402ms std=0.021ms
Speedup: 9.11x (89.0% faster)
Baseline mean=2.841ms
Optimized mean=0.312ms
p95 speedup: 9.12x | p99 speedup: 9.91x
Running the NumPy vectorization demo:
| Implementation | mean_ms | p95_ms | p99_ms |
|---|---|---|---|
| Python loop | 12.8410 | 13.2100 | 14.0031 |
| NumPy vectorized | 0.1122 | 0.1308 | 0.1501 |
python-perf-optimizer/
├── perf_optimizer/
│ ├── __init__.py
│ ├── __main__.py # CLI
│ ├── profiler.py # FunctionProfiler + ProfileResult
│ ├── benchmarker.py # Benchmarker + LatencyResult + ThroughputResult
│ ├── memory_tracker.py # MemoryTracker + MemoryResult + LeakReport
│ └── optimizer.py # OptimizationAdvisor + OptimizationSuggestion
├── examples/
│ ├── string_concat_demo.py
│ ├── numpy_vectorization_demo.py
│ └── memory_leak_demo.py
├── tests/
│ ├── test_benchmarker.py
│ ├── test_memory_tracker.py
│ └── test_optimizer.py
├── requirements.txt
└── setup.py
pytest tests/ -vMIT, Laela Zorana
Links: GitHub · HuggingFace · Kaggle