Skip to content
View Venkat2811's full-sized avatar
🎯
Focusing
🎯
Focusing

Organizations

@wso2-incubator

Block or report Venkat2811

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Venkat2811/README.md

Mechanical Sympathy is All You Need

Hi 👋, I'm Venkat !

Twitter LinkedIn GitHub

Built and scaled systems that can handle 5k->250k RPS w/o breaking a sweat.

Got into model serving and inference, enjoyed solving cold start, intelligent routing and optimizing GPU cluster utilization. Did a bit of RAG & Agents infra. Currently ML Infra - training, inference, comms collectives, storage, compiler backends, custom kernels optimizations & researching novel techniques.

High Agency individual deep in agentic-engineering mode. AI tools have enabled me to touch end-to-end infra from user facing APIs & Infra to tensors to metal. Always looking to maximize my learning curve 📈

ʕ•ᴥ•ʔ venkat.systems


Highlights


Projects

  • 🐨 WombatKV - KV blocks survive restarts, save prefill flops - Object-storage-native KV cache for Inference.
  • myelon - HFT-grade LMAX-Disruptor multiprocess IPC over SHM & mmap. 240 ns P99 · 5.58 M ops/s · 92.6 GB/s.
  • 🐘 YALI - Ultra-low-latency GPU comms collective. Outperforms NVIDIA NCCL P2P by 1.2 - 2.4x.
  • 🪢 GPU Kernel Batcher - Batching identical GEMMs into one cuBLAS call - 90%+ fewer launches, 22% faster FP16 workloads
  • ⏲️ Metered Compute - 5 reference architectures for reliably metering sync and async compute.
  • 🔍 Inference Assayer - Compiler driven models <> HWs inference perf analyzing deterministic fast simulator lab.

Technologies

HomeCodex Claude CLI macOS pi.dev Tailscale tmux AutoResearch
LanguagesRust Go Python Java CUDA English Markdown does it matter anymore?
InferencevLLM SGLang TensorRT-LLM Transformers
InfraK8s Helm Argo Docker NVIDIA Dynamo vLLM AIBrix
AcceleratorsPyTorch Triton CUTLASS CuBLAS Mojo ThunderKittens
StorageMySQL PostgreSQL Redis S3 SlateDB
MiddlewareKafka Apache Iggy NATS Redpanda ZeroMQ RabbitMQ
CloudAWS GCP Terraform Ansible
BuildEarthly Makefile Bash Bazel

Writings

Hashnode Medium Blogger

Acknowledgements

Inspired by

  • GitHub
  • GitHub
  • GitHub

profile views

Pinned Loading

  1. wombatkv wombatkv Public

    Object-storage-native KV cache for LLM inference & RL. Cross-restart, cross-conversation, cross-engine via shared S3 bucket.

    Rust 10 1

  2. myelon myelon Public

    Ultra-low-latency, high-throughput multiprocess transport over SHM and mmap. LMAX-Disruptor-style cross-process ring substrate.

    Rust 9 1

  3. yali yali Public

    Speed-of-Light SW efficiency by using ultra low-latency primitives for comms collectives

    Cuda 13

  4. ai-dynamo/dynamo ai-dynamo/dynamo Public

    A Datacenter Scale Distributed Inference Serving Framework

    Rust 7.1k 1.2k

  5. sgl-project/sglang sgl-project/sglang Public

    SGLang is a high-performance serving framework for large language models and multimodal models.

    Python 28.4k 6.2k

  6. vllm-project/aibrix vllm-project/aibrix Public

    Cost-efficient and pluggable Infrastructure components for GenAI inference

    Go 4.8k 588