shrvan30

👋 Hey, Shravan Upadhye Here

ENTC Undergrad | ML & Systems | Building at the edge of AI and Hardware | Semiconductor AI Enthusiast

🌐 LinkedIn • 📧 Email

🧠 About Me

🎓 Undergrad in Electronics & Telecommunication Engineering at Pune Institute of Computer Technology (PICT) — curious, hardworking
💼 Currently interning as a Software Development Intern @ DeepTek.ai — working at the intersection of medical AI, Transformer workflows, and scalable backend systems
⚡ Passionate about GPU computing and AI systems — from writing low-level CUDA kernels to deploying end-to-end ML pipelines
🎯 Driven by a long-term vision of becoming an AI Engineer in the semiconductor space — where hardware meets intelligence
🏍️ Fitness enthusiast, bike rider, and occasional swimmer — I believe a strong body fuels a sharper mind

🔗 Socials

🖥️ Tech Stack

⚙️ Languages

⚡ GPU Computing

🤖 ML & AI

🗄️ Databases

🌐 Frontend

🛠️ Frameworks & Tools

☁️ Cloud & DevOps

🚀 Featured Projects

⚡ FlashAttention-CUDA: High-Performance GPU Attention Kernel

CUDA · C++ · Parallel Computing · GPU Architecture

Engineered a FlashAttention-style GPU kernel with shared-memory tiling, online softmax, and fused attention to minimize HBM memory movement
Achieved 254× over CPU baseline and 70.69× over simple GPU baseline, reaching 303 GFLOPs/s on NVIDIA RTX 3090
Applied kernel fusion, warp-synchronous computation, and SRAM reuse — avoiding N×N intermediate memory materialization

🎥 VidRAG: Video Retrieval-Augmented Generation System

Python · FastAPI · FAISS · BM25 · Whisper · Docker · PostgreSQL

Distributed RAG system converting video into a searchable knowledge base via Whisper transcription, semantic chunking, and hybrid FAISS+BM25 retrieval with CrossEncoder re-ranking
LLM-based Q&A (llama.cpp / Phi-3), timestamp-level retrieval, Redis caching, and PostgreSQL metadata store

🏋️ AI-Powered Fitness Assessment Platform

Python · OpenCV · TensorFlow Lite · MediaPipe · Flutter · Firebase

Real-time workout evaluation system achieving ~95% accuracy in posture detection and rep counting
Optimized TFLite inference reducing latency by 40% for edge deployment; full-stack with Flutter + Firebase

📜 Certifications

📊 GitHub Stats

"Transforming attention from a memory-bound workload into a compute-efficient kernel — one CUDA thread at a time."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly