ENTC Undergrad Β |Β ML & Systems Β |Β Building at the edge of AI and Hardware Β |Β Semiconductor AI Enthusiast
- π Undergrad in Electronics & Telecommunication Engineering at Pune Institute of Computer Technology (PICT) β curious, hardworking
- πΌ Currently interning as a Software Development Intern @ DeepTek.ai β working at the intersection of medical AI, Transformer workflows, and scalable backend systems
- β‘ Passionate about GPU computing and AI systems β from writing low-level CUDA kernels to deploying end-to-end ML pipelines
- π― Driven by a long-term vision of becoming an AI Engineer in the semiconductor space β where hardware meets intelligence
- ποΈ Fitness enthusiast, bike rider, and occasional swimmer β I believe a strong body fuels a sharper mind
CUDA Β· C++ Β· Parallel Computing Β· GPU Architecture
- Engineered a FlashAttention-style GPU kernel with shared-memory tiling, online softmax, and fused attention to minimize HBM memory movement
- Achieved 254Γ over CPU baseline and 70.69Γ over simple GPU baseline, reaching 303 GFLOPs/s on NVIDIA RTX 3090
- Applied kernel fusion, warp-synchronous computation, and SRAM reuse β avoiding NΓN intermediate memory materialization
Python Β· FastAPI Β· FAISS Β· BM25 Β· Whisper Β· Docker Β· PostgreSQL
- Distributed RAG system converting video into a searchable knowledge base via Whisper transcription, semantic chunking, and hybrid FAISS+BM25 retrieval with CrossEncoder re-ranking
- LLM-based Q&A (llama.cpp / Phi-3), timestamp-level retrieval, Redis caching, and PostgreSQL metadata store
Python Β· OpenCV Β· TensorFlow Lite Β· MediaPipe Β· Flutter Β· Firebase
- Real-time workout evaluation system achieving ~95% accuracy in posture detection and rep counting
- Optimized TFLite inference reducing latency by 40% for edge deployment; full-stack with Flutter + Firebase
- π Machine Learning Specialization β Andrew Ng
- π Complete Data Science, ML, DL, NLP Bootcamp β Krish Naik
- π Data Analysis Bootcamp β Alexander Freberg
"Transforming attention from a memory-bound workload into a compute-efficient kernel β one CUDA thread at a time."