SHANKHARAJ DATTA 1SHAMAY1

⚡ SHANKHARAJ DATTA (1SHAMAY1)

Systems & Silicon Engineer • C++ & SystemVerilog Developer • Low-Level & AI Architect

🧩 About Me

I design and implement high-performance systems across the entire stack—ranging from synthesizable RTL SoC Interconnects and GPU vector architectures to low-level bare-metal device drivers, data-oriented physics engines (ECS), and distributed machine learning models.

Focus Areas

Hardware/Silicon Design – Synthesizable SystemVerilog, AXI4 Network-on-Chip (NoC) crossbars, and SIMT GPU architectures
Systems & Embedded Programming – Bare-metal C driver development, Memory-Mapped I/O (MMIO), ring buffers, and custom hardware/software co-design
Simulation & Engine Tech – Data-Oriented Design (DOD), Entity-Component-Systems (ECS), custom XPBD physics engines, and procedural C++ locomotion
Distributed AI & Machine Learning – Swarm intelligence frameworks, temporal/predictive ML pipelines, and agent consensus protocols

🛠 Tech Stack

Languages & HDLs

Hardware, VLSI & Embedded Systems

Frameworks & Engines

Concepts & Paradigms

🚀 Featured Projects

🔌 Low-Level Systems & Silicon (RTL & Drivers)

🌐 UMA SoC Interconnect

Apple M-Series style Unified Memory Architecture (UMA) SoC Interconnect designed in synthesizable SystemVerilog.

System Topology (uma_soc_top) – Integrates a 2x1 AXI4 Crossbar (axi4_crossbar) with a Unified Memory Controller (axi4_uma_controller) supporting 32-bit addresses and 256-bit wide data channels.
Fixed-Priority Arbitration – Implements a custom arbiter granting immediate memory channel access to the GPU master (M1) over the CPU master (M0) to guarantee high-bandwidth execution.
Transaction ID Routing – Safely multiplexes Read/Write channels and routes responses (bid / rid) to the correct master using AXI transaction IDs (ID_WIDTH = 4), mapping CPU to 4'h1 and GPU to 4'h2.
AXI4 Memory Controller (axi4_uma_controller) – Implements a slave interface mapping 256-bit wide AXI read/write burst transactions (s_axi_awburst = 2'b01, s_axi_awsize = 3'b101) directly to a simulated shared HBM/DRAM static memory array.

🏎️ GPU Compute Core & Ray-Tracing Accelerator

SIMT GPU Streaming Multiprocessor (SM) integrated with a dedicated Ray-Tracing Compute Unit (RTCU).

Top-Level Wrapper (gpu_top) – Connects the SIMT processing core (gpu_sm_core) to the custom hardware Ray-Tracing pipeline (rtcu_core).
SIMT Core Pipeline (gpu_sm_core) – Execution core processing 32-lane warps (WARP_SIZE = 32) with a Vector Register File (vrf) managing 256 registers per thread. Opcode 7'h7B dispatches ray tasks.
Ray-Tracing Co-processor (rtcu_core) – Synthesizable hardware accelerator executing parallel Ray-Box and Ray-Triangle intersections. Implements FSM traversals (FETCH_BVH, INT_BOX, FETCH_TRI, INT_TRI) per warp lane.
Unified Memory Port – Features a dedicated 256-bit wide read bus (mem_read_data) allowing the rtcu_core to directly fetch BVH nodes and triangles from memory, returning intersection results (hit_valid, hit_distance, and barycentrics) back to the SM.

💻 Graphics Driver API

Bare-metal C GPU device driver engineered to interface a CPU application with the GPU Compute Core over a Unified Memory Interconnect.

Command Ring Buffer (gpu_cmd_ring_t) – Manages asynchronous GPU commands through a 32-byte aligned circular queue queueing up to 256 entries in unified memory at 0x40000000.
MMIO Register Map – Maps physical registers starting at GPU_MMIO_BASE = 0x80000000 (Doorbell: +0x00, Status: +0x04, Ring Addr: +0x08, Head: +0x0C, Tail: +0x10) to control hardware FSMs.
API Routines – Implements driver initialization (gpu_init), command buffer dispatch (gpu_push_command), host doorbell signaling (gpu_ring_doorbell), and Ray-Tracing kernel dispatches (gpu_dispatch_raytracing) utilizing CMD_DISPATCH_RT (opcode 0x02).

🦾 Physics & Simulation Engines

⚡ Velox

High-performance 2D physics engine written in modern C++ utilizing Data-Oriented Design (DOD) and an XPBD solver.

Data-Oriented ECS – Custom Entity-Component-System architecture optimized for L1/L2 cache line locality.
XPBD Solver – Extended Position-Based Dynamics solver for stable stacking, rigid body constraints, and stiff constraint resolution.
Broadphase Collision – Spatial Hash Grid reducing comparison complexity from O(N^2) to O(N).
Visualizer – Built-in Real-time simulation demo powered by Raylib.

🦾 Character Locomotion System

Advanced UE5 Locomotion Plugin implementing procedural and physics-based movement systems.

Modular Parkour Pipeline – Clean C++ runtime execution handler for climbing, vaulting, mantling, and wall-running.
Physics Integration – Blends keyframe animations with real-time physical constraints for realistic collisions.

🧠 Distributed AI & Machine Learning

🕸️ SYNAPSE

High-performance decentralized swarm intelligence framework built for multi-agent coordination.

Autonomous Agent Swarms – Implements decentralized communication layers allowing agents to dynamically distribute workloads.
Consensus Protocols – Integrates lightweight state synchronization and self-healing task routing between active nodes.
Event-Driven Pipeline – Optimized async runtime architecture handling massive message passing between concurrent agents.

🧬 CORTEX

GPU-accelerated, real-time multi-agent swarm intelligence research platform.

SEMAL Algorithm – Implements a hybrid Social-Evolutionary Multi-Agent Learning pipeline utilizing PyTorch and CUDA for real-time neural policy training on local hardware.
Cultural Policy Distillation – Integrates local elite peer imitation with genetic algorithms (crossover and Gaussian mutation) to accelerate collective convergence and generational evolution.
Batched GPU Inference – Optimizes simulation throughput with broad-phase raycast filtering and a dynamic load-balancing daemon to sustain high GPU utilization.
Cognitive Persistence – Backed by a SQLite memory vault that automatically serializes and resumes high-fitness neural checkpoints across generations.

🔋 AI Battery Health

Deep learning predictive diagnostic tool for evaluating State of Health (SoH) and Remaining Useful Life (RUL) of lithium-ion cells.

Temporal Networks – Employs recurrent networks (LSTM/GRU architectures) to model non-linear electrochemical degradation curves.
Thermodynamic Modeling – Integrates real-time cell thermal profiles with current/voltage curves to predict thermal runaway risks.

📊 GitHub Stats & Badges

Top languages: C++ • SystemVerilog • C • Python • GLSL/HLSL

🎯 What I'm Building

Hardware Accelerators – Extending SIMT GPU instruction pipelines to handle wider matrices for AI math operations.
Velox Engine Modules – Further optimizations on Broadphase algorithms and multi-threaded constraint solvers.
Swarm Robotics & AI – Applying Synapse algorithms to real-world edge controllers and simulation environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly