I design and implement high-performance systems across the entire stackβranging from synthesizable RTL SoC Interconnects and GPU vector architectures to low-level bare-metal device drivers, data-oriented physics engines (ECS), and distributed machine learning models.
Focus Areas
- Hardware/Silicon Design β Synthesizable SystemVerilog, AXI4 Network-on-Chip (NoC) crossbars, and SIMT GPU architectures
- Systems & Embedded Programming β Bare-metal C driver development, Memory-Mapped I/O (MMIO), ring buffers, and custom hardware/software co-design
- Simulation & Engine Tech β Data-Oriented Design (DOD), Entity-Component-Systems (ECS), custom XPBD physics engines, and procedural C++ locomotion
- Distributed AI & Machine Learning β Swarm intelligence frameworks, temporal/predictive ML pipelines, and agent consensus protocols
Apple M-Series style Unified Memory Architecture (UMA) SoC Interconnect designed in synthesizable SystemVerilog.
- System Topology (
uma_soc_top) β Integrates a 2x1 AXI4 Crossbar (axi4_crossbar) with a Unified Memory Controller (axi4_uma_controller) supporting 32-bit addresses and 256-bit wide data channels. - Fixed-Priority Arbitration β Implements a custom arbiter granting immediate memory channel access to the GPU master (
M1) over the CPU master (M0) to guarantee high-bandwidth execution. - Transaction ID Routing β Safely multiplexes Read/Write channels and routes responses (
bid/rid) to the correct master using AXI transaction IDs (ID_WIDTH = 4), mapping CPU to4'h1and GPU to4'h2. - AXI4 Memory Controller (
axi4_uma_controller) β Implements a slave interface mapping 256-bit wide AXI read/write burst transactions (s_axi_awburst = 2'b01,s_axi_awsize = 3'b101) directly to a simulated shared HBM/DRAM static memory array.
SIMT GPU Streaming Multiprocessor (SM) integrated with a dedicated Ray-Tracing Compute Unit (RTCU).
- Top-Level Wrapper (
gpu_top) β Connects the SIMT processing core (gpu_sm_core) to the custom hardware Ray-Tracing pipeline (rtcu_core). - SIMT Core Pipeline (
gpu_sm_core) β Execution core processing 32-lane warps (WARP_SIZE = 32) with a Vector Register File (vrf) managing 256 registers per thread. Opcode7'h7Bdispatches ray tasks. - Ray-Tracing Co-processor (
rtcu_core) β Synthesizable hardware accelerator executing parallel Ray-Box and Ray-Triangle intersections. Implements FSM traversals (FETCH_BVH,INT_BOX,FETCH_TRI,INT_TRI) per warp lane. - Unified Memory Port β Features a dedicated 256-bit wide read bus (
mem_read_data) allowing thertcu_coreto directly fetch BVH nodes and triangles from memory, returning intersection results (hit_valid,hit_distance, and barycentrics) back to the SM.
Bare-metal C GPU device driver engineered to interface a CPU application with the GPU Compute Core over a Unified Memory Interconnect.
- Command Ring Buffer (
gpu_cmd_ring_t) β Manages asynchronous GPU commands through a 32-byte aligned circular queue queueing up to 256 entries in unified memory at0x40000000. - MMIO Register Map β Maps physical registers starting at
GPU_MMIO_BASE = 0x80000000(Doorbell:+0x00, Status:+0x04, Ring Addr:+0x08, Head:+0x0C, Tail:+0x10) to control hardware FSMs. - API Routines β Implements driver initialization (
gpu_init), command buffer dispatch (gpu_push_command), host doorbell signaling (gpu_ring_doorbell), and Ray-Tracing kernel dispatches (gpu_dispatch_raytracing) utilizingCMD_DISPATCH_RT(opcode0x02).
High-performance 2D physics engine written in modern C++ utilizing Data-Oriented Design (DOD) and an XPBD solver.
- Data-Oriented ECS β Custom Entity-Component-System architecture optimized for L1/L2 cache line locality.
- XPBD Solver β Extended Position-Based Dynamics solver for stable stacking, rigid body constraints, and stiff constraint resolution.
- Broadphase Collision β Spatial Hash Grid reducing comparison complexity from
O(N^2)toO(N). - Visualizer β Built-in Real-time simulation demo powered by Raylib.
Advanced UE5 Locomotion Plugin implementing procedural and physics-based movement systems.
- Modular Parkour Pipeline β Clean C++ runtime execution handler for climbing, vaulting, mantling, and wall-running.
- Physics Integration β Blends keyframe animations with real-time physical constraints for realistic collisions.
High-performance decentralized swarm intelligence framework built for multi-agent coordination.
- Autonomous Agent Swarms β Implements decentralized communication layers allowing agents to dynamically distribute workloads.
- Consensus Protocols β Integrates lightweight state synchronization and self-healing task routing between active nodes.
- Event-Driven Pipeline β Optimized async runtime architecture handling massive message passing between concurrent agents.
GPU-accelerated, real-time multi-agent swarm intelligence research platform.
- SEMAL Algorithm β Implements a hybrid Social-Evolutionary Multi-Agent Learning pipeline utilizing PyTorch and CUDA for real-time neural policy training on local hardware.
- Cultural Policy Distillation β Integrates local elite peer imitation with genetic algorithms (crossover and Gaussian mutation) to accelerate collective convergence and generational evolution.
- Batched GPU Inference β Optimizes simulation throughput with broad-phase raycast filtering and a dynamic load-balancing daemon to sustain high GPU utilization.
- Cognitive Persistence β Backed by a SQLite memory vault that automatically serializes and resumes high-fitness neural checkpoints across generations.
Deep learning predictive diagnostic tool for evaluating State of Health (SoH) and Remaining Useful Life (RUL) of lithium-ion cells.
- Temporal Networks β Employs recurrent networks (LSTM/GRU architectures) to model non-linear electrochemical degradation curves.
- Thermodynamic Modeling β Integrates real-time cell thermal profiles with current/voltage curves to predict thermal runaway risks.
Top languages: C++ β’ SystemVerilog β’ C β’ Python β’ GLSL/HLSL
- Hardware Accelerators β Extending SIMT GPU instruction pipelines to handle wider matrices for AI math operations.
- Velox Engine Modules β Further optimizations on Broadphase algorithms and multi-threaded constraint solvers.
- Swarm Robotics & AI β Applying Synapse algorithms to real-world edge controllers and simulation environments.


