Skip to content
View 1SHAMAY1's full-sized avatar
πŸ’­
I may be slow to respond.
πŸ’­
I may be slow to respond.
  • Tripura, India
  • 06:58 (UTC -12:00)

Block or report 1SHAMAY1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
1SHAMAY1/README.md

⚑ SHANKHARAJ DATTA (1SHAMAY1)

Systems & Silicon Engineer β€’ C++ & SystemVerilog Developer β€’ Low-Level & AI Architect


🧩 About Me

I design and implement high-performance systems across the entire stackβ€”ranging from synthesizable RTL SoC Interconnects and GPU vector architectures to low-level bare-metal device drivers, data-oriented physics engines (ECS), and distributed machine learning models.

Focus Areas

  • Hardware/Silicon Design – Synthesizable SystemVerilog, AXI4 Network-on-Chip (NoC) crossbars, and SIMT GPU architectures
  • Systems & Embedded Programming – Bare-metal C driver development, Memory-Mapped I/O (MMIO), ring buffers, and custom hardware/software co-design
  • Simulation & Engine Tech – Data-Oriented Design (DOD), Entity-Component-Systems (ECS), custom XPBD physics engines, and procedural C++ locomotion
  • Distributed AI & Machine Learning – Swarm intelligence frameworks, temporal/predictive ML pipelines, and agent consensus protocols

πŸ›  Tech Stack

Languages & HDLs

C++ SystemVerilog C Python

Hardware, VLSI & Embedded Systems

AXI4 Protocol

Yosys Synthesis

Riviera Pro

Bare-Metal C

MMIO

Frameworks & Engines

Unreal Engine 5 OpenGL CUDA PyTorch Raylib

Concepts & Paradigms

OOP

Data-Oriented Design (DOD)

Parallel Programming

Deep Learning


πŸš€ Featured Projects

πŸ”Œ Low-Level Systems & Silicon (RTL & Drivers)

🌐 UMA SoC Interconnect

Apple M-Series style Unified Memory Architecture (UMA) SoC Interconnect designed in synthesizable SystemVerilog.
UMA SoC

  • System Topology (uma_soc_top) – Integrates a 2x1 AXI4 Crossbar (axi4_crossbar) with a Unified Memory Controller (axi4_uma_controller) supporting 32-bit addresses and 256-bit wide data channels.
  • Fixed-Priority Arbitration – Implements a custom arbiter granting immediate memory channel access to the GPU master (M1) over the CPU master (M0) to guarantee high-bandwidth execution.
  • Transaction ID Routing – Safely multiplexes Read/Write channels and routes responses (bid / rid) to the correct master using AXI transaction IDs (ID_WIDTH = 4), mapping CPU to 4'h1 and GPU to 4'h2.
  • AXI4 Memory Controller (axi4_uma_controller) – Implements a slave interface mapping 256-bit wide AXI read/write burst transactions (s_axi_awburst = 2'b01, s_axi_awsize = 3'b101) directly to a simulated shared HBM/DRAM static memory array.

🏎️ GPU Compute Core & Ray-Tracing Accelerator

SIMT GPU Streaming Multiprocessor (SM) integrated with a dedicated Ray-Tracing Compute Unit (RTCU).
GPU Core

  • Top-Level Wrapper (gpu_top) – Connects the SIMT processing core (gpu_sm_core) to the custom hardware Ray-Tracing pipeline (rtcu_core).
  • SIMT Core Pipeline (gpu_sm_core) – Execution core processing 32-lane warps (WARP_SIZE = 32) with a Vector Register File (vrf) managing 256 registers per thread. Opcode 7'h7B dispatches ray tasks.
  • Ray-Tracing Co-processor (rtcu_core) – Synthesizable hardware accelerator executing parallel Ray-Box and Ray-Triangle intersections. Implements FSM traversals (FETCH_BVH, INT_BOX, FETCH_TRI, INT_TRI) per warp lane.
  • Unified Memory Port – Features a dedicated 256-bit wide read bus (mem_read_data) allowing the rtcu_core to directly fetch BVH nodes and triangles from memory, returning intersection results (hit_valid, hit_distance, and barycentrics) back to the SM.

πŸ’» Graphics Driver API

Bare-metal C GPU device driver engineered to interface a CPU application with the GPU Compute Core over a Unified Memory Interconnect.
Graphics Driver

  • Command Ring Buffer (gpu_cmd_ring_t) – Manages asynchronous GPU commands through a 32-byte aligned circular queue queueing up to 256 entries in unified memory at 0x40000000.
  • MMIO Register Map – Maps physical registers starting at GPU_MMIO_BASE = 0x80000000 (Doorbell: +0x00, Status: +0x04, Ring Addr: +0x08, Head: +0x0C, Tail: +0x10) to control hardware FSMs.
  • API Routines – Implements driver initialization (gpu_init), command buffer dispatch (gpu_push_command), host doorbell signaling (gpu_ring_doorbell), and Ray-Tracing kernel dispatches (gpu_dispatch_raytracing) utilizing CMD_DISPATCH_RT (opcode 0x02).

🦾 Physics & Simulation Engines

⚑ Velox

High-performance 2D physics engine written in modern C++ utilizing Data-Oriented Design (DOD) and an XPBD solver.
Velox

  • Data-Oriented ECS – Custom Entity-Component-System architecture optimized for L1/L2 cache line locality.
  • XPBD Solver – Extended Position-Based Dynamics solver for stable stacking, rigid body constraints, and stiff constraint resolution.
  • Broadphase Collision – Spatial Hash Grid reducing comparison complexity from O(N^2) to O(N).
  • Visualizer – Built-in Real-time simulation demo powered by Raylib.

🦾 Character Locomotion System

Advanced UE5 Locomotion Plugin implementing procedural and physics-based movement systems.
CLS

  • Modular Parkour Pipeline – Clean C++ runtime execution handler for climbing, vaulting, mantling, and wall-running.
  • Physics Integration – Blends keyframe animations with real-time physical constraints for realistic collisions.

🧠 Distributed AI & Machine Learning

πŸ•ΈοΈ SYNAPSE

High-performance decentralized swarm intelligence framework built for multi-agent coordination.
SYNAPSE

  • Autonomous Agent Swarms – Implements decentralized communication layers allowing agents to dynamically distribute workloads.
  • Consensus Protocols – Integrates lightweight state synchronization and self-healing task routing between active nodes.
  • Event-Driven Pipeline – Optimized async runtime architecture handling massive message passing between concurrent agents.

🧬 CORTEX

GPU-accelerated, real-time multi-agent swarm intelligence research platform.
CORTEX

  • SEMAL Algorithm – Implements a hybrid Social-Evolutionary Multi-Agent Learning pipeline utilizing PyTorch and CUDA for real-time neural policy training on local hardware.
  • Cultural Policy Distillation – Integrates local elite peer imitation with genetic algorithms (crossover and Gaussian mutation) to accelerate collective convergence and generational evolution.
  • Batched GPU Inference – Optimizes simulation throughput with broad-phase raycast filtering and a dynamic load-balancing daemon to sustain high GPU utilization.
  • Cognitive Persistence – Backed by a SQLite memory vault that automatically serializes and resumes high-fitness neural checkpoints across generations.

πŸ”‹ AI Battery Health

Deep learning predictive diagnostic tool for evaluating State of Health (SoH) and Remaining Useful Life (RUL) of lithium-ion cells.
AI Battery

  • Temporal Networks – Employs recurrent networks (LSTM/GRU architectures) to model non-linear electrochemical degradation curves.
  • Thermodynamic Modeling – Integrates real-time cell thermal profiles with current/voltage curves to predict thermal runaway risks.

πŸ“Š GitHub Stats & Badges

Followers Stars Views

Top languages: C++ β€’ SystemVerilog β€’ C β€’ Python β€’ GLSL/HLSL


🎯 What I'm Building

  • Hardware Accelerators – Extending SIMT GPU instruction pipelines to handle wider matrices for AI math operations.
  • Velox Engine Modules – Further optimizations on Broadphase algorithms and multi-threaded constraint solvers.
  • Swarm Robotics & AI – Applying Synapse algorithms to real-world edge controllers and simulation environments.

πŸ”— Connect

LinkedIn


⭐ If you like my work, consider starring a repository!

Pinned Loading

  1. Plugin-CharacterLocomotionSystem Plugin-CharacterLocomotionSystem Public

    Character Locomotion System (CLS) is a high-performance, C++ based movement framework for Unreal Engine 5.

    C++ 2

  2. Plugin-EnhancedCameraSystem Plugin-EnhancedCameraSystem Public

    Enhanced Camera System (ECS) is a modular, runtime-switchable camera framework for Unreal Engine 5.

    C++

  3. Big-Integer-Operations Big-Integer-Operations Public

    C++

  4. Customizable_Logger Customizable_Logger Public

    This repository have a customizable logger, that is based on c++.

    C++

  5. DSAUtility DSAUtility Public

    A comprehensive, header-only C++ library for Data Structures and Algorithms with modern C++17 features, template support, and extensive testing.

    C++

  6. TimeEngine TimeEngine Public

    Time Engine is a high-performance C++ game engine designed for sophisticated 2D application development and deterministic time manipulation.

    C++