Milo832 is a tile-based rasterizing GPU implemented in VHDL, designed to run on resource-constrained FPGAs. It functions as a graphics coprocessor for the m65832 retro computing project.
The SIMT shader core in this project is a VHDL translation of the SIMT-GPU-Core by Aritra Manna. Many thanks to Aritra for creating and open-sourcing this excellent educational GPU implementation. The original SystemVerilog code was translated to VHDL for use in this project.
Original Project: https://github.com/aritramanna/SIMT-GPU-Core
Milo832 is a Unified Shader Architecture GPU featuring:
- Programmable SIMT Shader Cores - Vertex and fragment shaders run on the same hardware
- Tile-Based Rendering - Memory-efficient rendering suitable for FPGA BRAM constraints
- Hardware Texture Sampling - Bilinear filtering with multiple texture formats
- Fixed-Function Rasterization - Triangle setup, edge walking, and fragment generation
| Platform | FPGA | Status |
|---|---|---|
| DE2-115 | Cyclone IV EP4CE115 | In Development |
| Kria KV260 | Zynq UltraScale+ XCK26 | Planned |
┌─────────────────────────────────────────────────────────────────┐
│ m65832 CPU │
│ (Command Buffer Producer) │
└──────────────────────────┬──────────────────────────────────────┘
│ System Bus
┌──────────────────────────▼──────────────────────────────────────┐
│ Milo832 GPU │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Command │ │ Triangle │ │ Tile Rasterizer │ │
│ │ Processor │──│ Binner │──│ (Edge Function Eval) │ │
│ └─────────────┘ └─────────────┘ └───────────┬─────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────▼─────────────┐ │
│ │ Streaming Multiprocessor (SM) │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────────┐ │ │
│ │ │ ALU │ │ FPU │ │ SFU │ │ Texture │ │ │
│ │ │ 32-lane │ │ 32-lane │ │ 32-lane │ │ Unit │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────▼─────────────┐ │
│ │ ROP (Raster Operations) │ │
│ │ Blend, Depth Test, Tile Writeback │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
The shader core implements a 5-stage pipeline executing 32 threads in lockstep (one warp):
| Stage | Function |
|---|---|
| IF | Instruction Fetch with round-robin warp scheduling |
| ID | Decode, scoreboard check, hazard detection |
| OC | Operand Collection from banked register file |
| EX | 32-lane parallel execution (ALU/FPU/SFU) |
| WB | Writeback with out-of-order memory completion |
Key Features:
- 24 warps per SM (768 threads resident)
- 16KB shared memory with 32-bank conflict detection
- Hardware divergence stack (SSY/JOIN)
- Barrier synchronization with epoch consistency
- Non-blocking memory with 64-entry MSHR per warp
The rasterizer processes geometry in screen-space tiles to minimize memory bandwidth:
- Triangle Setup - Compute edge equations from vertex positions
- Binning - Assign triangles to overlapping tiles
- Tile Rasterization - Per-pixel edge function evaluation
- Fragment Generation - Barycentric interpolation of attributes
Tile size is configurable (default 16x16 pixels) to balance BRAM usage vs. overdraw.
The texture sampling pipeline supports:
| Format | Description |
|---|---|
| RGBA8888 | 32-bit true color |
| RGBA4444 | 16-bit with alpha |
| RGB565 | 16-bit no alpha |
| PAL8 | 8-bit paletized |
| PAL4 | 4-bit paletized |
| ETC1/ETC2 | Compressed (planned) |
Filtering:
- Nearest neighbor
- Bilinear with mip-level selection (no trilinear)
Wrap Modes:
- Repeat, Clamp, Mirror
- SIMT shader core (VHDL translation from SystemVerilog)
- 32-lane ALU, FPU, SFU execution units
- Operand collector with bank conflict handling
- Shared memory subsystem
- Tile-based rasterizer with barycentric interpolation
- Texture sampler (nearest/bilinear, wrap modes)
- System bus infrastructure
- Texture unit SIMT integration (32 parallel samplers)
- ETC1/ETC2 texture decompression
- ROP (blend, depth test)
- Command processor
- Vertex fetch / input assembly
- Primitive assembly (triangle setup from vertex shader output)
- Texture cache with request coalescing
- L1 cache for global memory
- Multi-SM scaling
- FPGA synthesis and timing closure
Milo832 is the graphics subsystem for the m65832 retro computing project. The two components share:
- System Bus - Multi-master bus developed here, canonical version in m65832
- Memory Map - Shared DDR4/SDRAM address space
- Command Model - CPU submits command buffers, GPU processes asynchronously
The m65832 CPU handles:
- Application logic and game code
- Geometry transformation (optional - can also run on GPU)
- Command buffer construction
- Display timing and scanout
Milo832 handles:
- Vertex shading (programmable)
- Rasterization (fixed-function)
- Fragment shading (programmable)
- Texture sampling
- Framebuffer operations
milo832/
├── RTL/
│ ├── Core/ # SIMT shader core
│ │ ├── streaming_multiprocessor.vhd
│ │ ├── simt_pkg.vhd
│ │ └── shared_memory.vhd
│ ├── Compute/ # Execution units
│ │ ├── int_alu.vhd
│ │ ├── fpu.vhd
│ │ └── sfu.vhd
│ ├── Memory/ # Memory subsystem
│ │ ├── operand_collector.vhd
│ │ └── fifo.vhd
│ ├── graphics/ # Rasterization pipeline
│ │ ├── tile_rasterizer.vhd
│ │ ├── texture_sampler.vhd
│ │ ├── texture_unit.vhd
│ │ └── rop.vhd
│ └── bus/ # System interconnect
│ ├── system_bus.vhd
│ └── bus_arbiter.vhd
├── TB/ # VHDL testbenches
├── docs/ # Design documentation
└── README.md
- GHDL - VHDL simulator (tested with ghdl-llvm)
- Python 3 with Pillow - For image output visualization
# Compile and run a testbench
cd TB
ghdl -a --std=08 ../RTL/Core/simt_pkg.vhd
ghdl -a --std=08 ../RTL/Core/streaming_multiprocessor.vhd
ghdl -a --std=08 tb_sm_basic.vhd
ghdl -e --std=08 tb_sm_basic
ghdl -r --std=08 tb_sm_basicTestbenches output PPM images that can be converted to PNG:
# Convert PPM to PNG (macOS)
sips -s format png output.ppm --out output.png
# Or with ImageMagick
convert output.ppm output.png- Complete the graphics pipeline - Integrate all stages from vertex fetch to framebuffer write
- FPGA synthesis - Target DE2-115 first, then KV260
- Performance optimization - Balance resource usage vs. throughput
- Driver development - CPU-side library for m65832 integration
- Demo applications - 3D rendering demos showcasing capabilities
MIT License - See LICENSE file.
- SIMT-GPU-Core - Original SystemVerilog implementation by Aritra Manna
- Tile-Based Rendering - ARM Mali architecture guide
- GPU Gems - NVIDIA GPU programming resources