tinyGPT.c

tinyGPT.c is a tiny version of GPT.c, designed specifically for microcontrollers and tested on ESP32S3-N16R8.

Navigation

Features
Contents
Quickstart
Training tinyGPT.c-Compatible Models
UART Inference
Future Plans
Contact
Support

Features

Prompt/Output: Send a prompt and get a response fast, easily.

Tuning: Tune the GPT.c-compatible model you use to get the most out of your model.

Serial inference: tinyGPT.c converts your preferred microcontroller into a serial AI hub. You can connect the board with tinyGPT.c to pretty much anything, and inference with your AI models locally via UART.

Notes
- tinyGPT.c is originally developed for Pocketive P1, world's first pocket-sized local AI device powered by microcontrollers.
- GPT.c, tinyGPT.c and Pocketive P1 are developed for educational purposes only, therefore there might be weak points, like in design, codes and more. I don't claim that it's perfect, since project is under development and not finished.
Why?
- Modern AI chatbots require sending your conversations to remote servers, raising privacy concerns about who has access to your data. Additionally, the massive computing infrastructure needed to run large language models has significant environmental costs. tinyGPT.c explores an alternative approach: running small language models directly on low-power microcontrollers. This enables private, offline AI inference without relying on cloud services or consuming substantial energy. While these models are currently limited compared to their larger counterparts, they demonstrate the potential for edge AI that keeps your data local and your conversations truly private.

gpt.c -> Main tinyGPT.c library
gpt.h -> Main tinyGPT.c library
main.c -> Main UART Inference firmware script, which uses the tinyGPT.c library to turn the uploaded microcontroller into a serial/UART AI hub.
partitions.csv -> Partitioning scheme made for UART Inference firmware.
espic2.bin -> Espic-2's pretrained weights.
tokenizer.bin -> Tokenizer for Espic-2.

Quickstart

This guide will help you integrate the tinyGPT.c library into your ESP32S3 project for running GPT.c-formatted models locally on your microcontroller.

Prerequisites

ESP-IDF toolchain installed (Installation Guide)
ESP32S3 with at least 8MB PSRAM (tested on N16R8 variant, and I recommend it)
A GPT.c-compatible model file (.bin format)
Corresponding tokenizer file (tokenizer.bin)

Library Overview

The tinyGPT.c library consists of two main files:

gpt.c - Core implementation with transformer architecture, tokenization, and sampling
gpt.h - Header file with data structures and function declarations

Basic Integration

1. Add Library Files to Your Project

Copy gpt.c and gpt.h to your ESP-IDF project's main/ directory or component folder.

2. Initialize the Storage System

Before using the library, initialize SPIFFS to access your model files:

#include "esp_spiffs.h"

void init_storage(void) {
    esp_vfs_spiffs_conf_t conf = {
        .base_path = "/data",
        .partition_label = NULL,
        .max_files = 5,
        .format_if_mount_failed = false
    };
    esp_vfs_spiffs_register(&conf);
}

3. Build the Transformer

Load your model checkpoint and build the transformer:

#include "gpt.h"

Transformer transformer;
build_transformer(&transformer, "/data/your-model.bin");

The build_transformer() function:

Reads the model checkpoint file
Allocates memory for the run state
Initializes FreeRTOS tasks for parallel computation
Sets up synchronization primitives

4. Initialize the Tokenizer

Build the tokenizer with your vocabulary:

Tokenizer tokenizer;
build_tokenizer(&tokenizer, "/data/tokenizer.bin", transformer.config.vocab_size);

5. Create a Sampler

Configure the sampling parameters for text generation:

Sampler sampler;
float temperature = 0.8f;  // Controls randomness (0.0 = deterministic, 1.0 = more random)
float topp = 0.9f;         // Top-p (nucleus) sampling threshold
unsigned long long rng_seed = (unsigned int)time(NULL);

build_sampler(&sampler, transformer.config.vocab_size, temperature, topp, rng_seed);

6. Generate Text

Generate text from a prompt:

int steps = 100;  // Maximum number of tokens to generate
char *prompt = "Once upon a time";

generate(&transformer, &tokenizer, &sampler, prompt, steps, NULL);

With a callback for completion stats:

void on_complete(float tokens_per_sec) {
    printf("\nGeneration speed: %.2f tokens/sec\n", tokens_per_sec);
}

generate(&transformer, &tokenizer, &sampler, prompt, steps, on_complete);

7. Cleanup

Free allocated resources when done:

free_sampler(&sampler);
free_tokenizer(&tokenizer);
free_transformer(&transformer);

API Reference

Core Data Structures

`Config`

Holds model configuration parameters:

dim - Model dimension
hidden_dim - Hidden layer dimension
n_layers - Number of transformer layers
n_heads - Number of attention heads
vocab_size - Vocabulary size
seq_len - Maximum sequence length

`Transformer`

Main model structure containing configuration, weights, and state.

`Tokenizer`

Handles text encoding/decoding with vocabulary management.

`Sampler`

Controls text generation sampling strategy.

Key Functions

`void build_transformer(Transformer t, char checkpoint_path)`

Loads model from file and initializes the transformer.

Parameters:

t - Pointer to Transformer structure
checkpoint_path - Path to model checkpoint file

`void build_tokenizer(Tokenizer* t, char* tokenizer_path, int vocab_size)`

Initializes tokenizer from vocabulary file.

Parameters:

t - Pointer to Tokenizer structure
tokenizer_path - Path to tokenizer file
vocab_size - Size of vocabulary

`void build_sampler(Sampler* sampler, int vocab_size, float temperature, float topp, unsigned long long rng_seed)`

Creates sampler with specified parameters.

Parameters:

sampler - Pointer to Sampler structure
vocab_size - Size of vocabulary
temperature - Sampling temperature (0.0-1.0+)
topp - Top-p sampling threshold (0.0-1.0)
rng_seed - Random seed for reproducibility

`void generate(Transformer transformer, Tokenizer tokenizer, Sampler sampler, char prompt, int steps, generated_complete_cb cb_done)`

Generates text from a prompt.

Parameters:

transformer - Pointer to initialized Transformer
tokenizer - Pointer to initialized Tokenizer
sampler - Pointer to initialized Sampler
prompt - Input text prompt (can be NULL for unconditional generation)
steps - Maximum tokens to generate
cb_done - Optional callback function called on completion

`float* forward(Transformer* transformer, int token, int pos)`

Performs a forward pass through the transformer.

Parameters:

transformer - Pointer to Transformer
token - Input token ID
pos - Position in sequence

Returns: Pointer to logits array

`void encode(Tokenizer* t, char text, int8_t bos, int8_t eos, int tokens, int *n_tokens)`

Encodes text into token IDs.

Parameters:

t - Pointer to Tokenizer
text - Input text to encode
bos - Add beginning-of-sequence token (0 or 1)
eos - Add end-of-sequence token (0 or 1)
tokens - Output array for token IDs
n_tokens - Pointer to store number of tokens

`char* decode(Tokenizer* t, int prev_token, int token)`

Decodes a token ID to text.

Parameters:

t - Pointer to Tokenizer
prev_token - Previous token ID (for context)
token - Token ID to decode

Returns: Decoded text string

Model File Format

tinyGPT.c uses a binary checkpoint format compatible with GPT.c:

Header (256 bytes):

Magic number: 0x616b3432
Version: 3
Configuration parameters (dim, hidden_dim, n_layers, etc.)

Weights: Following the header, model weights are stored as float arrays in this order:

Token embeddings
Positional embeddings
Layer normalization weights/biases
Attention projection weights (Q, K, V, O)
Feed-forward network weights
Final layer normalization
Classification head (optional, can share token embeddings)

Performance Tuning

Temperature

0.0: Deterministic (always picks most likely token)
0.5-0.8: Balanced creativity and coherence (recommended)
1.0+: More random and creative

Top-p (Nucleus Sampling)

0.9: Good default, considers top 90% probability mass
1.0: No filtering (equivalent to temperature-only sampling)
<0.9: More conservative, fewer token choices

Steps

Limit generation length to conserve memory and processing time. The model's seq_len parameter defines the maximum context window.

Memory Considerations

The library allocates significant memory for:

Model weights (read from file)
KV cache (stores attention keys/values for all positions)
Activation buffers

For ESP32S3-N16R8 (16MB Flash, 8MB PSRAM), suitable models are typically:

<50M parameters
Dimension ≤ 512
6-8 layers

Advanced Usage: Custom UART Protocol

The example main.c demonstrates implementing a UART-based AI hub. Key features:

Output Redirection

Override _write() to redirect printf output to UART during generation:

int _write(int fd, const void *buf, size_t count) {
    if ((fd == 1 || fd == 2) && output_to_uart) {
        uart_write_bytes(UART_NUM, buf, count);
        return count;
    }
    return count;
}

Protocol Format

The example uses a simple protocol:

Input: Text prompt ending with newline
Output: Generated tokens followed by <STATS>X.XX</STATS><END>\n

Reading Prompts

int uart_read_line(char *buffer, int max_len) {
    // Read bytes until newline
    // Returns length of line read
}

Troubleshooting

Out of Memory

Reduce model size or use a smaller model
Decrease seq_len in model configuration
Check PSRAM allocation in menuconfig

Slow Generation

Model may be too large for your hardware
Check CPU frequency settings
Verify PSRAM speed configuration

Invalid Output

Verify model and tokenizer files match
Check temperature/topp parameters
Ensure sufficient steps for meaningful output

Complete Minimal Example

#include "gpt.h"
#include "esp_spiffs.h"
#include <time.h>

void app_main(void) {
    // Initialize storage
    esp_vfs_spiffs_conf_t conf = {
        .base_path = "/data",
        .partition_label = NULL,
        .max_files = 5,
        .format_if_mount_failed = false
    };
    esp_vfs_spiffs_register(&conf);
    
    // Build transformer
    Transformer transformer;
    build_transformer(&transformer, "/data/model.bin");
    
    // Build tokenizer
    Tokenizer tokenizer;
    build_tokenizer(&tokenizer, "/data/tokenizer.bin", 
                    transformer.config.vocab_size);
    
    // Build sampler
    Sampler sampler;
    build_sampler(&sampler, transformer.config.vocab_size, 
                  0.8f, 0.9f, (unsigned int)time(NULL));
    
    // Generate text
    char *prompt = "Hello, I am";
    generate(&transformer, &tokenizer, &sampler, prompt, 50, NULL);
    
    // Cleanup
    free_sampler(&sampler);
    free_tokenizer(&tokenizer);
    free_transformer(&transformer);
}

Training tinyGPT.c-Compatible Models

Want to train your own models for tinyGPT.c? This guide covers the complete process, from dataset preparation to exporting a model ready for your ESP32.

Note: tinyGPT.c uses a type-grouped export format optimized for ESP32's memory architecture. This format enables zero-copy memory mapping, making model loading faster and more memory-efficient on resource-constrained devices.

Quick Start with Google Colab

The easiest way to get started is with our training notebook: train/tgptc_train.ipynb

This notebook handles the entire process step by step. Click the "Open in Colab" button above and you can start training immediately. All you need is your dataset. You can upload it to Colab and run every cell.

Training Pipeline Overview

Dataset Preparation → Model Training → Export Model → Export Tokenizer → Flash to ESP32

Step 1: Prepare Your Dataset

Using DataSeek (optional but recommended)

DataSeek makes dataset preparation significantly easier. It's powered by DeepSeekr, so you can automatically generate training datasets without any API costs, for free.

DeepSeekr is a Selenium automation tool that generates conversations from DeepSeek's web interface.

The workflow is straightforward:

Give DataSeek a prompt describing your desired dataset subject (e.g., "conversational AI about programming")
DataSeek uses DeepSeekr to automatically generate conversations on that topic
You get a clean, formatted dataset file ready for training

This eliminates manual data collection, web scraping, and API costs while generating high-quality training data.

Manual Dataset Preparation

If you're preparing data manually, save it as a plain text file named dataset.txt (UTF-8 encoding).

Dataset tips:

Larger datasets generally produce better results (aim for ≈70MB minimum, more is better)
Clean your data to avoid encoding issues that can affect training
For conversational models, format text as dialogue
The training script builds a character-level tokenizer from your data

Step 2: Configure Training Parameters

Open train/tiny_train.py and adjust the TrainConfig class:

class TrainConfig:
    # model config
    dim = 128         # Embedding dimension
    n_layers = 4      # Number of transformer layers
    n_head = 4        # Number of attention heads
    max_seq_len = 32  # Context window (keep low for ESP32)
    
    # training config
    batch_size = 32
    learning_rate = 5e-3
    max_iters = 8000
    eval_interval = 500

Key parameters:

dim: Model dimension. Higher values increase capacity but require more memory. 128-192 works well for ESP32S3.
max_seq_len: Maximum context length. This is critical for memory usage. Keep it low (32-64) for ESP32.
n_layers: Number of transformer layers. More layers improve quality but slow down inference. 4-6 layers is recommended.
max_iters: Training iterations. Continue training while validation loss decreases.

Memory Budget Guide

For ESP32S3-N16R8 (8MB PSRAM):

~250K parameters: dim=128, n_layers=4, max_seq_len=32
~500K parameters: dim=192, n_layers=4, max_seq_len=48
~1M parameters: dim=256, n_layers=6, max_seq_len=32

Larger configurations will struggle with memory constraints, especially when the KV cache fills during generation.

Step 3: Train the Model

Place your dataset in the training directory and run:

python train/tiny_train.py

The training process:

Loads and splits data (90% train, 10% validation)
Builds a character-level tokenizer from your dataset
Initializes the GPT model architecture
Trains for the specified iterations
Saves checkpoints to out/ directory

During training:

Training loss should decrease consistently
Validation loss should track training loss (divergence indicates overfitting)
Best model is automatically saved to out/best_model.pt

Training time varies depending on dataset size, model configuration, and hardware, but it shouldn't be very long. (Max. ≈10 minutes for a tinyGPT.c model on Colab with typical GPU)

Step 4: Export the Model

After training completes, export the model to tinyGPT.c format:

python train/export_tg.py output.bin checkpoint.pt

This creates output.bin containing your model weights in a format optimized for ESP32.

Arguments:

First argument: output filepath (where to save the .bin file)
Second argument: checkpoint filepath (your trained .pt file)

Example:

python train/export_tg.py espic-2.bin out/best_model.pt

Type-Grouped Export Format

Standard model formats store weights layer by layer. Type-grouped format stores all weights of the same type together:

All token embeddings
All positional embeddings
All layer norm weights for each layer
All attention weights for each layer
etc.

This layout enables zero-copy memory mapping on ESP32, allowing direct access to weights from flash memory without copying to RAM. This saves substantial memory and makes loading nearly instant.

Step 5: Export the Tokenizer

Export the tokenizer vocabulary:

python train/export_tokenizer.py

This creates tokenizer.bin containing your character-level vocabulary. The tokenizer is built from your training data and includes exactly the characters your model was trained on.

Step 6: Flash to ESP32

You now have two files:

output.bin (your trained model, typically ≈4-8MB)
tokenizer.bin (your tokenizer vocabulary, typically ≈5-50KB)

To deploy:

Rename output.bin to match your firmware's expected path (e.g., espic-2.bin)
Flash both files to your ESP32's SPIFFS partition using esptool.py or your preferred method
Flash your tinyGPT.c firmware

See the UART Inference section for detailed flashing instructions.

Training Files Reference

All training scripts are located in the train/ folder:

train/tgptc_train.ipynb -> Complete training notebook
train/tiny_train.py -> Main training script with TrainConfig
train/model_gpt.py -> GPT model architecture implementation
train/export_tg.py -> Exports trained models to .bin format
train/export_tokenizer.py -> Exports character-level tokenizer

Troubleshooting

Loss not decreasing:

Increase learning rate (try 1e-2)
Verify data quality and formatting
Reduce model size (smaller models learn faster)

Out of memory during training:

Reduce batch_size (try 16 or 8)
Reduce max_seq_len
Train on CPU if GPU memory is insufficient (will be slower)

Validation loss increasing (overfitting):

Add dropout (try 0.1)
Reduce max_iters
Increase training data
Use a smaller model

Model produces nonsensical output:

Train longer (model may not have converged)
Verify tokenizer export matches training vocabulary
Check exported model file size is reasonable

Testing Before Deployment

You can test your model before flashing to ESP32. Load the checkpoint and run inference in Python:

import torch
from model_gpt import GPT, GPTModelArgs

# Load checkpoint
ckpt = torch.load("out/best_model.pt", map_location="cpu")
config = ckpt['config']
stoi = ckpt['stoi']
itos = {i: ch for ch, i in stoi.items()}

# Build model
model_args = GPTModelArgs(
    dim=config.dim,
    n_layers=config.n_layers,
    n_heads=config.n_head,
    vocab_size=config.vocab_size,
    max_seq_len=config.max_seq_len
)
model = GPT(model_args)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()

# Generate text
prompt = "Hello"
context = torch.tensor([stoi[c] for c in prompt], dtype=torch.long).unsqueeze(0)
generated = model.generate(context, max_new_tokens=100, temperature=0.8, top_k=40)

# Decode output
output = ''.join([itos[int(i)] for i in generated[0]])
print(output)

This allows faster iteration without repeatedly flashing the ESP32.

Example: Espic-2 Training Configuration

Espic-2 (the current included example model) was trained with:

Dataset: ~80MB of conversational text (70MB train, ~8MB validation after 90/10 split)
Configuration: dim=128, n_layers=4, n_heads=4, max_seq_len=32
Training: 8000 iterations, learning_rate=5e-3, batch_size=32
Result: inference speed ~30-40 tokens/sec on ESP32S3
Vocabulary: 4661 unique characters

Your results will vary based on dataset quality and configuration choices.

Important

Espic-2's performance is currently limited. The initial release prioritized getting the product out and demonstrating the concept. Future updates will focus on architectural improvements and training models with significantly better conversational capabilities.

UART Inference

To flash the default UART Inference firmware, which turns your ESP32S3-N16R8 into a serial AI hub, if you are building a P1 or just want to have a microcontroller AI hub for for any other application, you can follow this guide.

Note

You need to have ESP-IDF toolchain installed for this.

Navigate to `uartinference` folder

cd uartinference

Build...

idf.py build

...and flash!

idf.py -p /dev/yourport flash

Steps to reproduce

mindmap
  root((tinyGPT.c))
    Model
      Obtain dataset
      Train model
      Convert to tinyGPT.c
    Firmware
      Build firmware
      Flash it

You need to obtain a model or train your own, and use a firmware like the one in this repository, which turns your microcontroller into a UART AI hub, or make your own and flash.

Future Plans

Train Espic-3; a more valid, better model than Espic-2.
Improve optimization.

Contact

You can contact me using yusuf@tachion.tech

Support

You can support me using:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
content		content
train		train
uartinference		uartinference
README.md		README.md
gpt.c		gpt.c
gpt.h		gpt.h

Folders and files

Latest commit

History

Repository files navigation

tinyGPT.c

Navigation

Features

Contents

Some of the key files in this repository

Quickstart

Prerequisites

Library Overview

Basic Integration

1. Add Library Files to Your Project

2. Initialize the Storage System

3. Build the Transformer

4. Initialize the Tokenizer

5. Create a Sampler

6. Generate Text

7. Cleanup

API Reference

Core Data Structures

Config

Transformer

Tokenizer

Sampler

Key Functions

void build_transformer(Transformer *t, char* checkpoint_path)

void build_tokenizer(Tokenizer* t, char* tokenizer_path, int vocab_size)

void build_sampler(Sampler* sampler, int vocab_size, float temperature, float topp, unsigned long long rng_seed)

void generate(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler, char *prompt, int steps, generated_complete_cb cb_done)

float* forward(Transformer* transformer, int token, int pos)

void encode(Tokenizer* t, char *text, int8_t bos, int8_t eos, int *tokens, int *n_tokens)

char* decode(Tokenizer* t, int prev_token, int token)

Model File Format

Performance Tuning

Temperature

Top-p (Nucleus Sampling)

Steps

Memory Considerations

Advanced Usage: Custom UART Protocol

Output Redirection

Protocol Format

Reading Prompts

Troubleshooting

Out of Memory

Slow Generation

Invalid Output

Complete Minimal Example

Training tinyGPT.c-Compatible Models

Quick Start with Google Colab

Training Pipeline Overview

Step 1: Prepare Your Dataset

Using DataSeek (optional but recommended)

Manual Dataset Preparation

Step 2: Configure Training Parameters

Memory Budget Guide

Step 3: Train the Model

Step 4: Export the Model

Type-Grouped Export Format

Step 5: Export the Tokenizer

Step 6: Flash to ESP32

Training Files Reference

Troubleshooting

Testing Before Deployment

Example: Espic-2 Training Configuration

UART Inference

Navigate to uartinference folder

Build...

...and flash!

Steps to reproduce

Future Plans

Contact

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

`Config`

`Transformer`

`Tokenizer`

`Sampler`

`void build_transformer(Transformer t, char checkpoint_path)`

`void build_tokenizer(Tokenizer* t, char* tokenizer_path, int vocab_size)`

`void build_sampler(Sampler* sampler, int vocab_size, float temperature, float topp, unsigned long long rng_seed)`

`void generate(Transformer transformer, Tokenizer tokenizer, Sampler sampler, char prompt, int steps, generated_complete_cb cb_done)`

`float* forward(Transformer* transformer, int token, int pos)`

`void encode(Tokenizer* t, char text, int8_t bos, int8_t eos, int tokens, int *n_tokens)`

`char* decode(Tokenizer* t, int prev_token, int token)`

Navigate to `uartinference` folder

Packages