Skip to content

pocketive/tinygpt.c

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tinyGPT.c

Keep your chats in your pockets.

tinyGPT.c is a tiny version of GPT.c, designed specifically for microcontrollers and tested on ESP32S3-N16R8.

Navigation

Features

Prompt/Output: Send a prompt and get a response fast, easily.

Tuning: Tune the GPT.c-compatible model you use to get the most out of your model.

Serial inference: tinyGPT.c converts your preferred microcontroller into a serial AI hub. You can connect the board with tinyGPT.c to pretty much anything, and inference with your AI models locally via UART.

  • Notes
    • tinyGPT.c is originally developed for Pocketive P1, world's first pocket-sized local AI device powered by microcontrollers.
    • GPT.c, tinyGPT.c and Pocketive P1 are developed for educational purposes only, therefore there might be weak points, like in design, codes and more. I don't claim that it's perfect, since project is under development and not finished.
  • Why?
    • Modern AI chatbots require sending your conversations to remote servers, raising privacy concerns about who has access to your data. Additionally, the massive computing infrastructure needed to run large language models has significant environmental costs. tinyGPT.c explores an alternative approach: running small language models directly on low-power microcontrollers. This enables private, offline AI inference without relying on cloud services or consuming substantial energy. While these models are currently limited compared to their larger counterparts, they demonstrate the potential for edge AI that keeps your data local and your conversations truly private.

Contents

Some of the key files in this repository

  • gpt.c -> Main tinyGPT.c library
  • gpt.h -> Main tinyGPT.c library
  • main.c -> Main UART Inference firmware script, which uses the tinyGPT.c library to turn the uploaded microcontroller into a serial/UART AI hub.
  • partitions.csv -> Partitioning scheme made for UART Inference firmware.
  • espic2.bin -> Espic-2's pretrained weights.
  • tokenizer.bin -> Tokenizer for Espic-2.

Quickstart

This guide will help you integrate the tinyGPT.c library into your ESP32S3 project for running GPT.c-formatted models locally on your microcontroller.

Prerequisites

  • ESP-IDF toolchain installed (Installation Guide)
  • ESP32S3 with at least 8MB PSRAM (tested on N16R8 variant, and I recommend it)
  • A GPT.c-compatible model file (.bin format)
  • Corresponding tokenizer file (tokenizer.bin)

Library Overview

The tinyGPT.c library consists of two main files:

  • gpt.c - Core implementation with transformer architecture, tokenization, and sampling
  • gpt.h - Header file with data structures and function declarations

Basic Integration

1. Add Library Files to Your Project

Copy gpt.c and gpt.h to your ESP-IDF project's main/ directory or component folder.

2. Initialize the Storage System

Before using the library, initialize SPIFFS to access your model files:

#include "esp_spiffs.h"

void init_storage(void) {
    esp_vfs_spiffs_conf_t conf = {
        .base_path = "/data",
        .partition_label = NULL,
        .max_files = 5,
        .format_if_mount_failed = false
    };
    esp_vfs_spiffs_register(&conf);
}

3. Build the Transformer

Load your model checkpoint and build the transformer:

#include "gpt.h"

Transformer transformer;
build_transformer(&transformer, "/data/your-model.bin");

The build_transformer() function:

  • Reads the model checkpoint file
  • Allocates memory for the run state
  • Initializes FreeRTOS tasks for parallel computation
  • Sets up synchronization primitives

4. Initialize the Tokenizer

Build the tokenizer with your vocabulary:

Tokenizer tokenizer;
build_tokenizer(&tokenizer, "/data/tokenizer.bin", transformer.config.vocab_size);

5. Create a Sampler

Configure the sampling parameters for text generation:

Sampler sampler;
float temperature = 0.8f;  // Controls randomness (0.0 = deterministic, 1.0 = more random)
float topp = 0.9f;         // Top-p (nucleus) sampling threshold
unsigned long long rng_seed = (unsigned int)time(NULL);

build_sampler(&sampler, transformer.config.vocab_size, temperature, topp, rng_seed);

6. Generate Text

Generate text from a prompt:

int steps = 100;  // Maximum number of tokens to generate
char *prompt = "Once upon a time";

generate(&transformer, &tokenizer, &sampler, prompt, steps, NULL);

With a callback for completion stats:

void on_complete(float tokens_per_sec) {
    printf("\nGeneration speed: %.2f tokens/sec\n", tokens_per_sec);
}

generate(&transformer, &tokenizer, &sampler, prompt, steps, on_complete);

7. Cleanup

Free allocated resources when done:

free_sampler(&sampler);
free_tokenizer(&tokenizer);
free_transformer(&transformer);

API Reference

Core Data Structures

Config

Holds model configuration parameters:

  • dim - Model dimension
  • hidden_dim - Hidden layer dimension
  • n_layers - Number of transformer layers
  • n_heads - Number of attention heads
  • vocab_size - Vocabulary size
  • seq_len - Maximum sequence length

Transformer

Main model structure containing configuration, weights, and state.

Tokenizer

Handles text encoding/decoding with vocabulary management.

Sampler

Controls text generation sampling strategy.

Key Functions

void build_transformer(Transformer *t, char* checkpoint_path)

Loads model from file and initializes the transformer.

Parameters:

  • t - Pointer to Transformer structure
  • checkpoint_path - Path to model checkpoint file

void build_tokenizer(Tokenizer* t, char* tokenizer_path, int vocab_size)

Initializes tokenizer from vocabulary file.

Parameters:

  • t - Pointer to Tokenizer structure
  • tokenizer_path - Path to tokenizer file
  • vocab_size - Size of vocabulary

void build_sampler(Sampler* sampler, int vocab_size, float temperature, float topp, unsigned long long rng_seed)

Creates sampler with specified parameters.

Parameters:

  • sampler - Pointer to Sampler structure
  • vocab_size - Size of vocabulary
  • temperature - Sampling temperature (0.0-1.0+)
  • topp - Top-p sampling threshold (0.0-1.0)
  • rng_seed - Random seed for reproducibility

void generate(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler, char *prompt, int steps, generated_complete_cb cb_done)

Generates text from a prompt.

Parameters:

  • transformer - Pointer to initialized Transformer
  • tokenizer - Pointer to initialized Tokenizer
  • sampler - Pointer to initialized Sampler
  • prompt - Input text prompt (can be NULL for unconditional generation)
  • steps - Maximum tokens to generate
  • cb_done - Optional callback function called on completion

float* forward(Transformer* transformer, int token, int pos)

Performs a forward pass through the transformer.

Parameters:

  • transformer - Pointer to Transformer
  • token - Input token ID
  • pos - Position in sequence

Returns: Pointer to logits array

void encode(Tokenizer* t, char *text, int8_t bos, int8_t eos, int *tokens, int *n_tokens)

Encodes text into token IDs.

Parameters:

  • t - Pointer to Tokenizer
  • text - Input text to encode
  • bos - Add beginning-of-sequence token (0 or 1)
  • eos - Add end-of-sequence token (0 or 1)
  • tokens - Output array for token IDs
  • n_tokens - Pointer to store number of tokens

char* decode(Tokenizer* t, int prev_token, int token)

Decodes a token ID to text.

Parameters:

  • t - Pointer to Tokenizer
  • prev_token - Previous token ID (for context)
  • token - Token ID to decode

Returns: Decoded text string

Model File Format

tinyGPT.c uses a binary checkpoint format compatible with GPT.c:

Header (256 bytes):

  • Magic number: 0x616b3432
  • Version: 3
  • Configuration parameters (dim, hidden_dim, n_layers, etc.)

Weights: Following the header, model weights are stored as float arrays in this order:

  1. Token embeddings
  2. Positional embeddings
  3. Layer normalization weights/biases
  4. Attention projection weights (Q, K, V, O)
  5. Feed-forward network weights
  6. Final layer normalization
  7. Classification head (optional, can share token embeddings)

Performance Tuning

Temperature

  • 0.0: Deterministic (always picks most likely token)
  • 0.5-0.8: Balanced creativity and coherence (recommended)
  • 1.0+: More random and creative

Top-p (Nucleus Sampling)

  • 0.9: Good default, considers top 90% probability mass
  • 1.0: No filtering (equivalent to temperature-only sampling)
  • <0.9: More conservative, fewer token choices

Steps

Limit generation length to conserve memory and processing time. The model's seq_len parameter defines the maximum context window.

Memory Considerations

The library allocates significant memory for:

  • Model weights (read from file)
  • KV cache (stores attention keys/values for all positions)
  • Activation buffers

For ESP32S3-N16R8 (16MB Flash, 8MB PSRAM), suitable models are typically:

  • <50M parameters
  • Dimension ≤ 512
  • 6-8 layers

Advanced Usage: Custom UART Protocol

The example main.c demonstrates implementing a UART-based AI hub. Key features:

Output Redirection

Override _write() to redirect printf output to UART during generation:

int _write(int fd, const void *buf, size_t count) {
    if ((fd == 1 || fd == 2) && output_to_uart) {
        uart_write_bytes(UART_NUM, buf, count);
        return count;
    }
    return count;
}

Protocol Format

The example uses a simple protocol:

  • Input: Text prompt ending with newline
  • Output: Generated tokens followed by <STATS>X.XX</STATS><END>\n

Reading Prompts

int uart_read_line(char *buffer, int max_len) {
    // Read bytes until newline
    // Returns length of line read
}

Troubleshooting

Out of Memory

  • Reduce model size or use a smaller model
  • Decrease seq_len in model configuration
  • Check PSRAM allocation in menuconfig

Slow Generation

  • Model may be too large for your hardware
  • Check CPU frequency settings
  • Verify PSRAM speed configuration

Invalid Output

  • Verify model and tokenizer files match
  • Check temperature/topp parameters
  • Ensure sufficient steps for meaningful output

Complete Minimal Example

#include "gpt.h"
#include "esp_spiffs.h"
#include <time.h>

void app_main(void) {
    // Initialize storage
    esp_vfs_spiffs_conf_t conf = {
        .base_path = "/data",
        .partition_label = NULL,
        .max_files = 5,
        .format_if_mount_failed = false
    };
    esp_vfs_spiffs_register(&conf);
    
    // Build transformer
    Transformer transformer;
    build_transformer(&transformer, "/data/model.bin");
    
    // Build tokenizer
    Tokenizer tokenizer;
    build_tokenizer(&tokenizer, "/data/tokenizer.bin", 
                    transformer.config.vocab_size);
    
    // Build sampler
    Sampler sampler;
    build_sampler(&sampler, transformer.config.vocab_size, 
                  0.8f, 0.9f, (unsigned int)time(NULL));
    
    // Generate text
    char *prompt = "Hello, I am";
    generate(&transformer, &tokenizer, &sampler, prompt, 50, NULL);
    
    // Cleanup
    free_sampler(&sampler);
    free_tokenizer(&tokenizer);
    free_transformer(&transformer);
}

Training tinyGPT.c-Compatible Models

Want to train your own models for tinyGPT.c? This guide covers the complete process, from dataset preparation to exporting a model ready for your ESP32.

Open In Colab

Note: tinyGPT.c uses a type-grouped export format optimized for ESP32's memory architecture. This format enables zero-copy memory mapping, making model loading faster and more memory-efficient on resource-constrained devices.

Quick Start with Google Colab

The easiest way to get started is with our training notebook: train/tgptc_train.ipynb

This notebook handles the entire process step by step. Click the "Open in Colab" button above and you can start training immediately. All you need is your dataset. You can upload it to Colab and run every cell.

Training Pipeline Overview

Dataset Preparation → Model Training → Export Model → Export Tokenizer → Flash to ESP32

Step 1: Prepare Your Dataset

Using DataSeek (optional but recommended)

DataSeek makes dataset preparation significantly easier. It's powered by DeepSeekr, so you can automatically generate training datasets without any API costs, for free.

DeepSeekr is a Selenium automation tool that generates conversations from DeepSeek's web interface.

The workflow is straightforward:

  1. Give DataSeek a prompt describing your desired dataset subject (e.g., "conversational AI about programming")
  2. DataSeek uses DeepSeekr to automatically generate conversations on that topic
  3. You get a clean, formatted dataset file ready for training

This eliminates manual data collection, web scraping, and API costs while generating high-quality training data.

Manual Dataset Preparation

If you're preparing data manually, save it as a plain text file named dataset.txt (UTF-8 encoding).

Dataset tips:

  • Larger datasets generally produce better results (aim for ≈70MB minimum, more is better)
  • Clean your data to avoid encoding issues that can affect training
  • For conversational models, format text as dialogue
  • The training script builds a character-level tokenizer from your data

Step 2: Configure Training Parameters

Open train/tiny_train.py and adjust the TrainConfig class:

class TrainConfig:
    # model config
    dim = 128         # Embedding dimension
    n_layers = 4      # Number of transformer layers
    n_head = 4        # Number of attention heads
    max_seq_len = 32  # Context window (keep low for ESP32)
    
    # training config
    batch_size = 32
    learning_rate = 5e-3
    max_iters = 8000
    eval_interval = 500

Key parameters:

  • dim: Model dimension. Higher values increase capacity but require more memory. 128-192 works well for ESP32S3.
  • max_seq_len: Maximum context length. This is critical for memory usage. Keep it low (32-64) for ESP32.
  • n_layers: Number of transformer layers. More layers improve quality but slow down inference. 4-6 layers is recommended.
  • max_iters: Training iterations. Continue training while validation loss decreases.

Memory Budget Guide

For ESP32S3-N16R8 (8MB PSRAM):

  • ~250K parameters: dim=128, n_layers=4, max_seq_len=32
  • ~500K parameters: dim=192, n_layers=4, max_seq_len=48
  • ~1M parameters: dim=256, n_layers=6, max_seq_len=32

Larger configurations will struggle with memory constraints, especially when the KV cache fills during generation.

Step 3: Train the Model

Place your dataset in the training directory and run:

python train/tiny_train.py

The training process:

  1. Loads and splits data (90% train, 10% validation)
  2. Builds a character-level tokenizer from your dataset
  3. Initializes the GPT model architecture
  4. Trains for the specified iterations
  5. Saves checkpoints to out/ directory

During training:

  • Training loss should decrease consistently
  • Validation loss should track training loss (divergence indicates overfitting)
  • Best model is automatically saved to out/best_model.pt

Training time varies depending on dataset size, model configuration, and hardware, but it shouldn't be very long. (Max. ≈10 minutes for a tinyGPT.c model on Colab with typical GPU)

Step 4: Export the Model

After training completes, export the model to tinyGPT.c format:

python train/export_tg.py output.bin checkpoint.pt

This creates output.bin containing your model weights in a format optimized for ESP32.

Arguments:

  • First argument: output filepath (where to save the .bin file)
  • Second argument: checkpoint filepath (your trained .pt file)

Example:

python train/export_tg.py espic-2.bin out/best_model.pt

Type-Grouped Export Format

Standard model formats store weights layer by layer. Type-grouped format stores all weights of the same type together:

  • All token embeddings
  • All positional embeddings
  • All layer norm weights for each layer
  • All attention weights for each layer
  • etc.

This layout enables zero-copy memory mapping on ESP32, allowing direct access to weights from flash memory without copying to RAM. This saves substantial memory and makes loading nearly instant.

Step 5: Export the Tokenizer

Export the tokenizer vocabulary:

python train/export_tokenizer.py

This creates tokenizer.bin containing your character-level vocabulary. The tokenizer is built from your training data and includes exactly the characters your model was trained on.

Step 6: Flash to ESP32

You now have two files:

  • output.bin (your trained model, typically ≈4-8MB)
  • tokenizer.bin (your tokenizer vocabulary, typically ≈5-50KB)

To deploy:

  1. Rename output.bin to match your firmware's expected path (e.g., espic-2.bin)
  2. Flash both files to your ESP32's SPIFFS partition using esptool.py or your preferred method
  3. Flash your tinyGPT.c firmware

See the UART Inference section for detailed flashing instructions.

Training Files Reference

All training scripts are located in the train/ folder:

Troubleshooting

Loss not decreasing:

  • Increase learning rate (try 1e-2)
  • Verify data quality and formatting
  • Reduce model size (smaller models learn faster)

Out of memory during training:

  • Reduce batch_size (try 16 or 8)
  • Reduce max_seq_len
  • Train on CPU if GPU memory is insufficient (will be slower)

Validation loss increasing (overfitting):

  • Add dropout (try 0.1)
  • Reduce max_iters
  • Increase training data
  • Use a smaller model

Model produces nonsensical output:

  • Train longer (model may not have converged)
  • Verify tokenizer export matches training vocabulary
  • Check exported model file size is reasonable

Testing Before Deployment

You can test your model before flashing to ESP32. Load the checkpoint and run inference in Python:

import torch
from model_gpt import GPT, GPTModelArgs

# Load checkpoint
ckpt = torch.load("out/best_model.pt", map_location="cpu")
config = ckpt['config']
stoi = ckpt['stoi']
itos = {i: ch for ch, i in stoi.items()}

# Build model
model_args = GPTModelArgs(
    dim=config.dim,
    n_layers=config.n_layers,
    n_heads=config.n_head,
    vocab_size=config.vocab_size,
    max_seq_len=config.max_seq_len
)
model = GPT(model_args)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()

# Generate text
prompt = "Hello"
context = torch.tensor([stoi[c] for c in prompt], dtype=torch.long).unsqueeze(0)
generated = model.generate(context, max_new_tokens=100, temperature=0.8, top_k=40)

# Decode output
output = ''.join([itos[int(i)] for i in generated[0]])
print(output)

This allows faster iteration without repeatedly flashing the ESP32.

Example: Espic-2 Training Configuration

Espic-2 (the current included example model) was trained with:

  • Dataset: ~80MB of conversational text (70MB train, ~8MB validation after 90/10 split)
  • Configuration: dim=128, n_layers=4, n_heads=4, max_seq_len=32
  • Training: 8000 iterations, learning_rate=5e-3, batch_size=32
  • Result: inference speed ~30-40 tokens/sec on ESP32S3
  • Vocabulary: 4661 unique characters

Your results will vary based on dataset quality and configuration choices.

Important

Espic-2's performance is currently limited. The initial release prioritized getting the product out and demonstrating the concept. Future updates will focus on architectural improvements and training models with significantly better conversational capabilities.

UART Inference

To flash the default UART Inference firmware, which turns your ESP32S3-N16R8 into a serial AI hub, if you are building a P1 or just want to have a microcontroller AI hub for for any other application, you can follow this guide.

Note

You need to have ESP-IDF toolchain installed for this.

Navigate to uartinference folder

cd uartinference

Build...

idf.py build

...and flash!

idf.py -p /dev/yourport flash

Steps to reproduce

mindmap
  root((tinyGPT.c))
    Model
      Obtain dataset
      Train model
      Convert to tinyGPT.c
    Firmware
      Build firmware
      Flash it
Loading

You need to obtain a model or train your own, and use a firmware like the one in this repository, which turns your microcontroller into a UART AI hub, or make your own and flash.

Future Plans

  • Train Espic-3; a more valid, better model than Espic-2.
  • Improve optimization.

Contact

You can contact me using yusuf@tachion.tech

Support

You can support me using: "Buy Me A Coffee"

About

Inference for GPT.c models in microcontrollers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors