tinyGPT.c is a tiny version of GPT.c, designed specifically for microcontrollers and tested on ESP32S3-N16R8.
- Features
- Contents
- Quickstart
- Training tinyGPT.c-Compatible Models
- UART Inference
- Future Plans
- Contact
- Support
Prompt/Output: Send a prompt and get a response fast, easily.
Tuning: Tune the GPT.c-compatible model you use to get the most out of your model.
Serial inference: tinyGPT.c converts your preferred microcontroller into a serial AI hub. You can connect the board with tinyGPT.c to pretty much anything, and inference with your AI models locally via UART.
- Notes
- tinyGPT.c is originally developed for Pocketive P1, world's first pocket-sized local AI device powered by microcontrollers.
- GPT.c, tinyGPT.c and Pocketive P1 are developed for educational purposes only, therefore there might be weak points, like in design, codes and more. I don't claim that it's perfect, since project is under development and not finished.
- Why?
- Modern AI chatbots require sending your conversations to remote servers, raising privacy concerns about who has access to your data. Additionally, the massive computing infrastructure needed to run large language models has significant environmental costs. tinyGPT.c explores an alternative approach: running small language models directly on low-power microcontrollers. This enables private, offline AI inference without relying on cloud services or consuming substantial energy. While these models are currently limited compared to their larger counterparts, they demonstrate the potential for edge AI that keeps your data local and your conversations truly private.
- gpt.c -> Main tinyGPT.c library
- gpt.h -> Main tinyGPT.c library
- main.c -> Main UART Inference firmware script, which uses the tinyGPT.c library to turn the uploaded microcontroller into a serial/UART AI hub.
- partitions.csv -> Partitioning scheme made for UART Inference firmware.
- espic2.bin -> Espic-2's pretrained weights.
- tokenizer.bin -> Tokenizer for Espic-2.
This guide will help you integrate the tinyGPT.c library into your ESP32S3 project for running GPT.c-formatted models locally on your microcontroller.
- ESP-IDF toolchain installed (Installation Guide)
- ESP32S3 with at least 8MB PSRAM (tested on N16R8 variant, and I recommend it)
- A GPT.c-compatible model file (
.binformat) - Corresponding tokenizer file (
tokenizer.bin)
The tinyGPT.c library consists of two main files:
- gpt.c - Core implementation with transformer architecture, tokenization, and sampling
- gpt.h - Header file with data structures and function declarations
Copy gpt.c and gpt.h to your ESP-IDF project's main/ directory or component folder.
Before using the library, initialize SPIFFS to access your model files:
#include "esp_spiffs.h"
void init_storage(void) {
esp_vfs_spiffs_conf_t conf = {
.base_path = "/data",
.partition_label = NULL,
.max_files = 5,
.format_if_mount_failed = false
};
esp_vfs_spiffs_register(&conf);
}Load your model checkpoint and build the transformer:
#include "gpt.h"
Transformer transformer;
build_transformer(&transformer, "/data/your-model.bin");The build_transformer() function:
- Reads the model checkpoint file
- Allocates memory for the run state
- Initializes FreeRTOS tasks for parallel computation
- Sets up synchronization primitives
Build the tokenizer with your vocabulary:
Tokenizer tokenizer;
build_tokenizer(&tokenizer, "/data/tokenizer.bin", transformer.config.vocab_size);Configure the sampling parameters for text generation:
Sampler sampler;
float temperature = 0.8f; // Controls randomness (0.0 = deterministic, 1.0 = more random)
float topp = 0.9f; // Top-p (nucleus) sampling threshold
unsigned long long rng_seed = (unsigned int)time(NULL);
build_sampler(&sampler, transformer.config.vocab_size, temperature, topp, rng_seed);Generate text from a prompt:
int steps = 100; // Maximum number of tokens to generate
char *prompt = "Once upon a time";
generate(&transformer, &tokenizer, &sampler, prompt, steps, NULL);With a callback for completion stats:
void on_complete(float tokens_per_sec) {
printf("\nGeneration speed: %.2f tokens/sec\n", tokens_per_sec);
}
generate(&transformer, &tokenizer, &sampler, prompt, steps, on_complete);Free allocated resources when done:
free_sampler(&sampler);
free_tokenizer(&tokenizer);
free_transformer(&transformer);Holds model configuration parameters:
dim- Model dimensionhidden_dim- Hidden layer dimensionn_layers- Number of transformer layersn_heads- Number of attention headsvocab_size- Vocabulary sizeseq_len- Maximum sequence length
Main model structure containing configuration, weights, and state.
Handles text encoding/decoding with vocabulary management.
Controls text generation sampling strategy.
Loads model from file and initializes the transformer.
Parameters:
t- Pointer to Transformer structurecheckpoint_path- Path to model checkpoint file
Initializes tokenizer from vocabulary file.
Parameters:
t- Pointer to Tokenizer structuretokenizer_path- Path to tokenizer filevocab_size- Size of vocabulary
void build_sampler(Sampler* sampler, int vocab_size, float temperature, float topp, unsigned long long rng_seed)
Creates sampler with specified parameters.
Parameters:
sampler- Pointer to Sampler structurevocab_size- Size of vocabularytemperature- Sampling temperature (0.0-1.0+)topp- Top-p sampling threshold (0.0-1.0)rng_seed- Random seed for reproducibility
void generate(Transformer *transformer, Tokenizer *tokenizer, Sampler *sampler, char *prompt, int steps, generated_complete_cb cb_done)
Generates text from a prompt.
Parameters:
transformer- Pointer to initialized Transformertokenizer- Pointer to initialized Tokenizersampler- Pointer to initialized Samplerprompt- Input text prompt (can be NULL for unconditional generation)steps- Maximum tokens to generatecb_done- Optional callback function called on completion
Performs a forward pass through the transformer.
Parameters:
transformer- Pointer to Transformertoken- Input token IDpos- Position in sequence
Returns: Pointer to logits array
Encodes text into token IDs.
Parameters:
t- Pointer to Tokenizertext- Input text to encodebos- Add beginning-of-sequence token (0 or 1)eos- Add end-of-sequence token (0 or 1)tokens- Output array for token IDsn_tokens- Pointer to store number of tokens
Decodes a token ID to text.
Parameters:
t- Pointer to Tokenizerprev_token- Previous token ID (for context)token- Token ID to decode
Returns: Decoded text string
tinyGPT.c uses a binary checkpoint format compatible with GPT.c:
Header (256 bytes):
- Magic number:
0x616b3432 - Version:
3 - Configuration parameters (dim, hidden_dim, n_layers, etc.)
Weights: Following the header, model weights are stored as float arrays in this order:
- Token embeddings
- Positional embeddings
- Layer normalization weights/biases
- Attention projection weights (Q, K, V, O)
- Feed-forward network weights
- Final layer normalization
- Classification head (optional, can share token embeddings)
- 0.0: Deterministic (always picks most likely token)
- 0.5-0.8: Balanced creativity and coherence (recommended)
- 1.0+: More random and creative
- 0.9: Good default, considers top 90% probability mass
- 1.0: No filtering (equivalent to temperature-only sampling)
- <0.9: More conservative, fewer token choices
Limit generation length to conserve memory and processing time. The model's seq_len parameter defines the maximum context window.
The library allocates significant memory for:
- Model weights (read from file)
- KV cache (stores attention keys/values for all positions)
- Activation buffers
For ESP32S3-N16R8 (16MB Flash, 8MB PSRAM), suitable models are typically:
- <50M parameters
- Dimension ≤ 512
- 6-8 layers
The example main.c demonstrates implementing a UART-based AI hub. Key features:
Override _write() to redirect printf output to UART during generation:
int _write(int fd, const void *buf, size_t count) {
if ((fd == 1 || fd == 2) && output_to_uart) {
uart_write_bytes(UART_NUM, buf, count);
return count;
}
return count;
}The example uses a simple protocol:
- Input: Text prompt ending with newline
- Output: Generated tokens followed by
<STATS>X.XX</STATS><END>\n
int uart_read_line(char *buffer, int max_len) {
// Read bytes until newline
// Returns length of line read
}- Reduce model size or use a smaller model
- Decrease
seq_lenin model configuration - Check PSRAM allocation in menuconfig
- Model may be too large for your hardware
- Check CPU frequency settings
- Verify PSRAM speed configuration
- Verify model and tokenizer files match
- Check temperature/topp parameters
- Ensure sufficient steps for meaningful output
#include "gpt.h"
#include "esp_spiffs.h"
#include <time.h>
void app_main(void) {
// Initialize storage
esp_vfs_spiffs_conf_t conf = {
.base_path = "/data",
.partition_label = NULL,
.max_files = 5,
.format_if_mount_failed = false
};
esp_vfs_spiffs_register(&conf);
// Build transformer
Transformer transformer;
build_transformer(&transformer, "/data/model.bin");
// Build tokenizer
Tokenizer tokenizer;
build_tokenizer(&tokenizer, "/data/tokenizer.bin",
transformer.config.vocab_size);
// Build sampler
Sampler sampler;
build_sampler(&sampler, transformer.config.vocab_size,
0.8f, 0.9f, (unsigned int)time(NULL));
// Generate text
char *prompt = "Hello, I am";
generate(&transformer, &tokenizer, &sampler, prompt, 50, NULL);
// Cleanup
free_sampler(&sampler);
free_tokenizer(&tokenizer);
free_transformer(&transformer);
}Want to train your own models for tinyGPT.c? This guide covers the complete process, from dataset preparation to exporting a model ready for your ESP32.
Note: tinyGPT.c uses a type-grouped export format optimized for ESP32's memory architecture. This format enables zero-copy memory mapping, making model loading faster and more memory-efficient on resource-constrained devices.
The easiest way to get started is with our training notebook: train/tgptc_train.ipynb
This notebook handles the entire process step by step. Click the "Open in Colab" button above and you can start training immediately. All you need is your dataset. You can upload it to Colab and run every cell.
Dataset Preparation → Model Training → Export Model → Export Tokenizer → Flash to ESP32
Using DataSeek (optional but recommended)
DataSeek makes dataset preparation significantly easier. It's powered by DeepSeekr, so you can automatically generate training datasets without any API costs, for free.
DeepSeekr is a Selenium automation tool that generates conversations from DeepSeek's web interface.
The workflow is straightforward:
- Give DataSeek a prompt describing your desired dataset subject (e.g., "conversational AI about programming")
- DataSeek uses DeepSeekr to automatically generate conversations on that topic
- You get a clean, formatted dataset file ready for training
This eliminates manual data collection, web scraping, and API costs while generating high-quality training data.
If you're preparing data manually, save it as a plain text file named dataset.txt (UTF-8 encoding).
Dataset tips:
- Larger datasets generally produce better results (aim for ≈70MB minimum, more is better)
- Clean your data to avoid encoding issues that can affect training
- For conversational models, format text as dialogue
- The training script builds a character-level tokenizer from your data
Open train/tiny_train.py and adjust the TrainConfig class:
class TrainConfig:
# model config
dim = 128 # Embedding dimension
n_layers = 4 # Number of transformer layers
n_head = 4 # Number of attention heads
max_seq_len = 32 # Context window (keep low for ESP32)
# training config
batch_size = 32
learning_rate = 5e-3
max_iters = 8000
eval_interval = 500Key parameters:
dim: Model dimension. Higher values increase capacity but require more memory. 128-192 works well for ESP32S3.max_seq_len: Maximum context length. This is critical for memory usage. Keep it low (32-64) for ESP32.n_layers: Number of transformer layers. More layers improve quality but slow down inference. 4-6 layers is recommended.max_iters: Training iterations. Continue training while validation loss decreases.
For ESP32S3-N16R8 (8MB PSRAM):
- ~250K parameters: dim=128, n_layers=4, max_seq_len=32
- ~500K parameters: dim=192, n_layers=4, max_seq_len=48
- ~1M parameters: dim=256, n_layers=6, max_seq_len=32
Larger configurations will struggle with memory constraints, especially when the KV cache fills during generation.
Place your dataset in the training directory and run:
python train/tiny_train.pyThe training process:
- Loads and splits data (90% train, 10% validation)
- Builds a character-level tokenizer from your dataset
- Initializes the GPT model architecture
- Trains for the specified iterations
- Saves checkpoints to
out/directory
During training:
- Training loss should decrease consistently
- Validation loss should track training loss (divergence indicates overfitting)
- Best model is automatically saved to
out/best_model.pt
Training time varies depending on dataset size, model configuration, and hardware, but it shouldn't be very long. (Max. ≈10 minutes for a tinyGPT.c model on Colab with typical GPU)
After training completes, export the model to tinyGPT.c format:
python train/export_tg.py output.bin checkpoint.ptThis creates output.bin containing your model weights in a format optimized for ESP32.
Arguments:
- First argument: output filepath (where to save the .bin file)
- Second argument: checkpoint filepath (your trained .pt file)
Example:
python train/export_tg.py espic-2.bin out/best_model.ptStandard model formats store weights layer by layer. Type-grouped format stores all weights of the same type together:
- All token embeddings
- All positional embeddings
- All layer norm weights for each layer
- All attention weights for each layer
- etc.
This layout enables zero-copy memory mapping on ESP32, allowing direct access to weights from flash memory without copying to RAM. This saves substantial memory and makes loading nearly instant.
Export the tokenizer vocabulary:
python train/export_tokenizer.pyThis creates tokenizer.bin containing your character-level vocabulary. The tokenizer is built from your training data and includes exactly the characters your model was trained on.
You now have two files:
output.bin(your trained model, typically ≈4-8MB)tokenizer.bin(your tokenizer vocabulary, typically ≈5-50KB)
To deploy:
- Rename
output.binto match your firmware's expected path (e.g.,espic-2.bin) - Flash both files to your ESP32's SPIFFS partition using
esptool.pyor your preferred method - Flash your tinyGPT.c firmware
See the UART Inference section for detailed flashing instructions.
All training scripts are located in the train/ folder:
- train/tgptc_train.ipynb -> Complete training notebook
- train/tiny_train.py -> Main training script with TrainConfig
- train/model_gpt.py -> GPT model architecture implementation
- train/export_tg.py -> Exports trained models to .bin format
- train/export_tokenizer.py -> Exports character-level tokenizer
Loss not decreasing:
- Increase learning rate (try 1e-2)
- Verify data quality and formatting
- Reduce model size (smaller models learn faster)
Out of memory during training:
- Reduce batch_size (try 16 or 8)
- Reduce max_seq_len
- Train on CPU if GPU memory is insufficient (will be slower)
Validation loss increasing (overfitting):
- Add dropout (try 0.1)
- Reduce max_iters
- Increase training data
- Use a smaller model
Model produces nonsensical output:
- Train longer (model may not have converged)
- Verify tokenizer export matches training vocabulary
- Check exported model file size is reasonable
You can test your model before flashing to ESP32. Load the checkpoint and run inference in Python:
import torch
from model_gpt import GPT, GPTModelArgs
# Load checkpoint
ckpt = torch.load("out/best_model.pt", map_location="cpu")
config = ckpt['config']
stoi = ckpt['stoi']
itos = {i: ch for ch, i in stoi.items()}
# Build model
model_args = GPTModelArgs(
dim=config.dim,
n_layers=config.n_layers,
n_heads=config.n_head,
vocab_size=config.vocab_size,
max_seq_len=config.max_seq_len
)
model = GPT(model_args)
model.load_state_dict(ckpt['model_state_dict'])
model.eval()
# Generate text
prompt = "Hello"
context = torch.tensor([stoi[c] for c in prompt], dtype=torch.long).unsqueeze(0)
generated = model.generate(context, max_new_tokens=100, temperature=0.8, top_k=40)
# Decode output
output = ''.join([itos[int(i)] for i in generated[0]])
print(output)This allows faster iteration without repeatedly flashing the ESP32.
Espic-2 (the current included example model) was trained with:
- Dataset: ~80MB of conversational text (70MB train, ~8MB validation after 90/10 split)
- Configuration: dim=128, n_layers=4, n_heads=4, max_seq_len=32
- Training: 8000 iterations, learning_rate=5e-3, batch_size=32
- Result: inference speed ~30-40 tokens/sec on ESP32S3
- Vocabulary: 4661 unique characters
Your results will vary based on dataset quality and configuration choices.
Important
Espic-2's performance is currently limited. The initial release prioritized getting the product out and demonstrating the concept. Future updates will focus on architectural improvements and training models with significantly better conversational capabilities.
To flash the default UART Inference firmware, which turns your ESP32S3-N16R8 into a serial AI hub, if you are building a P1 or just want to have a microcontroller AI hub for for any other application, you can follow this guide.
Note
You need to have ESP-IDF toolchain installed for this.
cd uartinference
idf.py build
idf.py -p /dev/yourport flash
mindmap
root((tinyGPT.c))
Model
Obtain dataset
Train model
Convert to tinyGPT.c
Firmware
Build firmware
Flash it
You need to obtain a model or train your own, and use a firmware like the one in this repository, which turns your microcontroller into a UART AI hub, or make your own and flash.
- Train Espic-3; a more valid, better model than Espic-2.
- Improve optimization.
You can contact me using yusuf@tachion.tech


