Skip to content

JacobHaig/teddy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Teddy

A data manipulation and analysis library for Zig, inspired by pandas and polars. Teddy provides type-safe, null-aware operations on structured tabular data with full control over memory allocation.

Features

  • Typed Series — Generic Series(T) supporting all integer types, f32, f64, bool, and String
  • DataFrame — Multi-column tabular structure backed by typed series
  • Null support — Validity bitmap (Apache Arrow model); null-aware aggregations, fill, and drop
  • Type casting — Three-tier system: castSafe (comptime-verified), cast (strict runtime), castLossy (failures → null)
  • Aggregationssum, min, max, mean, stdDev — all skip nulls
  • GroupBy — Group by any column; aggregate with count, sum, mean, min, max, stdDev
  • Joins — Inner, left, right, and outer joins on a key column
  • Sorting & filteringsort, filter, select, head, tail, slice, unique, valueCounts
  • I/O — CSV, JSON (rows, columns, NDJSON/JSONL), and Parquet read/write
  • Memory safety — Explicit allocator everywhere; clear ownership contracts

Getting Started (Library User)

Requirements

Zig 0.16.0 or later is required.

Adding Teddy to Your Project

Fetch the package and save it to your build.zig.zon:

zig fetch --save git+https://github.com/Wisward/teddy

Then wire it up in your build.zig:

const teddy_dep = b.dependency("teddy", .{
    .target = target,
    .optimize = optimize,
});

exe.root_module.addImport("teddy", teddy_dep.module("dataframe"));

First Program

const std = @import("std");
const teddy = @import("teddy");
const Dataframe = teddy.Dataframe;
const Series = teddy.Series;

pub fn main(init: std.process.Init) !void {
    const allocator = init.gpa;
    const io = init.io;

    // Load a CSV file
    var reader = try teddy.Reader.init(allocator, io);
    defer reader.deinit();

    var df = try reader
        .withFileType(.csv)
        .withPath("data.csv")
        .load();
    defer df.deinit();

    // Print the first 5 rows
    const preview = try df.head(5);
    defer preview.deinit();
    try preview.print();
}

Run it:

zig build run

Core Concepts

Building a DataFrame

var df = try Dataframe.init(allocator);
defer df.deinit();

var age = try Series(i32).init(allocator);
try age.rename("age");
try age.append(25);
try age.append(30);
try age.appendNull(); // missing value
try age.append(22);
try df.addSeries(age.toBoxedSeries());

var name = try Series(teddy.String).init(allocator);
try name.rename("name");
try name.tryAppend("Alice");
try name.tryAppend("Bob");
try name.tryAppend("Carol");
try name.tryAppend("Dave");
try df.addSeries(name.toBoxedSeries());

Filtering and Selecting

// Keep rows where age > 24 (nulls are excluded automatically)
var adults = try df.filter("age", i32, .gt, 24);
defer adults.deinit();

// Keep only specific columns
var slim = try df.select(&.{ "name", "age" });
defer slim.deinit();

Supported filter operators: .eq, .neq, .lt, .lte, .gt, .gte

Sorting

// Sort ascending
var sorted_asc = try df.sort("age", true);
defer sorted_asc.deinit();

// Sort descending
var sorted_desc = try df.sort("age", false);
defer sorted_desc.deinit();

Aggregations

// Summary statistics for all numeric columns
var stats = try df.describe();
defer stats.deinit();
try stats.print();

GroupBy

var gb = try df.groupBy("department");
defer gb.deinit();

var counts = try gb.count();
defer counts.deinit();

var totals = try gb.sum("salary");
defer totals.deinit();

var averages = try gb.mean("salary");
defer averages.deinit();

Joins

// left / inner / right / outer
const joined = try left_df.join(right_df, "id", .left);
defer joined.deinit();

Null Handling

Nulls are stored as a validity bitmap alongside values — no memory overhead on null-free columns. Aggregations skip nulls automatically.

try s.appendNull();           // mark a slot as missing
s.isNull(i)                   // check a slot
s.nullCount()                 // count missing values
try s.fillNull(0)             // replace nulls with a scalar
try s.fillNullForward()       // carry last valid value forward (LOCF)
try s.fillNullBackward()      // carry next valid value backward
try s.dropNulls()             // new series with null rows removed

try df.dropNulls("col")       // drop rows where col is null
try df.dropNullsAny()         // drop rows where any column is null
try df.fillNull("col", i32, 0)

Type Casting

Three tiers matching Zig's explicit-cast philosophy:

// castSafe — comptime guarantee; compile error if the types aren't safely widening
var wide = try s.castSafe(i64);     // i32 → i64: always safe
// var bad = try s.castSafe(i8);    // compile error: may overflow

// cast — strict runtime; errors on overflow or fractional float→int
var exact = try s.cast(i32);        // 3.0 → 3 ok; 3.5 → error.LossyCast

// castLossy — permissive; failures become null
var best = try s.castLossy(i8);     // 300 → null, 1 → 1

All three variants are available on Series(T), BoxedSeries, and Dataframe.


I/O

The Reader and Writer use a fluent builder pattern. File type is set with .withFileType(); format-specific options are set with additional calls before .load() / .toString() / .save().

Reading

// CSV (auto-infers column types)
var reader = try teddy.Reader.init(allocator, io);
defer reader.deinit();

var df = try reader
    .withFileType(.csv)
    .withPath("data.csv")
    .withDelimiter(',')   // default; omit if not needed
    .withHeaders(true)    // default
    .load();

// JSON — auto-detects rows `[{...}]`, columns `{"col":[...]}`, or NDJSON
var df2 = try reader.withFileType(.json).withPath("data.json").load();

// Parquet
var df3 = try reader.withFileType(.parquet).withPath("data.parquet").load();

Writing to a File

var writer = try teddy.Writer.init(allocator, io);
defer writer.deinit();

// CSV
try writer.withFileType(.csv).withPath("out.csv").save(df);

// JSON — rows `[{...}]` format
try writer
    .withFileType(.json)
    .withJsonFormat(.rows)   // .rows | .columns | .ndjson
    .withPath("out.json")
    .save(df);

// Parquet with Snappy compression
try writer
    .withFileType(.parquet)
    .withCompression(.snappy)
    .withPath("out.parquet")
    .save(df);

Writing to a String (in-memory)

var writer = try teddy.Writer.init(allocator, io);
defer writer.deinit();

const csv_bytes = try writer.withFileType(.csv).toString(df);
defer allocator.free(csv_bytes);

const json_bytes = try writer.withFileType(.json).withJsonFormat(.ndjson).toString(df);
defer allocator.free(json_bytes);

JSON Format Options

Format Shape Example
.rows Array of objects [{"a":1},{"a":2}]
.columns Object of arrays {"a":[1,2]}
.ndjson One object per line {"a":1}\n{"a":2}

Format is auto-detected on read. NDJSON is detected when content starts with { and contains \n{. For single-object files you can force it: .{ .format = .ndjson }.


Building

zig build          # compile
zig build test     # run the full test suite
zig build run      # run the example (loads data/ files)

Project Structure

src/
  dataframe/
    dataframe.zig       — DataFrame: columns, filter, sort, join, groupby, I/O
    series.zig          — Series(T): typed column with null support and casting
    boxed_series.zig    — Type-erased union over all Series(T) variants
    raw.zig             — Raw fallback column (undecoded parquet payloads)
    boxed_groupby.zig   — Type-erased GroupBy dispatch
    group.zig           — GroupBy(T) aggregation engine
    join.zig            — Inner / left / right / outer join
    strings.zig         — Managed String type used for string columns
    reader.zig          — Unified Reader builder
    writer.zig          — Unified Writer builder
    csv_reader.zig      — CSV parser with type inference
    csv_writer.zig      — CSV serializer (RFC 4180)
    json_reader.zig     — JSON parser (rows, columns, NDJSON)
    json_writer.zig     — JSON serializer
    parquet.zig         — Parquet ↔ DataFrame adapter
    *_test.zig          — Per-module test files
    tests.zig           — Test aggregator (imported by build.zig)
  parquet/
    parquet.zig         — Module root (exports readParquet / writeParquet)
    parquet_reader.zig  — File reader (PAR1 magic, footer, column chunks)
    parquet_writer.zig  — File writer
    encoding_*.zig      — PLAIN and RLE/bit-packed codec
    thrift_*.zig        — Thrift metadata serialization
    snappy.zig          — Snappy compression
    *_test.zig          — Per-module test files
    tests.zig           — Test aggregator
  main.zig              — Example program (zig build run)
data/
  addresses.csv / .parquet / _snappy.parquet
  stock_apple.csv
docs/
  architecture.md / file-formats.md / type-system.md / operations.md / reader-writer.md

Contributing

Setup

git clone https://github.com/Wisward/teddy
cd teddy
zig build test   # all tests should pass before you start

Running Tests

zig build test --summary all   # shows pass count per test suite

Tests live next to the source they cover (csv_writer_test.zig beside csv_writer.zig, etc.). Each test file is imported by the suite aggregator at src/dataframe/tests.zig or src/parquet/tests.zig — add your file to the matching aggregator when you create a new one.

Adding a File Format

  1. Create src/dataframe/<format>_reader.zig and/or <format>_writer.zig
  2. Add a variant to the FileType union in reader.zig / writer.zig and wire it into the load / toString dispatch
  3. Add <format>_reader_test.zig / <format>_writer_test.zig and import them in src/dataframe/tests.zig

Style Notes

  • Every public function that allocates takes an Allocator and documents ownership in a doc comment
  • Prefer zero-copy parsing (slice into the input buffer) over allocating intermediate strings
  • Match the existing error type conventions (error.InvalidJson, error.LossyCast, etc.)

Roadmap

See docs/feature-roadmap.md for the prioritized feature roadmap comparing teddy to pandas and polars.

License

TBD

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors