Read and write fixed-width / flat-file / mainframe data in DuckDB with SQL —
the equivalent of Perl unpack() / Python struct, plus COBOL copybooks
(COMP-3, zoned decimal, EBCDIC, OCCURS, REDEFINES).
It runs as a VGI worker: a small standalone binary that
DuckDB launches and talks to over Apache Arrow. You ATTACH it and call its
functions like any other.
ATTACH 'fixed' (TYPE vgi, LOCATION './target/release/fixedformat-worker');
SET search_path = 'fixed.main';
SELECT unpack_fixed('JOHN 00042', 'name:A10 qty:9(5)');
-- {'name': JOHN, 'qty': 42}1. Build the worker (needs Rust 1.86+):
cargo build --release # produces target/release/fixedformat-worker2. Attach it in DuckDB (any DuckDB with the vgi community extension):
INSTALL vgi FROM community; -- one time
ATTACH 'fixed' (TYPE vgi, LOCATION '/absolute/path/to/fixedformat-worker');
SET search_path = 'fixed.main'; -- so you can call functions unqualifiedUse an absolute LOCATION (it's resolved relative to DuckDB's working
directory).
| Function | Shape | What it does |
|---|---|---|
unpack_fixed(rec, spec [, encoding]) |
scalar → STRUCT | Parse one VARCHAR/BLOB record into a struct of fields |
pack_fixed(struct, spec [, encoding]) |
scalar → BLOB | Format a struct back into a fixed-width record |
read_fixed(path, spec [, options…]) |
table function | Read a whole fixed-width file into rows |
write_fixed((FROM rel), path, spec [, options…]) |
table function | Write a relation out to a fixed-width file |
pack_fixed is the exact inverse of unpack_fixed:
pack_fixed(unpack_fixed(rec, s), s) = rec.
SELECT unpack_fixed('JOHN 00042', 'name:A10 qty:9(5)').qty; -- 42
-- Over a column:
SELECT (unpack_fixed(raw_line, 'name:A10 qty:9(5)')).*
FROM my_table;rec can be VARCHAR or BLOB (use a BLOB for binary / COMP-3 / EBCDIC data).
The third argument is the byte encoding ('ascii' default, or 'ebcdic').
SELECT pack_fixed({'name': 'JOHN', 'qty': 42}, 'name:A10 qty:9(5)');
-- returns BLOB 'JOHN 00042'SELECT * FROM read_fixed('data/accounts.dat', 'name:A10 qty:9(5)');
-- COBOL copybook + EBCDIC + fixed-length records:
SELECT * FROM read_fixed('data/master.dat',
'01 REC. 05 NM PIC X(20). 05 BAL PIC S9(7)V99 COMP-3.',
encoding => 'ebcdic', framing => 'fixed');path may be a glob (data/*.dat). Options are named: format, encoding,
framing, record_length.
write_fixed is a table function, so call it in a FROM clause and pass the
input relation as a subquery (FROM …):
SELECT * FROM write_fixed((FROM my_table), '/tmp/out.dat', 'name:A10 qty:9(5)');
-- returns one row: (rows_written, bytes_written)A spec describes the record layout. Three formats are accepted and
auto-detected (override with format => on the table functions):
Whitespace-separated tokens, each optionally name:-prefixed.
name:A10 qty:9(5) amt:l< flags:C
| Code | Meaning | The count is… |
|---|---|---|
A / a / Z |
string — space-pad / null-pad / null-terminated | the width in bytes |
9(n) / S9(n) / X(n) |
display numeric / signed / text (COBOL PIC) | the digit/char count |
c C |
int8 / uint8 | a repeat → LIST |
s S |
int16 / uint16 | a repeat |
l L (or i I) |
int32 / uint32 | a repeat |
q Q |
int64 / uint64 | a repeat |
n N |
uint16 / uint32, big-endian | a repeat |
v V |
uint16 / uint32, little-endian | a repeat |
e f d |
float16 / float32 / float64 | a repeat |
H h |
hex string, high / low nibble first | the width |
? |
boolean byte | a repeat |
x |
pad byte(s) — consumed, no output column | the width |
Byte order: default is big-endian. A trailing < / > on a code sets it
per-field (l< = little-endian int32); a standalone < > ! = @ token
sets the default for everything after it. (!/> = network/big, < = little,
=/@ = native.)
A count is (n) or trailing digits: A10 and A(10) are the same; s(3) is
three int16s returned as a LIST.
Self-documenting; good for generated specs.
[
{"name": "id", "type": "str", "width": 10},
{"name": "qty", "type": "int", "digits": 5},
{"name": "amt", "type": "comp3", "digits": 9, "scale": 2, "signed": true}
]Types: str, int, decimal, comp3/packed, zoned, binary/comp,
float/double/half, hex, bool, pad. Options: width, digits,
scale, signed, endian (big/little), occurs, justify (left/right),
pad, sign (leading/trailing/embedded).
Real copybook text — paste it straight in.
01 ACCOUNT-RECORD.
05 ACCT-ID PIC X(10).
05 BALANCE PIC S9(7)V99 COMP-3.
05 HISTORY OCCURS 12 PIC 9(6).
05 RAW-DATE PIC X(8).
05 DATE-PARTS REDEFINES RAW-DATE.
10 YYYY PIC 9(4).
10 MM PIC 9(2).
10 DD PIC 9(2).- Group items →
STRUCTcolumns. OCCURS n→LIST(UNNESTit to get rows).REDEFINES→ aSTRUCTholding every overlapping interpretation of the same bytes (named after the base field).USAGE COMP-3/PACKED-DECIMAL,COMP/COMP-4/COMP-5/BINARY, andSIGN LEADING/TRAILING [SEPARATE]are honored.
| Field | DuckDB type |
|---|---|
| text / hex | VARCHAR |
| display / binary integer | BIGINT |
| COMP-3, zoned, implied-point decimal | DECIMAL(p, s) (exact) |
| float32 / float16 | REAL |
| float64 | DOUBLE |
| boolean | BOOLEAN |
| OCCURS | LIST of the above |
| group / REDEFINES | STRUCT |
- encoding:
ascii(default) orebcdic(code page 037). - framing (how records are delimited in a file):
newline(default) — one record per line.fixed— no delimiters; records are exactlyrecord_lengthbytes (defaults to the layout length; override withrecord_length =>).rdw— IBM variable-length: each record prefixed with a 4-byte Record Descriptor Word.rdw_blocked— RDW records inside Block Descriptor Word blocks.
Split a fixed-width text column already in a table:
SELECT (unpack_fixed(line, 'id:A8 name:A20 amt:9(9)')).*
FROM staging_lines;Convert a COBOL EBCDIC file to a Parquet file:
COPY (
SELECT * FROM read_fixed('mainframe.dat', '<copybook>',
encoding => 'ebcdic', framing => 'fixed')
) TO 'out.parquet';Flatten an OCCURS array into rows:
SELECT r.acct_id, h.idx, h.val
FROM read_fixed('f.dat', '01 R. 05 ACCT-ID PIC X(10). 05 H OCCURS 3 PIC 9(4).') r,
UNNEST(r.H) WITH ORDINALITY AS h(val, idx);Round-trip / reformat a file (read with one spec, write with another):
SELECT * FROM write_fixed(
(FROM read_fixed('in.dat', 'a:A5 b:9(3)')),
'out.dat', 'a:A8 b:9(5)');- The function is
unpack_fixed, notunpack—unpackis a reserved keyword in DuckDB. write_fixedis a table function: useSELECT * FROM write_fixed((FROM t), …), notSELECT write_fixed(…).- If a query creates a table, don't
SET search_path = 'fixed.main'first — that points DDL at the read-only worker catalog. Either skipsearch_pathand fully-qualify calls (fixed.main.unpack_fixed(…)), or create your tables before setting it. pack_fixed/write_fixederror if a value doesn't fit its field width — that's intentional (silent truncation corrupts fixed-width files).read_fixeddecodes eagerly and fails fast: one malformed record (bad COMP-3 nibble, a too-short line, a ragged fixed-length stream) errors the whole query with a clear message rather than returning partial or corrupt rows.- After rebuilding the worker,
DETACH fixed; ATTACH …to pick up the new binary.
cargo test -p fixedformat-core # unit tests + property-based round-trip fuzzing
cargo clippy --all-targets
./run_tests.sh # end-to-end SQLLogic suite (see below)
python3 data/generate_fixtures.py # regenerate test fixturesCoverage: ~70 fixedformat-core unit tests, a proptest suite proving
decode(encode(v)) == v across every field kind (ASCII + EBCDIC), and 9 SQLLogic
files (120+ assertions) covering every function, spec format, framing mode, NULL
handling, and malformed-input behavior. Binary decoding is cross-checked against
Python struct.pack reference bytes.
The end-to-end suite (test/sql/*.test) runs against the built worker through
the haybarn DuckDB unittest runner. One-time setup:
uv tool install haybarn-unittest
uv tool install haybarn
echo "INSTALL vgi FROM community;" | uvx haybarn-cliThe codebase splits into a pure-logic crate and a thin adapter — see
CLAUDE.md for architecture and conventions.