insight-canon

insight-canon — tokenization: the foundational log-analysis pipeline.

insight_canon is a self-contained C++23 static library that provides the foundational pipeline for structured log analysis:

Layer	What it does
core	Shared types (`LogLevel`, `EventID`), logging façade (spdlog), ISO-8601 time utilities
tokenization	Format detection, Drain template clustering, arena allocator, `CanonicalEvent` output

A canonical event is the normalized, format-agnostic representation of a log line. Tokenization produces it; insight-metalog and insight-eidos consume it.

Both layers are built as one library and consumed via a single CMake target: insight::canon.

Pipeline

Raw logs
  insight-canon   ->  CanonicalEvent  ->  event stream
  insight-metalog ->  bounded behavioral fingerprint
  insight-eidos   ->  detection reports + explain packets

Determinism

insight-canon is the content-determinism layer of the pipeline: the same input bytes always produce the same canonical events and the same numbers — on any compiler, architecture, or machine. That reproducibility is what makes everything built downstream benchmarkable and verifiable, run after run.

Two guarantees combine:

Deterministic tokenization — a line's template is a pure function of its bytes: same bytes ⇒ same tokens ⇒ same template id, versioned by a canonicalization_version. EventIDs are monotonic (never wall-clock or hash-seeded), CanonicalEvent string views are arena-stable, Drain clustering is a pure function of (template prefix, parameter mask), and no unordered_map iteration order ever reaches output.
Bit-identical math (det_math) — every logarithm / entropy / divergence on the deterministic path is computed in integer fixed-point with no libm and accumulated in a 128-bit integer reducer (order-independent by construction). IEEE + − × ÷ are already cross-machine deterministic; removing libm transcendentals and float-sum ordering removes the only two divergence sources. Consuming code builds with -ffp-contract=off.

The result: identical logs yield an identical fingerprint input everywhere — the foundation the transport (coderoast-ipc) and the format (insight-metalog) build a fully reproducible pipeline on.

Requirements

Tool	Minimum version
C++ compiler	GCC 15 or Clang 21 with C++23 support
CMake	3.28
Ninja	any recent
Conan	2.x

All library dependencies (spdlog, fmt, simdjson, GTest, nlohmann_json) are resolved by Conan from Conan Center Index — nothing needs to be installed manually.

Quick start

For local CodeRoast workspace iteration, use the parent malf helper from the repo root:

malf build .
malf test .

Conan workflow

1. Install dependencies and configure

# Install deps and generate CMake presets into build/
conan install . \
  --output-folder=build \
  --build=missing \
  --profile:host=linux-gcc15-release \
  --profile:build=linux-gcc15-release

A build/CMakePresets.json will be generated. The repo root CMakeUserPresets.json includes it automatically, so IDEs pick up the presets without extra configuration.

2. Configure and build

cmake --preset conan-release
cmake --build --preset conan-release

3. Run tests

ctest --preset conan-release --output-on-failure

Consuming as a Conan dependency

Option A — from a GitHub Release tarball

Each tagged release attaches a insight_canon-X.Y.Z.tgz produced by conan cache save. Restore it into your local cache:

# Download (requires gh CLI or manual download)
gh release download vX.Y.Z \
  --repo CodeRoasted/insight-canon \
  --pattern 'insight_canon-*.tgz' \
  --dir /tmp/

conan cache restore /tmp/insight_canon-X.Y.Z.tgz

Option B — build from source

git clone https://github.com/CodeRoasted/insight-canon.git
cd insight-canon
conan create . --profile:host=linux-gcc15-release --profile:build=linux-gcc15-release --build=missing

CMake usage in your project

find_package(insight_canon REQUIRED CONFIG)

target_link_libraries(my_target PRIVATE insight::canon)

In your conanfile.py:

def requirements(self):
  self.requires("insight_canon/<version>")  # pin to the current release

insight-canon is the upstream tokenization layer of the MetaLog pipeline.

Project layout

insight-canon/
├── api/                    PUBLIC (installed) module interface units
│   └── insight/
│       ├── canon.internal.cppm   insight.canon.internal — std manifest
│       ├── canon.api.cppm        insight.canon.api — the contract (types, det_math,
│       │                         CanonicalEvent, DrainConfig, arena, utils, logging accessors)
│       ├── canon.cppm            insight.canon — the facade (Tokenizer)
│       └── utils/log_macros.hpp  textual INSIGHT_LOG_* macro layer (installed header)
├── src/                    SEALED detail shards (build-only, never installed) + impl units
│   └── insight/
│       ├── scan/           insight.canon.detail.scan — fast_gates predicates + SSE2 sv_* scans
│       ├── strategy/       insight.canon.detail.strategy — IFormatStrategy + 20 format strategies
│       ├── drain/          insight.canon.detail.drain — the Drain template miner
│       ├── parse/          insight.canon.detail.parse — FormatDetector + LogParser
│       ├── tokenizer/      tokenizer_engine.cpp — facade impl unit (the Tokenizer seam)
│       ├── arena/          arena_allocator.cpp — api impl unit
│       └── utils/          logger / time_utils / failure_lexicon — api impl units
├── test_package/           Conan consumer smoke test (zero-init, import insight.canon only)
├── tests/                  Per-domain mirror of src/ + the insight.canon.test aggregate module
│   ├── canon.test.cppm     insight.canon.test — re-exports facade + all detail shards
│   ├── <domain>/           math/ arena/ utils/ drain/ strategy/ parse/ tokenizer/ — GTest suites
│   └── regression/         Loghub-dataset regression tests
├── benchmarks/             Benchmarks + the insight.canon.bench aggregate module
├── proof/                  Public determinism proof gate (Approach B)
├── scripts/
│   └── download_logs.sh    Download Loghub 2k + Zenodo datasets for regression
├── CMakeLists.txt          Single root CMake file
├── conanfile.py            Single Conan recipe
└── .github/workflows/      ci.yml  release-publish.yml  workflow-lint.yml

Module layering (the §11.9.11 pattern): internal ◀ api ◀ detail.{scan ◀ strategy ◀ parse, drain} ◀ facade. The facade interface never imports a detail shard; tokenizer_engine.cpp (a facade impl unit) imports detail.{strategy,drain,parse} to assemble the pipeline — consumers just import insight.canon;.

Building and running tests locally (without Conan)

If you already have the dependencies available as CMake packages, you can configure directly:

cmake -S . -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DINSIGHT_CANON_BUILD_TESTS=ON \
  -DCMAKE_CXX_STANDARD=23

cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure

CMake options

Option	Default	Description
`INSIGHT_CANON_BUILD_TESTS`	`ON` when top-level	Build unit and regression tests
`INSIGHT_CANON_ENABLE_NUMA`	`OFF`	Link libnuma for NUMA-aware arena allocation

Code style

The project uses clang-format (.clang-format) and clang-tidy (.clang-tidy) with settings checked in at the repo root.

# Format all sources in-place
clang-format -i $(find api src tests test_package -name '*.cpp' -o -name '*.hpp')

# Lint (requires compile_commands.json in build/)
clang-tidy -p build $(find src -name '*.cpp')

Regression tests (Loghub datasets)

The regression suite in tests/regression/ tokenizes 16 real-world log files from the Loghub 2k benchmark and asserts minimum per-dataset parse-success rates. The test binary auto-skips if the data directory is absent, so a normal conan create or ctest run is always clean without them.

1. Download the datasets

bash scripts/download_logs.sh

This populates:

data/logs/loghub/   ← 16 × *_2k.log files from the logpai/loghub GitHub repo
data/logs/zenodo/   ← extended archive from Zenodo record 18522101

Requires curl and either unzip or bsdtar.

2. Run the regression tests

The test binary looks for data/logs/loghub/ relative to its working directory. When running via CTest the working directory is the build folder, so symlink or copy the data directory there, or set the working directory explicitly:

# Conan workflow — run from the build output directory
cd build/<preset-dir>
ln -s ../../data data       # or: cp -r ../../data data
ctest --output-on-failure -R regression

# CMake direct workflow
cmake --build build -j$(nproc)
cd build
ln -s ../data data
ctest --output-on-failure -R regression

To override the minimum success threshold for all datasets:

INSIGHT_TOKENIZER_REGRESSION_MIN_SUCCESS_RATE=0.90 ctest --output-on-failure -R regression

CI

Workflow	Trigger	What it does
`ci.yml`	PR touching `api/`, `src/`, `tests/`, `CMakeLists.txt`, or `conanfile.py`	`conan create` — builds the library, runs unit + regression tests, runs the test_package smoke test
`release-publish.yml`	Push of a `vX.Y.Z` tag (or manual dispatch)	Verifies recipe version matches tag, builds, exports a `conan cache save` tarball, attaches it to the GitHub Release
`workflow-lint.yml`	PR touching `.github/workflows/**`	Runs actionlint on all workflow files

Technical Docs

Pipeline reference for tokenization lives in technical_docs/.

License

Apache-2.0 — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

insight-canon

Pipeline

Determinism

Requirements

Quick start

Conan workflow

1. Install dependencies and configure

2. Configure and build

3. Run tests

Consuming as a Conan dependency

Option A — from a GitHub Release tarball

Option B — build from source

CMake usage in your project

Project layout

Building and running tests locally (without Conan)

CMake options

Code style

Regression tests (Loghub datasets)

1. Download the datasets

2. Run the regression tests

CI

Technical Docs

License

About

Uh oh!

Releases 17

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.github/workflows		.github/workflows
api/insight		api/insight
bench_results		bench_results
benchmarks		benchmarks
proof		proof
scripts		scripts
src/insight		src/insight
technical_docs		technical_docs
test_package		test_package
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.clangd		.clangd
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
conanfile.py		conanfile.py
linux-clang21-libcxx-release		linux-clang21-libcxx-release
linux-gcc15-release		linux-gcc15-release
packages.yml		packages.yml

Folders and files

Latest commit

History

Repository files navigation

insight-canon

Pipeline

Determinism

Requirements

Quick start

Conan workflow

1. Install dependencies and configure

2. Configure and build

3. Run tests

Consuming as a Conan dependency

Option A — from a GitHub Release tarball

Option B — build from source

CMake usage in your project

Project layout

Building and running tests locally (without Conan)

CMake options

Code style

Regression tests (Loghub datasets)

1. Download the datasets

2. Run the regression tests

CI

Technical Docs

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages