insight-canon — tokenization: the foundational log-analysis pipeline.
insight_canon is a self-contained C++23 static library that provides the foundational pipeline for structured log analysis:
| Layer | What it does |
|---|---|
| core | Shared types (LogLevel, EventID), logging façade (spdlog), ISO-8601 time utilities |
| tokenization | Format detection, Drain template clustering, arena allocator, CanonicalEvent output |
A canonical event is the normalized, format-agnostic representation of a log line. Tokenization produces it; insight-metalog and insight-eidos consume it.
Both layers are built as one library and consumed via a single CMake target: insight::canon.
Raw logs
insight-canon -> CanonicalEvent -> event stream
insight-metalog -> bounded behavioral fingerprint
insight-eidos -> detection reports + explain packets
insight-canon is the content-determinism layer of the pipeline: the same input bytes always produce the same canonical events and the same numbers — on any compiler, architecture, or machine. That reproducibility is what makes everything built downstream benchmarkable and verifiable, run after run.
Two guarantees combine:
- Deterministic tokenization — a line's template is a pure function of its bytes:
same bytes ⇒ same tokens ⇒ same template id, versioned by acanonicalization_version.EventIDs are monotonic (never wall-clock or hash-seeded),CanonicalEventstring views are arena-stable, Drain clustering is a pure function of (template prefix, parameter mask), and nounordered_mapiteration order ever reaches output. - Bit-identical math (
det_math) — every logarithm / entropy / divergence on the deterministic path is computed in integer fixed-point with no libm and accumulated in a 128-bit integer reducer (order-independent by construction). IEEE+ − × ÷are already cross-machine deterministic; removing libm transcendentals and float-sum ordering removes the only two divergence sources. Consuming code builds with-ffp-contract=off.
The result: identical logs yield an identical fingerprint input everywhere — the foundation the transport (coderoast-ipc) and the format (insight-metalog) build a fully reproducible pipeline on.
| Tool | Minimum version |
|---|---|
| C++ compiler | GCC 15 or Clang 21 with C++23 support |
| CMake | 3.28 |
| Ninja | any recent |
| Conan | 2.x |
All library dependencies (spdlog, fmt, simdjson, GTest, nlohmann_json) are resolved by Conan from Conan Center Index — nothing needs to be installed manually.
For local CodeRoast workspace iteration, use the parent malf helper from the
repo root:
malf build .
malf test .# Install deps and generate CMake presets into build/
conan install . \
--output-folder=build \
--build=missing \
--profile:host=linux-gcc15-release \
--profile:build=linux-gcc15-releaseA build/CMakePresets.json will be generated. The repo root
CMakeUserPresets.json includes it automatically, so IDEs pick up the
presets without extra configuration.
cmake --preset conan-release
cmake --build --preset conan-releasectest --preset conan-release --output-on-failureEach tagged release attaches a insight_canon-X.Y.Z.tgz produced by
conan cache save. Restore it into your local cache:
# Download (requires gh CLI or manual download)
gh release download vX.Y.Z \
--repo CodeRoasted/insight-canon \
--pattern 'insight_canon-*.tgz' \
--dir /tmp/
conan cache restore /tmp/insight_canon-X.Y.Z.tgzgit clone https://github.com/CodeRoasted/insight-canon.git
cd insight-canon
conan create . --profile:host=linux-gcc15-release --profile:build=linux-gcc15-release --build=missingfind_package(insight_canon REQUIRED CONFIG)
target_link_libraries(my_target PRIVATE insight::canon)In your conanfile.py:
def requirements(self):
self.requires("insight_canon/<version>") # pin to the current releaseinsight-canon is the upstream tokenization layer of the MetaLog pipeline.
insight-canon/
├── api/ PUBLIC (installed) module interface units
│ └── insight/
│ ├── canon.internal.cppm insight.canon.internal — std manifest
│ ├── canon.api.cppm insight.canon.api — the contract (types, det_math,
│ │ CanonicalEvent, DrainConfig, arena, utils, logging accessors)
│ ├── canon.cppm insight.canon — the facade (Tokenizer)
│ └── utils/log_macros.hpp textual INSIGHT_LOG_* macro layer (installed header)
├── src/ SEALED detail shards (build-only, never installed) + impl units
│ └── insight/
│ ├── scan/ insight.canon.detail.scan — fast_gates predicates + SSE2 sv_* scans
│ ├── strategy/ insight.canon.detail.strategy — IFormatStrategy + 20 format strategies
│ ├── drain/ insight.canon.detail.drain — the Drain template miner
│ ├── parse/ insight.canon.detail.parse — FormatDetector + LogParser
│ ├── tokenizer/ tokenizer_engine.cpp — facade impl unit (the Tokenizer seam)
│ ├── arena/ arena_allocator.cpp — api impl unit
│ └── utils/ logger / time_utils / failure_lexicon — api impl units
├── test_package/ Conan consumer smoke test (zero-init, import insight.canon only)
├── tests/ Per-domain mirror of src/ + the insight.canon.test aggregate module
│ ├── canon.test.cppm insight.canon.test — re-exports facade + all detail shards
│ ├── <domain>/ math/ arena/ utils/ drain/ strategy/ parse/ tokenizer/ — GTest suites
│ └── regression/ Loghub-dataset regression tests
├── benchmarks/ Benchmarks + the insight.canon.bench aggregate module
├── proof/ Public determinism proof gate (Approach B)
├── scripts/
│ └── download_logs.sh Download Loghub 2k + Zenodo datasets for regression
├── CMakeLists.txt Single root CMake file
├── conanfile.py Single Conan recipe
└── .github/workflows/ ci.yml release-publish.yml workflow-lint.yml
Module layering (the §11.9.11 pattern): internal ◀ api ◀ detail.{scan ◀ strategy ◀ parse, drain} ◀ facade. The facade interface never imports a detail shard; tokenizer_engine.cpp (a facade impl
unit) imports detail.{strategy,drain,parse} to assemble the pipeline — consumers just
import insight.canon;.
If you already have the dependencies available as CMake packages, you can configure directly:
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DINSIGHT_CANON_BUILD_TESTS=ON \
-DCMAKE_CXX_STANDARD=23
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure| Option | Default | Description |
|---|---|---|
INSIGHT_CANON_BUILD_TESTS |
ON when top-level |
Build unit and regression tests |
INSIGHT_CANON_ENABLE_NUMA |
OFF |
Link libnuma for NUMA-aware arena allocation |
The project uses clang-format (.clang-format) and clang-tidy (.clang-tidy) with
settings checked in at the repo root.
# Format all sources in-place
clang-format -i $(find api src tests test_package -name '*.cpp' -o -name '*.hpp')
# Lint (requires compile_commands.json in build/)
clang-tidy -p build $(find src -name '*.cpp')The regression suite in tests/regression/ tokenizes 16 real-world log files from the
Loghub 2k benchmark and asserts minimum per-dataset
parse-success rates. The test binary auto-skips if the data directory is absent, so a
normal conan create or ctest run is always clean without them.
bash scripts/download_logs.shThis populates:
data/logs/loghub/ ← 16 × *_2k.log files from the logpai/loghub GitHub repo
data/logs/zenodo/ ← extended archive from Zenodo record 18522101
Requires curl and either unzip or bsdtar.
The test binary looks for data/logs/loghub/ relative to its working directory.
When running via CTest the working directory is the build folder, so symlink or
copy the data directory there, or set the working directory explicitly:
# Conan workflow — run from the build output directory
cd build/<preset-dir>
ln -s ../../data data # or: cp -r ../../data data
ctest --output-on-failure -R regression# CMake direct workflow
cmake --build build -j$(nproc)
cd build
ln -s ../data data
ctest --output-on-failure -R regressionTo override the minimum success threshold for all datasets:
INSIGHT_TOKENIZER_REGRESSION_MIN_SUCCESS_RATE=0.90 ctest --output-on-failure -R regression| Workflow | Trigger | What it does |
|---|---|---|
ci.yml |
PR touching api/, src/, tests/, CMakeLists.txt, or conanfile.py |
conan create — builds the library, runs unit + regression tests, runs the test_package smoke test |
release-publish.yml |
Push of a vX.Y.Z tag (or manual dispatch) |
Verifies recipe version matches tag, builds, exports a conan cache save tarball, attaches it to the GitHub Release |
workflow-lint.yml |
PR touching .github/workflows/** |
Runs actionlint on all workflow files |
Pipeline reference for tokenization lives in technical_docs/.
Apache-2.0 — see LICENSE.