Skip to content

Level-2 enrichment: integrate Jelly and build a dataflow/taint analysis layer #2

@rahlk

Description

@rahlk

Level-2 enrichment: integrate Jelly and build a dataflow/taint analysis layer

PROBLEM

The level-2 analysis enrichment is currently a stub. The original plan targeted
CodeQL, which was never implemented and is unsuitable: it requires an external,
proprietary engine that cannot be embedded in the self-contained cants binary
produced by bun --compile.

We will replace it with Jelly (https://github.com/cs-au-dk/jelly, Aarhus
University). Jelly is a pure-TypeScript static analyzer that computes a call
graph, flow-insensitive (Andersen-style) points-to information, and library
access paths. It is BSD-3-Clause licensed with no native dependencies, so it can
run in-process inside the cants binary. This was verified by a spike: Jelly
builds and runs under Bun, exposes a usable library API (analyzeFiles, Solver,
fragmentState.subsetEdges), and was successfully embedded in a compiled
single-file binary that ran the analysis with no Node runtime and no node_modules.

Jelly provides a call graph and points-to information, but not dataflow or taint.
On top of Jelly we will build a flow-sensitive, relational, summary-based System
Dependence Graph (SDG) and expose taint analysis as a query over it. The design
uses hammock (single-entry, multi-exit) region decomposition with bottom-up
relational summaries, treating Jelly as a fixed call-graph and points-to oracle.
Sources, sinks, sanitizers, and library models are supplied as data (not
hardcoded), and the same relational-summary format represents both
user-specified models and computed function summaries.

This issue covers three pieces of work:

  1. Jelly integration (level-2 call-graph enrichment) plus the taint/dataflow layer.
  2. Removal of the dead CodeQL references from code and docs.
  3. Addition of the missing Apache-2.0 LICENSE file.

PART 1 - JELLY INTEGRATION AND DATAFLOW LAYER

Substrate decisions (locked):

  • Depend on @cs-au-dk/jelly as a package and deep-import its compiled lib/.
    Do not vendor its source. The published package ships lib/ with no
    exports restriction, so deep imports are supported.
  • Treat Jelly as a frozen oracle: read its solved state (call graph, points-to,
    access paths); never modify its solver.
  • Build a flow-sensitive, relational summary layer on top. This is
    flow-sensitive on value flow and locals, and flow-insensitive on the heap
    (heap precision is capped by Jelly's Andersen solve).

Steps:

  1. Add @cs-au-dk/jelly and @babel/preset-typescript (dev) as dependencies.
  2. Check in a bun patch for the one bun --compile blocker (see CAVEATS).
  3. Wire analyzeFiles/Solver behind a flag; run one whole-program solve.
  4. Build the identity-mapping layer: map Jelly's location-based and
    access-path-based node identities to our canonical signature keys (the keys
    produced by computeSignatureForDecl and used by symbol_table / call_graph).
  5. Merge Jelly's call graph into the existing call_graph with provenance
    "jelly", reconciled against the level-1 tsc-resolver edges.
  6. Record global and module-level state (module bindings, singletons, exported
    mutables) and their read/write sites; these become summary inputs/outputs.
  7. Condense the call graph into strongly connected components (Tarjan).
  8. Build an exceptional control-flow graph per function (edges for return,
    throw, break, continue, yield, await), then decompose each CFG into hammock
    regions: single-entry, multi-exit.
  9. Process regions bottom-up (innermost first). For each region build a local
    relational dataflow over access-path vertices and summarize it as a set of
    labeled edges (source vertex -> destination vertex) indexed by exit, plus a
    global read/write footprint.
  10. Collapse each region into a summary node in the enclosing CFG, splicing each
    exceptional exit to the nearest enclosing handler or propagating it outward.
  11. Compose function summaries bottom-up over the SCC-condensation DAG. At each
    call site, apply the callee summary by binding formals to actuals via
    access-path rewriting and splicing exits.
  12. Within each SCC (mutual recursion), iterate summary computation to a
    monotone fixpoint; summaries of recursive members are co-defined.
  13. Represent external/library behavior as model summaries in the same
    relational format. Unmodeled externals default to a conservative
    pass-through (all arguments and reachable heap flow to the return value and
    to external state).
  14. Merge function summaries, region summaries, global-state effects, and model
    summaries into one System Dependence Graph keyed by canonical signatures.
  15. Expose taint analysis as a query over the SDG: seed at sources, propagate
    labeled reachability along dependence edges, block at sanitizers that lie on
    the path (flow-sensitive), and report a flow when a source-kind label
    reaches a matching sink. Reconstruct witness paths lazily over the reverse
    value-flow edges.
  16. Make sources, sinks, sanitizers, and pass-through models configurable as
    data via CLI flags and a JSON spec file (with stdin support), validated
    against a JSON Schema. Precedence: built-in pack, then config file, then
    inline flags. Report the matching model id per flow for explainability.
  17. Emit a new taint_flows section in analysis.json (source, sink, rule,
    sanitized flag, and path). Co-evolve the matching python-sdk Pydantic models.
  18. After an initial call-graph-only MVP, upgrade propagation to be points-to
    backed (alias-aware) over Jelly's constraint graph.

PART 2 - REMOVE CODEQL REFERENCES

The CodeQL path was only ever a stub. Retarget all references to Jelly.

  • Rename src/semantic_analysis/codeql/ to jelly/ and buildCodeqlCallGraph to
    buildJellyCallGraph.
  • Update src/core.ts (level-2 enrich call and comments).
  • Update src/cli.ts (--analysis-level help text: "2 = + CodeQL enrichment").
  • Update src/options/options.ts (level-2 doc comment).
  • Update comments in src/semantic_analysis/callGraph.ts and index.ts.
  • Update README.md (--help block, "Deeper analysis" example, level-2 prose).

PART 3 - ADD LICENSE

The LICENSE file is missing, though README links to it and both package.json and
pyproject.toml declare Apache-2.0.

  • Add a verbatim Apache-2.0 LICENSE file (matching the codeanalyzer-python and
    codeanalyzer-java siblings).
  • Add a NOTICE / third-party-license entry crediting @cs-au-dk/jelly
    (BSD-3-Clause) for attribution in the compiled binary.

CAVEATS AND KNOWN RISKS

  • bun --compile and Babel: Jelly's parser passes Babel transform plugins by
    string name, which the bundler cannot statically resolve, so a naive compiled
    binary fails at parse time with "Cannot find module". Fix via a checked-in
    bun patch that passes imported plugin objects instead of string ids; also
    add @babel/preset-typescript (a dead-branch require inside @babel/core) as a
    dependency. A CJS/ESM interop quirk (the default export arriving as
    default.default under --compile) is handled in our own glue, not Jelly's.
    Upstreaming the plugin fix is worthwhile: Jelly's own standalone binary build
    likely has the same latent bug, and a merge would let us drop the patch.

  • Deep-import stability: we depend on Jelly's internal module paths (e.g.
    lib/analysis/solver.js). If upstream later adds an exports map, these break.
    Track this and request a stable, supported entrypoint upstream.

  • Two parsing foundations: level-1 uses the tsc (ts-morph) type-aware resolver;
    Jelly parses with Babel (type-stripping, not type-aware). The call graphs
    must be reconciled in the merge step.

  • Identity model mismatch: Jelly node identities are location-based and
    access-path-based; our call_graph and symbol_table are keyed by canonical
    signatures. The identity-mapping layer is the required bridge and is on the
    critical path for both call-graph merge and taint.

  • Termination: the relational summaries use access-path vertices, so the
    interprocedural fixpoint does not terminate without bounding access-path
    depth. k-limiting is mandatory (CLI knob, e.g. --taint-field-depth, default
    3); taint-kind sets are bounded bitsets.

  • Precision: the analysis is intentionally over-approximate (sound-leaning,
    false-positive prone). Precision is recovered downstream by ranking/pruning;
    the engine should not trade soundness for a lower false-positive rate.

  • Inherited unsoundness: dynamic eval, reflection, and monkey-patching are not
    modeled (Jelly limitation), and npm type declarations are not reliable enough
    to trust for soundness.

  • Incrementality: aspirational, not in the initial scope. Jelly runs
    whole-program from scratch each time, which is the main obstacle. Record
    summary-to-fact dependency edges and content-hash summaries from the start so
    incremental re-analysis can be switched on later without a rewrite.

  • Binary size: embedding Jelly (with Babel and TypeScript) produced a roughly
    78 MB compiled binary in the spike, up from the current cants size. Confirm
    this is acceptable.

SUGGESTED STAGING

PR A LICENSE plus CodeQL removal (independent, no Jelly dependency).
PR B Jelly dependency, bun patch, identity mapping, call-graph merge; CI
proves the compiled binary runs Jelly in-process.
PR C MVP taint: relational summaries, hammock regions, SCC fixpoint with
k-limiting; demonstrate one source-to-sink flow with a correctly applied
sanitizer.
PR D Configurable models: CLI, JSON Schema validation, default model pack.
PR E taint_flows output and lazy witness reconstruction.
PR F Points-to-backed (alias-aware) propagation.
PR G Incremental re-analysis (switch on the recorded dependency edges).

Context: this supersedes the original CodeQL-based level-2 plan. The
single-binary goal is the driver; Jelly is the only credible engine that runs in
pure JavaScript and can live inside the cants binary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions