Skip to content

santhsecurity/flashsieve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

flashsieve

Part of Santh - open source Rust security and infrastructure tooling. Follow @SanthProject on X.

crates.io docs.rs license

Storage-level pre-filtering for pattern matching.

flashsieve builds per-block byte histograms and 2-byte n-gram bloom filters, then uses them to answer which blocks might contain matches. If a block cannot contain a pattern, it is skipped entirely—saving CPU, I/O, and memory at scale.


Installation

[dependencies]
flashsieve = "0.1"

Minimum supported Rust version: 1.85.


Quick Start

use flashsieve::{BlockIndexBuilder, ByteFilter, NgramFilter};

let mut block_a = vec![b'x'; 256];
let mut block_b = vec![b'y'; 256];
block_a[..6].copy_from_slice(b"secret");
block_b[..5].copy_from_slice(b"token");
let patterns: [&[u8]; 2] = [b"secret", b"token"];

let index = BlockIndexBuilder::new()
    .block_size(256)
    .bloom_bits(512)
    .build_streaming([block_a, block_b].into_iter())?;

let byte_filter = ByteFilter::from_patterns(&patterns);
let ngram_filter = NgramFilter::from_patterns(&patterns);

let candidates = index.candidate_blocks(&byte_filter, &ngram_filter);
assert_eq!(candidates.len(), 1);
assert_eq!(candidates[0].length, 512);
# Ok::<(), flashsieve::Error>(())

What It Does

Component Purpose
BlockIndexBuilder Constructs indexes from raw bytes or streaming blocks
BlockIndex In-memory index with per-block histograms & bloom filters
FileBloomIndex File-level bloom union for fast n-gram rejection before per-block scans
MmapBlockIndex Zero-parse, zero-copy queries over serialized indexes
IncrementalBuilder Append new blocks to existing indexes without rebuilding
ByteFilter Reject blocks that don't contain all required bytes
NgramFilter Reject blocks that don't contain all required 2-byte n-grams

Bloom Filter Math

The theoretical false-positive rate for a Bloom filter with:

  • m = number of bits
  • n = number of distinct n-grams inserted
  • k = number of hash functions (k = 3 in this crate)

is approximately:

FPR ≈ (1 - e^(-kn/m))^k

For the optimal number of hash functions k = (m/n) × ln(2):

FPR ≈ 0.6185^(m/n)

For filters with ≥4096 bits, flashsieve allocates an exact 65,536-bit pair table, giving zero false positives for all possible 2-byte n-gram queries.


Safety

This crate keeps unsafe usage narrowly scoped to hot-path bloom-filter word loads where index validity is proven by construction. All serialized input is validated (magic, version, checksum, bounds) before use—corrupt files return typed errors rather than panicking.


License

Licensed under the MIT license (LICENSE or https://opensource.org/licenses/MIT).

About

Storage-level pre-filtering for pattern matching

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages