"You gotta do it, you can do it, but you just don't wanna do it."
PyNorma is a Python library that provides insights and automation for preprocessing messy, real-world tabular data. It's designed for data scientists, analysts, and anyone who's tired of the tedious task of cleaning up unstructured spreadsheets.
- Ensemble Table Detection: 6 competing strategies auto-detect the data region within messy files, scored and selected by an internal quality metric — no ground truth required.
- Multi-Table Detection: Automatically finds multiple tables within a single sheet by detecting empty-row gaps between data segments.
- Pipeline API: A fluent, chainable interface connects detection → cleaning → transformation in one call.
- 1NF Violation Detection: Automatically identifies columns containing multi-valued cells (e.g.,
"apple, banana") using atom overlap-ratio analysis. - Preprocessing Toolkit:
Atomizer: Splits cells with multiple values into distinct rows or columns (1NF normalization).Flattener: Converts wide, multi-level header tables into a tidy, long format.Clarifier: Standardizes data values using a custom dictionary mapping.Merger: Deduplicates rows by summing numeric columns.Appender: Vertically concatenates DataFrames with smart header alignment.
pip install pynormafrom pynorma import Pipeline
df = Pipeline("messy_report.xlsx").run()
print(df.head())from pynorma import Pipeline
df = (Pipeline("data.xlsx", strategy="D")
.detect() # Find table regions
.clean() # Extract & clean
.atomize(cols=["Tags"], delimiter=",") # Explode multi-valued cells
.clarify("업종", "dict.csv", sum_columns=["매출"]) # Standardize values
.merge(sum_columns=["매출"]) # Deduplicate
.result())p = Pipeline("multi_table_sheet.xlsx")
p.detect().clean()
for i, table in enumerate(p.all_tables()):
print(f"Table {i}: {table.shape}")from pynorma import detect_multivalue_columns
result = detect_multivalue_columns(df)
# [('Fruits', ',', 1.0)] → Fruits column has comma-separated listsimport pynorma
# parse() now uses ensemble detection internally (with legacy fallback)
df = pynorma.parse("data.csv", trim="auto")Raw File ──→ Detection ──→ Cleaning ──→ DataFrame ──→ Preprocessor ──→ Clean Output
6 strategies common pandas atomize
multi-table pipeline clarify
quality_score merge / flatten
PyNorma reduces the table detection problem to finding N tables, each described by 5 integers:
(header, top, left, bottom, right)
Six strategies compete on each file, and the best is selected via a ground-truth-free quality_score based on type consistency, fill uniformity, header confidence, boundary sharpness, coverage, and region size.
For multi-table sheets, gap-based splitting runs first — consecutive empty rows signal table boundaries.
| Strategy | Approach | Avg Score |
|---|---|---|
| D_Pattern | Regex normalization + type ratio | 0.991 |
| C_Gradient | Density gradient boundary detection | 0.989 |
| B_Entropy | Entropy jump detection | 0.967 |
| F_Voting | Column-type voting | 0.964 |
| A_Rules | Fixed heuristic rules | 0.960 |
| E_Window | Sliding window density | 0.916 |
Benchmarked on 36 specimens (24 real-world + 12 adversarial edge cases) with 60 regression tests.
detect_multivalue_columns() uses atom overlap-ratio analysis:
- For each column, tries candidate delimiters (
,;/|) - Splits cells into atoms and checks cross-cell overlap
- High overlap ratio = multi-valued column (atoms appear in other cells)
pynorma/
├── pynorma/ # Main package
│ ├── pipeline.py # Pipeline API (detection → preprocessor)
│ ├── io/ # File I/O (CSV, XLSX)
│ │ └── trimmer.py # Ensemble detection → legacy fallback
│ ├── detect/ # Table region detection (legacy)
│ └── preprocessor/ # Atomizer, Clarifier, Merger, Flattener, Appender
├── specimen/ # Test data (36 files)
│ └── benchmark/ # Ensemble detection framework
│ ├── core.py # TableRegion, quality_score, split_tables_by_gap
│ ├── preprocess.py # detect() + preprocess() (multi-table aware)
│ ├── strategies/ # 6 competing detection strategies
│ └── tests/ # 60 regression tests
└── tests/ # Package-level tests
nash-dir (https://github.com/nash-dir)
This project is licensed under the MIT License.