PyNorma

"You gotta do it, you can do it, but you just don't wanna do it."

PyNorma is a Python library that provides insights and automation for preprocessing messy, real-world tabular data. It's designed for data scientists, analysts, and anyone who's tired of the tedious task of cleaning up unstructured spreadsheets.

Key Features

Ensemble Table Detection: 6 competing strategies auto-detect the data region within messy files, scored and selected by an internal quality metric — no ground truth required.
Multi-Table Detection: Automatically finds multiple tables within a single sheet by detecting empty-row gaps between data segments.
Pipeline API: A fluent, chainable interface connects detection → cleaning → transformation in one call.
1NF Violation Detection: Automatically identifies columns containing multi-valued cells (e.g., "apple, banana") using atom overlap-ratio analysis.
Preprocessing Toolkit:
- Atomizer: Splits cells with multiple values into distinct rows or columns (1NF normalization).
- Flattener: Converts wide, multi-level header tables into a tidy, long format.
- Clarifier: Standardizes data values using a custom dictionary mapping.
- Merger: Deduplicates rows by summing numeric columns.
- Appender: Vertically concatenates DataFrames with smart header alignment.

Installation

pip install pynorma

Quickstart

One-liner (auto-detect everything)

from pynorma import Pipeline

df = Pipeline("messy_report.xlsx").run()
print(df.head())

Full control

from pynorma import Pipeline

df = (Pipeline("data.xlsx", strategy="D")
      .detect()                                    # Find table regions
      .clean()                                     # Extract & clean
      .atomize(cols=["Tags"], delimiter=",")        # Explode multi-valued cells
      .clarify("업종", "dict.csv", sum_columns=["매출"])  # Standardize values
      .merge(sum_columns=["매출"])                  # Deduplicate
      .result())

Multi-table sheets

p = Pipeline("multi_table_sheet.xlsx")
p.detect().clean()

for i, table in enumerate(p.all_tables()):
    print(f"Table {i}: {table.shape}")

Detect multi-valued columns (1NF violations)

from pynorma import detect_multivalue_columns

result = detect_multivalue_columns(df)
# [('Fruits', ',', 1.0)]  → Fruits column has comma-separated lists

Legacy API

import pynorma

# parse() now uses ensemble detection internally (with legacy fallback)
df = pynorma.parse("data.csv", trim="auto")

Architecture

Raw File ──→ Detection ──→ Cleaning ──→ DataFrame ──→ Preprocessor ──→ Clean Output
             6 strategies    common       pandas       atomize
             multi-table     pipeline                  clarify
             quality_score                             merge / flatten

Detection: Table as N × 5 Integers

PyNorma reduces the table detection problem to finding N tables, each described by 5 integers:

(header, top, left, bottom, right)

Six strategies compete on each file, and the best is selected via a ground-truth-free quality_score based on type consistency, fill uniformity, header confidence, boundary sharpness, coverage, and region size.

For multi-table sheets, gap-based splitting runs first — consecutive empty rows signal table boundaries.

Strategy	Approach	Avg Score
D_Pattern	Regex normalization + type ratio	0.991
C_Gradient	Density gradient boundary detection	0.989
B_Entropy	Entropy jump detection	0.967
F_Voting	Column-type voting	0.964
A_Rules	Fixed heuristic rules	0.960
E_Window	Sliding window density	0.916

Benchmarked on 36 specimens (24 real-world + 12 adversarial edge cases) with 60 regression tests.

1NF Violation Detection

detect_multivalue_columns() uses atom overlap-ratio analysis:

For each column, tries candidate delimiters (,;/|)
Splits cells into atoms and checks cross-cell overlap
High overlap ratio = multi-valued column (atoms appear in other cells)

Project Structure

pynorma/
├── pynorma/                    # Main package
│   ├── pipeline.py             # Pipeline API (detection → preprocessor)
│   ├── io/                     # File I/O (CSV, XLSX)
│   │   └── trimmer.py          # Ensemble detection → legacy fallback
│   ├── detect/                 # Table region detection (legacy)
│   └── preprocessor/           # Atomizer, Clarifier, Merger, Flattener, Appender
├── specimen/                   # Test data (36 files)
│   └── benchmark/              # Ensemble detection framework
│       ├── core.py             # TableRegion, quality_score, split_tables_by_gap
│       ├── preprocess.py       # detect() + preprocess() (multi-table aware)
│       ├── strategies/         # 6 competing detection strategies
│       └── tests/              # 60 regression tests
└── tests/                      # Package-level tests

Author

nash-dir (https://github.com/nash-dir)

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
examples		examples
pynorma		pynorma
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyNorma

Key Features

Installation

Quickstart

One-liner (auto-detect everything)

Full control

Multi-table sheets

Detect multi-valued columns (1NF violations)

Legacy API

Architecture

Detection: Table as N × 5 Integers

1NF Violation Detection

Project Structure

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PyNorma

Key Features

Installation

Quickstart

One-liner (auto-detect everything)

Full control

Multi-table sheets

Detect multi-valued columns (1NF violations)

Legacy API

Architecture

Detection: Table as N × 5 Integers

1NF Violation Detection

Project Structure

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages