Skip to content

nash-dir/PyNorma

Repository files navigation

PyNorma

"You gotta do it, you can do it, but you just don't wanna do it."

PyNorma is a Python library that provides insights and automation for preprocessing messy, real-world tabular data. It's designed for data scientists, analysts, and anyone who's tired of the tedious task of cleaning up unstructured spreadsheets.

Key Features

  • Ensemble Table Detection: 6 competing strategies auto-detect the data region within messy files, scored and selected by an internal quality metric — no ground truth required.
  • Multi-Table Detection: Automatically finds multiple tables within a single sheet by detecting empty-row gaps between data segments.
  • Pipeline API: A fluent, chainable interface connects detection → cleaning → transformation in one call.
  • 1NF Violation Detection: Automatically identifies columns containing multi-valued cells (e.g., "apple, banana") using atom overlap-ratio analysis.
  • Preprocessing Toolkit:
    • Atomizer: Splits cells with multiple values into distinct rows or columns (1NF normalization).
    • Flattener: Converts wide, multi-level header tables into a tidy, long format.
    • Clarifier: Standardizes data values using a custom dictionary mapping.
    • Merger: Deduplicates rows by summing numeric columns.
    • Appender: Vertically concatenates DataFrames with smart header alignment.

Installation

pip install pynorma

Quickstart

One-liner (auto-detect everything)

from pynorma import Pipeline

df = Pipeline("messy_report.xlsx").run()
print(df.head())

Full control

from pynorma import Pipeline

df = (Pipeline("data.xlsx", strategy="D")
      .detect()                                    # Find table regions
      .clean()                                     # Extract & clean
      .atomize(cols=["Tags"], delimiter=",")        # Explode multi-valued cells
      .clarify("업종", "dict.csv", sum_columns=["매출"])  # Standardize values
      .merge(sum_columns=["매출"])                  # Deduplicate
      .result())

Multi-table sheets

p = Pipeline("multi_table_sheet.xlsx")
p.detect().clean()

for i, table in enumerate(p.all_tables()):
    print(f"Table {i}: {table.shape}")

Detect multi-valued columns (1NF violations)

from pynorma import detect_multivalue_columns

result = detect_multivalue_columns(df)
# [('Fruits', ',', 1.0)]  → Fruits column has comma-separated lists

Legacy API

import pynorma

# parse() now uses ensemble detection internally (with legacy fallback)
df = pynorma.parse("data.csv", trim="auto")

Architecture

Raw File ──→ Detection ──→ Cleaning ──→ DataFrame ──→ Preprocessor ──→ Clean Output
             6 strategies    common       pandas       atomize
             multi-table     pipeline                  clarify
             quality_score                             merge / flatten

Detection: Table as N × 5 Integers

PyNorma reduces the table detection problem to finding N tables, each described by 5 integers:

(header, top, left, bottom, right)

Six strategies compete on each file, and the best is selected via a ground-truth-free quality_score based on type consistency, fill uniformity, header confidence, boundary sharpness, coverage, and region size.

For multi-table sheets, gap-based splitting runs first — consecutive empty rows signal table boundaries.

Strategy Approach Avg Score
D_Pattern Regex normalization + type ratio 0.991
C_Gradient Density gradient boundary detection 0.989
B_Entropy Entropy jump detection 0.967
F_Voting Column-type voting 0.964
A_Rules Fixed heuristic rules 0.960
E_Window Sliding window density 0.916

Benchmarked on 36 specimens (24 real-world + 12 adversarial edge cases) with 60 regression tests.

1NF Violation Detection

detect_multivalue_columns() uses atom overlap-ratio analysis:

  • For each column, tries candidate delimiters (,;/|)
  • Splits cells into atoms and checks cross-cell overlap
  • High overlap ratio = multi-valued column (atoms appear in other cells)

Project Structure

pynorma/
├── pynorma/                    # Main package
│   ├── pipeline.py             # Pipeline API (detection → preprocessor)
│   ├── io/                     # File I/O (CSV, XLSX)
│   │   └── trimmer.py          # Ensemble detection → legacy fallback
│   ├── detect/                 # Table region detection (legacy)
│   └── preprocessor/           # Atomizer, Clarifier, Merger, Flattener, Appender
├── specimen/                   # Test data (36 files)
│   └── benchmark/              # Ensemble detection framework
│       ├── core.py             # TableRegion, quality_score, split_tables_by_gap
│       ├── preprocess.py       # detect() + preprocess() (multi-table aware)
│       ├── strategies/         # 6 competing detection strategies
│       └── tests/              # 60 regression tests
└── tests/                      # Package-level tests

Author

nash-dir (https://github.com/nash-dir)

License

This project is licensed under the MIT License.

About

PyNorma is a Python library that provides insights and automation for preprocessing messy, real-world tabular data. It's designed for data scientists, analysts, and anyone who's tired of the tedious task of cleaning up unstructured spreadsheets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors