Home

Welcome to the PyNorma Wiki!

PyNorma is a Python library for preprocessing messy, real-world tabular data. Its mission is to automate the tedious tasks of data cleaning, allowing data scientists and analysts to focus on analysis, not janitorial work.

This project was born from a simple need—to achieve the First Normal Form (1NF) on messy datasets—and has grown into a smart toolkit for handling a wide range of data "abnormalities."

The Challenge: The Messy Real World

In an ideal world, all data arrives in a perfectly structured, analysis-ready format. In reality, we face:

1. Unstructured Files: Spreadsheets and CSVs are often wrapped in junk. Files like townbusiness1.csv contain explanatory notes, headers, footers, and empty rows/columns that surround the actual data table.
2. Non-Atomic Data: A single cell might contain multiple distinct values, violating the core principle of 1NF (e.g., "CVS, Drugstore" in one cell).
3. Complex "Wide" Formats: Data is often presented in wide, human-readable formats with multi-level or merged headers (like Kor119.xlsx). This "pivoted" structure is unusable for most analytical tools, which expect a "long" (tidy) format.
4. Inconsistent Values: The same entity is often recorded under different names due to typos or synonyms (e.g., "Restaurannt," "Diner," and "Fast Food" all referring to 'Restaurant').

The PyNorma Philosophy: Normalization as a Concept

The name PyNorma comes from Normalization.

While this started with the classic database concept of 1NF (atomic values), the project's philosophy embraces a broader definition:

Normalization is the entire process of turning abnormal, messy data into a normal, usable state.

A "normal" dataset is one that is Tidy, Atomic, and Consistent. The functions in PyNorma are the tools to achieve this standard, turning unstructured chaos into analysis-ready order.

Core Features (The Toolkit)

PyNorma tackles these challenges with a suite of smart functions:

Smart Parsing (parse & trimmer): Automatically detects and trims the "junk" surrounding a data table. It finds the real data so you don't have to.
Flattening (flattener): Converts complex, multi-level 'wide' tables into a simple 'long' (tidy) format, making it ready for analysis.
Atomizing (atomizer): Solves the 1NF problem by splitting cells with multiple delimited values into distinct rows.
Clarifying (clarifier): Standardizes inconsistent data by mapping typos and synonyms to a single, correct value using a simple dictionary.
Appending (appender): Intelligently stacks multiple data files, even with slight schema differences.

Project Status

This project is currently under construction and active formalization. The core components are functional, but the API is evolving. We are currently focused on:

Formalizing the public API.
Expanding test coverage.
Building out comprehensive documentation right here in this Wiki.

See the [API Reference] (under construction) to get started with the core functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly