Skip to content

python-excel-automation/spreadsheet-cleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spreadsheet-cleaner

Clean a messy CSV / Excel file in one command — and get a report of exactly what changed.

Real spreadsheets are full of stray spaces, inconsistent headers, blank rows, duplicates, and numbers stored as text. spreadsheet-cleaner fixes all of that in one pass and writes a Changes sheet so you can see what it touched.

python -m spreadsheet_cleaner messy_contacts.csv -o clean.xlsx
✓ Wrote clean.xlsx  (4 rows kept)
  • headers normalized: 5
  • whitespace cells trimmed: 7
  • empty rows dropped: 1
  • empty columns dropped: 0
  • duplicate rows dropped: 1
  • columns retyped: 1

📚 Built to go with the step-by-step guides at python-excel-automation.com — practical Python + Excel tutorials covering pandas, openpyxl, data cleaning and reporting.

What it cleans

  • 🏷️ Headers — trimmed, lower-cased, snake_cased, and de-duplicated.
  • ✂️ Whitespace — leading/trailing spaces removed and runs of spaces collapsed.
  • 🗑️ Empty rows & columns — fully-blank rows and columns dropped.
  • 👯 Duplicate rows — exact duplicates removed.
  • 🔢 Types — columns that are entirely numbers or dates stored as text are converted (mixed columns are left alone, so nothing is silently mangled).
  • 🩹 Missing values — optionally filled with zero, the column mean, or ffill/bfill.
  • 🧾 Change log — every Excel output gets a Changes sheet tallying the above.

Setup

Runs straight from a clone of the repository — there's nothing to install from a package index.

git clone https://github.com/python-excel-automation/spreadsheet-cleaner.git
cd spreadsheet-cleaner

It needs Python 3.9+ and two libraries — pandas and openpyxl (listed in requirements.txt). Make them available in your Python environment using whatever package manager you use, then run the tool from the repo folder.

Usage

# Clean a file (writes clean.xlsx with a Changes sheet)
python -m spreadsheet_cleaner messy.xlsx -o clean.xlsx

# Fill missing numbers with the column mean
python -m spreadsheet_cleaner data.csv --fill mean -o clean.xlsx

# Keep duplicates and original headers, write CSV out
python -m spreadsheet_cleaner data.csv --no-dedupe --no-normalize-headers -o clean.csv

Options

Flag Description
-o, --output Output path (default: cleaned.xlsx). CSV output skips the change log.
--fill Missing-value strategy: none, zero, mean, ffill, bfill (default: none).
--no-normalize-headers Leave column names as-is.
--no-strip-whitespace Don't trim/collapse whitespace.
--no-drop-empty Keep fully-empty rows and columns.
--no-dedupe Keep duplicate rows.
--no-coerce-types Don't convert text-numbers/dates.

Use it from Python

from spreadsheet_cleaner import clean_frame
import pandas as pd

result = clean_frame(pd.read_csv("messy.csv"))
print(result.changes)        # {'headers_normalized': 5, 'duplicate_rows_dropped': 1, ...}
result.frame.to_excel("clean.xlsx", index=False)

Try it

python -m spreadsheet_cleaner examples/messy_contacts.csv -o demo.xlsx

Develop

With the dependencies and pytest available, run the tests from the repo folder:

python -m pytest

License

MIT

About

Clean messy CSV/Excel files in one command — normalize headers, trim whitespace, drop empties & duplicates, fix types, fill gaps — with a change log. pandas + openpyxl.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages