A shared workspace for code, experiments, and data pipelines.
- Project 1 — Reading CSV Files with Pandas — load a real aqueous-solubility dataset (AQSolDB) into a pandas DataFrame and explore it with shape, dtypes, summary stats, and filtering.
- Project 2 — Summary Statistics & Outlier Detection — compute quartiles and the IQR and implement Tukey's outlier rule from scratch on the Palmer Penguins dataset, discovering why outliers only surface once you group by species.
- Project 3 — Clustering & Dimensionality Reduction — cluster the penguins by their measurements in native 4D with k-means, project the result down to 2D with PCA, and plot it colored by cluster — a DataFrame pipeline where each step adds a column.
The macOS package manager — used to install everything below. If you don't have it:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"macOS ships git with Apple's Command Line Tools:
xcode-select --install # installs git + compilers (skip if already present)
git --version # verifyOptional: brew install git for a newer version than Apple's.
uv manages the Python interpreter, the virtual environment, and packages — all in one fast tool.
brew install uvgit clone git@github.com:SuperCowPowers/data_engineering.git
cd data_engineering
uv sync # creates .venv, installs the right Python + all dependenciesuv sync reads pyproject.toml and .python-version, downloads Python 3.13 if
you don't have it, and builds the environment. That's the whole setup.
uv run python path/to/script.py # run a scriptPrefer the classic workflow? Activate the env and use python directly:
source .venv/bin/activate
python path/to/script.pyPoint your editor at the project's .venv so it uses the right interpreter and
finds the installed packages.
PyCharm
- Settings → Project → Python Interpreter → Add Interpreter → Add Local.
- Choose Existing and select
.venv/bin/pythonin the project. (PyCharm 2024.2+ also has a native uv option that does this for you.)
VS Code
- Install the Python extension.
- Command Palette (⌘⇧P) → Python: Select Interpreter → pick the one under
.venv. VS Code usually auto-detects it on open.
uv run pytest # run testsgit checkout -b my-feature
# ... make changes, commit ...
git push -u origin my-featureThen open a pull request on GitHub for review.
data_engineering/
├── pyproject.toml # project, dependencies, tool config
├── .python-version # pinned Python version
├── uv.lock # exact resolved versions (created by `uv sync`)
├── src/data_engineering/ # importable, shared code
├── tests/ # pytest tests
├── project_1/ # reading CSVs with pandas
├── project_2/ # summary statistics & outlier detection
└── project_3/ # clustering & dimensionality reduction