An automated ETL data processing application designed to extract private transactional information from localized banking account statement files, execute formatting pipelines, ingest sanitized structures into Google Sheets and visualized in looker reporting dashboards.
- Automated Statement Ingestion: Scans designated Google Drive input directories to locate and extract uncompiled PDF bank account statement structures natively
- Infrastructure as Code: Terraform as IAC Tool to declerate the GCP infrastructure and to avoid the need of doing this manually
- Layout-Preserving Text Parsing: Integrates advanced tabular layout-mode extractions via
pdfplumberto maintain spacing bounds for accounting balance audits - Index-Boundary Transaction Mapping: Programmatically evaluates isolated byte arrays bounded by historical ledger indicators (
alter Kontostandandneuer Kontostand) to isolate row objects - Preserved Schema Updates: Updates remote sheets through decoupled data-frame updates starting precisely beneath header validation lines to keep styling intact
- Automated Continuous Delivery: Packaged with multi-stage integration test runners (
pytest), coverage calculators, security auditors (Bandit) and automated deployment recipes for Cloud Run Jobs - Data Viz in Data Studio: Visualize data from Google Sheets in Data Studio
The chart below maps structural boundaries across user upload layers, serverless job schedulers, and Google API authorization boundaries during active batch sync runs:
The application parses environment variables to configure runtime behavior. Ensure the following variables are specified in your execution space:
- The app/ main script is deployed on GCP
- Execution status: 0 7 2 * * (Every second day of a month at 7am.)
# Clone the repository structure locally
git clone [https://github.com/XaverHeuser/finance-analysis.git](https://github.com/XaverHeuser/finance-analysis.git)
cd finance-analysis
# Initialize isolated runtime virtual dependencies
python3 -m venv .venv
source .venv/bin/activate# Update local installer configurations to safe margins
python -m pip install --upgrade pip==26.1.1
# Deploy core data-processing layers
python -m pip install -r requirements.txt
# Deploy validation layers (linters, type-checkers, unit tests)
python -m pip install -r requirements-dev.txt
# Initialize pre-commit automation hooks locally
pre-commit installTo verify extraction behaviors using a local credentials file before deployment:
- Save your Service Account credentials file to a non-tracked folder named credentials/.
- Export your environment variable configurations.
- Run the script execution command:
export SPREADSHEET_ID="your_sheet_id"
export TEMP_FOLDER_ID="your_temp_folder_id"
export REGULAR_FOLDER_ID="your_regular_folder_id"
python src/main.pyThe configured GitHub Actions pipeline performs multi-stage quality control checks on every code push or pull request to the main branch:
- Quality Audits: Enforces code style guidelines using Ruff
- Static Type Invariants: Assesses type safety definitions with strict mypy evaluations
- Security Testing: Scans for vulnerabilities and exposed secrets using Bandit
- Unit Test Execution: Runs automated tests via pytest with code coverage verification
The application utilizes an automated multi-stage Google Cloud Build runner pipeline (cloudbuild.yaml) triggered on merges to the main branch to handle serverless application packaging:
# Manual trigger option via gcloud SDK
gcloud builds submit --config cloudbuild.yaml .The production system operates on a serverless scheduling model inside Google Cloud Run Jobs, configured on an automated loop via Cloud Scheduler:
- Cron Trigger:
0 7 2 * *(Executes automatically at 7:00 AM on the 2nd day of each calendar month).
The project's Google Cloud Platform (GCP) infrastructure is fully modularized and provisioned using Terraform. The deployment utilizes variables targeting a custom repository setup.
providers.tf: Configures the HashiCorp Google provider (pinned tov6.8.0) and programmatically enables required remote APIs (Cloud Run,Cloud Scheduler,Cloud Build,Artifact Registry,Secret Manager, andGoogle Drive).variables.tf: Sets localized deployment variables, defaulting to thefinance-analysis-idproject ID inside theeurope-west1geographical region.cloudbuild.tf: Spins up a secure Docker Artifact Registry (finance-analysis-repo) to house application images. It also initializes a dedicated pipeline service account (cloudbuild-runner-sa) granted minimal-privilege roles (run.invoker,storage.admin,logging.logWriter,artifactregistry.writer, andcloudbuild.builds.builder).pipeline.tf: Defines the serverlessfinance-analysis-jobon Cloud Run. It configures a task timeout window of 30 minutes (1800s) and maps structural runtime parameters (SPREADSHEET_ID,TEMP_FOLDER_ID,REGULAR_FOLDER_ID) securely using dynamic environment blocks tied to Secret Manager references.secrets.tf: Instantiates locked Secret Manager parameter definitions, providing explicit read/access permissions exclusively to the Cloud Run Job runtime service account.orchestration.tf: Establishes a serverlessgoogle_cloud_scheduler_jobengine running on a Cron loop triggered automatically at 7:00 AM on the 2nd day of each month to wake up the analysis job pipeline safely.
-
Create GCP project
-
Authenticate to project in
gcloudterminal -
Initialize Workspace & Providers
# Create folder and initialize terraform cd terraform terraform init # Review execution spec terraform plan # Deploy terraform code terraform apply
-
Set secrets in Secret Manager
-
Apply Cloudbuild
gcloud builds submit --config cloudbuild.yaml . -
Update Cloud Run Job with Docker Image
gcloud run jobs update finance-analysis-job --image=europe-west1-docker.pkg.dev/finance-analysis-id/finance-analysis-repo/analysis-image:latest --region=europe-west1
The continuous security scanning infrastructure flags a specific structural advisory concerning pip below version 25.2:
- Advisory ID: GHSA-4xh5-x5gv-qwph
- Mitigation Strategy: The risk applies to untrusted source distributions (
sdists). Because this workflow uses pinned Python Wheels from verified PyPI mirrors within a controlled workspace, the rule is explicitly ignored using--ignore-vulnin the continuous integration environment.
βββ .github/ # Automated deployment workflow blueprints & validation logic
β βββ workflows/
β βββ ci-pipeline.yml # Multi-stage code verification and testing workflow
β βββ deploy.yml # Production Cloud Run deployment configuration
βββ notebooks/ # Analysis and prototyping environments
βββ src/ # Application source modules
β βββ config/ # Scope allocations and environmental configurations
β βββ domain/ # Core balance checking and transaction parsing logic
β βββ infrastructure/ # Drive/Sheets integration clients and text parsers
β βββ models/ # Dataclass modeling schemas
β βββ processing/ # Core transaction orchestration logic
β βββ main.py # Application entry point gateway
βββ terraform/ # Infrastructure as Code infrastructure definitions
| βββ providers.tf # Pinned provider dependencies & API activation lists
| βββ pipeline.tf # Serverless Cloud Run Job definitions
| βββ variables.tf # Project resource definitions & regional controls
| βββ ...
βββ tests/ # Test suite matching application file layouts
βββ Dockerfile # Application container build script
βββ cloudbuild.yaml # Google Cloud Build compilation pipeline
βββ pyproject.toml # Package build properties and configuration specs
Distributed directly under the terms of the open-source MIT License guidelines. See accompanying repository license files for deep details.