MatAI is an open-source, privacy-conscious platform for turning machine-learning prediction tables into searchable candidate-review workflows for advanced materials R&D.
It is designed for scientists and engineers who need to inspect, rank, and operationalize ML prediction outputs locally without deploying a heavy data warehouse or sending sensitive experimental data to external services.
The initial use case focuses on advanced semiconductor materials such as EUV photoresists and metal oxide resist candidate screening, but the architecture is general enough for chemistry, materials informatics, and industrial R&D prediction workflows.
Modern materials R&D increasingly uses machine learning to screen large candidate spaces. However, many industrial teams face a practical gap:
- ML predictions are often produced as CSV, Excel, or other table outputs.
- Sensitive experimental data cannot be freely uploaded to external services.
- Researchers need a lightweight way to filter, rank, and review candidates.
- Full data warehouse or MLOps platforms are often too heavy for small internal tools.
- Domain experts need readable workflows, not only notebooks.
MatAI addresses this gap by providing a small, local-first application pattern:
S3 prediction table
↓
local raw snapshot
↓
canonical Parquet conversion
↓
DuckDB analytical query layer
↓
Flask API and candidate-review UI
- Download the latest prediction table from S3.
- Store the raw prediction table locally for traceability.
- Convert raw CSV, Excel, JSON, or Parquet inputs into a canonical Parquet snapshot.
- Query prediction outputs locally with DuckDB.
- Provide a lightweight Flask API and web UI for candidate screening.
- Filter candidates by target thresholds such as EOP, IPU, and process margin.
- Keep all query serving local and privacy-conscious.
- Use dummy sample data for open-source demonstration.
The current demo includes a dummy prediction table for a materials-screening workflow.
data/raw/latest_prediction.csv
↓
data/processed/mor_predictions_latest.parquet
↓
DuckDB read_parquet()
↓
Flask web page and API
The sample data is synthetic and does not contain confidential experimental or company data.
MatAI/
app.py
db.py
updater.py
refresh_once.py
rebuild_local_once.py
make_sample_data.py
requirements.txt
.env.example
README.md
templates/
index.html
data/
raw/
latest_prediction.csv
processed/
mor_predictions_latest.parquet
mor_predictions_meta.json
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
flask --app app run --debug --no-reloadFor Windows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
flask --app app run --debug --no-reloadOpen the local web UI:
http://127.0.0.1:5000
python make_sample_data.pypython rebuild_local_once.pyYou can also rebuild the local Parquet snapshot from the web UI.
Copy the example environment file:
cp .env.example .envEdit .env:
AWS_REGION=ap-northeast-2
AWS_PROFILE=your-profile
S3_BUCKET=your-bucket-name
S3_KEY=ml-output/latest_prediction.csv
RAW_DIR=data/raw
PROCESSED_PARQUET_PATH=data/processed/mor_predictions_latest.parquet
LOCAL_META_PATH=data/processed/mor_predictions_meta.json
ENABLE_SCHEDULER=1
SCHEDULER_TIMEZONE=Asia/SeoulRun a one-time S3 refresh:
python refresh_once.pyHealth check:
GET /api/health
Metadata and data summary:
GET /api/meta
Candidate query:
GET /api/candidates?eop_max=1.0&ipu_max=1.0&margin_min=1.0&limit=100&sort=eop_asc
Manual S3 refresh:
POST /api/admin/refresh
Local raw-to-Parquet rebuild:
POST /api/admin/rebuild-local
gunicorn -w 1 --threads 4 -b 0.0.0.0:8000 "app:create_app()"When using the built-in scheduler, a single worker is recommended to avoid duplicate scheduled jobs.
For production systems, the recommended architecture is to separate the web server from the scheduled ingestion job:
Flask server
└── query and review UI
Cron, Kubernetes CronJob, or Airflow
└── scheduled S3 download and Parquet conversion
MatAI is designed around local-first and privacy-conscious operation.
- Prediction tables are downloaded into a local environment.
- Query serving is performed through local DuckDB execution.
- No sample data in this repository contains proprietary experimental data.
- Secrets should be stored in
.envor an external secret manager. - Real S3 credentials, raw experimental data, and production prediction outputs should not be committed to GitHub.
- Add schema validation for prediction tables.
- Add configurable column mapping through YAML.
- Add candidate tagging and review status storage.
- Add SQLite or PostgreSQL integration for review metadata.
- Add automated tests for ingestion, conversion, and query logic.
- Add GitHub Actions for linting and test execution.
- Add Docker and Docker Compose examples.
- Add deployment examples for internal R&D environments.
- Add documentation for materials informatics workflows beyond the initial EUV resist example.
- Add Codex-assisted maintainer workflows for issue triage, PR review, and documentation updates.
MatAI is intended for:
- Materials informatics researchers
- Semiconductor materials engineers
- Chemistry and formulation scientists
- R&D teams operationalizing ML prediction outputs
- Engineers building lightweight internal AI tools
- Open-source contributors interested in local-first scientific software
Contributions are welcome. Good first contributions include:
- Improving documentation
- Adding test cases
- Supporting additional input formats
- Improving schema validation
- Adding UI filters
- Adding deployment examples
- Hardening S3 and local file handling
This project is released under the MIT License.
MatAI is an early-stage open-source project. The current implementation demonstrates the core workflow and will be expanded toward a more robust, contributor-ready platform for local AI prediction review in materials R&D.