AutoEDA is a command-line application for exploratory data analysis that integrates large language models (LLM) into each step of the analytical workflow: automated schema inference, natural-language database queries, and semantic search across multi-terabyte text corpora.
LLM access is handled through OpenAI-compatible HTTP endpoints. The primary endpoint is typically a local Ollama server; if unavailable, the system transparently falls back to a configurable cloud endpoint. Model selection, base URL, and API key are freely configurable, so any OpenAI-compatible backend can be used.
- Schema inference and missing-value detection for CSV, JSON, and database sources.
- Outlier detection with four complementary methods: Z-Score, IQR, Isolation Forest, and Local Outlier Factor.
- Correlation and clustering (Pearson, k-Means, DBSCAN).
- Text analysis: text-column detection, sentiment lexicon, named-entity recognition (e-mail, URL, date).
- Survival analysis: Kaplan-Meier curves, log-rank test, Cox Proportional Hazards.
- Natural-language database queries: the LLM generates SQL, Elasticsearch
DSL, or MongoDB queries from a plain question; SQL execution is restricted
to read-only
SELECTstatements. - Vector search over large text corpora with disk-based LanceDB storage, incremental indexing, and metadata prefiltering.
- Primary/fallback LLM with exponential backoff and retry.
- Python 3.11 or higher
- A running Ollama server (for local embeddings and chat) and/or a cloud API key for an OpenAI-compatible service
- Optional: PostgreSQL, MySQL/MariaDB, MongoDB, or Elasticsearch, if database-driven analyses are to be performed
git clone https://github.com/FBR65/AutoEDA
cd AutoEDA
uv syncThe first run creates a configuration file at ~/.eda_tool_config.json
(see below) when the LLM is first invoked. The repository also contains a
local .eda_tool_config.json for development.
{
"llm": {
"base_url": "http://localhost:11434/v1",
"api_key": "ollama",
"model": "llama3.2",
"fallback_base_url": "https://api.openai.com/v1",
"fallback_api_key": "sk-...",
"fallback_model": "gpt-4o-mini",
"max_retries": 3
},
"db_connections": {
"postgresql": "postgresql://user:pass@host:5432/db",
"elasticsearch": "http://user:pass@host:9200",
"mongodb": "mongodb://user:pass@host:27017/db"
}
}| Field | Description |
|---|---|
llm.base_url |
Primary endpoint (for example, local Ollama server) |
llm.api_key |
API key for the primary endpoint (ollama for local Ollama) |
llm.model |
Model name for the primary endpoint |
llm.fallback_base_url |
Fallback endpoint (for example, OpenAI or Anthropic-compatible) |
llm.fallback_api_key |
API key for the fallback endpoint. Empty to disable fallback |
llm.fallback_model |
Model name for the fallback endpoint |
llm.max_retries |
Retry attempts on the primary endpoint before fallback is invoked |
db_connections |
Dictionary of database connections (populated via connect commands) |
The configuration is loaded with the following precedence:
EDA_CONFIGenvironment variable (path to a JSON file)./.eda_tool_config.json(project directory)~/.eda_tool_config.json(user home)
All commands are invoked as python -m src <command>.
# Register a connection
python -m src connect postgresql "postgresql://user:pass@host:5432/database"
python -m src connect mysql "mysql+pymysql://user:pass@host:3306/database"
python -m src connect mongodb "mongodb://user:pass@host:27017/database"
python -m src connect elasticsearch "http://user:pass@host:9200"
# List all registered connections
python -m src schemaspython -m src scan postgresqlOutput includes:
- List of all tables with row count and column count
- Data type and nullability for each column
- For Elasticsearch: indices, document count, storage size, cluster status
- For MongoDB: collections and document count
The search command takes a question in natural language, has the LLM
generate a database-specific query (SQL, Elasticsearch DSL, or MongoDB),
and executes it against the database.
python -m src search postgresql "Revenue by region in the last quarter, grouped by month"
python -m src search elasticsearch "Error messages with timeout in the last 24 hours"
python -m src search mongodb "Customers with more than five orders since January 2026"Pipeline:
- The question is sent to the LLM.
- The LLM generates a database-specific query.
- For SQL databases, the schema is read beforehand and embedded in the prompt.
- The generated query is displayed.
- For SQL, the query is sanitized (single
SELECTstatement, no write operations) and executed inside a read-only transaction. - Results are printed as a table to the terminal.
Supports CSV and JSON files.
# Full EDA with automatic AI hypothesis generation
python -m src analyze data.csv
# Targeted question about a dataset
python -m src ask data.csv "Which columns correlate most strongly with the target variable?"
# Outlier analysis only
python -m src outliers data.csv- Dataset overview (rows, columns)
- Missing values (absolute and percentage per column)
- Outlier detection using Z-Score, IQR, Isolation Forest, and LOF
- Top correlations between numeric columns (Pearson)
- Detected text columns (suitable for NLP)
- LLM-generated hypotheses about relationships in the dataset
| Method | Description | Parameters |
|---|---|---|
| Z-Score | Standard deviations from the mean | threshold = 3.0 |
| IQR | Interquartile range (Tukey method) | factor = 1.5 |
| Isolation Forest | Ensemble-based anomaly detection via decision trees | contamination = 0.05 |
| LOF | Local Outlier Factor — density-based outlier detection | n_neighbors = 20, contamination = 0.05 |
Combining these methods enables robust detection of both global and context-dependent outliers. In time-series data, the STL decomposition + LOF combination has been reported to achieve detection rates above 80% with a false-positive rate below one percent (iX 6/2025).
# Basic Kaplan-Meier survival curve
python -m src survival data.csv --duration days_since_signup --event churned
# Stratified by group with log-rank test
python -m src survival data.csv --duration days --event churned --group plan
# With Cox Proportional Hazards covariates
python -m src survival data.csv --duration days --event churned --cox "age,monthly_spend"
# Export as HTML
python -m src survival data.csv --duration days --event churned --output report.html- Kaplan-Meier survival curve with median survival time
- With
--group: stratified curves and log-rank test for significance - With
--cox: hazard ratios with p-values (Cox Proportional Hazards) - Interactive Plotly chart (browser) or HTML export via
--output
| Question | Tool |
|---|---|
| Which columns correlate? | analyze (Pearson) |
| How long until churn? | survival (Kaplan-Meier) |
| Does the plan affect churn speed? | survival --cox |
Survival analysis is appropriate when the timing of an event matters, and when the event may not yet have occurred for some observations (censored data).
Semantic search across large text collections uses a two-stage process:
- Indexing — texts are split into overlapping chunks, vectorized via an embedding model, and stored in a LanceDB table.
- Search — a query in natural language is embedded and searched against the vector database via Approximate Nearest Neighbor (ANN). Metadata filters (date, file type) are applied before the vector search (prefiltering).
| Criterion | FAISS | LanceDB |
|---|---|---|
| Storage model | Entirely in RAM | Disk-based (Lance columnar format) |
| Incremental updates | No — full reindexing on new data | Yes — only new or modified files are embedded |
| Metadata filtering | Not supported | SQL-like WHERE clauses before ANN search |
| Scaling | Limited by available RAM | Billions of vectors via columnar storage |
| Persistence | Manual via write_index |
Automatic on every write |
For text collections in the terabyte range, FAISS is unsuitable because the entire index must be loaded into main memory. LanceDB stores data in the Lance columnar format on disk and loads only the segments relevant to a given query, similar to Parquet but with integrated vector search and indices.
# Initial indexing of a directory
python -m src index /data/documents --name project-archive
# On a second invocation, only new files or files modified since
# the last indexing run (mtime-based) are processed
python -m src index /data/documents --name project-archiveThe index command:
- Recursively traverses the specified directory.
- Processes files of types
.txt,.md,.json,.csv,.log,.py,.xml,.html. Files larger than 50 MB are skipped. - Splits text into chunks of 1,000 characters with 200 characters overlap.
- Sends chunks in batches to the LLM's
/embeddingsendpoint. - Writes vectors and metadata to a LanceDB table.
- Creates an ANN index on the vector column.
- On repeated invocation, checks
mtimeper file and processes only changes.
# Simple full-text search
python -m src search-text "Security vulnerabilities in microservices"
# With metadata filters
python -m src search-text "Docker networking" --since 2025-06-01 --suffix md,log
# Top-20 instead of top-10 results
python -m src search-text "Kubernetes deployment" --k 20Date filters must be supplied in YYYY-MM-DD format; the file-type filter
accepts a comma-separated list of extensions drawn from the supported set
above.
# Question + top-5 relevant chunks -> LLM answers with source references
python -m src ask-text "When was the MCP protocol introduced?"
# Only consider documents from 2026 onward
python -m src ask-text "Changes to the security architecture" --since 2026-01-01The LLM response includes:
- An answer derived from the retrieved text excerpts
- Citations of the source files
- Modification dates of the cited documents
# List all indices with size and document count
python -m src indicessrc/
__init__.py
__main__.py -- Entry point (`python -m src`)
cli.py -- Argument parser and command dispatch
config.py -- AppConfig / LLMConfig (dataclasses, JSON persistence)
llm_client.py -- LLMClient with primary endpoint, fallback, and backoff
ingestion.py -- Load CSV, JSON, SQL tables, MongoDB collections
cleaning.py -- Missing-value detection, imputation, date detection
outliers.py -- Z-Score, IQR, Isolation Forest, Local Outlier Factor
analysis.py -- Pearson correlation, k-Means, DBSCAN clustering
nlp.py -- Text column detection, sentiment, NER
ai_eda.py -- LLM calls: queries, hypotheses, text-to-SQL, summary
survival.py -- Kaplan-Meier, Cox PH, group comparisons
vector_search.py -- Chunking, embedding, LanceDB indexing and ANN search
visualization.py -- Plotly charts for survival and analysis output
tests/
conftest.py
test_llm_client.py -- Primary, fallback, error cases, context manager
test_ingestion.py -- CSV/JSON loading, schema inference
test_cleaning.py -- Missing-value detection and imputation
test_outliers.py -- All four outlier detection methods
test_analysis.py -- Correlation, clustering, validation
test_nlp.py -- Text analysis, sentiment, NER, edge cases
test_ai_eda.py -- AI analysis functions (mocked LLM)
test_survival.py -- Kaplan-Meier, Cox PH, log-rank test
Input Processing Output
───── ────────── ──────
CSV/JSON file ───────> ingestion.py ──────> pandas DataFrame
SQL table ──────────> sqlalchemy ────────> pandas DataFrame
MongoDB ───────────> pymongo ───────────> pandas DataFrame
│
┌────────────────────┤
▼ ▼
cleaning.py outliers.py
(Missing, (Z-Score, IQR,
Imputation) LOF, IsoForest)
│ │
▼ ▼
analysis.py nlp.py
(Correlation, (Sentiment,
Clustering) NER)
│ │
└────────┬───────────┘
▼
ai_eda.py
(LLM: Hypotheses,
Text-to-SQL,
Narrative Summary)
│
┌────────┴────────┐
▼ ▼
Terminal vector_search.py
(Text) (LanceDB Index,
ANN Search,
RAG Responses)
User query
│
▼
LLM Client
│
├──> Primary endpoint (Ollama)
│ │
│ ├── Success ──> Response
│ │
│ └── Error ──> Retry with exponential backoff
│ │
│ └── All retries exhausted
│ │
└──> Fallback endpoint (Cloud) <──────┘
│
└──> Response
- SQL execution is read-only. Generated queries are sanitized to a
single
SELECTstatement; write keywords (INSERT,UPDATE,DELETE,DROP, etc.) are rejected. - File size limit. Files larger than 50 MB are skipped during indexing to avoid memory pressure.
- Filter validation. Vector search accepts only
YYYY-MM-DDdates and file extensions from a fixed allow-list. - Lazy HTTP clients. The LLM client does not open connections until the first request and closes them deterministically via an async context manager.
python3 -m pytest tests/ -vThe test suite (52 tests) covers:
- LLM client: primary success, fallback switching, error case without fallback configured, context-manager lifecycle, and exponential backoff.
- CSV and JSON ingestion (empty, populated, schema inference).
- Missing-value detection and categorical/numeric imputation.
- All four outlier detection methods on synthetic data with known outliers.
- Pearson correlation, top correlations, clustering, and validation errors.
- NLP module: text column detection, word frequencies, sentiment, NER, and edge cases (empty input).
- AI analysis functions with mocked LLM calls.
- Survival analysis: Kaplan-Meier, log-rank test, Cox PH, and validation.
| Package | Purpose |
|---|---|
| pandas | DataFrames, schema inference, CSV/JSON parsing |
| numpy | Numeric arrays, vector operations |
| scipy | Statistical tests (Z-Score) |
| scikit-learn | Isolation Forest, LOF, k-Means, DBSCAN, imputation |
| plotly | Chart generation (interactive and HTML export) |
| sqlalchemy | SQL database connections (PostgreSQL, MySQL) |
| pymongo | MongoDB connections |
| lancedb | Vector database for semantic text search |
| pyarrow | Columnar format for LanceDB |
| lifelines | Survival analysis (Kaplan-Meier, Cox PH) |
| httpx | Async HTTP client for LLM API calls |
MIT