AutoEDA — AI-Assisted Exploratory Data Analysis (CLI)

AutoEDA is a command-line application for exploratory data analysis that integrates large language models (LLM) into each step of the analytical workflow: automated schema inference, natural-language database queries, and semantic search across multi-terabyte text corpora.

LLM access is handled through OpenAI-compatible HTTP endpoints. The primary endpoint is typically a local Ollama server; if unavailable, the system transparently falls back to a configurable cloud endpoint. Model selection, base URL, and API key are freely configurable, so any OpenAI-compatible backend can be used.

Features

Schema inference and missing-value detection for CSV, JSON, and database sources.
Outlier detection with four complementary methods: Z-Score, IQR, Isolation Forest, and Local Outlier Factor.
Correlation and clustering (Pearson, k-Means, DBSCAN).
Text analysis: text-column detection, sentiment lexicon, named-entity recognition (e-mail, URL, date).
Survival analysis: Kaplan-Meier curves, log-rank test, Cox Proportional Hazards.
Natural-language database queries: the LLM generates SQL, Elasticsearch DSL, or MongoDB queries from a plain question; SQL execution is restricted to read-only SELECT statements.
Vector search over large text corpora with disk-based LanceDB storage, incremental indexing, and metadata prefiltering.
Primary/fallback LLM with exponential backoff and retry.

System Requirements

Python 3.11 or higher
A running Ollama server (for local embeddings and chat) and/or a cloud API key for an OpenAI-compatible service
Optional: PostgreSQL, MySQL/MariaDB, MongoDB, or Elasticsearch, if database-driven analyses are to be performed

Installation

git clone https://github.com/FBR65/AutoEDA
cd AutoEDA
uv sync

The first run creates a configuration file at ~/.eda_tool_config.json (see below) when the LLM is first invoked. The repository also contains a local .eda_tool_config.json for development.

Configuration

{
  "llm": {
    "base_url": "http://localhost:11434/v1",
    "api_key": "ollama",
    "model": "llama3.2",
    "fallback_base_url": "https://api.openai.com/v1",
    "fallback_api_key": "sk-...",
    "fallback_model": "gpt-4o-mini",
    "max_retries": 3
  },
  "db_connections": {
    "postgresql": "postgresql://user:pass@host:5432/db",
    "elasticsearch": "http://user:pass@host:9200",
    "mongodb": "mongodb://user:pass@host:27017/db"
  }
}

Field	Description
`llm.base_url`	Primary endpoint (for example, local Ollama server)
`llm.api_key`	API key for the primary endpoint (`ollama` for local Ollama)
`llm.model`	Model name for the primary endpoint
`llm.fallback_base_url`	Fallback endpoint (for example, OpenAI or Anthropic-compatible)
`llm.fallback_api_key`	API key for the fallback endpoint. Empty to disable fallback
`llm.fallback_model`	Model name for the fallback endpoint
`llm.max_retries`	Retry attempts on the primary endpoint before fallback is invoked
`db_connections`	Dictionary of database connections (populated via `connect` commands)

The configuration is loaded with the following precedence:

EDA_CONFIG environment variable (path to a JSON file)
./.eda_tool_config.json (project directory)
~/.eda_tool_config.json (user home)

Command Reference

All commands are invoked as python -m src <command>.

Managing Database Connections

# Register a connection
python -m src connect postgresql "postgresql://user:pass@host:5432/database"
python -m src connect mysql "mysql+pymysql://user:pass@host:3306/database"
python -m src connect mongodb "mongodb://user:pass@host:27017/database"
python -m src connect elasticsearch "http://user:pass@host:9200"

# List all registered connections
python -m src schemas

Analyzing Database Schemas

python -m src scan postgresql

Output includes:

List of all tables with row count and column count
Data type and nullability for each column
For Elasticsearch: indices, document count, storage size, cluster status
For MongoDB: collections and document count

Natural-Language Database Queries

The search command takes a question in natural language, has the LLM generate a database-specific query (SQL, Elasticsearch DSL, or MongoDB), and executes it against the database.

python -m src search postgresql "Revenue by region in the last quarter, grouped by month"
python -m src search elasticsearch "Error messages with timeout in the last 24 hours"
python -m src search mongodb "Customers with more than five orders since January 2026"

Pipeline:

The question is sent to the LLM.
The LLM generates a database-specific query.
For SQL databases, the schema is read beforehand and embedded in the prompt.
The generated query is displayed.
For SQL, the query is sanitized (single SELECT statement, no write operations) and executed inside a read-only transaction.
Results are printed as a table to the terminal.

File-Based Exploratory Data Analysis

Supports CSV and JSON files.

# Full EDA with automatic AI hypothesis generation
python -m src analyze data.csv

# Targeted question about a dataset
python -m src ask data.csv "Which columns correlate most strongly with the target variable?"

# Outlier analysis only
python -m src outliers data.csv

Output of `analyze`

Dataset overview (rows, columns)
Missing values (absolute and percentage per column)
Outlier detection using Z-Score, IQR, Isolation Forest, and LOF
Top correlations between numeric columns (Pearson)
Detected text columns (suitable for NLP)
LLM-generated hypotheses about relationships in the dataset

Outlier Detection Methods

Method	Description	Parameters
Z-Score	Standard deviations from the mean	threshold = 3.0
IQR	Interquartile range (Tukey method)	factor = 1.5
Isolation Forest	Ensemble-based anomaly detection via decision trees	contamination = 0.05
LOF	Local Outlier Factor — density-based outlier detection	n_neighbors = 20, contamination = 0.05

Combining these methods enables robust detection of both global and context-dependent outliers. In time-series data, the STL decomposition + LOF combination has been reported to achieve detection rates above 80% with a false-positive rate below one percent (iX 6/2025).

Survival Analysis (Time-to-Event)

# Basic Kaplan-Meier survival curve
python -m src survival data.csv --duration days_since_signup --event churned

# Stratified by group with log-rank test
python -m src survival data.csv --duration days --event churned --group plan

# With Cox Proportional Hazards covariates
python -m src survival data.csv --duration days --event churned --cox "age,monthly_spend"

# Export as HTML
python -m src survival data.csv --duration days --event churned --output report.html

Output of `survival`

Kaplan-Meier survival curve with median survival time
With --group: stratified curves and log-rank test for significance
With --cox: hazard ratios with p-values (Cox Proportional Hazards)
Interactive Plotly chart (browser) or HTML export via --output

Question	Tool
Which columns correlate?	`analyze` (Pearson)
How long until churn?	`survival` (Kaplan-Meier)
Does the plan affect churn speed?	`survival --cox`

Survival analysis is appropriate when the timing of an event matters, and when the event may not yet have occurred for some observations (censored data).

Vector Search for Large Text Corpora

Semantic search across large text collections uses a two-stage process:

Indexing — texts are split into overlapping chunks, vectorized via an embedding model, and stored in a LanceDB table.
Search — a query in natural language is embedded and searched against the vector database via Approximate Nearest Neighbor (ANN). Metadata filters (date, file type) are applied before the vector search (prefiltering).

Why LanceDB and not FAISS

Criterion	FAISS	LanceDB
Storage model	Entirely in RAM	Disk-based (Lance columnar format)
Incremental updates	No — full reindexing on new data	Yes — only new or modified files are embedded
Metadata filtering	Not supported	SQL-like `WHERE` clauses before ANN search
Scaling	Limited by available RAM	Billions of vectors via columnar storage
Persistence	Manual via `write_index`	Automatic on every write

For text collections in the terabyte range, FAISS is unsuitable because the entire index must be loaded into main memory. LanceDB stores data in the Lance columnar format on disk and loads only the segments relevant to a given query, similar to Parquet but with integrated vector search and indices.

Creating an Index

# Initial indexing of a directory
python -m src index /data/documents --name project-archive

# On a second invocation, only new files or files modified since
# the last indexing run (mtime-based) are processed
python -m src index /data/documents --name project-archive

The index command:

Recursively traverses the specified directory.
Processes files of types .txt, .md, .json, .csv, .log, .py, .xml, .html. Files larger than 50 MB are skipped.
Splits text into chunks of 1,000 characters with 200 characters overlap.
Sends chunks in batches to the LLM's /embeddings endpoint.
Writes vectors and metadata to a LanceDB table.
Creates an ANN index on the vector column.
On repeated invocation, checks mtime per file and processes only changes.

Semantic Search

# Simple full-text search
python -m src search-text "Security vulnerabilities in microservices"

# With metadata filters
python -m src search-text "Docker networking" --since 2025-06-01 --suffix md,log

# Top-20 instead of top-10 results
python -m src search-text "Kubernetes deployment" --k 20

Date filters must be supplied in YYYY-MM-DD format; the file-type filter accepts a comma-separated list of extensions drawn from the supported set above.

AI-Powered Question Answering with Source Attribution

# Question + top-5 relevant chunks -> LLM answers with source references
python -m src ask-text "When was the MCP protocol introduced?"

# Only consider documents from 2026 onward
python -m src ask-text "Changes to the security architecture" --since 2026-01-01

The LLM response includes:

An answer derived from the retrieved text excerpts
Citations of the source files
Modification dates of the cited documents

Index Management

# List all indices with size and document count
python -m src indices

Architecture

src/
  __init__.py
  __main__.py          -- Entry point (`python -m src`)
  cli.py               -- Argument parser and command dispatch
  config.py            -- AppConfig / LLMConfig (dataclasses, JSON persistence)
  llm_client.py        -- LLMClient with primary endpoint, fallback, and backoff
  ingestion.py         -- Load CSV, JSON, SQL tables, MongoDB collections
  cleaning.py          -- Missing-value detection, imputation, date detection
  outliers.py          -- Z-Score, IQR, Isolation Forest, Local Outlier Factor
  analysis.py          -- Pearson correlation, k-Means, DBSCAN clustering
  nlp.py               -- Text column detection, sentiment, NER
  ai_eda.py            -- LLM calls: queries, hypotheses, text-to-SQL, summary
  survival.py          -- Kaplan-Meier, Cox PH, group comparisons
  vector_search.py     -- Chunking, embedding, LanceDB indexing and ANN search
  visualization.py     -- Plotly charts for survival and analysis output

tests/
  conftest.py
  test_llm_client.py   -- Primary, fallback, error cases, context manager
  test_ingestion.py    -- CSV/JSON loading, schema inference
  test_cleaning.py     -- Missing-value detection and imputation
  test_outliers.py     -- All four outlier detection methods
  test_analysis.py     -- Correlation, clustering, validation
  test_nlp.py          -- Text analysis, sentiment, NER, edge cases
  test_ai_eda.py       -- AI analysis functions (mocked LLM)
  test_survival.py     -- Kaplan-Meier, Cox PH, log-rank test

Data Flow

Input                          Processing                  Output
─────                          ──────────                  ──────
CSV/JSON file ───────> ingestion.py ──────> pandas DataFrame
SQL table ──────────> sqlalchemy ────────> pandas DataFrame
MongoDB ───────────> pymongo ───────────> pandas DataFrame
                                                 │
                            ┌────────────────────┤
                            ▼                    ▼
                      cleaning.py          outliers.py
                      (Missing,            (Z-Score, IQR,
                       Imputation)          LOF, IsoForest)
                            │                    │
                            ▼                    ▼
                      analysis.py           nlp.py
                      (Correlation,         (Sentiment,
                       Clustering)           NER)
                            │                    │
                            └────────┬───────────┘
                                     ▼
                               ai_eda.py
                               (LLM: Hypotheses,
                                Text-to-SQL,
                                Narrative Summary)
                                     │
                            ┌────────┴────────┐
                            ▼                 ▼
                       Terminal         vector_search.py
                       (Text)           (LanceDB Index,
                                        ANN Search,
                                        RAG Responses)

Fallback Mechanism

User query
      │
      ▼
LLM Client
      │
      ├──> Primary endpoint (Ollama)
      │         │
      │         ├── Success ──> Response
      │         │
      │         └── Error ──> Retry with exponential backoff
      │                            │
      │                            └── All retries exhausted
      │                                      │
      └──> Fallback endpoint (Cloud) <──────┘
                    │
                    └──> Response

Safety

SQL execution is read-only. Generated queries are sanitized to a single SELECT statement; write keywords (INSERT, UPDATE, DELETE, DROP, etc.) are rejected.
File size limit. Files larger than 50 MB are skipped during indexing to avoid memory pressure.
Filter validation. Vector search accepts only YYYY-MM-DD dates and file extensions from a fixed allow-list.
Lazy HTTP clients. The LLM client does not open connections until the first request and closes them deterministically via an async context manager.

Tests

python3 -m pytest tests/ -v

The test suite (52 tests) covers:

LLM client: primary success, fallback switching, error case without fallback configured, context-manager lifecycle, and exponential backoff.
CSV and JSON ingestion (empty, populated, schema inference).
Missing-value detection and categorical/numeric imputation.
All four outlier detection methods on synthetic data with known outliers.
Pearson correlation, top correlations, clustering, and validation errors.
NLP module: text column detection, word frequencies, sentiment, NER, and edge cases (empty input).
AI analysis functions with mocked LLM calls.
Survival analysis: Kaplan-Meier, log-rank test, Cox PH, and validation.

Dependencies

Package	Purpose
pandas	DataFrames, schema inference, CSV/JSON parsing
numpy	Numeric arrays, vector operations
scipy	Statistical tests (Z-Score)
scikit-learn	Isolation Forest, LOF, k-Means, DBSCAN, imputation
plotly	Chart generation (interactive and HTML export)
sqlalchemy	SQL database connections (PostgreSQL, MySQL)
pymongo	MongoDB connections
lancedb	Vector database for semantic text search
pyarrow	Columnar format for LanceDB
lifelines	Survival analysis (Kaplan-Meier, Cox PH)
httpx	Async HTTP client for LLM API calls

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
tests		tests
.eda_tool_config.json		.eda_tool_config.json
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_data_producer.py		test_data_producer.py
test_eda.csv		test_eda.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoEDA — AI-Assisted Exploratory Data Analysis (CLI)

Features

System Requirements

Installation

Configuration

Command Reference

Managing Database Connections

Analyzing Database Schemas

Natural-Language Database Queries

File-Based Exploratory Data Analysis

Output of `analyze`

Outlier Detection Methods

Survival Analysis (Time-to-Event)

Output of `survival`

Vector Search for Large Text Corpora

Why LanceDB and not FAISS

Creating an Index

Semantic Search

AI-Powered Question Answering with Source Attribution

Index Management

Architecture

Data Flow

Fallback Mechanism

Safety

Tests

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoEDA — AI-Assisted Exploratory Data Analysis (CLI)

Features

System Requirements

Installation

Configuration

Command Reference

Managing Database Connections

Analyzing Database Schemas

Natural-Language Database Queries

File-Based Exploratory Data Analysis

Output of analyze

Outlier Detection Methods

Survival Analysis (Time-to-Event)

Output of survival

Vector Search for Large Text Corpora

Why LanceDB and not FAISS

Creating an Index

Semantic Search

AI-Powered Question Answering with Source Attribution

Index Management

Architecture

Data Flow

Fallback Mechanism

Safety

Tests

Dependencies

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Output of `analyze`

Output of `survival`

Packages