Skip to content

FBR65/AutoEDA

Repository files navigation

AutoEDA — AI-Assisted Exploratory Data Analysis (CLI)

AutoEDA is a command-line application for exploratory data analysis that integrates large language models (LLM) into each step of the analytical workflow: automated schema inference, natural-language database queries, and semantic search across multi-terabyte text corpora.

LLM access is handled through OpenAI-compatible HTTP endpoints. The primary endpoint is typically a local Ollama server; if unavailable, the system transparently falls back to a configurable cloud endpoint. Model selection, base URL, and API key are freely configurable, so any OpenAI-compatible backend can be used.

Features

  • Schema inference and missing-value detection for CSV, JSON, and database sources.
  • Outlier detection with four complementary methods: Z-Score, IQR, Isolation Forest, and Local Outlier Factor.
  • Correlation and clustering (Pearson, k-Means, DBSCAN).
  • Text analysis: text-column detection, sentiment lexicon, named-entity recognition (e-mail, URL, date).
  • Survival analysis: Kaplan-Meier curves, log-rank test, Cox Proportional Hazards.
  • Natural-language database queries: the LLM generates SQL, Elasticsearch DSL, or MongoDB queries from a plain question; SQL execution is restricted to read-only SELECT statements.
  • Vector search over large text corpora with disk-based LanceDB storage, incremental indexing, and metadata prefiltering.
  • Primary/fallback LLM with exponential backoff and retry.

System Requirements

  • Python 3.11 or higher
  • A running Ollama server (for local embeddings and chat) and/or a cloud API key for an OpenAI-compatible service
  • Optional: PostgreSQL, MySQL/MariaDB, MongoDB, or Elasticsearch, if database-driven analyses are to be performed

Installation

git clone https://github.com/FBR65/AutoEDA
cd AutoEDA
uv sync

The first run creates a configuration file at ~/.eda_tool_config.json (see below) when the LLM is first invoked. The repository also contains a local .eda_tool_config.json for development.

Configuration

{
  "llm": {
    "base_url": "http://localhost:11434/v1",
    "api_key": "ollama",
    "model": "llama3.2",
    "fallback_base_url": "https://api.openai.com/v1",
    "fallback_api_key": "sk-...",
    "fallback_model": "gpt-4o-mini",
    "max_retries": 3
  },
  "db_connections": {
    "postgresql": "postgresql://user:pass@host:5432/db",
    "elasticsearch": "http://user:pass@host:9200",
    "mongodb": "mongodb://user:pass@host:27017/db"
  }
}
Field Description
llm.base_url Primary endpoint (for example, local Ollama server)
llm.api_key API key for the primary endpoint (ollama for local Ollama)
llm.model Model name for the primary endpoint
llm.fallback_base_url Fallback endpoint (for example, OpenAI or Anthropic-compatible)
llm.fallback_api_key API key for the fallback endpoint. Empty to disable fallback
llm.fallback_model Model name for the fallback endpoint
llm.max_retries Retry attempts on the primary endpoint before fallback is invoked
db_connections Dictionary of database connections (populated via connect commands)

The configuration is loaded with the following precedence:

  1. EDA_CONFIG environment variable (path to a JSON file)
  2. ./.eda_tool_config.json (project directory)
  3. ~/.eda_tool_config.json (user home)

Command Reference

All commands are invoked as python -m src <command>.

Managing Database Connections

# Register a connection
python -m src connect postgresql "postgresql://user:pass@host:5432/database"
python -m src connect mysql "mysql+pymysql://user:pass@host:3306/database"
python -m src connect mongodb "mongodb://user:pass@host:27017/database"
python -m src connect elasticsearch "http://user:pass@host:9200"

# List all registered connections
python -m src schemas

Analyzing Database Schemas

python -m src scan postgresql

Output includes:

  • List of all tables with row count and column count
  • Data type and nullability for each column
  • For Elasticsearch: indices, document count, storage size, cluster status
  • For MongoDB: collections and document count

Natural-Language Database Queries

The search command takes a question in natural language, has the LLM generate a database-specific query (SQL, Elasticsearch DSL, or MongoDB), and executes it against the database.

python -m src search postgresql "Revenue by region in the last quarter, grouped by month"
python -m src search elasticsearch "Error messages with timeout in the last 24 hours"
python -m src search mongodb "Customers with more than five orders since January 2026"

Pipeline:

  1. The question is sent to the LLM.
  2. The LLM generates a database-specific query.
  3. For SQL databases, the schema is read beforehand and embedded in the prompt.
  4. The generated query is displayed.
  5. For SQL, the query is sanitized (single SELECT statement, no write operations) and executed inside a read-only transaction.
  6. Results are printed as a table to the terminal.

File-Based Exploratory Data Analysis

Supports CSV and JSON files.

# Full EDA with automatic AI hypothesis generation
python -m src analyze data.csv

# Targeted question about a dataset
python -m src ask data.csv "Which columns correlate most strongly with the target variable?"

# Outlier analysis only
python -m src outliers data.csv

Output of analyze

  1. Dataset overview (rows, columns)
  2. Missing values (absolute and percentage per column)
  3. Outlier detection using Z-Score, IQR, Isolation Forest, and LOF
  4. Top correlations between numeric columns (Pearson)
  5. Detected text columns (suitable for NLP)
  6. LLM-generated hypotheses about relationships in the dataset

Outlier Detection Methods

Method Description Parameters
Z-Score Standard deviations from the mean threshold = 3.0
IQR Interquartile range (Tukey method) factor = 1.5
Isolation Forest Ensemble-based anomaly detection via decision trees contamination = 0.05
LOF Local Outlier Factor — density-based outlier detection n_neighbors = 20, contamination = 0.05

Combining these methods enables robust detection of both global and context-dependent outliers. In time-series data, the STL decomposition + LOF combination has been reported to achieve detection rates above 80% with a false-positive rate below one percent (iX 6/2025).

Survival Analysis (Time-to-Event)

# Basic Kaplan-Meier survival curve
python -m src survival data.csv --duration days_since_signup --event churned

# Stratified by group with log-rank test
python -m src survival data.csv --duration days --event churned --group plan

# With Cox Proportional Hazards covariates
python -m src survival data.csv --duration days --event churned --cox "age,monthly_spend"

# Export as HTML
python -m src survival data.csv --duration days --event churned --output report.html

Output of survival

  1. Kaplan-Meier survival curve with median survival time
  2. With --group: stratified curves and log-rank test for significance
  3. With --cox: hazard ratios with p-values (Cox Proportional Hazards)
  4. Interactive Plotly chart (browser) or HTML export via --output
Question Tool
Which columns correlate? analyze (Pearson)
How long until churn? survival (Kaplan-Meier)
Does the plan affect churn speed? survival --cox

Survival analysis is appropriate when the timing of an event matters, and when the event may not yet have occurred for some observations (censored data).

Vector Search for Large Text Corpora

Semantic search across large text collections uses a two-stage process:

  1. Indexing — texts are split into overlapping chunks, vectorized via an embedding model, and stored in a LanceDB table.
  2. Search — a query in natural language is embedded and searched against the vector database via Approximate Nearest Neighbor (ANN). Metadata filters (date, file type) are applied before the vector search (prefiltering).

Why LanceDB and not FAISS

Criterion FAISS LanceDB
Storage model Entirely in RAM Disk-based (Lance columnar format)
Incremental updates No — full reindexing on new data Yes — only new or modified files are embedded
Metadata filtering Not supported SQL-like WHERE clauses before ANN search
Scaling Limited by available RAM Billions of vectors via columnar storage
Persistence Manual via write_index Automatic on every write

For text collections in the terabyte range, FAISS is unsuitable because the entire index must be loaded into main memory. LanceDB stores data in the Lance columnar format on disk and loads only the segments relevant to a given query, similar to Parquet but with integrated vector search and indices.

Creating an Index

# Initial indexing of a directory
python -m src index /data/documents --name project-archive

# On a second invocation, only new files or files modified since
# the last indexing run (mtime-based) are processed
python -m src index /data/documents --name project-archive

The index command:

  1. Recursively traverses the specified directory.
  2. Processes files of types .txt, .md, .json, .csv, .log, .py, .xml, .html. Files larger than 50 MB are skipped.
  3. Splits text into chunks of 1,000 characters with 200 characters overlap.
  4. Sends chunks in batches to the LLM's /embeddings endpoint.
  5. Writes vectors and metadata to a LanceDB table.
  6. Creates an ANN index on the vector column.
  7. On repeated invocation, checks mtime per file and processes only changes.

Semantic Search

# Simple full-text search
python -m src search-text "Security vulnerabilities in microservices"

# With metadata filters
python -m src search-text "Docker networking" --since 2025-06-01 --suffix md,log

# Top-20 instead of top-10 results
python -m src search-text "Kubernetes deployment" --k 20

Date filters must be supplied in YYYY-MM-DD format; the file-type filter accepts a comma-separated list of extensions drawn from the supported set above.

AI-Powered Question Answering with Source Attribution

# Question + top-5 relevant chunks -> LLM answers with source references
python -m src ask-text "When was the MCP protocol introduced?"

# Only consider documents from 2026 onward
python -m src ask-text "Changes to the security architecture" --since 2026-01-01

The LLM response includes:

  • An answer derived from the retrieved text excerpts
  • Citations of the source files
  • Modification dates of the cited documents

Index Management

# List all indices with size and document count
python -m src indices

Architecture

src/
  __init__.py
  __main__.py          -- Entry point (`python -m src`)
  cli.py               -- Argument parser and command dispatch
  config.py            -- AppConfig / LLMConfig (dataclasses, JSON persistence)
  llm_client.py        -- LLMClient with primary endpoint, fallback, and backoff
  ingestion.py         -- Load CSV, JSON, SQL tables, MongoDB collections
  cleaning.py          -- Missing-value detection, imputation, date detection
  outliers.py          -- Z-Score, IQR, Isolation Forest, Local Outlier Factor
  analysis.py          -- Pearson correlation, k-Means, DBSCAN clustering
  nlp.py               -- Text column detection, sentiment, NER
  ai_eda.py            -- LLM calls: queries, hypotheses, text-to-SQL, summary
  survival.py          -- Kaplan-Meier, Cox PH, group comparisons
  vector_search.py     -- Chunking, embedding, LanceDB indexing and ANN search
  visualization.py     -- Plotly charts for survival and analysis output

tests/
  conftest.py
  test_llm_client.py   -- Primary, fallback, error cases, context manager
  test_ingestion.py    -- CSV/JSON loading, schema inference
  test_cleaning.py     -- Missing-value detection and imputation
  test_outliers.py     -- All four outlier detection methods
  test_analysis.py     -- Correlation, clustering, validation
  test_nlp.py          -- Text analysis, sentiment, NER, edge cases
  test_ai_eda.py       -- AI analysis functions (mocked LLM)
  test_survival.py     -- Kaplan-Meier, Cox PH, log-rank test

Data Flow

Input                          Processing                  Output
─────                          ──────────                  ──────
CSV/JSON file ───────> ingestion.py ──────> pandas DataFrame
SQL table ──────────> sqlalchemy ────────> pandas DataFrame
MongoDB ───────────> pymongo ───────────> pandas DataFrame
                                                 │
                            ┌────────────────────┤
                            ▼                    ▼
                      cleaning.py          outliers.py
                      (Missing,            (Z-Score, IQR,
                       Imputation)          LOF, IsoForest)
                            │                    │
                            ▼                    ▼
                      analysis.py           nlp.py
                      (Correlation,         (Sentiment,
                       Clustering)           NER)
                            │                    │
                            └────────┬───────────┘
                                     ▼
                               ai_eda.py
                               (LLM: Hypotheses,
                                Text-to-SQL,
                                Narrative Summary)
                                     │
                            ┌────────┴────────┐
                            ▼                 ▼
                       Terminal         vector_search.py
                       (Text)           (LanceDB Index,
                                        ANN Search,
                                        RAG Responses)

Fallback Mechanism

User query
      │
      ▼
LLM Client
      │
      ├──> Primary endpoint (Ollama)
      │         │
      │         ├── Success ──> Response
      │         │
      │         └── Error ──> Retry with exponential backoff
      │                            │
      │                            └── All retries exhausted
      │                                      │
      └──> Fallback endpoint (Cloud) <──────┘
                    │
                    └──> Response

Safety

  • SQL execution is read-only. Generated queries are sanitized to a single SELECT statement; write keywords (INSERT, UPDATE, DELETE, DROP, etc.) are rejected.
  • File size limit. Files larger than 50 MB are skipped during indexing to avoid memory pressure.
  • Filter validation. Vector search accepts only YYYY-MM-DD dates and file extensions from a fixed allow-list.
  • Lazy HTTP clients. The LLM client does not open connections until the first request and closes them deterministically via an async context manager.

Tests

python3 -m pytest tests/ -v

The test suite (52 tests) covers:

  • LLM client: primary success, fallback switching, error case without fallback configured, context-manager lifecycle, and exponential backoff.
  • CSV and JSON ingestion (empty, populated, schema inference).
  • Missing-value detection and categorical/numeric imputation.
  • All four outlier detection methods on synthetic data with known outliers.
  • Pearson correlation, top correlations, clustering, and validation errors.
  • NLP module: text column detection, word frequencies, sentiment, NER, and edge cases (empty input).
  • AI analysis functions with mocked LLM calls.
  • Survival analysis: Kaplan-Meier, log-rank test, Cox PH, and validation.

Dependencies

Package Purpose
pandas DataFrames, schema inference, CSV/JSON parsing
numpy Numeric arrays, vector operations
scipy Statistical tests (Z-Score)
scikit-learn Isolation Forest, LOF, k-Means, DBSCAN, imputation
plotly Chart generation (interactive and HTML export)
sqlalchemy SQL database connections (PostgreSQL, MySQL)
pymongo MongoDB connections
lancedb Vector database for semantic text search
pyarrow Columnar format for LanceDB
lifelines Survival analysis (Kaplan-Meier, Cox PH)
httpx Async HTTP client for LLM API calls

License

MIT

About

AutoEDA is a Python library for automated exploratory data analysis, providing ingestion, outlier detection, correlation analysis, and clustering with comprehensive test coverage.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages