Skip to content

Padhma/capstone

Repository files navigation

Saving Clicks and Sanity: Inference-Time Structured Search with Language Model Agents

A research project investigating whether classical tree-based search algorithms (DFS, BFS, A*) can make LLM-driven web navigation agents more robust and efficient. The agent navigates a purpose-built e-commerce environment, guided by Gemini 2.0 Flash at each decision step, and is evaluated across 130 structured queries.

Paper: Saving Clicks and Sanity: Inference-Time Structured Search with Language Model Agents — Padhma Muniraj, University of Michigan, Ann Arbor

Master's Capstone Project · Department of Statistics · Advised by Dr. Chris Teplovs

📄 Read the report · 🔗 DOI: [https://doi.org/10.7302/eqcz-wm18] · 🏛️ Deep Blue Repository

Overview

Large language models can suggest plausible navigation actions, but they lack systematic exploration strategies for multi-step tasks. This project integrates DFS, BFS, and A* as structured backbones for an LLM agent navigating a simulated laptop e-commerce site. At each step, the agent:

  1. Loads the current page via Selenium and extracts DOM elements into a structured representation
  2. Formats state + goal constraints into a structured prompt for Gemini 2.0 Flash
  3. Parses the LLM's action recommendations (label → URL)
  4. Expands the search frontier using the active algorithm
  5. Checks whether the current page satisfies the goal constraints

The environment (E-Buy) is intentionally designed with dead ends, cyclic links, multiple paths to goals, and breadcrumb navigation to create realistic search challenges.

Results Summary

390 search trajectories (130 queries × 3 algorithms) were collected and analyzed with both frequentist ANOVA and Bayesian joint modeling.

Algorithm Success Rate Avg. Execution Time (ms) Avg. Nodes Expanded Grounding Success
A* Search 0.56 28,967 3.08 68.42%
Breadth-First Search 0.55 25,935 2.96 65.41%
Depth-First Search 0.54 30,376 3.05 63.91%

Key findings:

  • All three algorithms performed similarly in success rate and search efficiency — no statistically significant differences were found via ANOVA (p > 0.49 for continuous metrics).
  • Task complexity is the dominant factor: simple queries (complexity 1–2) achieved near 100% success; performance degraded sharply at complexity 3–4.
  • LLM reasoning quality is the primary bottleneck: agent grounding success (robustness to LLM misinterpretations) had a larger impact on outcomes than choice of search algorithm.
  • BFS exhibited slightly faster execution times; A* showed a marginal edge in success probability under Bayesian modeling.

Architecture

%%{init: {'theme': 'base'}}%%
flowchart TD
    A([Task + Goal Constraints]) --> B

    subgraph Agent["Agent Controller"]
        B[Main Loop]
    end

    subgraph Environment["Web Environment · E-Buy"]
        C[Page Loader\nSelenium DOM Extractor]
        D[Structured Element List\ne₁ e₂ … eₙ]
        C --> D
    end

    subgraph LLM["LLM Interaction Layer"]
        E[Prompt Formatter\nstate + goal → prompt]
        F[Gemini 2.0 Flash\nAPI call + cache]
        G[Response Parser\nAction: label → URL]
        E --> F --> G
    end

    subgraph Search["Search Engine"]
        H{Algorithm}
        I[DFS\nStack / LIFO]
        J[BFS\nQueue / FIFO]
        K[A★\nMin-Heap f=g+h]
        H --> I & J & K
        L[expand_frontier\nnew SearchNode]
        I & J & K --> L
    end

    subgraph Eval["Evaluation"]
        M[Goal Checker\nconstraint validation]
        N[Search Logger\nCSV metrics]
    end

    B -->|load page| C
    D -->|elements| E
    G -->|actions| H
    L -->|next node| M
    M -->|not satisfied| B
    M -->|goal found| O([End: path + metrics])
    L --> N
Loading

Project Structure

capstone-master/
├── index.html                  # E-Buy homepage
├── category/                   # Category pages (laptops, gaming, business, etc.)
├── pages/                      # Product listing pages
├── css/                        # Stylesheets
├── js/                         # Frontend scripts
├── data/                       # Product data (JSON)
├── llm_agent/
│   ├── main.py                 # Entry point — runs full simulation across all queries & algorithms
│   ├── agent/agent.py          # Core agent loop
│   ├── llm/                    # Gemini API client & prompt formatter
│   ├── scraper/page_loader.py  # Selenium-based DOM extractor
│   ├── search_algorithms/      # DFS, BFS, A* implementations + node/frontier logic
│   ├── utils/                  # Goal checker, response parsers
│   ├── agent_logging/          # CSV logger for simulation results
│   ├── rate_limit/             # API rate limiter (30s cooldown between calls)
│   ├── simulation_queries.json # 130 query definitions (task + goal constraints + complexity)
│   └── simulation_logs.csv     # Output logs from simulation runs
└── statistical_analysis/
    └── bayesian_analysis.ipynb # Frequentist ANOVA + Bayesian joint model (PyMC/ADVI)

Setup

Prerequisites

  • Python 3.10+
  • Google Chrome + ChromeDriver (for Selenium)
  • A Google Gemini API key

Install dependencies

cd llm_agent
pip install -r requirements.txt

Configure environment

Create a .env file in llm_agent/ with your Gemini API key:

GEMINI_API_KEY=your_key_here

Run the simulation

cd llm_agent
python main.py

This runs all 130 queries in simulation_queries.json against each of the three search algorithms (390 runs total) and writes results to simulation_logs.csv. A 30-second cooldown is enforced between queries to respect API rate limits.

How the Agent Works

Each search node encapsulates the state as (page_url, dom_elements, action_history, raw_html). At each step:

  • DFS pops from a stack, aggressively exploring deep paths before backtracking.
  • BFS dequeues from a FIFO queue, exploring level-by-level to find shortest paths.
  • A* uses a min-heap ordered by f(n) = g(n) + h(n), where g(n) is the number of clicks taken and h(n) is an LLM-generated "promise score" (1–10) for how likely the current page leads to the goal.

Duplicate nodes are pruned by comparing page paths and action histories. Goal satisfaction is checked at every expanded node against structured attribute constraints (e.g., brand=Apple, price < $2000).

Evaluation Metrics

Metric Description
Task Success Rate (TSR) Fraction of queries completed within 50 steps
Mean Execution Time (MET) Average wall-clock time per run (ms)
Nodes Expanded DOM pages visited before termination
Path Length Action steps taken on successful paths
Agent Grounding Success Whether the LLM correctly identified the goal page despite noise

Statistical Analysis

statistical_analysis/bayesian_analysis.ipynb fits a Bayesian joint model (PyMC, ADVI) over all five metrics simultaneously, with algorithm as a one-hot encoded predictor. Model specifications: Normal likelihood for log-transformed execution time, Bernoulli for success and grounding, Negative Binomial for nodes expanded, Poisson for path length, and Beta regression for TSR. Posterior inference used 30,000 ADVI iterations with 1,000 posterior samples.

Tech Stack

  • Frontend (E-Buy): HTML, CSS, JavaScript
  • Agent: Python, Google Gemini 2.0 Flash (google-genai), Selenium WebDriver
  • Analysis: PyMC, ArviZ, pandas, scipy, seaborn, scikit-learn

About

A research project investigating whether classical tree-based search algorithms (DFS, BFS, A*) can make LLM-driven web navigation agents more robust and efficient.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors