Saving Clicks and Sanity: Inference-Time Structured Search with Language Model Agents

A research project investigating whether classical tree-based search algorithms (DFS, BFS, A*) can make LLM-driven web navigation agents more robust and efficient. The agent navigates a purpose-built e-commerce environment, guided by Gemini 2.0 Flash at each decision step, and is evaluated across 130 structured queries.

Paper: Saving Clicks and Sanity: Inference-Time Structured Search with Language Model Agents — Padhma Muniraj, University of Michigan, Ann Arbor

Master's Capstone Project · Department of Statistics · Advised by Dr. Chris Teplovs

📄 Read the report · 🔗 DOI: [https://doi.org/10.7302/eqcz-wm18] · 🏛️ Deep Blue Repository

Overview

Large language models can suggest plausible navigation actions, but they lack systematic exploration strategies for multi-step tasks. This project integrates DFS, BFS, and A* as structured backbones for an LLM agent navigating a simulated laptop e-commerce site. At each step, the agent:

Loads the current page via Selenium and extracts DOM elements into a structured representation
Formats state + goal constraints into a structured prompt for Gemini 2.0 Flash
Parses the LLM's action recommendations (label → URL)
Expands the search frontier using the active algorithm
Checks whether the current page satisfies the goal constraints

The environment (E-Buy) is intentionally designed with dead ends, cyclic links, multiple paths to goals, and breadcrumb navigation to create realistic search challenges.

Results Summary

390 search trajectories (130 queries × 3 algorithms) were collected and analyzed with both frequentist ANOVA and Bayesian joint modeling.

Algorithm	Success Rate	Avg. Execution Time (ms)	Avg. Nodes Expanded	Grounding Success
A* Search	0.56	28,967	3.08	68.42%
Breadth-First Search	0.55	25,935	2.96	65.41%
Depth-First Search	0.54	30,376	3.05	63.91%

Key findings:

All three algorithms performed similarly in success rate and search efficiency — no statistically significant differences were found via ANOVA (p > 0.49 for continuous metrics).
Task complexity is the dominant factor: simple queries (complexity 1–2) achieved near 100% success; performance degraded sharply at complexity 3–4.
LLM reasoning quality is the primary bottleneck: agent grounding success (robustness to LLM misinterpretations) had a larger impact on outcomes than choice of search algorithm.
BFS exhibited slightly faster execution times; A* showed a marginal edge in success probability under Bayesian modeling.

Architecture

%%{init: {'theme': 'base'}}%%
flowchart TD
    A([Task + Goal Constraints]) --> B

    subgraph Agent["Agent Controller"]
        B[Main Loop]
    end

    subgraph Environment["Web Environment · E-Buy"]
        C[Page Loader\nSelenium DOM Extractor]
        D[Structured Element List\ne₁ e₂ … eₙ]
        C --> D
    end

    subgraph LLM["LLM Interaction Layer"]
        E[Prompt Formatter\nstate + goal → prompt]
        F[Gemini 2.0 Flash\nAPI call + cache]
        G[Response Parser\nAction: label → URL]
        E --> F --> G
    end

    subgraph Search["Search Engine"]
        H{Algorithm}
        I[DFS\nStack / LIFO]
        J[BFS\nQueue / FIFO]
        K[A★\nMin-Heap f=g+h]
        H --> I & J & K
        L[expand_frontier\nnew SearchNode]
        I & J & K --> L
    end

    subgraph Eval["Evaluation"]
        M[Goal Checker\nconstraint validation]
        N[Search Logger\nCSV metrics]
    end

    B -->|load page| C
    D -->|elements| E
    G -->|actions| H
    L -->|next node| M
    M -->|not satisfied| B
    M -->|goal found| O([End: path + metrics])
    L --> N

Project Structure

capstone-master/
├── index.html                  # E-Buy homepage
├── category/                   # Category pages (laptops, gaming, business, etc.)
├── pages/                      # Product listing pages
├── css/                        # Stylesheets
├── js/                         # Frontend scripts
├── data/                       # Product data (JSON)
├── llm_agent/
│   ├── main.py                 # Entry point — runs full simulation across all queries & algorithms
│   ├── agent/agent.py          # Core agent loop
│   ├── llm/                    # Gemini API client & prompt formatter
│   ├── scraper/page_loader.py  # Selenium-based DOM extractor
│   ├── search_algorithms/      # DFS, BFS, A* implementations + node/frontier logic
│   ├── utils/                  # Goal checker, response parsers
│   ├── agent_logging/          # CSV logger for simulation results
│   ├── rate_limit/             # API rate limiter (30s cooldown between calls)
│   ├── simulation_queries.json # 130 query definitions (task + goal constraints + complexity)
│   └── simulation_logs.csv     # Output logs from simulation runs
└── statistical_analysis/
    └── bayesian_analysis.ipynb # Frequentist ANOVA + Bayesian joint model (PyMC/ADVI)

Setup

Prerequisites

Python 3.10+
Google Chrome + ChromeDriver (for Selenium)
A Google Gemini API key

Install dependencies

cd llm_agent
pip install -r requirements.txt

Configure environment

Create a .env file in llm_agent/ with your Gemini API key:

GEMINI_API_KEY=your_key_here

Run the simulation

cd llm_agent
python main.py

This runs all 130 queries in simulation_queries.json against each of the three search algorithms (390 runs total) and writes results to simulation_logs.csv. A 30-second cooldown is enforced between queries to respect API rate limits.

How the Agent Works

Each search node encapsulates the state as (page_url, dom_elements, action_history, raw_html). At each step:

DFS pops from a stack, aggressively exploring deep paths before backtracking.
BFS dequeues from a FIFO queue, exploring level-by-level to find shortest paths.
A* uses a min-heap ordered by f(n) = g(n) + h(n), where g(n) is the number of clicks taken and h(n) is an LLM-generated "promise score" (1–10) for how likely the current page leads to the goal.

Duplicate nodes are pruned by comparing page paths and action histories. Goal satisfaction is checked at every expanded node against structured attribute constraints (e.g., brand=Apple, price < $2000).

Evaluation Metrics

Metric	Description
Task Success Rate (TSR)	Fraction of queries completed within 50 steps
Mean Execution Time (MET)	Average wall-clock time per run (ms)
Nodes Expanded	DOM pages visited before termination
Path Length	Action steps taken on successful paths
Agent Grounding Success	Whether the LLM correctly identified the goal page despite noise

Statistical Analysis

statistical_analysis/bayesian_analysis.ipynb fits a Bayesian joint model (PyMC, ADVI) over all five metrics simultaneously, with algorithm as a one-hot encoded predictor. Model specifications: Normal likelihood for log-transformed execution time, Bernoulli for success and grounding, Negative Binomial for nodes expanded, Poisson for path length, and Beta regression for TSR. Posterior inference used 30,000 ADVI iterations with 1,000 posterior samples.

Tech Stack

Frontend (E-Buy): HTML, CSS, JavaScript
Agent: Python, Google Gemini 2.0 Flash (google-genai), Selenium WebDriver
Analysis: PyMC, ArviZ, pandas, scipy, seaborn, scikit-learn

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.vscode		.vscode
assets/images		assets/images
category		category
css		css
data		data
js		js
llm_agent		llm_agent
pages		pages
server		server
statistical_analysis		statistical_analysis
.gitignore		.gitignore
Capstone.pdf		Capstone.pdf
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Saving Clicks and Sanity: Inference-Time Structured Search with Language Model Agents

Overview

Results Summary

Architecture

Project Structure

Setup

Prerequisites

Install dependencies

Configure environment

Run the simulation

How the Agent Works

Evaluation Metrics

Statistical Analysis

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Saving Clicks and Sanity: Inference-Time Structured Search with Language Model Agents

Overview

Results Summary

Architecture

Project Structure

Setup

Prerequisites

Install dependencies

Configure environment

Run the simulation

How the Agent Works

Evaluation Metrics

Statistical Analysis

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages