A research project investigating whether classical tree-based search algorithms (DFS, BFS, A*) can make LLM-driven web navigation agents more robust and efficient. The agent navigates a purpose-built e-commerce environment, guided by Gemini 2.0 Flash at each decision step, and is evaluated across 130 structured queries.
Paper: Saving Clicks and Sanity: Inference-Time Structured Search with Language Model Agents — Padhma Muniraj, University of Michigan, Ann Arbor
Master's Capstone Project · Department of Statistics · Advised by Dr. Chris Teplovs
📄 Read the report · 🔗 DOI:
[https://doi.org/10.7302/eqcz-wm18]· 🏛️ Deep Blue Repository
Large language models can suggest plausible navigation actions, but they lack systematic exploration strategies for multi-step tasks. This project integrates DFS, BFS, and A* as structured backbones for an LLM agent navigating a simulated laptop e-commerce site. At each step, the agent:
- Loads the current page via Selenium and extracts DOM elements into a structured representation
- Formats state + goal constraints into a structured prompt for Gemini 2.0 Flash
- Parses the LLM's action recommendations (label → URL)
- Expands the search frontier using the active algorithm
- Checks whether the current page satisfies the goal constraints
The environment (E-Buy) is intentionally designed with dead ends, cyclic links, multiple paths to goals, and breadcrumb navigation to create realistic search challenges.
390 search trajectories (130 queries × 3 algorithms) were collected and analyzed with both frequentist ANOVA and Bayesian joint modeling.
| Algorithm | Success Rate | Avg. Execution Time (ms) | Avg. Nodes Expanded | Grounding Success |
|---|---|---|---|---|
| A* Search | 0.56 | 28,967 | 3.08 | 68.42% |
| Breadth-First Search | 0.55 | 25,935 | 2.96 | 65.41% |
| Depth-First Search | 0.54 | 30,376 | 3.05 | 63.91% |
Key findings:
- All three algorithms performed similarly in success rate and search efficiency — no statistically significant differences were found via ANOVA (p > 0.49 for continuous metrics).
- Task complexity is the dominant factor: simple queries (complexity 1–2) achieved near 100% success; performance degraded sharply at complexity 3–4.
- LLM reasoning quality is the primary bottleneck: agent grounding success (robustness to LLM misinterpretations) had a larger impact on outcomes than choice of search algorithm.
- BFS exhibited slightly faster execution times; A* showed a marginal edge in success probability under Bayesian modeling.
%%{init: {'theme': 'base'}}%%
flowchart TD
A([Task + Goal Constraints]) --> B
subgraph Agent["Agent Controller"]
B[Main Loop]
end
subgraph Environment["Web Environment · E-Buy"]
C[Page Loader\nSelenium DOM Extractor]
D[Structured Element List\ne₁ e₂ … eₙ]
C --> D
end
subgraph LLM["LLM Interaction Layer"]
E[Prompt Formatter\nstate + goal → prompt]
F[Gemini 2.0 Flash\nAPI call + cache]
G[Response Parser\nAction: label → URL]
E --> F --> G
end
subgraph Search["Search Engine"]
H{Algorithm}
I[DFS\nStack / LIFO]
J[BFS\nQueue / FIFO]
K[A★\nMin-Heap f=g+h]
H --> I & J & K
L[expand_frontier\nnew SearchNode]
I & J & K --> L
end
subgraph Eval["Evaluation"]
M[Goal Checker\nconstraint validation]
N[Search Logger\nCSV metrics]
end
B -->|load page| C
D -->|elements| E
G -->|actions| H
L -->|next node| M
M -->|not satisfied| B
M -->|goal found| O([End: path + metrics])
L --> N
capstone-master/
├── index.html # E-Buy homepage
├── category/ # Category pages (laptops, gaming, business, etc.)
├── pages/ # Product listing pages
├── css/ # Stylesheets
├── js/ # Frontend scripts
├── data/ # Product data (JSON)
├── llm_agent/
│ ├── main.py # Entry point — runs full simulation across all queries & algorithms
│ ├── agent/agent.py # Core agent loop
│ ├── llm/ # Gemini API client & prompt formatter
│ ├── scraper/page_loader.py # Selenium-based DOM extractor
│ ├── search_algorithms/ # DFS, BFS, A* implementations + node/frontier logic
│ ├── utils/ # Goal checker, response parsers
│ ├── agent_logging/ # CSV logger for simulation results
│ ├── rate_limit/ # API rate limiter (30s cooldown between calls)
│ ├── simulation_queries.json # 130 query definitions (task + goal constraints + complexity)
│ └── simulation_logs.csv # Output logs from simulation runs
└── statistical_analysis/
└── bayesian_analysis.ipynb # Frequentist ANOVA + Bayesian joint model (PyMC/ADVI)
- Python 3.10+
- Google Chrome + ChromeDriver (for Selenium)
- A Google Gemini API key
cd llm_agent
pip install -r requirements.txtCreate a .env file in llm_agent/ with your Gemini API key:
GEMINI_API_KEY=your_key_here
cd llm_agent
python main.pyThis runs all 130 queries in simulation_queries.json against each of the three search algorithms (390 runs total) and writes results to simulation_logs.csv. A 30-second cooldown is enforced between queries to respect API rate limits.
Each search node encapsulates the state as (page_url, dom_elements, action_history, raw_html). At each step:
- DFS pops from a stack, aggressively exploring deep paths before backtracking.
- BFS dequeues from a FIFO queue, exploring level-by-level to find shortest paths.
- A* uses a min-heap ordered by
f(n) = g(n) + h(n), whereg(n)is the number of clicks taken andh(n)is an LLM-generated "promise score" (1–10) for how likely the current page leads to the goal.
Duplicate nodes are pruned by comparing page paths and action histories. Goal satisfaction is checked at every expanded node against structured attribute constraints (e.g., brand=Apple, price < $2000).
| Metric | Description |
|---|---|
| Task Success Rate (TSR) | Fraction of queries completed within 50 steps |
| Mean Execution Time (MET) | Average wall-clock time per run (ms) |
| Nodes Expanded | DOM pages visited before termination |
| Path Length | Action steps taken on successful paths |
| Agent Grounding Success | Whether the LLM correctly identified the goal page despite noise |
statistical_analysis/bayesian_analysis.ipynb fits a Bayesian joint model (PyMC, ADVI) over all five metrics simultaneously, with algorithm as a one-hot encoded predictor. Model specifications: Normal likelihood for log-transformed execution time, Bernoulli for success and grounding, Negative Binomial for nodes expanded, Poisson for path length, and Beta regression for TSR. Posterior inference used 30,000 ADVI iterations with 1,000 posterior samples.
- Frontend (E-Buy): HTML, CSS, JavaScript
- Agent: Python, Google Gemini 2.0 Flash (
google-genai), Selenium WebDriver - Analysis: PyMC, ArviZ, pandas, scipy, seaborn, scikit-learn