CaseBundleGen: A Multimodal RAG Benchmark Generator

A research framework for generating synthetic case bundles and evaluating Retrieval-Augmented Generation (RAG) systems. This project generates realistic case documents and systematically benchmarks RAG model performance using BERTScore.

Overview

This framework consists of three main components:

Case Generation: Creates diverse, realistic synthetic case bundles with structured documents
RAG Evaluation: Sets up and evaluates RAG systems using llama-index with Ollama integration
Benchmarking: Measures RAG performance using BERTScore, a state-of-the-art semantic similarity metric

Prerequisites

Python 3.10 or newer
Ollama CLI installed and available on PATH (tested with Ollama 0.18.1)
HuggingFace models (automatically downloaded on first use)
GPU recommended for embedding and benchmarking tasks

Install Ollama (v0.18.1)

This project has been tested with Ollama 0.18.1.

Linux

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.18.1 sh

macOS

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.18.1 sh

Windows (PowerShell)

$env:OLLAMA_VERSION="0.18.1"
irm https://ollama.com/install.ps1 | iex

Verify installation

ollama --version

Expected output:

ollama version 0.18.1

Quick Start

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Generate Cases

By default, this script runs the gemma3:12b model and generates 100 synthetic cases:

python generate_case.py

Cases are stored in the cases/ directory with the structure:

cases/case_001/, cases/case_002/, etc.
Each case contains multiple document types (emails, bank statements, reports, etc.)
Includes metadata.json and qa.json for evaluation

Evaluate RAG Systems

Open and run the Jupyter notebook in the Evaluation/ folder:

jupyter notebook Evaluation/evaluation.ipynb

The notebook:

Preprocesses case files and moves Q&A pairs to a dedicated folder
Builds vector indices using HuggingFace embeddings (BAAI/bge-base-en-v1.5)
Queries RAG systems using Ollama-powered LLMs
Records predicted answers and metrics for benchmarking

Benchmark with BERTScore

Once evaluation results are generated, use BERTScore for robust semantic evaluation:

cd bertscore
python run_bertscore_eval.py --input case_benchmark_results_bertscore_input.csv

BERTScore is the primary benchmarking tool used to evaluate RAG answer quality by measuring contextual similarity between predicted and reference answers. Unlike simple string matching, BERTScore captures semantic meaning and is more aligned with human judgment.

Project Structure

├── generate_case.py              # Main pipeline for synthetic case generation
├── ollama_utils.py               # Utilities for Ollama integration
├── cases/                        # Generated synthetic case bundles
├── Evaluation/
│   ├── evaluation.ipynb          # Jupyter notebook for RAG setup and evaluation
│   ├── cases/                    # Symbolic links to generated cases
│   └── quesNAs/                  # Q&A pairs extracted from cases
├── bertscore/
│   ├── run_bertscore_eval.py     # BERTScore evaluation script
│   ├── trim_bertscore_columns.py # Utility for result processing
│   └── case_benchmark_results_*.csv # Benchmark results at various stages
├── prompts/                      # Prompt templates for case generation
├── samples/                      # Sample documents for style reference
└── README.md                     # This file

Key Features

Ollama-Compatible: Leverages Ollama for local LLM inference without external API dependencies
Configurable Case Generation: Adjust the number of cases by modifying the batch size in generate_case.py
Semantic Evaluation: Uses BERTScore for evaluation beyond lexical metrics
Reproducibility: All components designed for research reproducibility and transparency

Configuration

Environment variables for evaluation:

EMBED_DEVICE=cpu              # Embedding device (cpu, cuda, etc.)
EMBED_BATCH_SIZE=8            # Batch size for embeddings
INDEX_INSERT_BATCH_SIZE=16    # Batch size for index insertion

Notes

This repository is designed for research and experimental evaluation purposes
Case generation times depend on model performance and available compute resources
We recommend gemma3:12b for a good balance of quality and speed, but you can experiment with other models supported by Ollama. Please note that the model's output should not contain thinking traces or internal reasoning steps, as these can interfere with the evaluation process. If you choose a different model, ensure that it produces clean, final answers without extraneous information.
For faster iteration, reduce the number of cases in generate_case.py

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CaseBundleGen: A Multimodal RAG Benchmark Generator

Overview

Prerequisites

Install Ollama (v0.18.1)

Linux

macOS

Windows (PowerShell)

Verify installation

Quick Start

Setup

Generate Cases

Evaluate RAG Systems

Benchmark with BERTScore

Project Structure

Key Features

Configuration

Notes

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Evaluation		Evaluation
bertscore		bertscore
cases		cases
prompts		prompts
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_case.py		generate_case.py
generation_timing_summary.json		generation_timing_summary.json
ollama_utils.py		ollama_utils.py

Folders and files

Latest commit

History

Repository files navigation

CaseBundleGen: A Multimodal RAG Benchmark Generator

Overview

Prerequisites

Install Ollama (v0.18.1)

Linux

macOS

Windows (PowerShell)

Verify installation

Quick Start

Setup

Generate Cases

Evaluate RAG Systems

Benchmark with BERTScore

Project Structure

Key Features

Configuration

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages