Gemma Token Analysis

Notebook	Link
Logprobs Generation
Token Tree Analysis

This repository explores the internal stochastic nature of Gemma models. By extracting transition scores and logits from the Hugging Face transformers generation loop, we can analyze the model's confidence levels and visualize "competing" tokens at each step of the sequence.

This repository contains no confidential data/IP and is intended for demonstration and research use.

Features

Log Probability Analysis: Extract and analyze the log probabilities of generated tokens to understand model confidence.
Top-K Candidates: View the top alternative tokens considered by the model at each step.
Guided Generation: Steer the model's output by providing a specific starting prefix (e.g., forcing a code block).
Token Tree Exploration: Construct and visualize decision trees of token generation paths based on probability thresholds.
Data Export: Save analysis results to JSONL or JSON for further processing.

Dynamic Thresholding Logic

The dynamic thresholding logic in Token_Tree_Analysis.ipynb adapts how "picky" the model is about branching based on how busy the search queue currently is.

$$ T_{current} = T_{min} + \min\left(1.0, \frac{|Q|}{Q_{limit}}\right) \times (T_{max} - T_{min}) $$

Where:

$T_{current}$ is the calculated probability threshold for the current step.
$T_{min}$ is the min_branch_threshold (e.g., 0.1).
$T_{max}$ is the max_branch_threshold (e.g., 0.5).
$|Q|$ is the current length of the queue (number of active paths).
$Q_{limit}$ is the soft_queue_limit (target number of active paths).

Note: The saturation ratio is capped at 1.0.

How it works behaviorally:

Empty Queue: When the queue is small, the threshold is close to $T_{min}$. This encourages the model to branch out and explore even low-probability alternatives.
Full Queue: As the queue fills up (approaching soft_queue_limit), the threshold rises toward $T_{max}$. This forces the model to be very selective, only branching on highly probable tokens to prevent the search from exploding exponentially.

Repository Structure

Logprobs_in_Gemma.ipynb: The main Jupyter Notebook containing the log probability analysis code, helper functions, and experiments.
Token_Tree_Analysis.ipynb: Notebook for generating and analyzing token trees.
token_tree_analysis/: Contains the visualizer and sample outputs for the token tree analysis.

Installation

Clone the repository.
Install the required dependencies:

pip install -U torch transformers pandas accelerate numpy huggingface-hub

Usage

Open Logprobs_in_Gemma.ipynb in VS Code or Jupyter Lab.
Ensure you have a Hugging Face account and an access token.
Run the notebook cells to:
- Authenticate with Hugging Face.
- Load the Gemma model (default: google/gemma-2-2b-it).
- Run the log probability analysis experiment.
- Run the guided generation experiment.
Open Token_Tree_Analysis.ipynb to generate and analyze token decision trees.

Visualization

You can visualize the generated JSONL data using the Gemma Token Analysis Visualizer.
For analyzing token generation trees, use the Token Tree Visualizer.
- Response Visualizer: Click any token to see alternatives and regenerate sequences
- Tree Visualizer: Interactive D3.js visualization of token generation paths

Requirements

Python 3.8+
PyTorch
Transformers
Pandas
Accelerate
Bitsandbytes
A GPU is recommended for faster inference.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
logprobs-visualizer		logprobs-visualizer
sample_data		sample_data
scripts		scripts
token_tree_analysis		token_tree_analysis
LICENSE		LICENSE
Logprobs_in_Gemma.ipynb		Logprobs_in_Gemma.ipynb
README.md		README.md
Token_Tree_Analysis.ipynb		Token_Tree_Analysis.ipynb
extract_text.py		extract_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemma Token Analysis

Features

Dynamic Thresholding Logic

How it works behaviorally:

Repository Structure

Installation

Usage

Visualization

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gemma Token Analysis

Features

Dynamic Thresholding Logic

How it works behaviorally:

Repository Structure

Installation

Usage

Visualization

Requirements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages