Forest vs tree by deepakpandita57 · Pull Request #2 · google-research/vet

deepakpandita57 · 2026-02-11T21:46:31Z

Summary of the changes:

Support for multiprocessing to run trials in parallel and compression to save storage space
Support for categorical responses and associated metrics
Methods used to run experiments reported in the Forest vs Tree paper and the EACL paper
Some minor changes to replace sklearn mean_squared_error() with root_mean_squared_error() as the former method is deprecated

…sion to save storage space

…est vs tree paper

…he former method is deprecated

…mple_lib

Added missing docstrings, cleaned up irrelevant docstrings, and split code into multiple lines to avoid exceeding 80 chars.

tripperroc · 2026-02-26T00:03:40Z

Hi Deepak, I’m going to take Samay and Mamadou out for dinner tomorrow, to celebrate submission of our paper. It’s also Mamadou’s birthday and Samay is leaving at the end of the week. Would you like to join us (on me)? We haven’t celebrated our conference attendances (or even acceptances, have we?) C From: Deepak Pandita ***@***.***> Date: Wednesday, February 25, 2026 at 4:46 PM To: google-research/vet ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [google-research/vet] Forest vs tree (PR #2) @deepakpandita57 commented on this pull request.

________________________________ In cat_machine_contest_metrics.py<#2 (comment)>:

+ machine1: np.ndarray,

+ machine2: np.ndarray, +) -> tuple[float, float]: + """Compute accuracy relative to human labels. + + Args: + human: A list of human scores. + machine1: A list of machine scores. + machine2: Another list of machine scores. + + Returns: + A pair of accuracy scores, for machines 1 and 2, relative to + human scores. + """ + + num_categories = np.max(np.concatenate(( Yes, I can use stack and max here. — Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAQNUTLXPLMDH5RAIL5X5VT4NYJZFAVCNFSM6AAAAACUZ4H2ZOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTQNJXGE3TIOJXGE>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

pk-at-g · 2026-03-02T20:27:29Z

+_COMPUTE_ACTUAL_P_VALUES = flags.DEFINE_boolean(
+    "compute_actual_p_values",
+    False,
+    "If true use categorical params directly to compute metrics and p-values.",


Give a pointer to the formula that would be used to compute an "actual" (analytic?) p-value.

This does not refer to the analytic p-values. Here it means that the categorical parameters sampled are directly used for p-value calculation.

pk-at-g · 2026-03-02T20:37:37Z

+    n_items: int = 1000,
+    k_responses: int = 5,
+    m_categories: int = 3,
+    alpha: List[float] = [0.6, 0.1, 0.3],


Why these default values? Note: they're different from the _ALPHA flag default values.

No particular reason, except this is the fraction of categories in the DICES dataset. I have changed the flag values to match these, but I'm open to your suggestions.

…e np.where

deepakpandita57 added 5 commits February 11, 2026 14:38

Add support for multiprocessing to run trials in parallel and compres…

069334c

…sion to save storage space

Support for categorical responses and experiments reported in the for…

0046a36

…est vs tree paper

Experiments on additional datasets for the EACL paper

c998f84

Replaced sklearn mean_squared_error with root_mean_squared_error as t…

af221db

…he former method is deprecated

Some tests for categorical_machine_contest_metrics and categorical_sa…

0a30d71

…mple_lib

pk-at-g reviewed Feb 19, 2026

View reviewed changes

Comment thread cat_machine_contest_metrics.py

pk-at-g reviewed Feb 19, 2026

View reviewed changes

Comment thread cat_machine_contest_metrics.py Outdated

pk-at-g reviewed Feb 19, 2026

View reviewed changes

Comment thread cat_machine_contest_metrics.py Outdated

Changes suggested in the PR.

663b255

Added missing docstrings, cleaned up irrelevant docstrings, and split code into multiple lines to avoid exceeding 80 chars.