Skip to content

Add Osprey (DIA) search-engine integration: Koina library, entrapment/FDRBench validation, finetune, and installer#11

Open
maccoss wants to merge 34 commits into
Noble-Lab:mainfrom
maccoss:main
Open

Add Osprey (DIA) search-engine integration: Koina library, entrapment/FDRBench validation, finetune, and installer#11
maccoss wants to merge 34 commits into
Noble-Lab:mainfrom
maccoss:main

Conversation

@maccoss

@maccoss maccoss commented Jun 28, 2026

Copy link
Copy Markdown

Consolidates the Osprey (DIA search-engine) integration into Carafe, validated
end-to-end on real data. main is a clean fast-forward superset of
Noble-Lab:main (34 commits, no conflicts).

Supersedes #8, #9, and #10 — those scoped branches diverged as the work
continued (the finetune and decoy/FDRBench fixes below landed only on main),
so they're folded into this single reviewable PR.

Highlights

Osprey search-engine integration

  • New "Osprey" DIA search engine wired through the GUI (Workflows 4 & 5):
    search → finetune → build a new library → re-search the project files.
  • OspreyBinaryResolver resolves the executable, preferring the
    installer-bundled build over a stale saved path.
  • Tracks the upstream OspreySharpOsprey rename (namespaces, Osprey.exe).

Initial library generation

  • Koina (remote) and Carafe-local AlphaPepDeep predictors produce the starting
    DIA-NN library; KoinaLibraryGenerator enumerates peptidoforms/charges and
    writes a DIA-NN TSV.

Entrapment + FDRBench (FDP validation)

  • Entrapment peptides go only in the library FASTA, never the training
    FASTA (OspreyFastaPlanner); a 5-column FDRBench pairing manifest +
    --fdrbench input bundle drive entrapment-based FDP measurement
    (OspreyFdrBenchPlanner).
  • Decoys are generated the Osprey way (C-term-preserving reverse, cycle on
    collision), marked correctly, and made unique against both targets and
    entrapments.

Finetune correctness fixes

  • Don't hard-exit when a peak apex falls between scans (unit-resolution data).
  • Read the synthesized ms2index (true MS2 scan ordinal) instead of the Osprey
    blib's placeholder MS2.Scan=0 — without this the MS2 model trained on zero
    valid PSMs and crashed.
  • Multi-file (folder) training: fixed a crash in the Osprey finetune step
    when training on a folder of mzMLs, and emit the Precursor.Id column
    (Modified.Sequence + Precursor.Charge) that the Osprey report was missing —
    the multi-run "best" merge does hIndex.get("Precursor.Id") and NPE'd without
    it. Precursor.Id is now built identically to Carafe's own Skyline→DIA-NN
    converter so both report sources merge the same way.

CI / installer

  • GitHub Actions workflow + script build a Windows MSI bundling Osprey
    (Osprey.exe); folder-publish (not single-file) so the SQLite .blib writer
    works; stage the .blib locally then move (NAS/SMB can't host the SQLite
    write); actions pinned off deprecated Node 20; vendored patched
    compomics-utilities jar for clean-machine builds.
  • A second GitHub Actions workflow runs the unit-test suite on every push; the
    MSI version and build stamp are derived from the Maven project version (single
    source of truth) so a pom bump can't drift from a hard-coded value.

Docs: an end-to-end Osprey workflow document describing how the GUI calls
each tool.

Tests

New unit tests: OspreyFastaPlannerTest, OspreyFdrBenchPlannerTest,
EntrapmentFastaGearTest (Osprey-style decoys), PSMConfigTest (ms2index
re-point regression), KoinaLibraryGeneratorTest (end-to-end via a stubbed
client), OspreyBinaryResolverTest, AIGearMsFileResolutionTest (multi-file
resolution), and OspreyBlibReaderTest (incl. the Precursor.Id regression).
Full suite green (105 tests, 0 failures).

Validation

End-to-end on Stellar HeLa entrapment data: at q = 0.01, combined FDP 0.98% /
paired FDP 0.88%
— on/below the unity line, i.e. the reported FDR is accurate
(slightly conservative).

Dependency

The Osprey side (--fdrbench output + a library-cache staleness fix) has
merged into ProteoWizard/pwiz masterProteoWizard/pwiz#4337 (--fdrbench),
ProteoWizard/pwiz#4338 (stale .libcache), plus the base_id decoy-pairing fix
ProteoWizard/pwiz#4347. The bundled MSI now builds Osprey from upstream master;
there is no longer a fork dependency for the Osprey side.

Test coverage — driven by real integration findings

Many of the tests here exist because we hit concrete failures bringing the Osprey integration up
end-to-end. They're regression guards for bugs we actually encountered, not speculative coverage;
the suite grew 69 → 105 tests over this work:

  • ms2index / "Spectrum not found" — the finetune trained on zero valid MS2 PSMs until the
    Osprey-blib → training-input column handoff was fixed. Guarded by PSMConfigTest, and the
    precursor → MS2-scan matching was extracted out of AIGear into a unit-tested ApexMatcher.
  • Multi-file (folder) training NPE — training on a folder of mzMLs crashed in the multi-run
    "best" merge because the Osprey report omitted Precursor.Id. OspreyBlibReaderTest now pins
    that the column is present and equals Modified.Sequence + Precursor.Charge, and
    AIGearMsFileResolutionTest covers the multi-file path.
  • The dia/ spectrum layer (previously 0% covered — where the "spectrum not found" lookups live):
    DIASpectrumTest parses a tiny synthetic mzML through the real MSFTBX path and checks DIAMeta
    parsing + DIAIndex.get_spectrum_by_scan; DIAMapTest pins isolation-window membership; and
    IsolationWindowTest pins the bucket-key rounding the decoy pairing depends on.
  • Stale library cache (SEA-AD: manifest decoy pairing collapsed to ~0.4%) — the window
    bucketing it hinged on is now covered by the dia/ tests above.
  • Locale / wrong-mass / unrecognized-mod / null-input edge cases from the Koina and
    OspreyBlibReader work — each now has a focused test (including the unknown-mod skip).
  • Foundational gaps the bugs exposed: DBGearTest (precursor m/z), the reuse-signature
    fingerprint, and Koina response-size validation.

These joined the fixes from the automated-review pass (Copilot + CodeRabbit). Full suite: 105 tests,
0 failures.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added full Osprey integration for peptide detection and library generation, including new GUI workflow options and CLI support for entrapment FASTA and Koina-based library building.
    • Added Osprey end-to-end workflow documentation, plus Windows/Linux scripts for building the Osprey component.
    • Introduced a tool to compare spectral library predictions.
  • Bug Fixes
    • Improved scan/spectrum matching behavior and made reuse/skip decisions based on input/command changes for fewer unnecessary reruns.
  • Documentation
    • Expanded README and added a dedicated Osprey end-to-end workflow guide.

maccoss and others added 22 commits June 27, 2026 00:15
The repository is already stored with LF endings, but it had no
.gitattributes and core.autocrlf is unset, so Windows checkouts silently
rewrote files to CRLF and surfaced the entire tree as modified. This adds an
explicit normalization policy (LF in the repo; CRLF only for Windows .bat/.cmd;
binaries untouched) to stop that churn going forward.

Co-Authored-By: Claude <noreply@anthropic.com>
Integrated OspreySharp as a search-engine alternative to DIA-NN in the Carafe
GUI, alongside Koina-based initial library generation and reuse validation.

Key additions:
- OspreySharp search workflow: generates a peptide-level entrapment FASTA
  (EntrapmentFastaGear) and target-decoy pairs, runs the search, and finetunes
  on the resulting BiblioSpec .blib. OspreyBlibReader converts the .blib to a
  DIA-NN TSV so the existing finetune path consumes it, and PSMConfig routes
  OspreySharp results through the DIA-NN column handling.
- Koina initial-library generation (KoinaClient/KoinaLibraryGenerator): a
  KServe v2 client and library generator covering AlphaPepDeep, Prosit, and
  ms2pip models, with Auto/manual NCE read from the mzML (MzmlUtils).
- Reuse-signature validation (ReuseSignature): the "Reuse existing results"
  option now skips a step only when its command and input-file fingerprints are
  unchanged, cascading re-runs to downstream steps. Tracked via CmdTask.input_files.
- Parallel msconvert with throttled lanes and shutdown-hook process cleanup
  (ProcessUtils) so closing the GUI does not orphan converter processes.
- TestNG coverage for all of the above; renamed testGUI to GuiCommandLineTest.

Co-Authored-By: Claude <noreply@anthropic.com>
Documented the OspreySharp workflows (4 and 5) and the OspreySharp settings
tab with screenshots in the 'Using OspreySharp as the search engine' section.

Co-Authored-By: Claude <noreply@anthropic.com>
…ed MSI

Automated the Windows installer build so the unsigned MSI ships with
OspreySharp bundled, matching the local jpackage flow.

- scripts/generate_installer_win.bat: jpackage wrapper that builds the unsigned,
  per-user MSI from target/carafe-2.2.0 (icon, main class, and the
  --enable-native-access java-option), bundling whatever is staged at
  osprey/win-x64/.
- .github/workflows/build-installer.yml: on manual dispatch or a v* tag, sets up
  JDK 25 + .NET 8 + WiX 3, builds the Carafe jar, sparse-checks-out the public
  OspreySharp source from ProteoWizard/pwiz (the OspreySharp subtree plus the
  PortableUtil project and the vendored DotNetZip and patched Parquet.Net DLLs it
  binds to), publishes a self-contained OspreySharp into osprey/win-x64, builds
  the MSI, and uploads it (attaching to the release on tags).

The OspreySharp source location and ref are workflow inputs so a fork can be
tested without editing the workflow.

Co-Authored-By: Claude <noreply@anthropic.com>
windows-latest already ships WiX 3.14, so the pinned
'choco install wixtoolset --version=3.14.1' failed as an attempted downgrade.
Locate the existing WiX light.exe and put its bin on PATH, installing only as
a fallback if it is missing.

Co-Authored-By: Claude <noreply@anthropic.com>
com.compomics:utilities:5.0.39P2 is a patched build that is not in any public
Maven repo, so a clean checkout (CI or a new contributor) could not resolve it.
Vendored the jar under lib/ (with a matching .gitignore exception) and added a
workflow step to install it into the runner's local Maven repo before the build.

This is a stop-gap; the clean fix is to publish the jar to a lab-controlled
Maven repository and resolve it normally. See lib/README.md.

Co-Authored-By: Claude <noreply@anthropic.com>
PowerShell (the default run: shell on windows runners) split the unquoted
-Dfile=lib/utilities-5.0.39P2.jar on its '.'/'/' characters, truncating it to
lib/utilities-5. cmd passes the token literally.

Co-Authored-By: Claude <noreply@anthropic.com>
actions/checkout v4->v5, setup-java v4->v5, setup-dotnet v4->v5,
upload-artifact v4->v5, softprops/action-gh-release v2->v3. The v4/v2
majors run on Node 20, which GitHub deprecated (2025-09-19) and now
forces onto Node 24 with a warning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Step-by-step description of the full OspreySharp library workflow: MSConvert,
FASTA digestion to a peptide FASTA + pairing manifest, initial in-silico
library, OspreySharp search, model fine-tuning, and final library generation
(plus the project search in the end-to-end variant). Documents the
decoys-in-library / pairing-manifest mechanism, the entrapment option, and how
the Carafe GUI invokes each external tool (MSConvert, OspreySharp, and the
spawned Carafe/Python child processes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…writer

OspreySharp's BlibWriter uses System.Data.SQLite, whose native SQLite.Interop.dll
is preloaded by resolving the managed assembly's own directory (Assembly.Location).
Under -p:PublishSingleFile=true, Assembly.Location is "", so the SQLiteConnection
ctor hits Path.Combine(null, "x64") and throws "ArgumentNullException (Parameter
'path1')" the moment it opens the output .blib — the search runs to completion and
then dies at the final write.

Publish self-contained as a normal folder (PublishSingleFile=false) so
SQLite.Interop.dll ships as a real sibling file and the directory resolution works.
resolveOspreyBinary() still finds osprey/<rid>/OspreySharp.exe and jpackage's
--input still sweeps the whole folder. Applied to the CI workflow and both
build_ospreysharp scripts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
actions/upload-artifact@v5 is still a Node 20 action; only v6+ moved to Node 24.
The earlier blanket bump to v5 left this one emitting the Node 20 deprecation
warning. Verified: upload-artifact v5 action.yml uses 'node20', v6 uses 'node24'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
OspreySharp writes its results as a SQLite .blib. SQLite cannot reliably
create/lock a database over a network share (UNC or an SMB-backed mapped drive):
BlibWriter does File.Delete + recreate + PRAGMA journal_mode=WAL, which fails on
SMB with "unable to open database file" (CANTOPEN) at SQLiteConnection.Open().
Output to a NAS path (e.g. W:\...) therefore fails at the final blib write even
though the search itself completed.

Have OspreySharp write the blib to a local temp directory and move it to the
requested (possibly network) output directory after the step exits cleanly. This
sidesteps every SMB+SQLite failure mode (delete/recreate race, WAL, mapped-drive
session quirks, UNC path handling) at once and is faster, since the many small
WAL writes hit local disk. The blib is an intermediate (converted to a DIA-NN TSV
for fine-tuning), so local staging has no effect on the final deliverables.

Adds an optional CmdTask.postAction hook, run by runLane only on a clean exit and
before the reuse signature is written; reuse stays keyed on the final blib path.
Applies to both the training and project OspreySharp searches.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The postAction lambda's `for (File f : staged)` collided with the existing
`File f = new File(ospreyPath)` earlier in buildOspreyCommand; Java forbids a
lambda local from shadowing an enclosing local. Rename to stagedFile. Verified
with mvn compile.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aining

Entrapment (p_target) peptides must not enter the training-DB FASTA: the training
search drives AI fine-tuning, and identifying random entrapment sequences would
pollute that training. They belong only in the library-DB FASTA that feeds the
finetuned library — the project-search library in workflow 5, and the deliverable
in workflow 4 — where FDRBench measures FDP.

buildEntrapmentFastaCommand now takes an explicit withEntrapment flag instead of
reading the checkbox. The training FASTA is always built with withEntrapment=false;
the library FASTA gets entrapment when the box is checked. Because that makes the
two FASTAs differ, they can no longer be shared when the train and library DBs are
the same file and entrapment is on, so a separate library FASTA is built (logged).
No behavior change when entrapment is off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…taPlanner

The rule "entrapment goes only in the library FASTA, never the training FASTA,
and the two are shared only when the DBs are identical and no entrapment is
requested" lived inline in runCarafe() and was untestable behind Swing. Extract
it into a pure OspreyFastaPlanner.plan(sameDb, entrapmentRequested) and have
runCarafe use it, then unit-test every case — including the core invariant that
the training FASTA never carries entrapment (which would pollute AI fine-tuning).

Pure refactor: no behavior change. mvn test: 18 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When entrapment is enabled, pass OspreySharp's new --fdrbench option on the
project search so it writes osprey_project/FDRBench/FDRBench-Input.tsv (level
follows the Osprey tab's --fdr-level), and copy the pairing manifest into that
folder so FDRBench has its input TSV + pairing manifest bundled together. The
training search never gets it (it drives fine-tuning). The decision is extracted
into a tested OspreyFdrBenchPlanner; buildOspreyCommand gains an fdrbenchOut arg
and copies the manifest in its post-step.

mvn test: 22 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ProteoWizard renamed pwiz_tools/OspreySharp to Osprey (AssemblyName Osprey,
namespaces pwiz.Osprey.*, executable Osprey.exe). Update Carafe to match:

- GUI: the OspreySharp tab, Workflow 4/5 labels, the search-engine combo item,
  tooltips, dialog/console text, and info cards now say "Osprey".
- Binary resolution: resolveOspreyBinary looks for Osprey.exe / Osprey.
- Internal search-engine identifier: -se "Osprey", AIGear/PSMConfig checks, and
  the .blib reader text/messages all use "Osprey" (kept consistent so the
  finetune-on-blib path still triggers).
- CI workflow: sparse-checkout pwiz_tools/Osprey, publish Osprey/Osprey.csproj.
- build_ospreysharp scripts: default OSPREY_CSPROJ points at Osprey/Osprey.csproj.

mvn test: 66 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…not shuffle

Two fixes for why the OspreySharp/FDRBench entrapment FDP was always 0%:

1. Decoy flag. EntrapmentFastaGear writes decoys with the "decoy_" prefix, but the
   library build flagged decoys via CParameter.decoy_prefix, which defaults to
   "rev_" - so the fine-tuned library's Decoy column came out 0 for every decoy
   and p_decoy. Add a -decoy_prefix option (sets CParameter.decoy_prefix) and have
   the GUI pass "-decoy_prefix decoy_" on the Osprey library builds.

2. Decoy sequences. Generate the decoy (and p_decoy) the way Osprey's DecoyGenerator
   does - reverse with the C-terminus preserved, cycling the internal residues by
   1..min(len,10) to dodge collisions with real targets, dropping the pair if no
   unique decoy is possible. Entrapment (p_target) stays a deterministic shuffle.

mvn test: 66 passed (adds reverseDecoy_matchesOspreyReverseAndCycle).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Match how Osprey would build decoys over a library containing both targets and
entrapments: add the deterministic-shuffle entrapments first, then generate the
reverse/cycle decoys (of targets and entrapments) made unique against the combined
target+entrapment set - entrapments are targets in Osprey's world, so decoys must
dodge them too. Like Osprey (and the Rust DecoyGenerator), decoys are NOT checked
against other decoys; the reverse/cycle collision set is target-side only.

mvn test: 67 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
xic_query first tried to match the peptide's apex RT to a scan RT within 0.01 min
and called System.exit(1) when none matched - killing the entire finetune. But the
apex RT is the fitted peak apex, which by definition lies BETWEEN scans, so on
slower-cycle instruments (e.g. Stellar sDIA, where the per-window cycle exceeds
0.6 s) no scan falls within the window and the run died on the first peptide
(empty fragment_intensity_df.tsv, exit 1). The very next block already re-detects
the apex/boundaries from the CLOSEST scan; remove the guard so it falls through to
that fallback (the other two XIC paths already do this - their exit is commented out).

mvn test: 67 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…an=0

The Osprey .blib has no usable DIA-NN MS2 scan index, so convertBlibToDiannTsv
writes MS2.Scan=0 for every precursor and add_ms2spectrum_index synthesizes the
true per-precursor MS2 ordinal into an "ms2index" column (and the matched apex
RT into "apex_rt"). The finetune path then called use_diann_report_column_names()
which reset the reader back to the all-zero MS2.Scan placeholder and never
re-pointed at ms2index. Result: global_index2scan_num.get(0) for every precursor
-> all map to MS2 scan 0 -> "Spectrum not found" for every PSM but the one whose
isolation window happens to contain scan 0 -> 0 valid MS2 PSMs -> the peptdeep
sampler requested a negative number of rows and the MS2 model training crashed.

Re-point the reader at the synthesized columns via a new
PSMConfig.use_added_ms2index_columns(), mirroring what the Skyline path already
does (use_skyline_report_column_names defaults ms2_index_column_name to ms2index).

Also add a Koina library-generation end-to-end test: refactor
KoinaLibraryGenerator.run() to accept an injected KoinaClient so the full
FASTA -> enumerate -> infer -> write-DIA-NN-TSV pipeline can be exercised offline
with stubbed predictions.

Tests: PSMConfigTest regression for the column re-point; Koina end-to-end test.
Full suite: 69 tests, 0 failures.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
resolveOspreyBinary() checked the saved "Osprey Executable" preference first,
so a stale dev path (e.g. an old OspreySharp build) silently shadowed the
Osprey that shipped with the MSI. Reorder so a build bundled next to the jar
(every MSI install has one) wins; the saved path is now a fallback for source
/ command-line runs that have no bundled build. Resolution order is factored
into OspreyBinaryResolver.choose() with a precedence unit test, and the GUI
tooltip is updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
Copilot AI review requested due to automatic review settings June 28, 2026 20:59
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds Osprey support across CLI, GUI, prediction, conversion, and packaging. It introduces peptide quartet FASTA generation, Koina-based library prediction, Osprey .blib conversion, reuse-signature caching, workflow 4/5 orchestration, and build scripts/CI for Windows installer output.

Changes

Osprey Search Engine Integration

Layer / File(s) Summary
Utility infrastructure
src/main/java/util/ProcessUtils.java, src/main/java/util/ReuseSignature.java, src/main/java/util/MzmlUtils.java, src/main/java/dia/ApexMatcher.java, src/test/java/util/*, src/test/java/dia/*, src/test/java/db/DBGearTest.java
Adds process termination, reuse-signature hashing and sidecar persistence, mzML collision-energy parsing, and isolation-window apex matching, with tests and shutdown-hook support.
Entrapment FASTA and manifest
src/main/java/db/EntrapmentFastaGear.java, src/test/java/db/EntrapmentFastaGearTest.java
Builds peptide quartets from FASTA input with deterministic target/entrapment/decoy generation, m/z filtering, and FASTA/manifest output.
Koina client and library generator
src/main/java/koina/KoinaClient.java, src/main/java/koina/KoinaLibraryGenerator.java, src/test/java/koina/*
Adds Koina inference for MS2 and iRT plus DIA-NN library generation from peptidoform enumeration, with offline and live tests.
Osprey blib conversion and column modes
src/main/java/ai/OspreyBlibReader.java, src/main/java/ai/PSMConfig.java, src/test/java/ai/OspreyBlibReaderTest.java, src/test/java/ai/PSMConfigTest.java, src/test/java/ai/SkylineIOTest.java
Converts Osprey .blib SQLite output to DIA-NN TSV and adds Osprey-specific column-name switching for finetuning.
CLI helpers and planner utilities
src/main/java/gui/CmdTask.java, src/main/java/gui/OspreyBinaryResolver.java, src/main/java/gui/OspreyFastaPlanner.java, src/main/java/gui/OspreyFdrBenchPlanner.java, src/test/java/gui/*
Extends command tasks with input tracking and post-actions, and adds Osprey executable, FASTA-sharing, and FDRBench path planners with tests.
AIGear CLI and Osprey DIA path
src/main/java/ai/AIGear.java, scripts/compare_library_predictions.py
Adds CLI branches for entrapment FASTA and Koina library generation, plus Osprey DIA conversion, ms2 indexing, and fragment-library comparison tooling.
CarafeGUI Osprey workflows and reuse logic
src/main/java/gui/CarafeGUI.java
Adds workflow 4/5 UI, Osprey execution wiring, phased parallel runs, reuse-signature skipping, shutdown handling, and settings persistence.
Build, CI, and docs
.gitattributes, .gitignore, lib/README.md, scripts/build_osprey.bat, scripts/build_osprey.sh, scripts/generate_installer_win.bat, .github/workflows/build-installer.yml, .github/workflows/test.yml, README.md, docs/osprey-end-to-end-workflow.md
Adds line-ending and ignore rules, Osprey build/install scripts, CI workflows, and documentation for building and running the new workflow.

Estimated code review effort: 5 (Critical) | ~120 minutes

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CarafeGUI
  participant EntrapmentFastaGear
  participant KoinaLibraryGenerator
  participant Osprey
  participant OspreyBlibReader
  participant AIGear

  User->>CarafeGUI: Start workflow 4/5
  CarafeGUI->>EntrapmentFastaGear: build peptide FASTA + manifest
  EntrapmentFastaGear-->>CarafeGUI: peptides.fasta, manifest.tsv
  CarafeGUI->>KoinaLibraryGenerator: generate initial DIA-NN library
  KoinaLibraryGenerator-->>CarafeGUI: initial_library.tsv
  CarafeGUI->>Osprey: search training mzML
  Osprey-->>CarafeGUI: training.blib
  CarafeGUI->>OspreyBlibReader: convert .blib to DIA-NN TSV
  OspreyBlibReader-->>CarafeGUI: osprey_report_diann.tsv
  CarafeGUI->>AIGear: finetune and export final library
  AIGear-->>CarafeGUI: final outputs
Loading
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main Osprey integration work, including Koina library generation, entrapment/FDRBench validation, finetuning, and installer support.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.44.0)
.github/workflows/build-installer.yml

[{"text":"Files.createTempFile("synthetic", ".mzML")","range":{"byteOffset":{"start":8248,"end":8290},"start":{"line":131,"column":17},"end":{"line":131,"column":59}},"file":"src/test/java/dia/DIASpectrumTest.java","lines":" Path p = Files.createTempFile("synthetic", ".mzML");","charCount":{"leading":17,"trailing":1},"language":"Java","metaVariables":{"single":{"X":{"text":"Files","range":{"byteOffset":{"start":8248,"end":8253},"start":{"line":131,"column":17},"end":{"line":131,"column":22}}}},"multi":{"secondary":[{"text":"private static Path writeMzml() throws IOException {\n Path p = Files.createTempFile("synthetic", ".mzML");\n Files.write(p, buildMzml().getBytes(StandardCharsets.UTF_8));\n return p;\n }","range":{"byteOffset":{"start":8178,"end":8385},"start":{"line":130,"column":4},"end":{"line":134,"column":5}}}]},"transformed":{}},"ruleId":"tempfile-delete","severity":"warning","note":"[CWE-377] Insecure Temporary File. Security best p

... [truncated 26117 characters] ...

rMZ, precursorCharge, peptideModSeq, "\n + "retentionTime, fileID) VALUES (1, 'SAMPLEPEPTIDEK', 500.0, 2, 'SAMPLEPEPTIDEK', 5.0, 1);");\n }\n Path outDir = Files.createTempDirectory("osprey_out");\n String tsv = OspreyBlibReader.convertBlibToDiannTsv(blib.toString(), outDir.toString());\n Map<String, Map<String, String>> byPep = byPeptide(readTsv(tsv));\n Assert.assertEquals(byPep.size(), 1);\n Assert.assertEquals(byPep.get("SAMPLEPEPTIDEK").get("Modified.Sequence"), "SAMPLEPEPTIDEK");\n Assert.assertEquals(byPep.get("SAMPLEPEPTIDEK").get("File.Name"), "run.mzML");\n }","range":{"byteOffset":{"start":11803,"end":13477},"start":{"line":212,"column":4},"end":{"line":233,"column":5}},"style":"secondary"}]}

.github/workflows/test.yml

[{"text":"Files.createTempFile("synthetic", ".mzML")","range":{"byteOffset":{"start":8248,"end":8290},"start":{"line":131,"column":17},"end":{"line":131,"column":59}},"file":"src/test/java/dia/DIASpectrumTest.java","lines":" Path p = Files.createTempFile("synthetic", ".mzML");","charCount":{"leading":17,"trailing":1},"language":"Java","metaVariables":{"single":{"X":{"text":"Files","range":{"byteOffset":{"start":8248,"end":8253},"start":{"line":131,"column":17},"end":{"line":131,"column":22}}}},"multi":{"secondary":[{"text":"private static Path writeMzml() throws IOException {\n Path p = Files.createTempFile("synthetic", ".mzML");\n Files.write(p, buildMzml().getBytes(StandardCharsets.UTF_8));\n return p;\n }","range":{"byteOffset":{"start":8178,"end":8385},"start":{"line":130,"column":4},"end":{"line":134,"column":5}}}]},"transformed":{}},"ruleId":"tempfile-delete","severity":"warning","note":"[CWE-377] Insecure Temporary File. Security best p

... [truncated 26117 characters] ...

rMZ, precursorCharge, peptideModSeq, "\n + "retentionTime, fileID) VALUES (1, 'SAMPLEPEPTIDEK', 500.0, 2, 'SAMPLEPEPTIDEK', 5.0, 1);");\n }\n Path outDir = Files.createTempDirectory("osprey_out");\n String tsv = OspreyBlibReader.convertBlibToDiannTsv(blib.toString(), outDir.toString());\n Map<String, Map<String, String>> byPep = byPeptide(readTsv(tsv));\n Assert.assertEquals(byPep.size(), 1);\n Assert.assertEquals(byPep.get("SAMPLEPEPTIDEK").get("Modified.Sequence"), "SAMPLEPEPTIDEK");\n Assert.assertEquals(byPep.get("SAMPLEPEPTIDEK").get("File.Name"), "run.mzML");\n }","range":{"byteOffset":{"start":11803,"end":13477},"start":{"line":212,"column":4},"end":{"line":233,"column":5}},"style":"secondary"}]}

src/main/java/ai/AIGear.java

[{"text":"Files.createTempFile("synthetic", ".mzML")","range":{"byteOffset":{"start":8248,"end":8290},"start":{"line":131,"column":17},"end":{"line":131,"column":59}},"file":"src/test/java/dia/DIASpectrumTest.java","lines":" Path p = Files.createTempFile("synthetic", ".mzML");","charCount":{"leading":17,"trailing":1},"language":"Java","metaVariables":{"single":{"X":{"text":"Files","range":{"byteOffset":{"start":8248,"end":8253},"start":{"line":131,"column":17},"end":{"line":131,"column":22}}}},"multi":{"secondary":[{"text":"private static Path writeMzml() throws IOException {\n Path p = Files.createTempFile("synthetic", ".mzML");\n Files.write(p, buildMzml().getBytes(StandardCharsets.UTF_8));\n return p;\n }","range":{"byteOffset":{"start":8178,"end":8385},"start":{"line":130,"column":4},"end":{"line":134,"column":5}}}]},"transformed":{}},"ruleId":"tempfile-delete","severity":"warning","note":"[CWE-377] Insecure Temporary File. Security best p

... [truncated 26117 characters] ...

rMZ, precursorCharge, peptideModSeq, "\n + "retentionTime, fileID) VALUES (1, 'SAMPLEPEPTIDEK', 500.0, 2, 'SAMPLEPEPTIDEK', 5.0, 1);");\n }\n Path outDir = Files.createTempDirectory("osprey_out");\n String tsv = OspreyBlibReader.convertBlibToDiannTsv(blib.toString(), outDir.toString());\n Map<String, Map<String, String>> byPep = byPeptide(readTsv(tsv));\n Assert.assertEquals(byPep.size(), 1);\n Assert.assertEquals(byPep.get("SAMPLEPEPTIDEK").get("Modified.Sequence"), "SAMPLEPEPTIDEK");\n Assert.assertEquals(byPep.get("SAMPLEPEPTIDEK").get("File.Name"), "run.mzML");\n }","range":{"byteOffset":{"start":11803,"end":13477},"start":{"line":212,"column":4},"end":{"line":233,"column":5}},"style":"secondary"}]}

  • 7 others

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates the Carafe integration for the Osprey DIA search engine, including initial library generation (Koina), entrapment/FDRBench validation plumbing, finetune-path correctness fixes, and Windows installer/CI automation, with extensive new TestNG coverage and documentation to support the end-to-end workflow.

Changes:

  • Added Osprey integration helpers (binary resolution, FASTA planning, FDRBench planning) plus Osprey .blib → DIA-NN-style TSV conversion and Osprey-specific PSM column configuration.
  • Implemented entrapment peptide FASTA + pairing manifest generation and Koina-based initial DIA-NN library generation, wired through the CLI and validated via new unit tests.
  • Added Windows MSI build automation (GitHub Actions + scripts), vendored build dependency documentation, and new/updated end-to-end docs.

Reviewed changes

Copilot reviewed 39 out of 43 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/test/java/util/ShutdownHookHelper.java Helper program for validating shutdown-hook process termination behavior.
src/test/java/util/ReuseSignatureTest.java Tests for deterministic reuse-signature computation and sidecar round-trip.
src/test/java/util/ProcessUtilsTest.java Tests for killing processes + descendants, including shutdown-hook scenario.
src/test/java/util/MzmlUtilsTest.java Tests for extracting collision energy from synthetic mzML.
src/test/java/koina/KoinaLiveTest.java Networked (skippable) tests against live Koina service.
src/test/java/koina/KoinaLibraryGeneratorTest.java Offline tests for peptidoform enumeration and stubbed end-to-end TSV writing.
src/test/java/koina/KoinaClientTest.java Offline tests for Koina request/response shaping and parsing.
src/test/java/gui/testGUI.java Removed non-running/legacy GUI tokenization test.
src/test/java/gui/OspreyFdrBenchPlannerTest.java Tests for when/where to emit Osprey FDRBench inputs.
src/test/java/gui/OspreyFastaPlannerTest.java Tests enforcing “entrapment never in training FASTA” planning rules.
src/test/java/gui/OspreyBinaryResolverTest.java Tests for Osprey executable resolution precedence.
src/test/java/gui/GuiCommandLineTest.java Ensures GUI “additional options” tokenization test actually runs under Surefire/TestNG.
src/test/java/db/EntrapmentFastaGearTest.java Extensive tests for peptide FASTA/manifest generation, decoys, determinism, and m/z filtering.
src/test/java/ai/SkylineIOTest.java Migrated from JUnit to TestNG + fixed UniMod casing expectations so tests run.
src/test/java/ai/PSMConfigTest.java Regression tests for Osprey .blib column config and synthesized ms2index usage.
src/test/java/ai/OspreyBlibReaderTest.java Tests for minimal .blib → DIA-NN-style TSV conversion and UniMod reconstruction.
src/main/java/util/ReuseSignature.java Implements per-step reuse signature and sidecar storage for GUI reuse logic.
src/main/java/util/ProcessUtils.java Extracted process termination logic (including descendant killing) for reuse/testing.
src/main/java/util/MzmlUtils.java Streaming StAX mzML reader for collision energy (NCE) extraction.
src/main/java/koina/KoinaLibraryGenerator.java Koina-driven initial DIA-NN library generation from peptide FASTA with mod/charge enumeration.
src/main/java/koina/KoinaClient.java Koina KServe v2 client with unit-testable JSON build/parse helpers.
src/main/java/gui/OspreyFdrBenchPlanner.java Pure planner for deciding FDRBench output path emission.
src/main/java/gui/OspreyFastaPlanner.java Pure planner enforcing training-vs-library FASTA separation rules with entrapment.
src/main/java/gui/OspreyBinaryResolver.java Pure precedence resolver for Osprey executable selection.
src/main/java/gui/CmdTask.java Adds input file tracking and a post-action hook to support reuse signatures and staging/moves.
src/main/java/db/EntrapmentFastaGear.java Generates peptide FASTA + FDRBench pairing manifest with Osprey-style decoys/entrapment.
src/main/java/ai/PSMConfig.java Adds Osprey .blib column configuration and “use synthesized ms2index/apex_rt” switch.
src/main/java/ai/OspreyBlibReader.java Reads Osprey .blib SQLite and writes DIA-NN-style TSV with UniMod reconstruction.
src/main/java/ai/AIGear.java Adds entrapment FASTA + Koina library CLI modes; adds Osprey .blib finetune/load path fixes.
scripts/generate_installer_win.bat Builds unsigned Windows MSI via jpackage and bundles Osprey output.
scripts/compare_library_predictions.py Utility to compare two DIA-NN TSV libraries fragment-by-fragment via cosine similarity.
scripts/build_ospreysharp.sh Publishes self-contained Osprey builds per RID for bundling.
scripts/build_ospreysharp.bat Windows publish script for self-contained Osprey executable.
README.md Adds user-facing Osprey workflow guidance and installation notes.
lib/README.md Documents vendored patched compomics-utilities jar and intended long-term resolution.
docs/ospreysharp-end-to-end-workflow.md Detailed end-to-end Osprey workflow documentation (GUI orchestration + CLI mapping).
.gitignore Adds exception for vendored jar and ignores TestNG output folder.
.github/workflows/build-installer.yml GitHub Actions workflow to build MSI, publish Osprey, and upload/attach artifacts.
.gitattributes Normalizes LF line endings while preserving Windows scripts and binary assets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +105 to +115
File f = new File(path);
if (f.isDirectory()) {
File[] kids = f.listFiles(File::isFile);
if (kids != null) {
for (File k : kids) {
lines.add(fingerprintFile(k));
}
}
} else {
lines.add(fingerprintFile(f));
}
Comment on lines +122 to +127
private static String fingerprintFile(File f) {
if (f.exists()) {
return "in:" + f.getName() + "|" + f.length() + "|" + f.lastModified();
}
return "in:" + f.getName() + "|MISSING";
}
Comment on lines +26 to +28
public static void terminateAll(Collection<Process> processes) {
List<Process> list = new ArrayList<>(processes);
// First pass: kill descendants (the actual converter spawned by the shell), then the
Comment on lines +265 to +269
w.write(p.form.diann + "\t" + p.form.stripped + "\t"
+ String.format("%.5f", p.mz) + "\t" + p.charge + "\t"
+ String.format("%.4f", irt) + "\t" + p.proteinId + "\t" + p.decoy + "\t"
+ String.format("%.5f", f[0]) + "\t" + String.format("%.6f", relInt) + "\t"
+ meta[0] + "\t" + (int) f[2] + "\t" + (int) f[3] + "\t" + meta[1] + "\n");
Comment on lines +148 to +158
static List<Ms2> parseMs2Response(String json, int n) {
JSONObject root = JSON.parseObject(json);
float[] intensities = floatOutput(root, "intensities");
float[] mz = floatOutput(root, "mz");
String[] annotation = stringOutput(root, "annotation");
if (intensities == null || mz == null || annotation == null) {
throw new IllegalStateException("Koina MS2 response missing intensities/mz/annotation: "
+ truncate(json));
}
int f = intensities.length / n;
List<Ms2> out = new ArrayList<>(n);
Comment on lines +904 to +910
EntrapmentFastaGear.Config efc = new EntrapmentFastaGear.Config();
efc.inputFasta = input_fasta;
efc.outputFasta = cmd.getOptionValue("build_entrapment_fasta");
efc.manifest = cmd.getOptionValue("manifest"); // null if not provided
efc.addEntrapment = cmd.hasOption("entrapment");
efc.addDecoys = !cmd.hasOption("no_decoys");
efc.applyMzFilter = cmd.hasOption("mz_filter");
Comment thread README.md Outdated
scripts\build_ospreysharp.bat win-x64
```

This stages `OspreySharp(.exe)` under `target/osprey/<rid>/`. Place the appropriate `<rid>/` directory beside the Carafe2 jar as `osprey/<rid>/` (or under `~/.carafe/osprey/<rid>/`); Carafe2 locates it automatically at runtime. You can also point Carafe2 at a custom OspreySharp build via its path preference. Override the OspreySharp source location with the `OSPREY_CSPROJ` environment variable if your `pwiz` checkout is elsewhere.
Comment thread docs/ospreysharp-end-to-end-workflow.md Outdated
Comment on lines +1 to +6
# The OspreySharp end-to-end workflow

This document describes, step by step, what Carafe2 does when it runs a spectral-library
workflow with **OspreySharp** as the search engine — from the input protein FASTA all the way
to the final fine-tuned spectral library (and, in the end-to-end variant, the search of your
project files with that library).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

202-293: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Update CLI help text to include new OspreySharp-related options.

The command-line help text is stale — it omits the new flags added by this PR:

  • -build_entrapment_fasta (Stage 1 peptide FASTA generation)
  • -build_koina_library (Koina-based initial library generation)
  • -decoy_prefix (custom decoy prefix)
  • -koina_ms2_model, -koina_rt_model, -koina_url (Koina configuration)
  • -se OspreySharp (search engine selection, though -se exists with limited values)

These flags are documented in docs/ospreysharp-end-to-end-workflow.md but missing from the README's CLI reference, which users will consult for scripting.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 202 - 293, The README CLI reference is missing the
new OspreySharp-related options, so update the help-text block to include the
added flags and their short descriptions. Add entries for
-build_entrapment_fasta, -build_koina_library, -decoy_prefix, -koina_ms2_model,
-koina_rt_model, and -koina_url, and expand the -se option in the same help list
to show OspreySharp as a supported search engine value. Keep the formatting
consistent with the existing option table so the README stays aligned with the
CLI output.
🧹 Nitpick comments (2)
docs/ospreysharp-end-to-end-workflow.md (1)

51-63: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add language specifiers to fenced code blocks.

Multiple fenced code blocks lack language identifiers, triggering markdownlint MD040 warnings and degrading syntax highlighting in rendered documentation:

Lines Content Suggested specifier
51-63 ASCII phase diagram text
79-81 MSConvert command shell
113-117 CLI command bash
150-163 OspreySharp command bash
280-282 MSConvert builder shell
363-365 Manifest TSV header tsv
380-386 Manifest example data tsv
429-447 Directory tree text

Example fix for line 51:

-```
+```text
             ┌─────────────────────────────────────────────────────────────────────┐

Also applies to: 79-81, 113-117, 150-163, 280-282, 363-365, 380-386, 429-447

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/ospreysharp-end-to-end-workflow.md` around lines 51 - 63, Add language
identifiers to every fenced code block in the workflow doc to satisfy
markdownlint MD040 and improve rendering. Update the existing fences around the
phase diagram, MSConvert and OspreySharp command examples, TSV manifest
snippets, and directory tree so each uses an appropriate specifier such as text,
shell, bash, or tsv. Use the surrounding blocks in the document to locate each
fence and keep the content unchanged aside from the opening fence tag.
.gitattributes (1)

8-20: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add .msi and .pdb to binary declarations.

This PR introduces Windows MSI installer generation and .NET-based OspreySharp builds. Both produce file types that should not have line endings normalized:

  • .msi — Windows installer output from generate_installer_win.bat
  • .pdb — debug symbols from dotnet publish in build_ospreysharp.bat / .sh

Without binary declarations, Git may corrupt these files on checkout.

 *.exe    binary
 *.dll    binary
+*.msi    binary
+*.pdb    binary
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.gitattributes around lines 8 - 20, Add the missing binary declarations in
the .gitattributes asset list by updating the existing binary patterns to
include .msi and .pdb alongside the current entries. Keep the change in the same
binary-assets section so Git treats the Windows installer output and .NET debug
symbols as non-text files and does not normalize line endings for them.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/build-installer.yml:
- Around line 122-133: Remove the hard-coded 2.2.0 release version from the
build workflow and make the version come from a single source used by both the
Publish Osprey step and scripts/generate_installer_win.bat. Update the
build-installer workflow to derive the Maven/app version once, then use that
value in the dotnet publish output path and pass it through to the MSI
generation step so the Osprey publish directory and APP_VERSION stay in sync
across future version bumps.
- Around line 30-31: The installer release workflow is currently using read-only
repository permissions and a duplicated hard-coded version, which blocks asset
upload and risks version drift. Update the build-installer workflow to use
contents: write for the release step used by softprops/action-gh-release, and
centralize the installer version so .github/workflows/build-installer.yml and
scripts/generate_installer_win.bat both reference the same single source of
truth instead of embedding 2.2.0 in multiple places.

In `@scripts/build_ospreysharp.sh`:
- Around line 35-49: The default RID logic in default_rid() already returns
arm64 on Apple Silicon, but the publish command still hard-codes Platform as
x64. Update the build flow in build_ospreysharp.sh so the Platform value is
derived from the selected RID (especially for osx-arm64) instead of always using
x64, and make sure the dotnet publish invocation uses the same architecture as
the RID.

In `@scripts/compare_library_predictions.py`:
- Around line 40-52: The fragment loader in load_fragments currently filters
only by StrippedPeptide and PrecursorCharge, which can merge rows from multiple
precursor candidates. Update the matching logic to use a more specific precursor
identifier such as ModifiedPeptide and/or precursor m/z, or detect multiple
matches and fail explicitly instead of silently overwriting fragments. Keep the
fix localized to load_fragments and any caller that provides the precursor
selection criteria so the cosine is computed from one unambiguous spectrum.

In `@scripts/generate_installer_win.bat`:
- Around line 35-39: The installer generation step currently only warns when
Osprey is missing, allowing the MSI to succeed without the bundled binary.
Update the missing-file check in generate_installer_win.bat so the build fails
hard when osprey\win-x64\Osprey.exe is absent, and keep the existing warning
text as part of the fatal error path. Use the existing INPUT_DIR check and the
Osprey.exe existence branch to enforce the bundling contract before continuing.

In `@src/main/java/ai/AIGear.java`:
- Around line 986-987: The Koina mod flags in AIGear are using substring
matching on CParameter.fixMods and CParameter.varMods, which can incorrectly
match multi-digit indices. Update the logic around the fixedCarbamidomethylC and
variableOxidationM assignments to parse the comma-separated mod lists and check
exact membership for the intended index values instead of calling contains().
This keeps Carbamidomethyl C and Oxidation M toggles aligned with the actual
modification set.

In `@src/main/java/ai/OspreyBlibReader.java`:
- Around line 166-190: In OspreyBlibReader, the modification parsing loop
currently falls back to a best-effort sequence when tokenFor(...) cannot
represent a PTM, which can silently alter chemistry while preserving the
original precursor metadata. Update this path so unsupported modifications are
not emitted as Modified.Sequence entries: either skip the row entirely or fail
the conversion once unrecognized becomes true, and keep the handling centered
around the existing tokenFor(...), posToken, and nTermPrefix logic.

In `@src/main/java/gui/CarafeGUI.java`:
- Around line 5792-5800: The lane-failure handling in CarafeGUI only sets the
failed flag and waits for the rest of the futures, so already-running
MSConvert/Koina/Osprey subprocesses keep running. Update the failure paths
around the futures loop and the related lane handling block to use the
tracked-process set to immediately terminate any outstanding subprocesses as
soon as the first lane throws, instead of only breaking after all futures are
collected.
- Around line 419-421: The new Osprey settings tab is not being frozen during
runs because setInputsFrozen() still only disables the first four tabs. Update
CarafeGUI so the tab-disable logic explicitly includes the Osprey tab (and keeps
Console excluded), using the tab indices or names around addTab("Osprey") and
the setInputsFrozen() method to ensure all workflow tabs are locked
consistently.
- Around line 5149-5151: In CarafeGUI’s Osprey training input handling, add the
same multi-folder validation used by the older DIA-NN flows before computing
trainMsInput from trainMs.getFirst(). Reject the request when trainMs contains
files from different parent directories instead of collapsing to new
File(trainMs.getFirst()).getParent(), so the full training set is preserved and
no inputs are silently dropped.
- Around line 6991-7005: The Koina argument builder in CarafeGUI is missing
propagation of the max fragment m/z setting, so the Koina path ignores the GUI’s
configured upper bound. Update the argument list construction near the existing
-lf_top_n_frag and -lf_frag_mz_min entries to also forward the max fragment m/z
value, using the relevant spinner/field in this builder so the Koina and local
Carafe paths stay consistent. Keep the change in the same argument assembly
block that builds the AIGear.main invocation inputs.

In `@src/main/java/koina/KoinaClient.java`:
- Around line 147-171: In parseMs2Response, validate the Koina output array
sizes before computing f or entering the reshape loop. Check that intensities,
mz, and annotation are all the same length and that the length is divisible by
n, and throw an IllegalStateException with a clear contract-error message if
not. Use the existing parseMs2Response, floatOutput, stringOutput, and Ms2 flow
to place the guard right after the null checks and before building the
per-precursor records.

In `@src/main/java/koina/KoinaLibraryGenerator.java`:
- Around line 265-269: The TSV writer in KoinaLibraryGenerator is formatting
numeric fields with the JVM default locale, which can emit comma decimals on
some systems and break the DIA-NN/Osprey output contract. Update the
String.format calls in the write block that serializes p.mz, irt, f[0], and
relInt to use a fixed locale (for example, Locale.ROOT or Locale.US) so the
numeric TSV fields always use dot-decimal formatting.
- Around line 212-218: The RT inference loop in KoinaLibraryGenerator’s batch
processing assumes `client.inferRt(...)` always returns one value per peptide in
`batch`, which can crash mid-build if the response is shorter. Add an explicit
size/cardinality check right after the `inferRt` call and before populating
`map`, using the local `batch`, `irt`, and `cfg.batchSize` context; if the
lengths differ, fail fast with a clear error instead of iterating past the
available RT values.

In `@src/main/java/util/ReuseSignature.java`:
- Around line 59-75: `ReuseSignature.compute()` is dropping the first real
argument whenever `commandTokens` is provided because `normalizeCommand()`
treats `tokens[0]` as an executable. Update the token handling so the cache key
preserves the full argument list coming from `CarafeGUI.skipIfResultPresent()`
(which already separates `command.args` and `command.cmd`), and only skip the
executable when it is actually part of the tokenized command. Use `compute()`,
`tokenize()`, and `normalizeCommand()` to ensure the first option is included in
the fingerprint whenever `args` is non-empty.
- Around line 122-126: The fingerprint used by ReuseSignature.fingerprintFile is
too weak because it only combines the file name, size, and last-modified time,
which can let matches() reuse outputs for different inputs. Update the
fingerprint to include a normalized path at minimum, and preferably a content
hash if the cache must remain safe across copied or staged files. Keep the
change localized to fingerprintFile so the reuse check gets a stronger key
without altering the matching flow.

In `@src/test/java/ai/PSMConfigTest.java`:
- Around line 17-41: Add test cleanup so `PSMConfig` does not keep the Osprey
static column settings after these cases run. Introduce an `@AfterMethod` (or
equivalent teardown) in `PSMConfigTest` that restores the DIA-NN defaults by
calling the appropriate `PSMConfig` reset/use method, and make sure both
`ospreyColumnsMatchDiannNamesButTagOspreyEngine` and
`diannAndOspreyShareIdentificationColumnNames` leave no persistent static state
behind.

In `@src/test/java/db/EntrapmentFastaGearTest.java`:
- Around line 47-61: The `baseConfig()` helper is mutating static `CParameter`
digestion settings and leaving them changed for the rest of the JVM, which can
make `EntrapmentFastaGear.run()` and unrelated tests order-dependent. Add test
setup/teardown around `EntrapmentFastaGearTest` to snapshot the original
`CParameter` values before they are changed and restore them afterward, or wrap
the mutation in a `try/finally` so the globals are always reset. Use the
existing `baseConfig()` helper and the `CParameter` fields it touches as the
main locations to update.

In `@src/test/java/koina/KoinaLiveTest.java`:
- Around line 58-104: The live Koina tests in KoinaLiveTest still execute during
the normal test run and only skip after KoinaClient times out. Update the two
TestNG methods, elvislivesrPrositReturnsRealisticSpectrum and
alphaPepDeepVsPrositSimilarityForElvislivesr, to be gated by a TestNG group or
an explicit env/property check so they are excluded from the default mvn test
suite and only run when enabled.

---

Outside diff comments:
In `@README.md`:
- Around line 202-293: The README CLI reference is missing the new
OspreySharp-related options, so update the help-text block to include the added
flags and their short descriptions. Add entries for -build_entrapment_fasta,
-build_koina_library, -decoy_prefix, -koina_ms2_model, -koina_rt_model, and
-koina_url, and expand the -se option in the same help list to show OspreySharp
as a supported search engine value. Keep the formatting consistent with the
existing option table so the README stays aligned with the CLI output.

---

Nitpick comments:
In @.gitattributes:
- Around line 8-20: Add the missing binary declarations in the .gitattributes
asset list by updating the existing binary patterns to include .msi and .pdb
alongside the current entries. Keep the change in the same binary-assets section
so Git treats the Windows installer output and .NET debug symbols as non-text
files and does not normalize line endings for them.

In `@docs/ospreysharp-end-to-end-workflow.md`:
- Around line 51-63: Add language identifiers to every fenced code block in the
workflow doc to satisfy markdownlint MD040 and improve rendering. Update the
existing fences around the phase diagram, MSConvert and OspreySharp command
examples, TSV manifest snippets, and directory tree so each uses an appropriate
specifier such as text, shell, bash, or tsv. Use the surrounding blocks in the
document to locate each fence and keep the content unchanged aside from the
opening fence tag.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5729011c-fd36-47a4-a11a-a576bd97c8e6

📥 Commits

Reviewing files that changed from the base of the PR and between 048710b and b1ff564.

⛔ Files ignored due to path filters (3)
  • docs/images/CarafeGUI-ospreysharp-tab.png is excluded by !**/*.png
  • docs/images/CarafeGUI-ospreysharp-workflows.png is excluded by !**/*.png
  • lib/utilities-5.0.39P2.jar is excluded by !**/*.jar
📒 Files selected for processing (40)
  • .gitattributes
  • .github/workflows/build-installer.yml
  • .gitignore
  • README.md
  • docs/ospreysharp-end-to-end-workflow.md
  • lib/README.md
  • scripts/build_ospreysharp.bat
  • scripts/build_ospreysharp.sh
  • scripts/compare_library_predictions.py
  • scripts/generate_installer_win.bat
  • src/main/java/ai/AIGear.java
  • src/main/java/ai/OspreyBlibReader.java
  • src/main/java/ai/PSMConfig.java
  • src/main/java/db/EntrapmentFastaGear.java
  • src/main/java/gui/CarafeGUI.java
  • src/main/java/gui/CmdTask.java
  • src/main/java/gui/OspreyBinaryResolver.java
  • src/main/java/gui/OspreyFastaPlanner.java
  • src/main/java/gui/OspreyFdrBenchPlanner.java
  • src/main/java/koina/KoinaClient.java
  • src/main/java/koina/KoinaLibraryGenerator.java
  • src/main/java/util/MzmlUtils.java
  • src/main/java/util/ProcessUtils.java
  • src/main/java/util/ReuseSignature.java
  • src/test/java/ai/OspreyBlibReaderTest.java
  • src/test/java/ai/PSMConfigTest.java
  • src/test/java/ai/SkylineIOTest.java
  • src/test/java/db/EntrapmentFastaGearTest.java
  • src/test/java/gui/GuiCommandLineTest.java
  • src/test/java/gui/OspreyBinaryResolverTest.java
  • src/test/java/gui/OspreyFastaPlannerTest.java
  • src/test/java/gui/OspreyFdrBenchPlannerTest.java
  • src/test/java/gui/testGUI.java
  • src/test/java/koina/KoinaClientTest.java
  • src/test/java/koina/KoinaLibraryGeneratorTest.java
  • src/test/java/koina/KoinaLiveTest.java
  • src/test/java/util/MzmlUtilsTest.java
  • src/test/java/util/ProcessUtilsTest.java
  • src/test/java/util/ReuseSignatureTest.java
  • src/test/java/util/ShutdownHookHelper.java
💤 Files with no reviewable changes (1)
  • src/test/java/gui/testGUI.java

Comment thread .github/workflows/build-installer.yml Outdated
Comment thread .github/workflows/build-installer.yml
Comment thread scripts/build_osprey.sh
Comment on lines +35 to +49
# Default RID: infer from the host when none is given on the command line.
default_rid() {
local os arch
os="$(uname -s)"
arch="$(uname -m)"
case "${os}" in
Linux) echo "linux-x64" ;;
Darwin)
case "${arch}" in
arm64) echo "osx-arm64" ;;
*) echo "osx-x64" ;;
esac ;;
MINGW*|MSYS*|CYGWIN*) echo "win-x64" ;;
*) echo "linux-x64" ;;
esac

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '\n## File: scripts/build_ospreysharp.sh (outline)\n'
ast-grep outline scripts/build_ospreysharp.sh --view expanded || true

printf '\n## Relevant lines in scripts/build_ospreysharp.sh\n'
nl -ba scripts/build_ospreysharp.sh | sed -n '1,180p'

printf '\n## Search for Platform=x64 and RID usage\n'
rg -n --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' --glob '!**/build/**' 'Platform=x64|default_rid|RIDS|osx-arm64|linux-arm64|dotnet publish' . || true

printf '\n## Nearby docs/comments mentioning build_ospreysharp\n'
rg -n --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' --glob '!**/build/**' 'build_ospreysharp|osx-arm64|linux-x64|linux-arm64|x64' README* docs* scripts* .github* || true

Repository: Noble-Lab/Carafe

Length of output: 337


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '## Candidate files\n'
git ls-files | rg '(^|/)(build_ospreysharp\.sh|README|docs|.*\.csproj|.*\.sh)$' || true

printf '\n## Search for the script name and RID/platform terms\n'
rg -n --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' --glob '!**/build/**' 'build_ospreysharp|Platform=x64|default_rid|osx-arm64|linux-arm64|dotnet publish' . || true

printf '\n## Read scripts/build_ospreysharp.sh if it exists\n'
if [ -f scripts/build_ospreysharp.sh ]; then
  sed -n '1,180p' scripts/build_ospreysharp.sh
else
  printf 'scripts/build_ospreysharp.sh not found\n'
fi

Repository: Noble-Lab/Carafe

Length of output: 6188


Derive Platform from the RID instead of hard-coding x64.

default_rid() can return osx-arm64, but dotnet publish still passes -p:Platform=x64, so osx-arm64 publishes are mismatched.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/build_ospreysharp.sh` around lines 35 - 49, The default RID logic in
default_rid() already returns arm64 on Apple Silicon, but the publish command
still hard-codes Platform as x64. Update the build flow in build_ospreysharp.sh
so the Platform value is derived from the selected RID (especially for
osx-arm64) instead of always using x64, and make sure the dotnet publish
invocation uses the same architecture as the RID.

Comment on lines +40 to +52
def load_fragments(path: Path, peptide: str, charge: int) -> dict[tuple, float]:
"""Map (FragmentType, FragmentNumber, FragmentCharge) -> RelativeIntensity for one precursor."""
frags: dict[tuple, float] = {}
with path.open() as f:
reader = csv.DictReader(f, delimiter="\t")
for row in reader:
if row.get("StrippedPeptide") != peptide:
continue
if int(float(row.get("PrecursorCharge", 0))) != charge:
continue
key = (row["FragmentType"], int(row["FragmentNumber"]), int(row["FragmentCharge"]))
frags[key] = float(row["RelativeIntensity"])
return frags

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Don't match precursors by stripped peptide alone.

KoinaLibraryGenerator can emit multiple precursors with the same StrippedPeptide and PrecursorCharge once variable mods are enabled. This loader silently overwrites fragment rows across those candidates, so the reported cosine can come from a mixed spectrum. Filter by ModifiedPeptide/m/z too, or fail when more than one precursor matches.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/compare_library_predictions.py` around lines 40 - 52, The fragment
loader in load_fragments currently filters only by StrippedPeptide and
PrecursorCharge, which can merge rows from multiple precursor candidates. Update
the matching logic to use a more specific precursor identifier such as
ModifiedPeptide and/or precursor m/z, or detect multiple matches and fail
explicitly instead of silently overwriting fragments. Keep the fix localized to
load_fragments and any caller that provides the precursor selection criteria so
the cosine is computed from one unambiguous spectrum.

Comment thread scripts/generate_installer_win.bat
Comment thread src/main/java/util/ReuseSignature.java
Comment thread src/main/java/util/ReuseSignature.java Outdated
Comment thread src/test/java/ai/PSMConfigTest.java
Comment thread src/test/java/db/EntrapmentFastaGearTest.java
Comment thread src/test/java/koina/KoinaLiveTest.java
maccoss and others added 6 commits June 28, 2026 14:34
…ated review

- KoinaLibraryGenerator: format library-TSV numbers with Locale.ROOT so a JVM
  comma-decimal locale can't emit "12,3400" and break downstream parsers
  (+ regression test under Locale.GERMANY).
- OspreyBlibReader: when a modification can't be recognized, skip the
  identification instead of emitting a partially de-modified (wrong-mass)
  sequence; report the skip count.
- AIGear: propagate an explicit -decoy_prefix into the -build_entrapment_fasta
  path (EntrapmentFastaGear.Config.decoyPrefix) so the flag isn't ignored.
- ProcessUtils.terminateAll: guard against a null collection (Javadoc already
  promised best-effort) (+ test).
- generate_installer_win.bat: fail when the bundled Osprey.exe is missing
  instead of silently shipping an Osprey-less MSI (opt out with ALLOW_NO_OSPREY=1).
- KoinaLiveTest: gate the networked tests behind KOINA_LIVE_TESTS / -Dkoina.live
  so they stay out of the default offline suite.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
- ReuseSignature: fingerprint inputs by absolute path (not basename) so two
  inputs sharing a filename can't collide, and walk directories recursively so
  nested-file inputs (e.g. Bruker .d) invalidate the cache when contents change;
  bump SCHEME_VERSION. (The "drops the first argument" finding was a false
  positive -- CmdTask.args[0] is the executable, which normalizeCommand
  intentionally drops; left as-is, with a clarifying comment + tests.)
- Koina: validate response array sizes before reshaping -- MS2 arrays must be
  equal length and a multiple of the request count; RT/MS2 batch lengths must
  match the request -- so a malformed/truncated response fails fast with a clear
  message instead of truncating or throwing AIOOBE later.
- Koina library: thread the max-fragment-m/z setting end to end
  (Config.maxFragmentMz, writePrecursor filter, -lf_frag_mz_max CLI, and the GUI
  Koina command), matching the DIA-NN library path.
- CI installer: centralize the version (derive project.version once, use it for
  the Osprey publish path and pass it to the MSI script as APP_VERSION) so a pom
  bump can't drift; grant contents: write so the release step can upload the MSI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
…ion, docs

- AIGear (Koina): match fix/var modification ids exactly (split on , ; or
  whitespace) instead of String.contains(), so ids like "10"/"21" can't
  mis-toggle Carbamidomethyl/Oxidation.
- Tests: restore the global state each test mutates -- PSMConfigTest resets the
  PSMConfig column config after each test; EntrapmentFastaGearTest saves and
  restores the CParameter digest settings -- so state can't leak across tests.
- Docs: rename the OspreySharp references to Osprey (the engine was renamed;
  build scripts and CI inputs keep the lowercase ospreysharp_ names), and rename
  the workflow doc to docs/osprey-end-to-end-workflow.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
…isolation

- AIGear (Koina): match fix/var modification ids exactly (split on , ; or
  whitespace) instead of String.contains(), so ids like "10"/"21" can't
  mis-toggle Carbamidomethyl/Oxidation.
- PSMConfigTest / EntrapmentFastaGearTest: restore the global static state each
  test mutates so it can't leak across tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
The engine was renamed Osprey upstream; finish the rename in Carafe so nothing
says OspreySharp anymore:
- scripts/build_ospreysharp.{sh,bat} -> scripts/build_osprey.{sh,bat}
- docs images CarafeGUI-ospreysharp-* -> CarafeGUI-osprey-*
- build-installer.yml workflow inputs ospreysharp_repo/ospreysharp_ref ->
  osprey_repo/osprey_ref (env OSPREY_REPO/OSPREY_REF)
- README, docs, the installer script, and a CarafeGUI error string updated to
  reference the new names.

NOTE: the installer workflow now takes -f osprey_repo / -f osprey_ref.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
… matcher

Coverage area A (pure functions that were untested but underpin everything):
- IsolationWindowTest: pins IsolationWindow.generate_id rounding/determinism --
  the bucket key whose meta-vs-index consistency the decoy pairing depends on.
- DBGearTest: pins DBGear.get_mz (the precursor m/z mass/charge relation).

Coverage area B (the precursor -> MS2-scan resolution behind "Spectrum not
found"): extract AIGear.add_ms2spectrum_index's range-based apex match into
dia/ApexMatcher.matchByIsolationRange and cover it (closest-RT in window,
boundary inclusivity, ties, no-window -> -1, no cross-window resolution). The
index map is now a TreeMap so iteration is RT-ascending, which makes the
early-exit sound (it was an unordered HashMap before); the dead duplicate second
pass is removed.

Full suite: 92 tests, 0 failures.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/main/java/ai/AIGear.java (1)

8237-8269: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

NPE when a precursor matches no isolation window.

ApexMatcher.matchByIsolationRange returns -1 when no isolation window contains precursor_mz. In that case the matcher block only logs "No apex scan found" and never puts an entry into row2index for that row. The output loop below then unconditionally dereferences row2index.get(k).apex_rt (and .ms2index, .scan_number), throwing a NullPointerException and aborting the whole Osprey run. Since the hard-fail was intentionally removed, unmatched precursors (boundary/out-of-acquisition-range) are now reachable and must be handled.

🛡️ Guard unmatched rows in the output loop
         IntStream.range(0, matches.size()).forEach(k -> {
             try {
+                ApexMatch am = row2index.get(k);
+                if (am == null) {
+                    // No apex scan found for this precursor; skip rather than NPE.
+                    return;
+                }
                 if(!cIndex.containsKey(PSMConfig.qvalue_column_name)){
-                    writer.write(matches.get(k)+"\t0\t"+row2index.get(k).apex_rt+"\t"+row2index.get(k).delta_rt+"\t"+row2index.get(k).ms2index+"\t"+row2index.get(k).scan_number+"\t"+index.get(row2index.get(k).ms2index).get("isolation_mz_start")+"\t"+index.get(row2index.get(k).ms2index).get("isolation_mz_end")+"\n");
+                    writer.write(matches.get(k)+"\t0\t"+am.apex_rt+"\t"+am.delta_rt+"\t"+am.ms2index+"\t"+am.scan_number+"\t"+index.get(am.ms2index).get("isolation_mz_start")+"\t"+index.get(am.ms2index).get("isolation_mz_end")+"\n");
                 }else{
-                    writer.write(matches.get(k)+"\t"+row2index.get(k).apex_rt+"\t"+row2index.get(k).delta_rt+"\t"+row2index.get(k).ms2index+"\t"+row2index.get(k).scan_number+"\t"+index.get(row2index.get(k).ms2index).get("isolation_mz_start")+"\t"+index.get(row2index.get(k).ms2index).get("isolation_mz_end")+"\n");
+                    writer.write(matches.get(k)+"\t"+am.apex_rt+"\t"+am.delta_rt+"\t"+am.ms2index+"\t"+am.scan_number+"\t"+index.get(am.ms2index).get("isolation_mz_start")+"\t"+index.get(am.ms2index).get("isolation_mz_end")+"\n");
                 }
             } catch (IOException e) {
                 throw new RuntimeException(e);
             }
         });

Note: skipping unmatched rows changes the output row count vs. input; if downstream loaders assume row alignment, consider writing a sentinel/blank ms2index instead. Please confirm the intended handling for precursors with no containing window.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main/java/ai/AIGear.java` around lines 8237 - 8269, The output loop is
dereferencing missing entries in row2index when
ApexMatcher.matchByIsolationRange returns -1 and no ApexMatch was stored. Update
the logic in the matching block and the writer loop in AIGear so unmatched rows
are handled explicitly before accessing row2index.get(k), either by skipping
them or by writing sentinel/blank values consistently. Keep the handling aligned
with the existing ApexMatch population and the matches iteration so a missing
isolation window no longer causes a NullPointerException.
docs/osprey-end-to-end-workflow.md (1)

150-163: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add a language tag to this fenced command block.

Markdownlint MD040 is still firing here. Mark the block as bash or text so the docs lint cleanly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/osprey-end-to-end-workflow.md` around lines 150 - 163, Add a fenced code
block language tag to the command example in the Osprey workflow docs so
Markdownlint MD040 passes; update the existing command block in the
documentation to be explicitly marked as bash or text, keeping the content
unchanged and ensuring the fenced block is recognized by the docs linter.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/build-installer.yml:
- Around line 39-42: The build-installer workflow is falling back to the Osprey
branch tip for tag-triggered builds, which makes release MSI builds
non-reproducible. Update the env handling in the workflow so the bundled Osprey
ref used by the build job is pinned to a reviewed revision for `push` tags
instead of defaulting to `master`, and keep the manual-input path in place for
`github.event.inputs.osprey_ref` when present. Focus on the `OSPREY_REF`
assignment in the build-installer workflow and make sure tag builds always
resolve to a fixed release-safe ref.

In `@scripts/build_osprey.sh`:
- Around line 23-24: Update the usage examples in build_osprey.sh so the
documented explicit RID list includes osx-x64 alongside win-x64, linux-x64, and
osx-arm64. Keep the change limited to the script’s usage/help text near the
example invocation lines, and ensure the examples reflect all supported host
targets.

---

Outside diff comments:
In `@docs/osprey-end-to-end-workflow.md`:
- Around line 150-163: Add a fenced code block language tag to the command
example in the Osprey workflow docs so Markdownlint MD040 passes; update the
existing command block in the documentation to be explicitly marked as bash or
text, keeping the content unchanged and ensuring the fenced block is recognized
by the docs linter.

In `@src/main/java/ai/AIGear.java`:
- Around line 8237-8269: The output loop is dereferencing missing entries in
row2index when ApexMatcher.matchByIsolationRange returns -1 and no ApexMatch was
stored. Update the logic in the matching block and the writer loop in AIGear so
unmatched rows are handled explicitly before accessing row2index.get(k), either
by skipping them or by writing sentinel/blank values consistently. Keep the
handling aligned with the existing ApexMatch population and the matches
iteration so a missing isolation window no longer causes a NullPointerException.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a3ad2aa7-35e7-4169-a603-a3b6a063fe71

📥 Commits

Reviewing files that changed from the base of the PR and between b1ff564 and a3bfbce.

⛔ Files ignored due to path filters (2)
  • docs/images/CarafeGUI-osprey-tab.png is excluded by !**/*.png
  • docs/images/CarafeGUI-osprey-workflows.png is excluded by !**/*.png
📒 Files selected for processing (23)
  • .github/workflows/build-installer.yml
  • README.md
  • docs/osprey-end-to-end-workflow.md
  • scripts/build_osprey.bat
  • scripts/build_osprey.sh
  • scripts/generate_installer_win.bat
  • src/main/java/ai/AIGear.java
  • src/main/java/ai/OspreyBlibReader.java
  • src/main/java/dia/ApexMatcher.java
  • src/main/java/gui/CarafeGUI.java
  • src/main/java/koina/KoinaClient.java
  • src/main/java/koina/KoinaLibraryGenerator.java
  • src/main/java/util/ProcessUtils.java
  • src/main/java/util/ReuseSignature.java
  • src/test/java/ai/PSMConfigTest.java
  • src/test/java/db/DBGearTest.java
  • src/test/java/db/EntrapmentFastaGearTest.java
  • src/test/java/dia/ApexMatcherTest.java
  • src/test/java/dia/IsolationWindowTest.java
  • src/test/java/koina/KoinaLibraryGeneratorTest.java
  • src/test/java/koina/KoinaLiveTest.java
  • src/test/java/util/ProcessUtilsTest.java
  • src/test/java/util/ReuseSignatureTest.java
✅ Files skipped from review due to trivial changes (2)
  • src/test/java/db/DBGearTest.java
  • README.md
🚧 Files skipped from review as they are similar to previous changes (9)
  • src/main/java/util/ProcessUtils.java
  • scripts/generate_installer_win.bat
  • src/test/java/util/ProcessUtilsTest.java
  • src/test/java/koina/KoinaLiveTest.java
  • src/test/java/ai/PSMConfigTest.java
  • src/main/java/koina/KoinaClient.java
  • src/main/java/koina/KoinaLibraryGenerator.java
  • src/test/java/db/EntrapmentFastaGearTest.java
  • src/main/java/gui/CarafeGUI.java

Comment on lines +39 to +42
env:
# Defaults used when the workflow is triggered by a tag (no manual inputs).
OSPREY_REPO: ${{ github.event.inputs.osprey_repo || 'ProteoWizard/pwiz' }}
OSPREY_REF: ${{ github.event.inputs.osprey_ref || 'master' }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

sed -n '1,180p' .github/workflows/build-installer.yml

Repository: Noble-Lab/Carafe

Length of output: 7428


Pin the bundled Osprey ref for tag builds. On push to v*, github.event.inputs.osprey_ref is empty, so this falls back to master. That makes release MSI builds pull whatever Osprey is at branch tip instead of a reviewed revision, which breaks reproducibility and can pull in unreviewed upstream changes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build-installer.yml around lines 39 - 42, The
build-installer workflow is falling back to the Osprey branch tip for
tag-triggered builds, which makes release MSI builds non-reproducible. Update
the env handling in the workflow so the bundled Osprey ref used by the build job
is pinned to a reviewed revision for `push` tags instead of defaulting to
`master`, and keep the manual-input path in place for
`github.event.inputs.osprey_ref` when present. Focus on the `OSPREY_REF`
assignment in the build-installer workflow and make sure tag builds always
resolve to a fixed release-safe ref.

Comment thread scripts/build_osprey.sh
Comment on lines +23 to +24
# scripts/build_osprey.sh # build the default RID for this host
# scripts/build_osprey.sh win-x64 linux-x64 osx-arm64

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Include osx-x64 in the usage examples.

The example list now implies the supported explicit RIDs are win-x64, linux-x64, and osx-arm64, but the runtime/docs contract also supports osx-x64. Omitting it makes Intel macOS builds look unsupported.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/build_osprey.sh` around lines 23 - 24, Update the usage examples in
build_osprey.sh so the documented explicit RID list includes osx-x64 alongside
win-x64, linux-x64, and osx-arm64. Keep the change limited to the script’s
usage/help text near the example invocation lines, and ensure the examples
reflect all supported host targets.

maccoss and others added 6 commits June 28, 2026 15:22
…own-mod test

Coverage area C (the dia/ spectrum-bucketing layer, previously 0% covered):
- DIAMapTest pins DIAMap.get_isolation_window / get_isolation_windows -- which
  window(s) a precursor m/z falls into, the membership behind per-window indexing
  and decoy pairing (the SEA-AD pairing diagnosis turned on exactly this):
  single-window, no-window -> empty, boundary inclusivity, and overlapping
  (staggered) windows reporting all matches. Driven from a hand-built
  isolationWindowMap, so no mzML fixture is needed.

Tier D:
- OspreyBlibReaderTest covers the unrecognized-modification skip: a peptide with
  an unknown mod mass is dropped, not emitted with a wrong (partially
  de-modified) mass.

Full suite: 97 tests, 0 failures. (The full DIAMeta.load_ms_data + DIAIndex.index
mzML round-trip still needs a binary mzML fixture and is left for a follow-up.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
Completes coverage area C: DIASpectrumTest builds a tiny 3-scan mzML (one MS1,
two MS2 in adjacent isolation windows, base64 peak arrays) at test time and runs
it through the real MSFTBX parse path, then asserts:
- DIAMeta.load_ms_data + get_ms_run_meta_data: scan count, MS levels, and the two
  discovered isolation windows (ids + bounds).
- DIAIndex.index + get_spectrum_by_scan: the in-window MS2 scan is indexed by its
  native scan number, its peaks round-trip, and an unknown scan number resolves
  to null (the upstream "spectrum not found" guard).

A real DIA mzML is far too large to commit (an Astral file is ~145 MB for under
a minute), so generating the fixture in-process gives the end-to-end spectrum-read
coverage with no binary in the repo.

Full suite: 99 tests, 0 failures.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo
The -ms option accepts a folder of MS files (e.g. gas-phase fractionation), but
the Osprey finetune path built a single MS2 index straight from -ms, so a folder
was opened as one mzML: fetchIndex() failed and DIAMeta.load_ms_data NPE'd on a
null index. add_ms2spectrum_index now resolves each PSM's run via its File.Name
and builds one MS2 index per run file (resolveRunMsFile mirrors the DIA-NN
loader; ApexMatch carries the isolation window so per-run output needs no shared
index). DIAMeta.load_ms_data now fails with a clear message instead of an NPE.

Added AIGearMsFileResolutionTest covering single-file, folder-by-basename,
existing-full-path, missing-run, and no-File.Name cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The installer workflow builds with -DskipTests, so the tests never ran in CI.
This runs mvn test on Linux (JDK 21) on push / PR / manual dispatch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Test builds pile up; stamp the MSI and artifact with a yyMMdd build date so they
are distinguishable, e.g. Carafe-2.2.0-260701.msi / Carafe-win-x64-msi-260701.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing NPE)

OspreyBlibReader.writeDiannTsv wrote a DIA-NN-style report that omitted the
Precursor.Id column. Single-file training never read it, but folder (multi-mzML)
training routes through AIGear.get_ms_file2psm_diann_multiple_ms_runs, whose
"best" merge does d[hIndex.get("Precursor.Id")] -> null -> NullPointerException
at AIGear.java:12275 (and the same lookup at 12376).

Add Precursor.Id as Modified.Sequence + Precursor.Charge (e.g.
AAC(UniMod:4)LLDGVPVALKK3), the exact key Carafe's own Skyline->DIA-NN converter
builds (AIGear.java:8510), so both report sources merge identically and every
Precursor.Id consumer is covered at once. use_osprey_blib_column_names() already
keeps precursor_id_column_name = "Precursor.Id", so the added column resolves.

Add a regression test (emitsPrecursorIdAsModifiedSequencePlusCharge) and include
Precursor.Id in the required-columns assertion. Full suite: 105 tests, 0 failures.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RDsBX86cMVqnjnrNDdNRMo

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main/java/ai/AIGear.java (1)

8246-8281: 🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

NullPointerException when no apex scan is found for a PSM row.

row2index (line 8168) is only populated inside the if (index.containsKey(matched_index)) branch (lines 8250-8258); when ApexMatcher.matchByIsolationRange fails to find a match, only a System.out.println("No apex scan found: ...") is emitted and no entry is stored for that row index.

The subsequent write loop unconditionally does ApexMatch am = row2index.get(k) and immediately dereferences am.apex_rt, am.delta_rt, am.ms2index, am.scan_number, am.isolation_mz_start, am.isolation_mz_end (lines 8272-8277) with no null check. Any row whose apex can't be resolved (a case this method explicitly anticipates) throws a NullPointerException and aborts the entire finetune/index-build run instead of just that row.

🐛 Proposed fix
         IntStream.range(0, matches.size()).forEach(k -> {
             try {
                 ApexMatch am = row2index.get(k);
+                if (am == null) {
+                    Cloger.getInstance().logger.warn("Skipping row with no apex scan match: " + matches.get(k));
+                    return;
+                }
                 if(!cIndex.containsKey(PSMConfig.qvalue_column_name)){
                     writer.write(matches.get(k)+"\t0\t"+am.apex_rt+"\t"+am.delta_rt+"\t"+am.ms2index+"\t"+am.scan_number+"\t"+am.isolation_mz_start+"\t"+am.isolation_mz_end+"\n");
                 }else{
                     writer.write(matches.get(k)+"\t"+am.apex_rt+"\t"+am.delta_rt+"\t"+am.ms2index+"\t"+am.scan_number+"\t"+am.isolation_mz_start+"\t"+am.isolation_mz_end+"\n");
                 }
             } catch (IOException e) {
                 throw new RuntimeException(e);
             }
         });

Note: silently skipping a row changes the output row count relative to the input PSM table — confirm downstream consumers of the _added_ms2index.tsv file tolerate that, or write a sentinel row instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main/java/ai/AIGear.java` around lines 8246 - 8281, The write path in the
apex-index export is dereferencing a missing `ApexMatch` when
`ApexMatcher.matchByIsolationRange` fails, causing a `NullPointerException`.
Update the `row2index` population and the later `IntStream.range(...).forEach`
loop so every `matches` row has a safe handling path in `AIGear` (for example,
guard `row2index.get(k)` before reading `apex_rt`, `delta_rt`, `ms2index`,
`scan_number`, `isolation_mz_start`, and `isolation_mz_end`). If no match
exists, either emit a sentinel/default output row consistently or skip writing
that row in a controlled way, but avoid dereferencing null.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/test.yml:
- Around line 17-18: The Checkout step uses actions/checkout@v5 with default
credential persistence, which leaves the GITHUB_TOKEN in local git config.
Update the Checkout job configuration in the workflow to disable credential
persistence for this read-only test run by setting the checkout action’s
persistence option off in the Checkout step.

---

Outside diff comments:
In `@src/main/java/ai/AIGear.java`:
- Around line 8246-8281: The write path in the apex-index export is
dereferencing a missing `ApexMatch` when `ApexMatcher.matchByIsolationRange`
fails, causing a `NullPointerException`. Update the `row2index` population and
the later `IntStream.range(...).forEach` loop so every `matches` row has a safe
handling path in `AIGear` (for example, guard `row2index.get(k)` before reading
`apex_rt`, `delta_rt`, `ms2index`, `scan_number`, `isolation_mz_start`, and
`isolation_mz_end`). If no match exists, either emit a sentinel/default output
row consistently or skip writing that row in a controlled way, but avoid
dereferencing null.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 154c257e-9cea-4c0e-9397-87686a96b942

📥 Commits

Reviewing files that changed from the base of the PR and between a3bfbce and d916373.

📒 Files selected for processing (10)
  • .github/workflows/build-installer.yml
  • .github/workflows/test.yml
  • src/main/java/ai/AIGear.java
  • src/main/java/ai/ApexMatch.java
  • src/main/java/ai/OspreyBlibReader.java
  • src/main/java/dia/DIAMeta.java
  • src/test/java/ai/AIGearMsFileResolutionTest.java
  • src/test/java/ai/OspreyBlibReaderTest.java
  • src/test/java/dia/DIAMapTest.java
  • src/test/java/dia/DIASpectrumTest.java
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/build-installer.yml
  • src/main/java/ai/OspreyBlibReader.java

Comment on lines +17 to +18
- name: Checkout
uses: actions/checkout@v5

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Checkout doesn't disable credential persistence.

actions/checkout@v5 persists the GITHUB_TOKEN in the local git config by default. Since this workflow only needs to read the repo to run tests (no push required), disable persistence to reduce the blast radius if a build step (e.g. via a compromised Maven dependency) attempts to exfiltrate it.

🔒 Proposed fix
       - name: Checkout
         uses: actions/checkout@v5
+        with:
+          persist-credentials: false
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Checkout
uses: actions/checkout@v5
- name: Checkout
uses: actions/checkout@v5
with:
persist-credentials: false
🧰 Tools
🪛 zizmor (1.26.1)

[warning] 17-21: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/test.yml around lines 17 - 18, The Checkout step uses
actions/checkout@v5 with default credential persistence, which leaves the
GITHUB_TOKEN in local git config. Update the Checkout job configuration in the
workflow to disable credential persistence for this read-only test run by
setting the checkout action’s persistence option off in the Checkout step.

Source: Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants