Skip to content

Proposition to extend /search/calls & /search/allelematrix functionalities #676

@GuilhemSempere

Description

@GuilhemSempere

Add studyDbIds, sampleDbIds, and germplasm filters to POST /search/calls; add studyDbIds to POST /search/allelematrix; formally document AND-intersection semantics

Summary

  • Add studyDbIds to both POST /search/allelematrix and POST /search/calls
  • Add sampleDbIds, germplasmDbIds, germplasmNames, and germplasmPUIs to POST /search/calls (these already exist on AlleleMatrixSearchRequest)
  • Formally document AND-intersection semantics for all high-level filters on both endpoints
  • Add dimensionColumnAggregation parameter to override default column aggregation granularity

Background

An allele matrix is a 2D structure where genotype calls are attached to CallSets:

  • Row dimension (variants): StudyVariantSetVariantCall
  • Column dimension (materials): StudyGermplasmSampleCallSetCall

The column dimension can be entered at any level: callers may filter by germplasmDbIds, sampleDbIds, or callSetDbIds directly. Note that Germplasm objects do not carry a studyDbId but can however be filtered by that field.

Currently, /search/calls only supports column selection via callSetDbIds, forcing clients to resolve the full Germplasm → Sample → CallSet chain themselves before querying. /search/allelematrix supports germplasm and sample filters but lacks studyDbIds. Intersection semantics for simultaneous use of multiple filters are undefined on both endpoints.

Proposed Changes

New fields on CallSearchRequest

studyDbIds:
  type: array
  items:
    type: string
  description: >
    Filter results to calls associated with the specified studies.
    The server resolves studyDbIds across both dimensions: to VariantSets on the row
    dimension, and to CallSets on the column dimension.
    Acts as an AND constraint alongside all other filters.

sampleDbIds:
  type: array
  items:
    type: string
  description: >
    Filter results to calls belonging to CallSets derived from the specified samples.
    Acts as an AND constraint alongside all other filters.

germplasmDbIds:
  type: array
  items:
    type: string
  description: >
    Filter results to calls belonging to CallSets derived from samples associated with
    the specified germplasm.
    Acts as an AND constraint alongside all other filters.

germplasmNames:
  type: array
  items:
    type: string
  description: As germplasmDbIds but matched against germplasm names.

germplasmPUIs:
  type: array
  items:
    type: string
  description: As germplasmDbIds but matched against germplasm PUIs.

dimensionColumnAggregation:
  type: string
  enum: [callSet, sample, germplasm]
  description: >
    Override the default column aggregation granularity (see Genotype aggregation level below).
    When provided, the server groups calls at the specified level regardless of which filter
    parameters were used to select the material. For example, filtering by sampleDbIds but
    setting dimensionColumnAggregation to "germplasm" will return one aggregated column per
    Germplasm rather than one per Sample.

New fields on AlleleMatrixSearchRequest

studyDbIds:
  type: array
  items:
    type: string
  description: >
    Filter the matrix to calls associated with the specified studies.
    The server resolves studyDbIds across both dimensions: to VariantSets on the row
    dimension, and to CallSets on the column dimension.
    Acts as an AND constraint alongside all other filters.

dimensionColumnAggregation:
  type: string
  enum: [callSet, sample, germplasm]
  description: >
    Override the default column aggregation granularity (see Genotype aggregation level below).
    When provided, the server groups calls at the specified level regardless of which filter
    parameters were used to select the material. For example, filtering by sampleDbIds but
    setting dimensionColumnAggregation to "germplasm" will return one aggregated column per
    Germplasm rather than one per Sample.

Genotype aggregation level

The biological entity type used to filter the column dimension determines the default granularity at which genotype calls are aggregated and returned:

  • callSetDbIds: one column per CallSet — finest granularity, no merging.
  • sampleDbIds: calls are grouped per Sample (aggregating all CallSets of that Sample).
  • germplasmDbIds / germplasmNames / germplasmPUIs: calls are grouped per Germplasm (aggregating all Samples and their CallSets for that Germplasm).
  • studyDbIds (column dimension): calls are grouped per Germplasm within the study.

When multiple filters from different tiers are provided simultaneously, the finest-grained tier governs the default aggregation. This default can be overridden using the dimensionColumnAggregation parameter.

Note: this proposal does not define the merging strategy for conflicting allele calls within an aggregation group (e.g. two CallSets of the same Sample carrying different genotypes). This is left to a follow-up discussion.

Intersection semantics (AND logic)

All filters stack as AND constraints across and within dimensions:

  • Providing only studyDbIds is sufficient to retrieve all calls for those studies (both dimensions implicitly resolved).
  • Any additional filter further narrows the result within the study scope.
  • Filters from different tiers of the same dimension are intersected: e.g. studyDbIds: ["S1"] + sampleDbIds: ["Samp1"] returns only calls from CallSets of Samp1 if and only if Samp1 belongs to S1.
  • Non-overlapping filter combinations return HTTP 200 with an empty data array (standard BrAPI empty result).
  • All pre-existing fields (variantDbIds, variantSetDbIds, callSetDbIds, etc.) retain their current semantics.

Examples

Minimal study query — all calls for one study, grouped by germplasm (default):

{ "studyDbIds": ["study1"] }

Study query with explicit aggregation override — same query, but return one column per Sample instead of per Germplasm:

{ "studyDbIds": ["study1"], "dimensionColumnAggregation": "sample" }

Study + germplasm — calls restricted to the specified germplasm within the study:

{ "studyDbIds": ["study1"], "germplasmDbIds": ["germ1", "germ2"] }

Sample filter with germplasm-level aggregation — select by sample but aggregate up to germplasm:

{ "sampleDbIds": ["samp1", "samp2"], "dimensionColumnAggregation": "germplasm" }

Study + variantSet — row dimension restricted to variants in both the study and the specified VariantSet:

{ "studyDbIds": ["study1"], "variantSetDbIds": ["vs1"] }

Non-overlapping filters — returns 200 + empty data:

{ "studyDbIds": ["study1"], "sampleDbIds": ["samp_from_study2"] }

Affected endpoints

  • POST /search/calls — add studyDbIds, sampleDbIds, germplasmDbIds, germplasmNames, germplasmPUIs, dimensionColumnAggregation
  • POST /search/allelematrix — add studyDbIds, dimensionColumnAggregation

Notes

  • Server implementations MAY return 202 Accepted with a search result URL for large result sets, consistent with existing BrAPI async search behaviour.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions