Add `get_dataset_info()` function by jashapiro · Pull Request #53 · AlexsLemonade/ScPCAr

jashapiro · 2026-06-05T17:50:51Z

Closes #36 (which had been erroneously closed before)

Sorry, one more code change before docs!

Here I am adding a get_dataset_info() function, that returns a subset of the information in get_dataset_detail() (a non-exported function) in what is hopefully a more digestible form.

The return value is still a list (I did not want to get into objects), but with a much smaller set of components:

id: the dataset id
format: SINGLE_CELL_EXPERIMENT or ANN_DATA
status: current processing status
n_samples: number of samples
n_projects: number of projects those samples come from
samples = a data frame of the individual samples, with the following columns:
- scpca_sample_id
- scpca_project_id
- modality
- includes_bulk
merged_projects: Since a dataset created on the website can have merged projects (and we may support that in the future), we have a separate list of all the merged projects included.

I am quite open to changing the format and contents of this return value. Right now it is limited by me not wanting to make extra queries for each sample, so I only have the info that was in the dataset detail we get from the query. If we are willing to make multiple calls to the API, we could get more info for each sample. (This part of the code is in a helper function: make_dataset_df())

That was the core change for the issue, but I made another change that made this from a small PR into a kind of big one: I changed all the return values from public functions from the dataset detail we get from the API to just returning the dataset id. I think that makes things a lot simpler, and avoids people expecting the dataset they had stored to be up-to-date, which it often will not be if they did not capture outputs. This means that there are a lot of docs updates as well.

I also consolidated the status translation into its own function, since there are now two functions that do that. I did not translate the format info, but I might be willing to do that as well so things are a bit prettier.

…info

sjspielman

The changes look fine to me overall, although there's a few spots I think we can be more clear about what we are returning. Most of my comments are about changes to tests which have been pared down a lot and it's not always clear to me why.

sjspielman · 2026-06-08T13:23:13Z

+#'   * `format`: the dataset file format (e.g. "SINGLE_CELL_EXPERIMENT", "ANN_DATA")
+#'   * `status`: the processing status — one of "pending", "processing",
+#'     "succeeded", "failed", or "expired" (see [get_dataset_status()])
+#'   * `n_samples`: the number of rows in `samples` (one per sample-modality


Is there info about the number of libraries that we can pull out and return here as well? I kind of don't think there is, but if there is it would be good.

I don't think there is either. We could get it with more calls to the API, but as I said, I don't really want to do that.

sjspielman · 2026-06-08T13:27:48Z

+    n_samples = nrow(samples),
+    n_projects = length(detail$data),


The way this is set up, it seems that n_samples will not included counts from a merged object but n_project will. So you could end up with something like n_samples = 0 and n_projects = 1 which seems really confusing. I understand why you wouldn't want to include for merged in samples because of how the data is delivered, so what if we have another field here n_samples_merged or so to represent the number of samples that are delivered in merged objects? The point is, can we report both?

I don't know that there is a good solution here without more calls to the API, but I will look into it some more.

I was actually unsure whether to include anything at all about merged datasets; at the moment I don't think there is an easy way to get the token used for a website-created dataset, which means this is kind of a problem that won't come up unless or until we add support for adding merged projects.

sjspielman · 2026-06-08T13:32:40Z

-  expect_null(result$email)
-})
-
-test_that("start_dataset_processing includes email in the same request when provided", {


Why was this removed? It looks like functionality is still there

Same as above: this is testing that we are constructing the request as expected, but that is also in the function that does the request construction.

sjspielman · 2026-06-08T13:36:10Z

+          SCPCP000001 = list(
+            SINGLE_CELL = list("SCPCS000001", "SCPCS000002"),
+            SPATIAL = list(),
+            includes_bulk = FALSE
+          ),
+          SCPCP000002 = list(
+            SINGLE_CELL = list("SCPCS000003"),
+            SPATIAL = list("SCPCS000003"),


I suspect this is still coming in a future PR but it would really be great for these to at least somewhat reflect reality or be fake. It's a test so it doesn't matter much, but it's enough to make me do a biggg double take.

I'm not sure what you are reacting to here? My guess from the second part is that you are upset about the ids not really being spatial? I don't really disagree, but also I don't think there should be any expectation that test values reflect reality except in ways that affect code (e.g., format here).

That's what I was reacting to; that id isn't spatial and also samples 1-3 are all in project 1. As you said it doesn't actually matter, but I had a "wait, what?" moment along the way. The code will be tested fine, but my brain cracks just a smidge. Nothing really needs to change though, this is something to just get over as, once again, doesn't matter!

sjspielman · 2026-06-08T13:38:43Z

    "previously failed to process"
  )
  expect_equal(captured_req$method, "PUT")
-  expect_true(captured_req$body$data$start)


Why was this removed? Another instance below too.

sjspielman · 2026-06-08T13:46:35Z

  })
 })

-test_that("get_ccdl_datasets passes project_id as ccdl_project_id query parameter", {


I'm confused why these tests are gone because it doesn't look like there are any changes to this function.

My main thought is that most of these tests never should have been here, I just didn't catch them when I wasn't looking closely at tests. When you look at the mock functions and the actual tests, you can see that they are really mostly testing that the argument is passed to the _perform function, but that is already tested as part of the scpca_request() function more directly. It's not that it is a completely redundant test, but it is largely so.

After some consideration, I think I will restore these tests; they are a bit noisy, but it does make sense to make sure the internal call has the expected content.

sjspielman · 2026-06-08T13:51:39Z


-test_that("remove_dataset_samples removes a project and PUTs", {
-  captured_req <- NULL
+test_that("remove_dataset_samples PUTs", {


I'm confused about this change - if we are testing removes, why would we start with an empty dataset list?

We aren't testing whether the removal works, as this function no longer returns the data inside it (and the mocking that was done didn't really test that anyway!) So the only thing to test is that it made a PUT request.

jashapiro

Thanks for the review! After thinking about it a few days, I think I will restore many of the tests; I had thought many of the ones I had removed were relatively redundant, as the internal functions are being tested more directly, but I can make a case that we should continue to test that the internal function is hooked up correctly, and allows us to make sure that changes to underlying functions don't affect the callers.

I'm less certain what I am going to do about merged projects. The fact that we can't create datasets with merged projects at the moment makes me not want to worry about the details, but I also don't want things to be confusing.

jashapiro · 2026-06-08T14:14:14Z

+          SCPCP000001 = list(
+            SINGLE_CELL = list("SCPCS000001", "SCPCS000002"),
+            SPATIAL = list(),
+            includes_bulk = FALSE
+          ),
+          SCPCP000002 = list(
+            SINGLE_CELL = list("SCPCS000003"),
+            SPATIAL = list("SCPCS000003"),


I'm not sure what you are reacting to here? My guess from the second part is that you are upset about the ids not really being spatial? I don't really disagree, but also I don't think there should be any expectation that test values reflect reality except in ways that affect code (e.g., format here).

jashapiro · 2026-06-08T14:15:01Z

  })
 })

-test_that("get_ccdl_datasets passes project_id as ccdl_project_id query parameter", {


My main thought is that most of these tests never should have been here, I just didn't catch them when I wasn't looking closely at tests. When you look at the mock functions and the actual tests, you can see that they are really mostly testing that the argument is passed to the _perform function, but that is already tested as part of the scpca_request() function more directly. It's not that it is a completely redundant test, but it is largely so.

After some consideration, I think I will restore these tests; they are a bit noisy, but it does make sense to make sure the internal call has the expected content.

jashapiro · 2026-06-08T14:17:57Z

-  expect_null(result$email)
-})
-
-test_that("start_dataset_processing includes email in the same request when provided", {


Same as above: this is testing that we are constructing the request as expected, but that is also in the function that does the request construction.

jashapiro · 2026-06-08T14:19:45Z


-test_that("remove_dataset_samples removes a project and PUTs", {
-  captured_req <- NULL
+test_that("remove_dataset_samples PUTs", {


We aren't testing whether the removal works, as this function no longer returns the data inside it (and the mocking that was done didn't really test that anyway!) So the only thing to test is that it made a PUT request.

jashapiro · 2026-06-09T13:41:26Z

+#'   * `format`: the dataset file format (e.g. "SINGLE_CELL_EXPERIMENT", "ANN_DATA")
+#'   * `status`: the processing status — one of "pending", "processing",
+#'     "succeeded", "failed", or "expired" (see [get_dataset_status()])
+#'   * `n_samples`: the number of rows in `samples` (one per sample-modality


I don't think there is either. We could get it with more calls to the API, but as I said, I don't really want to do that.

jashapiro · 2026-06-09T13:45:48Z

+    n_samples = nrow(samples),
+    n_projects = length(detail$data),


I don't know that there is a good solution here without more calls to the API, but I will look into it some more.

I was actually unsure whether to include anything at all about merged datasets; at the moment I don't think there is an easy way to get the token used for a website-created dataset, which means this is kind of a problem that won't come up unless or until we add support for adding merged projects.

…atial!

jashapiro · 2026-06-10T01:18:07Z

After your review I decided that it made sense to have a bit more detail in the dataset info response, so now it goes out and gets the project samples that will be included in the dataset, including both merged and individual samples. It still returns the data in samples dataframe (no library info) but now with info about what modalities are included for each sample.

I also reverted some of the test changes and brought them back to better coverage, including re-expanding the removal test to actually test something useful about the function (that it does actually do the expected removals).

Should be ready for another look.

sjspielman

This looks good to me, small comments but don't need to see again. I did various local testing as well with a bunch of different combos of stuff, nothing to flag!

The one bigger, but still on the smaller side, comment I have is - maybe it would be good to drop merged_projects from the get_dataset_info() output for now since it's not something we can currently support. When I see it in that function's output it makes me want to look for how to add one into the dataset, which I of course can't do. I do think it's good to be prepared for the future in case and have the code here, but just maybe not return that field quite yet. Again, I don't feel I need to see this again though and I'm fine with your decision on it.

sjspielman · 2026-06-10T15:03:45Z

+#' `seq_unit` gives the single-cell sequencing unit ("cell" or "nucleus", or `NA` when the
+#' sample is not included as single-cell),
+#' `has_spatial` marks spatial inclusion
+#' `has_bulk` reflects the project's `includes_bulk` request
+#' intersected with whether the sample actually has bulk data.
+#' `has_cite_seq` and `has_multiplexed` come from the sample records.


Is it possible to make these actual bullets? The includes_bulk phrasing isn't super clear either, but not sure what to suggest

sjspielman · 2026-06-11T13:54:33Z

+#'   * `n_samples`: the total number of samples in the dataset, taken from the
+#'     API's `total_sample_count`
+#'   * `n_projects`: the number of projects in the dataset
+#'   * `samples`: a data frame with one row per included sample and the following columns:


Can I lobby to call this samples_df? At a quick glance I would assume it's a vector of ids, but that's not what it is at all!

How about sample_info?

Co-authored-by: Stephanie J. Spielman <stephanie.spielman@gmail.com>

update tests to match (and be more conservative on the slot name)

jashapiro added 8 commits June 4, 2026 11:41

Add separate function for dataset status internal

3c584bb

use new dataset status

ab1e383

Use only the dataset id for external return values.

b0a1364

Add get_dataset_info function (and data frame helper)

1e8735f

Simplify and consolidate tests

b13835a

fix testing indentation error

afdb1df

document

abb682d

Merge remote-tracking branch 'origin/main' into jashapiro/36-dataset_…

f0417d6

…info

jashapiro requested a review from sjspielman June 5, 2026 17:50

sjspielman reviewed Jun 8, 2026

View reviewed changes

jashapiro commented Jun 9, 2026

View reviewed changes

jashapiro added 8 commits June 9, 2026 09:55

Revert test removals

0c9d05a

ignore claude directory for rbuild

e1f96cb

standardize testing with more detail

fcc337f

fix modailty test: can't have one sample with both single cell and sp…

d7014df

…atial!

get full sample size

e609cd0

Give a more complete sample table

8e277ff

update dataset tests

5e91a29

update docs

8ce0798

jashapiro requested a review from sjspielman June 10, 2026 01:18

re-expand add and remove tests.

3c8a7b4

sjspielman approved these changes Jun 11, 2026

View reviewed changes

jashapiro and others added 3 commits June 11, 2026 10:11

Apply suggestions from code review

4ca9ef6

Co-authored-by: Stephanie J. Spielman <stephanie.spielman@gmail.com>

change $samples to $sample_info in get_dataset_info

477c187

update tests to match (and be more conservative on the slot name)

docs updates

4704842

jashapiro merged commit 3d12fbf into main Jun 11, 2026
4 checks passed

Conversation

jashapiro commented Jun 5, 2026

Uh oh!

sjspielman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jashapiro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jashapiro commented Jun 10, 2026

Uh oh!

sjspielman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants