Skip to content

Add get_dataset_info() function#53

Merged
jashapiro merged 20 commits into
mainfrom
jashapiro/36-dataset_info
Jun 11, 2026
Merged

Add get_dataset_info() function#53
jashapiro merged 20 commits into
mainfrom
jashapiro/36-dataset_info

Conversation

@jashapiro

Copy link
Copy Markdown
Member

Closes #36 (which had been erroneously closed before)

Sorry, one more code change before docs!

Here I am adding a get_dataset_info() function, that returns a subset of the information in get_dataset_detail() (a non-exported function) in what is hopefully a more digestible form.

The return value is still a list (I did not want to get into objects), but with a much smaller set of components:

  • id: the dataset id
  • format: SINGLE_CELL_EXPERIMENT or ANN_DATA
  • status: current processing status
  • n_samples: number of samples
  • n_projects: number of projects those samples come from
  • samples = a data frame of the individual samples, with the following columns:
    • scpca_sample_id
    • scpca_project_id
    • modality
    • includes_bulk
  • merged_projects: Since a dataset created on the website can have merged projects (and we may support that in the future), we have a separate list of all the merged projects included.

I am quite open to changing the format and contents of this return value. Right now it is limited by me not wanting to make extra queries for each sample, so I only have the info that was in the dataset detail we get from the query. If we are willing to make multiple calls to the API, we could get more info for each sample. (This part of the code is in a helper function: make_dataset_df())

That was the core change for the issue, but I made another change that made this from a small PR into a kind of big one: I changed all the return values from public functions from the dataset detail we get from the API to just returning the dataset id. I think that makes things a lot simpler, and avoids people expecting the dataset they had stored to be up-to-date, which it often will not be if they did not capture outputs. This means that there are a lot of docs updates as well.

I also consolidated the status translation into its own function, since there are now two functions that do that. I did not translate the format info, but I might be willing to do that as well so things are a bit prettier.

@jashapiro jashapiro requested a review from sjspielman June 5, 2026 17:50

@sjspielman sjspielman left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look fine to me overall, although there's a few spots I think we can be more clear about what we are returning. Most of my comments are about changes to tests which have been pared down a lot and it's not always clear to me why.

Comment thread R/datasets.R Outdated
#' * `format`: the dataset file format (e.g. "SINGLE_CELL_EXPERIMENT", "ANN_DATA")
#' * `status`: the processing status — one of "pending", "processing",
#' "succeeded", "failed", or "expired" (see [get_dataset_status()])
#' * `n_samples`: the number of rows in `samples` (one per sample-modality

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there info about the number of libraries that we can pull out and return here as well? I kind of don't think there is, but if there is it would be good.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is either. We could get it with more calls to the API, but as I said, I don't really want to do that.

Comment thread R/datasets.R Outdated
Comment on lines +733 to +734
n_samples = nrow(samples),
n_projects = length(detail$data),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is set up, it seems that n_samples will not included counts from a merged object but n_project will. So you could end up with something like n_samples = 0 and n_projects = 1 which seems really confusing. I understand why you wouldn't want to include for merged in samples because of how the data is delivered, so what if we have another field here n_samples_merged or so to represent the number of samples that are delivered in merged objects? The point is, can we report both?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that there is a good solution here without more calls to the API, but I will look into it some more.

I was actually unsure whether to include anything at all about merged datasets; at the moment I don't think there is an easy way to get the token used for a website-created dataset, which means this is kind of a problem that won't come up unless or until we add support for adding merged projects.

expect_null(result$email)
})

test_that("start_dataset_processing includes email in the same request when provided", {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed? It looks like functionality is still there

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: this is testing that we are constructing the request as expected, but that is also in the function that does the request construction.

Comment thread tests/testthat/test-datasets.R Outdated
Comment on lines +372 to +379
SCPCP000001 = list(
SINGLE_CELL = list("SCPCS000001", "SCPCS000002"),
SPATIAL = list(),
includes_bulk = FALSE
),
SCPCP000002 = list(
SINGLE_CELL = list("SCPCS000003"),
SPATIAL = list("SCPCS000003"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is still coming in a future PR but it would really be great for these to at least somewhat reflect reality or be fake. It's a test so it doesn't matter much, but it's enough to make me do a biggg double take.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you are reacting to here? My guess from the second part is that you are upset about the ids not really being spatial? I don't really disagree, but also I don't think there should be any expectation that test values reflect reality except in ways that affect code (e.g., format here).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was reacting to; that id isn't spatial and also samples 1-3 are all in project 1. As you said it doesn't actually matter, but I had a "wait, what?" moment along the way. The code will be tested fine, but my brain cracks just a smidge. Nothing really needs to change though, this is something to just get over as, once again, doesn't matter!

"previously failed to process"
)
expect_equal(captured_req$method, "PUT")
expect_true(captured_req$body$data$start)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed? Another instance below too.

})
})

test_that("get_ccdl_datasets passes project_id as ccdl_project_id query parameter", {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why these tests are gone because it doesn't look like there are any changes to this function.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main thought is that most of these tests never should have been here, I just didn't catch them when I wasn't looking closely at tests. When you look at the mock functions and the actual tests, you can see that they are really mostly testing that the argument is passed to the _perform function, but that is already tested as part of the scpca_request() function more directly. It's not that it is a completely redundant test, but it is largely so.

After some consideration, I think I will restore these tests; they are a bit noisy, but it does make sense to make sure the internal call has the expected content.

Comment thread tests/testthat/test-datasets.R Outdated

test_that("remove_dataset_samples removes a project and PUTs", {
captured_req <- NULL
test_that("remove_dataset_samples PUTs", {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this change - if we are testing removes, why would we start with an empty dataset list?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't testing whether the removal works, as this function no longer returns the data inside it (and the mocking that was done didn't really test that anyway!) So the only thing to test is that it made a PUT request.

@jashapiro jashapiro left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! After thinking about it a few days, I think I will restore many of the tests; I had thought many of the ones I had removed were relatively redundant, as the internal functions are being tested more directly, but I can make a case that we should continue to test that the internal function is hooked up correctly, and allows us to make sure that changes to underlying functions don't affect the callers.

I'm less certain what I am going to do about merged projects. The fact that we can't create datasets with merged projects at the moment makes me not want to worry about the details, but I also don't want things to be confusing.

Comment thread tests/testthat/test-datasets.R Outdated
Comment on lines +372 to +379
SCPCP000001 = list(
SINGLE_CELL = list("SCPCS000001", "SCPCS000002"),
SPATIAL = list(),
includes_bulk = FALSE
),
SCPCP000002 = list(
SINGLE_CELL = list("SCPCS000003"),
SPATIAL = list("SCPCS000003"),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you are reacting to here? My guess from the second part is that you are upset about the ids not really being spatial? I don't really disagree, but also I don't think there should be any expectation that test values reflect reality except in ways that affect code (e.g., format here).

})
})

test_that("get_ccdl_datasets passes project_id as ccdl_project_id query parameter", {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main thought is that most of these tests never should have been here, I just didn't catch them when I wasn't looking closely at tests. When you look at the mock functions and the actual tests, you can see that they are really mostly testing that the argument is passed to the _perform function, but that is already tested as part of the scpca_request() function more directly. It's not that it is a completely redundant test, but it is largely so.

After some consideration, I think I will restore these tests; they are a bit noisy, but it does make sense to make sure the internal call has the expected content.

expect_null(result$email)
})

test_that("start_dataset_processing includes email in the same request when provided", {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: this is testing that we are constructing the request as expected, but that is also in the function that does the request construction.

Comment thread tests/testthat/test-datasets.R Outdated

test_that("remove_dataset_samples removes a project and PUTs", {
captured_req <- NULL
test_that("remove_dataset_samples PUTs", {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't testing whether the removal works, as this function no longer returns the data inside it (and the mocking that was done didn't really test that anyway!) So the only thing to test is that it made a PUT request.

Comment thread R/datasets.R Outdated
#' * `format`: the dataset file format (e.g. "SINGLE_CELL_EXPERIMENT", "ANN_DATA")
#' * `status`: the processing status — one of "pending", "processing",
#' "succeeded", "failed", or "expired" (see [get_dataset_status()])
#' * `n_samples`: the number of rows in `samples` (one per sample-modality

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is either. We could get it with more calls to the API, but as I said, I don't really want to do that.

Comment thread R/datasets.R Outdated
Comment on lines +733 to +734
n_samples = nrow(samples),
n_projects = length(detail$data),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that there is a good solution here without more calls to the API, but I will look into it some more.

I was actually unsure whether to include anything at all about merged datasets; at the moment I don't think there is an easy way to get the token used for a website-created dataset, which means this is kind of a problem that won't come up unless or until we add support for adding merged projects.

@jashapiro

Copy link
Copy Markdown
Member Author

After your review I decided that it made sense to have a bit more detail in the dataset info response, so now it goes out and gets the project samples that will be included in the dataset, including both merged and individual samples. It still returns the data in samples dataframe (no library info) but now with info about what modalities are included for each sample.

I also reverted some of the test changes and brought them back to better coverage, including re-expanding the removal test to actually test something useful about the function (that it does actually do the expected removals).

Should be ready for another look.

@jashapiro jashapiro requested a review from sjspielman June 10, 2026 01:18

@sjspielman sjspielman left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, small comments but don't need to see again. I did various local testing as well with a bunch of different combos of stuff, nothing to flag!

The one bigger, but still on the smaller side, comment I have is - maybe it would be good to drop merged_projects from the get_dataset_info() output for now since it's not something we can currently support. When I see it in that function's output it makes me want to look for how to add one into the dataset, which I of course can't do. I do think it's good to be prepared for the future in case and have the code here, but just maybe not return that field quite yet. Again, I don't feel I need to see this again though and I'm fine with your decision on it.

Comment thread R/datasets.R Outdated
Comment thread R/datasets.R Outdated
Comment on lines +629 to +634
#' `seq_unit` gives the single-cell sequencing unit ("cell" or "nucleus", or `NA` when the
#' sample is not included as single-cell),
#' `has_spatial` marks spatial inclusion
#' `has_bulk` reflects the project's `includes_bulk` request
#' intersected with whether the sample actually has bulk data.
#' `has_cite_seq` and `has_multiplexed` come from the sample records.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make these actual bullets? The includes_bulk phrasing isn't super clear either, but not sure what to suggest

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Comment thread R/datasets.R Outdated
Comment thread R/datasets.R Outdated
#' * `n_samples`: the total number of samples in the dataset, taken from the
#' API's `total_sample_count`
#' * `n_projects`: the number of projects in the dataset
#' * `samples`: a data frame with one row per included sample and the following columns:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I lobby to call this samples_df? At a quick glance I would assume it's a vector of ids, but that's not what it is at all!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about sample_info?

jashapiro and others added 3 commits June 11, 2026 10:11
Co-authored-by: Stephanie J. Spielman <stephanie.spielman@gmail.com>
update tests to match (and be more conservative on the slot name)
@jashapiro jashapiro merged commit 3d12fbf into main Jun 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add get_dataset_info() function

2 participants