Add get_dataset_info() function#53
Conversation
sjspielman
left a comment
There was a problem hiding this comment.
The changes look fine to me overall, although there's a few spots I think we can be more clear about what we are returning. Most of my comments are about changes to tests which have been pared down a lot and it's not always clear to me why.
| #' * `format`: the dataset file format (e.g. "SINGLE_CELL_EXPERIMENT", "ANN_DATA") | ||
| #' * `status`: the processing status — one of "pending", "processing", | ||
| #' "succeeded", "failed", or "expired" (see [get_dataset_status()]) | ||
| #' * `n_samples`: the number of rows in `samples` (one per sample-modality |
There was a problem hiding this comment.
Is there info about the number of libraries that we can pull out and return here as well? I kind of don't think there is, but if there is it would be good.
There was a problem hiding this comment.
I don't think there is either. We could get it with more calls to the API, but as I said, I don't really want to do that.
| n_samples = nrow(samples), | ||
| n_projects = length(detail$data), |
There was a problem hiding this comment.
The way this is set up, it seems that n_samples will not included counts from a merged object but n_project will. So you could end up with something like n_samples = 0 and n_projects = 1 which seems really confusing. I understand why you wouldn't want to include for merged in samples because of how the data is delivered, so what if we have another field here n_samples_merged or so to represent the number of samples that are delivered in merged objects? The point is, can we report both?
There was a problem hiding this comment.
I don't know that there is a good solution here without more calls to the API, but I will look into it some more.
I was actually unsure whether to include anything at all about merged datasets; at the moment I don't think there is an easy way to get the token used for a website-created dataset, which means this is kind of a problem that won't come up unless or until we add support for adding merged projects.
| expect_null(result$email) | ||
| }) | ||
|
|
||
| test_that("start_dataset_processing includes email in the same request when provided", { |
There was a problem hiding this comment.
Why was this removed? It looks like functionality is still there
There was a problem hiding this comment.
Same as above: this is testing that we are constructing the request as expected, but that is also in the function that does the request construction.
| SCPCP000001 = list( | ||
| SINGLE_CELL = list("SCPCS000001", "SCPCS000002"), | ||
| SPATIAL = list(), | ||
| includes_bulk = FALSE | ||
| ), | ||
| SCPCP000002 = list( | ||
| SINGLE_CELL = list("SCPCS000003"), | ||
| SPATIAL = list("SCPCS000003"), |
There was a problem hiding this comment.
I suspect this is still coming in a future PR but it would really be great for these to at least somewhat reflect reality or be fake. It's a test so it doesn't matter much, but it's enough to make me do a biggg double take.
There was a problem hiding this comment.
I'm not sure what you are reacting to here? My guess from the second part is that you are upset about the ids not really being spatial? I don't really disagree, but also I don't think there should be any expectation that test values reflect reality except in ways that affect code (e.g., format here).
There was a problem hiding this comment.
That's what I was reacting to; that id isn't spatial and also samples 1-3 are all in project 1. As you said it doesn't actually matter, but I had a "wait, what?" moment along the way. The code will be tested fine, but my brain cracks just a smidge. Nothing really needs to change though, this is something to just get over as, once again, doesn't matter!
| "previously failed to process" | ||
| ) | ||
| expect_equal(captured_req$method, "PUT") | ||
| expect_true(captured_req$body$data$start) |
There was a problem hiding this comment.
Why was this removed? Another instance below too.
| }) | ||
| }) | ||
|
|
||
| test_that("get_ccdl_datasets passes project_id as ccdl_project_id query parameter", { |
There was a problem hiding this comment.
I'm confused why these tests are gone because it doesn't look like there are any changes to this function.
There was a problem hiding this comment.
My main thought is that most of these tests never should have been here, I just didn't catch them when I wasn't looking closely at tests. When you look at the mock functions and the actual tests, you can see that they are really mostly testing that the argument is passed to the _perform function, but that is already tested as part of the scpca_request() function more directly. It's not that it is a completely redundant test, but it is largely so.
After some consideration, I think I will restore these tests; they are a bit noisy, but it does make sense to make sure the internal call has the expected content.
|
|
||
| test_that("remove_dataset_samples removes a project and PUTs", { | ||
| captured_req <- NULL | ||
| test_that("remove_dataset_samples PUTs", { |
There was a problem hiding this comment.
I'm confused about this change - if we are testing removes, why would we start with an empty dataset list?
There was a problem hiding this comment.
We aren't testing whether the removal works, as this function no longer returns the data inside it (and the mocking that was done didn't really test that anyway!) So the only thing to test is that it made a PUT request.
jashapiro
left a comment
There was a problem hiding this comment.
Thanks for the review! After thinking about it a few days, I think I will restore many of the tests; I had thought many of the ones I had removed were relatively redundant, as the internal functions are being tested more directly, but I can make a case that we should continue to test that the internal function is hooked up correctly, and allows us to make sure that changes to underlying functions don't affect the callers.
I'm less certain what I am going to do about merged projects. The fact that we can't create datasets with merged projects at the moment makes me not want to worry about the details, but I also don't want things to be confusing.
| SCPCP000001 = list( | ||
| SINGLE_CELL = list("SCPCS000001", "SCPCS000002"), | ||
| SPATIAL = list(), | ||
| includes_bulk = FALSE | ||
| ), | ||
| SCPCP000002 = list( | ||
| SINGLE_CELL = list("SCPCS000003"), | ||
| SPATIAL = list("SCPCS000003"), |
There was a problem hiding this comment.
I'm not sure what you are reacting to here? My guess from the second part is that you are upset about the ids not really being spatial? I don't really disagree, but also I don't think there should be any expectation that test values reflect reality except in ways that affect code (e.g., format here).
| }) | ||
| }) | ||
|
|
||
| test_that("get_ccdl_datasets passes project_id as ccdl_project_id query parameter", { |
There was a problem hiding this comment.
My main thought is that most of these tests never should have been here, I just didn't catch them when I wasn't looking closely at tests. When you look at the mock functions and the actual tests, you can see that they are really mostly testing that the argument is passed to the _perform function, but that is already tested as part of the scpca_request() function more directly. It's not that it is a completely redundant test, but it is largely so.
After some consideration, I think I will restore these tests; they are a bit noisy, but it does make sense to make sure the internal call has the expected content.
| expect_null(result$email) | ||
| }) | ||
|
|
||
| test_that("start_dataset_processing includes email in the same request when provided", { |
There was a problem hiding this comment.
Same as above: this is testing that we are constructing the request as expected, but that is also in the function that does the request construction.
|
|
||
| test_that("remove_dataset_samples removes a project and PUTs", { | ||
| captured_req <- NULL | ||
| test_that("remove_dataset_samples PUTs", { |
There was a problem hiding this comment.
We aren't testing whether the removal works, as this function no longer returns the data inside it (and the mocking that was done didn't really test that anyway!) So the only thing to test is that it made a PUT request.
| #' * `format`: the dataset file format (e.g. "SINGLE_CELL_EXPERIMENT", "ANN_DATA") | ||
| #' * `status`: the processing status — one of "pending", "processing", | ||
| #' "succeeded", "failed", or "expired" (see [get_dataset_status()]) | ||
| #' * `n_samples`: the number of rows in `samples` (one per sample-modality |
There was a problem hiding this comment.
I don't think there is either. We could get it with more calls to the API, but as I said, I don't really want to do that.
| n_samples = nrow(samples), | ||
| n_projects = length(detail$data), |
There was a problem hiding this comment.
I don't know that there is a good solution here without more calls to the API, but I will look into it some more.
I was actually unsure whether to include anything at all about merged datasets; at the moment I don't think there is an easy way to get the token used for a website-created dataset, which means this is kind of a problem that won't come up unless or until we add support for adding merged projects.
|
After your review I decided that it made sense to have a bit more detail in the dataset info response, so now it goes out and gets the project samples that will be included in the dataset, including both merged and individual samples. It still returns the data in I also reverted some of the test changes and brought them back to better coverage, including re-expanding the removal test to actually test something useful about the function (that it does actually do the expected removals). Should be ready for another look. |
sjspielman
left a comment
There was a problem hiding this comment.
This looks good to me, small comments but don't need to see again. I did various local testing as well with a bunch of different combos of stuff, nothing to flag!
The one bigger, but still on the smaller side, comment I have is - maybe it would be good to drop merged_projects from the get_dataset_info() output for now since it's not something we can currently support. When I see it in that function's output it makes me want to look for how to add one into the dataset, which I of course can't do. I do think it's good to be prepared for the future in case and have the code here, but just maybe not return that field quite yet. Again, I don't feel I need to see this again though and I'm fine with your decision on it.
| #' `seq_unit` gives the single-cell sequencing unit ("cell" or "nucleus", or `NA` when the | ||
| #' sample is not included as single-cell), | ||
| #' `has_spatial` marks spatial inclusion | ||
| #' `has_bulk` reflects the project's `includes_bulk` request | ||
| #' intersected with whether the sample actually has bulk data. | ||
| #' `has_cite_seq` and `has_multiplexed` come from the sample records. |
There was a problem hiding this comment.
Is it possible to make these actual bullets? The includes_bulk phrasing isn't super clear either, but not sure what to suggest
| #' * `n_samples`: the total number of samples in the dataset, taken from the | ||
| #' API's `total_sample_count` | ||
| #' * `n_projects`: the number of projects in the dataset | ||
| #' * `samples`: a data frame with one row per included sample and the following columns: |
There was a problem hiding this comment.
Can I lobby to call this samples_df? At a quick glance I would assume it's a vector of ids, but that's not what it is at all!
There was a problem hiding this comment.
How about sample_info?
Co-authored-by: Stephanie J. Spielman <stephanie.spielman@gmail.com>
update tests to match (and be more conservative on the slot name)
Closes #36 (which had been erroneously closed before)
Sorry, one more code change before docs!
Here I am adding a
get_dataset_info()function, that returns a subset of the information inget_dataset_detail()(a non-exported function) in what is hopefully a more digestible form.The return value is still a list (I did not want to get into objects), but with a much smaller set of components:
id: the dataset idSINGLE_CELL_EXPERIMENTorANN_DATAI am quite open to changing the format and contents of this return value. Right now it is limited by me not wanting to make extra queries for each sample, so I only have the info that was in the dataset detail we get from the query. If we are willing to make multiple calls to the API, we could get more info for each sample. (This part of the code is in a helper function:
make_dataset_df())That was the core change for the issue, but I made another change that made this from a small PR into a kind of big one: I changed all the return values from public functions from the dataset detail we get from the API to just returning the dataset id. I think that makes things a lot simpler, and avoids people expecting the dataset they had stored to be up-to-date, which it often will not be if they did not capture outputs. This means that there are a lot of docs updates as well.
I also consolidated the status translation into its own function, since there are now two functions that do that. I did not translate the format info, but I might be willing to do that as well so things are a bit prettier.