New compasUtils file#1467
Conversation
✅ COMPAS Build Successful!
Detailed Evolution PlotGenerated by COMPAS CI |
ilyamandel
left a comment
There was a problem hiding this comment.
I leave the review to folks who use python (@SimonStevenson ? @avigna ? ).
Consider including supported python tools in the documentation in https://compas.readthedocs.io/en/latest/pages/User%20guide/Post-processing/post-processing.html
|
Looks cool! I would suggest adding some more doctrings at the top, adding some pytests to ensure that this is maintained/doesn't break in the future. Also, how is this different from h5view? Can they be merged together? Finally, if you can add it to the setup.py entry points, its quite easy to get the auto-documentation tools to build docs like so I have some time, im happy to help with this if youd like :) |
|
As I suggested in @ilyamandel's post on Slack, I think it might be useful to add type hinting and docstring styling (similar to numpy) to bring the python code style closer to what we have for C++ files. Perhaps we can start here by just adding type hints? |
| ######################################################################## | ||
| # ## Function to print the data from a given COMPAS HDF5 group in a readable pandas template | ||
|
|
||
| def printCompasDetails(data, *seeds, mask=()): |
There was a problem hiding this comment.
Do I understand correctly that this function name might be a bit misleading? Outside the jupyter environment it would just return the data frame, not print it.
I also wonder if *seeds could be replaced for an input list, it might be more readable for large sets of seeds.
There was a problem hiding this comment.
Hey, finally getting around to this. The *seeds input is particularly handy because you can supply a single seed, multiple comma-separated seeds, or a list / array of seeds, so it's very flexible in this format. But I'm happy to make that clearer in the docstrings.
| mask = np.ones_like(allSeeds).astype(bool) | ||
| mask &= seedsMask | ||
|
|
||
| df = pd.DataFrame.from_dict({param: data[param][()][mask] for param in list(data.keys())}).set_index(seedVariableName).T |
There was a problem hiding this comment.
There are a couple of values being repeated, e.g., ('SEED' in list_of_keys) and list(data.keys())
| else: # No seed parameter, so do custom print for Run Details | ||
|
|
||
| # Get just the keys without the -Derivation suffix - those will be a second column | ||
| keys_not_derivations = [] |
There was a problem hiding this comment.
Can be swapped for a one line definition without affecting readability, e.g., keys_not_derivations = [x for x in key_list if "-Derivation" not in x] or similar
|
@reinhold-willcox -- please see the comments above. |
|
Ya, I'll take care of it (Nicolas has also written to me directly with some pointers). This is a fairly low priority for me, so I'll try to address this after the semester ends. |
|
@reinhold-willcox , any progress on this PR? |
|
@ilyamandel, sorry, haven't had a chance to dive into this yet, but I should have a window in the next few weeks. |
nrsegovia
left a comment
There was a problem hiding this comment.
@reinhold-willcox it all looks great, I just left a couple of comments/suggestions. I'm still not entirely convinced by *seeds but it seems to work just fine, so I won't dwell on that. I do wonder why some sections of the code seem to prefer lists over numpy arrays, though. And as a final style-related comment, the PEP8 has some recommendations regarding line length:
Limit all lines to a maximum of 79 characters.
For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.
(I have never cared much about this until I recently started adding a length limit to my own code, and it looks much tidier)
|
|
||
| Examples | ||
| -------- | ||
| >>> mt_data = h5.File('COMPAS_Output.h5')['BSE_RLOF']) |
There was a problem hiding this comment.
I think that ) at the end should be removed
| >>> print_compas_details_dataframe(mt_data, mt_seeds[:50], mask=cee_events) | ||
| [output of all Common Envelope events occuring in the first 50 seeds] | ||
| """ | ||
| list_of_keys = list(data.keys()) |
There was a problem hiding this comment.
This might be unnecessary. It seems that most of the time you do x in list_of_keys, but x in data would produced the same behavior if I remember correctly
| # If `seeds` or `mask` arguments supplied, create the relevant mask | ||
| all_seeds = data[seed_variable_name][()] | ||
| seeds_mask = np.isin(all_seeds, seeds) | ||
| if len(seeds) == 0: # If `seeds` argument is not supplied, set the default mask |
There was a problem hiding this comment.
seeds does not have a default value, so shouldn't it always be supplied? Otherwise you would get an error
There was a problem hiding this comment.
*seeds works the same as *args, so they are all optional. Not sure what the best way to indicate this is, but it is meant to run fine if it is not supplied.
| keys_not_derivations = [key for key in list_of_keys if '-Derivation' not in key] # Get just the keys without the -Derivation suffix | ||
|
|
||
| # Some parameter values are string types, formatted as np.bytes_, need to convert back | ||
| def convert_strings(param_array): |
There was a problem hiding this comment.
I wonder if this must be defined within the function. I think that at some point you (or the user) could need this utility, so exposing it would be better. Also, I'm being picky but perhaps you could change the check from isinstance to something like np.issubdtype(param_array.dtype, np.bytes_), though this assumes that param_array is a numpy array (I'm not entirely sure it is the case). This way you wouldn't check the first element, but the array itself.
| if mask == (): | ||
| mask = np.ones_like(all_seeds).astype(bool) | ||
| mask &= seeds_mask | ||
| df = pd.DataFrame.from_dict({param: data[param][()][mask] for param in list_of_keys}).set_index(seed_variable_name).T |
There was a problem hiding this comment.
If the default is to include all seeds, wouldn't this result in a "wide" table? Which also impacts pandas performance. I guess this is required to add units as a string column, but it might be worth to consider returning a separate dictionary with units in case the user wants to check them.
There was a problem hiding this comment.
In my experience, it's extremely valuable to have the units right there in the same dataframe. This is meant to be a debugging tool, so having that information right in sight as you look at the other columns has been really revealing. It does return a very wide table if you include all the seeds from some large run, but then the pandas (or possibly jupyter) width limit in practice replaces many of these with an ellipsis.
| ########################################### | ||
| # ## Produce strings of the event histories | ||
|
|
||
| stellar_type_dict = { |
There was a problem hiding this comment.
I see that this is a dictionary for simplified stellar type names, so perhaps you want to modify the name for something obvious like simplify_types_dict. Not critical, though.
|
|
||
| Notes | ||
| ----- | ||
| Exactly one of data or all_events must be included. If neither, nothign is returned. |
Thanks @nrsegovia . I've gone through and implemented your changes, except where I've raised discussions above. As for lists vs arrays, this often depends on the use case. It's easier to append to lists when you don't know how long they will be at the outset, while the arrays are nicer for speed. If there's a specific place you think I should swap one for the other, I'm happy to take a look. |
h5view is a command line tool, this one is for use in jupyter notebooks. They are similar conceptually in that they both provide a view into the data, but they are very different under the hood so I'm not sure it makes sense to combine them. I'm also not sure what to do for the setup.py functionality. I added the file to the setup.py entry points, but all the other files seem to allow for a command line argument parser, which is not relevant for this file. Can you advise on what I should be putting in main()? |

Added my own compasUtils file, which I use for lots of post-processing and testing. It was previously in
misc/unsupported_utils, so I am hoping to get it elevated to the level of the other utils.This contains two major functions that help a lot with inspecting COMPAS data.
I expect that there will be changes requested, so it is probably not yet ready to be merged in, but happy to hear feedback on how to improve / align these scripts with the other python utilities.