Skip to content

New compasUtils file#1467

Open
reinhold-willcox wants to merge 4 commits into
devfrom
compas_post_processing_utils
Open

New compasUtils file#1467
reinhold-willcox wants to merge 4 commits into
devfrom
compas_post_processing_utils

Conversation

@reinhold-willcox

Copy link
Copy Markdown
Collaborator

Added my own compasUtils file, which I use for lots of post-processing and testing. It was previously in misc/unsupported_utils, so I am hoping to get it elevated to the level of the other utils.

This contains two major functions that help a lot with inspecting COMPAS data.

  • printCompasDetails: takes the hdf5 output (optionally with specific seeds or masks) and prints a pandas df of the whole output (including units). Makes it visually much easier to analyze the output from a jupyter notebook.
  • getEventStrings: takes the hdf5 output and generates a string per seed that summarizes the events in the binary. Events include SMT, CEE, SNe, and mergers, and these are listed chronologically with a funny but simple syntax (described in the function). Combined with np.unique(), this is a great way to quickly see which binary channels dominate and which are rare.

I expect that there will be changes requested, so it is probably not yet ready to be merged in, but happy to hear feedback on how to improve / align these scripts with the other python utilities.

@github-actions

Copy link
Copy Markdown

✅ COMPAS Build Successful!

Item Value
Commit 98df732
Logs View workflow

Detailed Evolution Plot

Click to view evolution plot


Generated by COMPAS CI

@ilyamandel ilyamandel left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave the review to folks who use python (@SimonStevenson ? @avigna ? ).

Consider including supported python tools in the documentation in https://compas.readthedocs.io/en/latest/pages/User%20guide/Post-processing/post-processing.html

@avivajpeyi

avivajpeyi commented Apr 3, 2026

Copy link
Copy Markdown
Collaborator

Looks cool! I would suggest adding some more doctrings at the top, adding some pytests to ensure that this is maintained/doesn't break in the future.

Also, how is this different from h5view?

Can they be merged together?

Finally, if you can add it to the setup.py entry points, its quite easy to get the auto-documentation tools to build docs like so

https://compas.readthedocs.io/en/latest/pages/User%20guide/Post-processing/hdf5/post-processing-h5view.html

I have some time, im happy to help with this if youd like :)

@nrsegovia

Copy link
Copy Markdown
Collaborator

As I suggested in @ilyamandel's post on Slack, I think it might be useful to add type hinting and docstring styling (similar to numpy) to bring the python code style closer to what we have for C++ files.

Perhaps we can start here by just adding type hints?

Comment thread compas_python_utils/compasUtils.py Outdated
########################################################################
# ## Function to print the data from a given COMPAS HDF5 group in a readable pandas template

def printCompasDetails(data, *seeds, mask=()):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this function name might be a bit misleading? Outside the jupyter environment it would just return the data frame, not print it.

I also wonder if *seeds could be replaced for an input list, it might be more readable for large sets of seeds.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, finally getting around to this. The *seeds input is particularly handy because you can supply a single seed, multiple comma-separated seeds, or a list / array of seeds, so it's very flexible in this format. But I'm happy to make that clearer in the docstrings.

Comment thread compas_python_utils/compasUtils.py Outdated
mask = np.ones_like(allSeeds).astype(bool)
mask &= seedsMask

df = pd.DataFrame.from_dict({param: data[param][()][mask] for param in list(data.keys())}).set_index(seedVariableName).T

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of values being repeated, e.g., ('SEED' in list_of_keys) and list(data.keys())

Comment thread compas_python_utils/compasUtils.py Outdated
else: # No seed parameter, so do custom print for Run Details

# Get just the keys without the -Derivation suffix - those will be a second column
keys_not_derivations = []

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be swapped for a one line definition without affecting readability, e.g., keys_not_derivations = [x for x in key_list if "-Derivation" not in x] or similar

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@ilyamandel

Copy link
Copy Markdown
Collaborator

@reinhold-willcox -- please see the comments above.

@reinhold-willcox

Copy link
Copy Markdown
Collaborator Author

Ya, I'll take care of it (Nicolas has also written to me directly with some pointers).

This is a fairly low priority for me, so I'll try to address this after the semester ends.

@ilyamandel

Copy link
Copy Markdown
Collaborator

@reinhold-willcox , any progress on this PR?

@reinhold-willcox

Copy link
Copy Markdown
Collaborator Author

@ilyamandel, sorry, haven't had a chance to dive into this yet, but I should have a window in the next few weeks.

@nrsegovia nrsegovia left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reinhold-willcox it all looks great, I just left a couple of comments/suggestions. I'm still not entirely convinced by *seeds but it seems to work just fine, so I won't dwell on that. I do wonder why some sections of the code seem to prefer lists over numpy arrays, though. And as a final style-related comment, the PEP8 has some recommendations regarding line length:

Limit all lines to a maximum of 79 characters.
For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.

(I have never cared much about this until I recently started adding a length limit to my own code, and it looks much tidier)

Comment thread compas_python_utils/debugging_utils.py Outdated

Examples
--------
>>> mt_data = h5.File('COMPAS_Output.h5')['BSE_RLOF'])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ) at the end should be removed

Comment thread compas_python_utils/debugging_utils.py Outdated
>>> print_compas_details_dataframe(mt_data, mt_seeds[:50], mask=cee_events)
[output of all Common Envelope events occuring in the first 50 seeds]
"""
list_of_keys = list(data.keys())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be unnecessary. It seems that most of the time you do x in list_of_keys, but x in data would produced the same behavior if I remember correctly

Comment thread compas_python_utils/debugging_utils.py Outdated
# If `seeds` or `mask` arguments supplied, create the relevant mask
all_seeds = data[seed_variable_name][()]
seeds_mask = np.isin(all_seeds, seeds)
if len(seeds) == 0: # If `seeds` argument is not supplied, set the default mask

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeds does not have a default value, so shouldn't it always be supplied? Otherwise you would get an error

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*seeds works the same as *args, so they are all optional. Not sure what the best way to indicate this is, but it is meant to run fine if it is not supplied.

Comment thread compas_python_utils/debugging_utils.py Outdated
keys_not_derivations = [key for key in list_of_keys if '-Derivation' not in key] # Get just the keys without the -Derivation suffix

# Some parameter values are string types, formatted as np.bytes_, need to convert back
def convert_strings(param_array):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this must be defined within the function. I think that at some point you (or the user) could need this utility, so exposing it would be better. Also, I'm being picky but perhaps you could change the check from isinstance to something like np.issubdtype(param_array.dtype, np.bytes_), though this assumes that param_array is a numpy array (I'm not entirely sure it is the case). This way you wouldn't check the first element, but the array itself.

Comment thread compas_python_utils/debugging_utils.py Outdated
if mask == ():
mask = np.ones_like(all_seeds).astype(bool)
mask &= seeds_mask
df = pd.DataFrame.from_dict({param: data[param][()][mask] for param in list_of_keys}).set_index(seed_variable_name).T

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the default is to include all seeds, wouldn't this result in a "wide" table? Which also impacts pandas performance. I guess this is required to add units as a string column, but it might be worth to consider returning a separate dictionary with units in case the user wants to check them.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, it's extremely valuable to have the units right there in the same dataframe. This is meant to be a debugging tool, so having that information right in sight as you look at the other columns has been really revealing. It does return a very wide table if you include all the seeds from some large run, but then the pandas (or possibly jupyter) width limit in practice replaces many of these with an ellipsis.

Comment thread compas_python_utils/debugging_utils.py Outdated
###########################################
# ## Produce strings of the event histories

stellar_type_dict = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is a dictionary for simplified stellar type names, so perhaps you want to modify the name for something obvious like simplify_types_dict. Not critical, though.

Comment thread compas_python_utils/debugging_utils.py Outdated

Notes
-----
Exactly one of data or all_events must be included. If neither, nothign is returned.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in nothing

@reinhold-willcox

Copy link
Copy Markdown
Collaborator Author

@reinhold-willcox it all looks great, I just left a couple of comments/suggestions. I'm still not entirely convinced by *seeds but it seems to work just fine, so I won't dwell on that. I do wonder why some sections of the code seem to prefer lists over numpy arrays, though. And as a final style-related comment, the PEP8 has some recommendations regarding line length:

Limit all lines to a maximum of 79 characters.
For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.

(I have never cared much about this until I recently started adding a length limit to my own code, and it looks much tidier)

Thanks @nrsegovia . I've gone through and implemented your changes, except where I've raised discussions above. As for lists vs arrays, this often depends on the use case. It's easier to append to lists when you don't know how long they will be at the outset, while the arrays are nicer for speed. If there's a specific place you think I should swap one for the other, I'm happy to take a look.

@reinhold-willcox

Copy link
Copy Markdown
Collaborator Author

Looks cool! I would suggest adding some more doctrings at the top, adding some pytests to ensure that this is maintained/doesn't break in the future.

Also, how is this different from h5view?

Can they be merged together?

Finally, if you can add it to the setup.py entry points, its quite easy to get the auto-documentation tools to build docs like so

https://compas.readthedocs.io/en/latest/pages/User%20guide/Post-processing/hdf5/post-processing-h5view.html

I have some time, im happy to help with this if youd like :)

h5view is a command line tool, this one is for use in jupyter notebooks. They are similar conceptually in that they both provide a view into the data, but they are very different under the hood so I'm not sure it makes sense to combine them.

I'm also not sure what to do for the setup.py functionality. I added the file to the setup.py entry points, but all the other files seem to allow for a command line argument parser, which is not relevant for this file. Can you advise on what I should be putting in main()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants