Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow rinclude in collect_results to only select files which satisify all conditions #413

Open
cnrrobertson opened this issue Apr 16, 2024 · 7 comments
Labels
data Related to data management enhancement New feature or request running-listing Functionality for running and listing simulation runs

Comments

@cnrrobertson
Copy link

Is your feature request related to a problem? Please describe.
Currently, in the collect_results function, the rinclude argument allows for specifying filename conditions (OR behavior) and rexclude allows for ignoring filename conditions (NOT behavior). I'm looking for AND behavior. I want the loaded files to satisfy ALL conditions in rinclude rather than satisfy any of them.

As an example of where this may be useful, I'm running simulations with both different time and spatial steps and I only want to select those that have a specific combination of the two.

Describe the solution you'd like
I'm not sure exactly the best way to implement this, but it would probably look like adding a kwarg to collect_results either to modify how it handles the rinclude list or to specify an alternative list to rinclude and rexclude.

Describe alternatives you've considered
Currently, the undesirable results can be removed in a postprocessing step using tools from DataFrames.jl but that becomes prohibitive if you have large data because you need to load all of the results and then remove the unnecessary.

@Datseris Datseris added enhancement New feature or request running-listing Functionality for running and listing simulation runs data Related to data management labels Apr 17, 2024
@Datseris
Copy link
Member

It's a valid use case. I am not sure either. You are sure that there is no way to achieve this already with the given two keywords, using some advanced regex syntax? Create a regex expression that only matches both cionditions?

@cnrrobertson
Copy link
Author

There definitely is a way to put together a regex expression that matches both conditions, but it gets more complicated if you have several parameters. Imagine I have 5 parameters put into the filename with savename, then selecting specific options for only of them will require regex that is ordered correctly etc.

I guess I'm envisioning trying to get the same ease I enjoy with functions like savename and @tagsave only for retrieving the files I'm interested in. Maybe that could be a utility function to generate the complex regex for recovering the files that match the parameters in a @strdict?

@Datseris
Copy link
Member

check this out first: https://github.com/jkrumbiegel/ReadableRegex.jl and see if already provides this function you need.

@cnrrobertson
Copy link
Author

cnrrobertson commented Apr 18, 2024

Okay, you were right, there is a regex that can match all the conditions (without worrying about order). If I wanted to match all files where dt=0.001, dx=0.01, and ic=flat or ic=round for example, I could use the following regex string:

r"(?=.*dt=0\.001.*)(?=.*dx=0\.01.*)(?=.*ic=(flat|round).*)^.*$"

It uses "lookaheads" (which ReadableRegex unfortunately doesn't have). It's also a bit of a mess so I wrote it up in a function to generate these from a dictionaries generated by @strdict:

dt = 0.001
dx = 0.01
ic = ["flat", "round"]
desc = @strdict dt dx ic

# Escape periods
format_v(v) = replace(string(v), "." => "\\.")

# Make "or" statements for vectors in dictionary
function format_v(v::Vector{T}) where T
    formatted_v = [format_v(v_i) for v_i in v]
    return "($(join(formatted_v, "|")))"
end

# Generate the regex string
function strdict_regex(d)
    query = ""
    for (k,v) in d
        query *= "(?=.*$k=$(format_v(v)))"
    end
    query *= "^.*\$"
    return Regex(query)
end
strdict_regex(desc)

For me, the spirit of DrWatson is that these kind of convenience functions are built in to make processing easier. So, this regex generator could be a provided utility or a dictionary from @strdict could be passed directly into collect_results to do this processing under the hood. Does that seem within the scope of the package? I could put in a draft PR if so.

@Datseris
Copy link
Member

This is definitely within scope.

However, I believe I do not understand what you mean by "a dictionary from @strdict could be passed". I don't follow :D you need to illustrate it for me via an example. Like, what would the keyword to collect be? And what would it do?>

Perhaps the simplest would be to add a new keyword, rallinclude which is non-usable with rinclude. So if rallinclude is given rinclude must be emtpy and vice versa. This rallinclude does exactly what you want (matches ewhen ALL conditions are satisfied).

@cnrrobertson
Copy link
Author

Ah yes. Sorry for the confusion. What I meant is that you could have a version of collect_results that accepts a Dict along with rinclude and it would collect all results pertaining to the parameters in the dictionary and any other conditions in rinclude.

That seems easier to me because I can use the same procedure to save and retrieve the data.

For example to generate data, I have something like the following:

dt = 0.001
dx = 0.01
ic = ["flat", "round"]
params = @strdict dt dx ic

# Run simulation
results = simulation(params)

# Save
sim_name = savename("my_simulation", desc)
@tagsave(datadir(sim_name*".jld2"), results)

Then to retrieve it, I could have:

dt = 0.001
dx = 0.01
ic = ["flat", "round"]
params = @strdict dt dx ic

# Collect data
sim_name = savename("my_simulation", desc)
results = collect_results(datadir(); rinclude=[r"my_simulation"], params=params)

To me that seems so clean.

What I am currently doing to circumvent the issue is:

dt = 0.001
dx = 0.01
ic = ["flat", "round"]
params = @strdict dt dx ic

# Collect data
sim_name = savename("my_simulation", desc)
params_regex = Regex("(?=.*my_simulation.*)" * strdict_regex(params).pattern)
results = collect_results(datadir(); rinclude=[params_regex])

@Datseris
Copy link
Member

right, but what would happen inside collect_results with params? what would it actually do?

Yes, I agree that increasing the communication between savename and collect_Results would be great. If you can put together a PR that would be nice, because there we can talk with a concrete code implementation instead of the current situation where only the input is specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Related to data management enhancement New feature or request running-listing Functionality for running and listing simulation runs
Projects
None yet
Development

No branches or pull requests

2 participants