Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect_results handles .jld2 with groups incorrectly #396

Open
xukai92 opened this issue Sep 4, 2023 · 4 comments
Open

collect_results handles .jld2 with groups incorrectly #396

xukai92 opened this issue Sep 4, 2023 · 4 comments
Labels
discussion Let's talk about how things should be! saving-files Functionality for saving files

Comments

@xukai92
Copy link
Contributor

xukai92 commented Sep 4, 2023

Describe the bug
When config is saved as JLD2 format and has groups, collect_results cannot properly collect the results into a data frame

Minimal Working Example

config = Dict("a/x" => 1, "a/y" => 2, "a/z" => 3, "b/x" => "one", "b/y" => "two")
save("example.jld2", config)
collect_results(".")

gives
image

the root cause is the use of jldopen in function to_data_row(file::File{format"JLD2"}; kwargs...) that results in the grouped dict:

cjld2 = JLD2.jldopen("example.jld2")

image
and the keys on it is the groups

keys(cjld2)

gives

2-element Vector{String}:
 "a"
 "b"

comparing to loading with wload

wload("example.jld2")

which gives

Dict{String, Any} with 5 entries:
  "a/x" => 1
  "a/z" => 3
  "b/y" => "two"
  "a/y" => 2
  "b/x" => "one"

I think we should allow the user to control whether or not jldopen is used.
Any suggestion how to implement this?
I'm happy to raise a PR once we have an agreed solution.


  • Julia 1.9.3
  • DrWatson v2.12.7
@Datseris
Copy link
Member

Datseris commented Sep 5, 2023

perhaps @JonasIsensee has an idea for here?

@Datseris Datseris added discussion Let's talk about how things should be! saving-files Functionality for saving files labels Sep 5, 2023
@JonasIsensee
Copy link
Member

JonasIsensee commented Sep 6, 2023

I think we should allow the user to control whether or not jldopen is used.
Any suggestion how to implement this?
I'm happy to raise a PR once we have an agreed solution.

I believe the first question to answer is,
@xukai92 : What is the behaviour that you would like to see?

Should the dataframe columns be ["a/x", "a/z", ....] ?
Should the columns be nested dictionaries themselves?

edit: just to clarify, what you are getting is obviously not desirable but I just don't know what behaviour would be good.

@xukai92
Copy link
Contributor Author

xukai92 commented Sep 6, 2023

i have a temp fix on my side as below

import DrWatson: to_data_row

function to_data_row(file::File{format"JLD2"}; jldopen=false, kwargs...)
    fpath = filename(file)
    if jldopen
        @debug "Opening $fpath with jldopen."
        JLD2.jldopen(fpath, "r") do data
            return to_data_row(data, fpath; kwargs...)
        end
    else
        @debug "Opening $fpath with wload."
        return to_data_row(wload(fpath), fpath; kwargs...)
    end
end

when jldopen=false, it doesn't use jldopen and simply make a dataframe with flattened columns, i.e. ["a/x", "a/y", "a/z", "b/x", "b/y"], which is my expected behavior

@JonasIsensee
Copy link
Member

The problem with the standard implementation is default keyword argument white_list = keys(data) .
The whole point of the JLD2.JLDFile handle being passed around was to avoid loading all the data (which might be too large and hence blacklisted anyway..)
Here's a different idea, (that uses all kinds of JLD2 internals....):
This piece of code steps through the JLD2 file to accumulate a white_list with all proper datasets.
It's ugly, but I suppose the functionality could also be added to JLD2 directly.

function DrWatson.to_data_row(file::File{format"JLD2"}; white_list=nothing, kwargs...)
           fpath = filename(file)
           @debug "Opening $(filename(file)) with jldopen."
           JLD2.jldopen(filename(file), "r") do f
               if isnothing(white_list)
                    candidates = keys(f)
                    white_list = String[]
                    while !isempty(candidates)
                           c = popfirst!(candidates)
                           g, dset = JLD2.pathize(f.root_group, c, false)
                           haskey(g, dset) || continue
                           if JLD2.isgroup(f, JLD2.lookup_offset(g, dset))
                                  append!(candidates, Ref(c*"/") .*keys(g[dset]))
                           else
                                  push!(white_list, c)
                           end
                   end 
               end
               return DrWatson.to_data_row(f, fpath; white_list, kwargs...)
           end
       end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Let's talk about how things should be! saving-files Functionality for saving files
Projects
None yet
Development

No branches or pull requests

3 participants