Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question/Feature Request] Getting contents or copying binary blob #460

Open
noamross opened this issue Nov 2, 2023 · 2 comments
Open

Comments

@noamross
Copy link

noamross commented Nov 2, 2023

I am trying to use git2r to extract individual files from repository history for the purpose of comparing R objects through the repository history (for instance, for comparing model performance different versions of models saved as .rds files in the repo). In some cases I am extracting subdirectories and so working through the git_tree recursively. I do not want to overwrite the working copies of these files, but copy them to a location of my choosing.

I am able to use git2r::content() to read an individual text file blobs, which can then be written to files. However, it returns NA if the blob/file is binary. I would like to be able to either (a) return a raw vector of binary data from git2r::content(), or (b) copy the blobs to files directly without reading them in, perhaps by having a version of the C function blob_content_to_file exposed to the R API. The latter would be more efficient as it avoids the read-write cycle into R, though I think the former would be easiest to implement.

I may be able to implement the latter as a PR but my C skills are limited. If I can and you are interested, would you prefer content() to return a raw vector for binary data, or for content() to be type-stable and aseparate content_raw() function be used for binary files?

Alternatively, there may be a way to do this with checkout() or another function that I've missed, but I've not figured it out.

Thanks for this excellent and long-maintained package!

@stewid
Copy link
Member

stewid commented Nov 3, 2023

Hi, thanks for your suggestion. I added an argument raw to the content function to return the blob content as a raw vector when set. Could you please checkout the branch raw-blob-content and try if that solves your use-case. Maybe, your suggestion b) to use blob_content_to_file is better? Or should also be added to facilitate various use-cases?

@noamross
Copy link
Author

noamross commented Nov 3, 2023

Wow, what turnaround! Yes, running content(..., raw= TRUE) on the raw-blob-content branch solves my use-case. Thanks!

While blob_content_to_file might be more elegant in some cases, I think the cases where its efficiency would be needed are quite niche. It would be faster in cases where (a) large binary files are stored in the git repository, and/or (b) one is extracting many files, such as a script where one is getting every version of a file or files out of git history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants