Skip to content
/ bin2ml Public

A command line tool for extracting machine learning ready data from software binaries powered by Radare2

License

Notifications You must be signed in to change notification settings

br0kej/bin2ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bin2ml

bin2ml is a command line tool to extract machine learning ready data from software binaries. It's ideal for researchers and hackers to easily extract data suitable for training machine learning approaches such as natural language processing (NLP) or Graph Neural Networks (GNN's) models using data derived from software binaries.

  • Extract a range of different data from binaries such as Attributed Control Flow Graphs, Basic Block random walks and function instructions strings powered by Radare2.
  • Multithreaded data processing throughout powered by Rayon.
  • Save processed data in ready to go formats such as graphs saved as NetworkX compatible JSON objects.
  • Experimental support for creating machine learning embedded basic block CFG's using tch-rs and TorchScript traced models.

bin2ml is under active development and is in an alpha state. Things will change as the tool is developed and built upon further.

Pre-Requisites

  • Radare2 Installed - Info on how to do this can be found here.

Quickstart

git clone https://github.com/br0kej/bin2ml
cd bin2ml
cargo build --release

Alternatively, there are two Dockerfile's provided. Dockerfile.build can be used to build the bin2ml binary without having to have cargo on your workstation or Dockerfile builds bin2ml as well as installing radare2 to provide a means of doing processing within the container.

Docs

bin2ml does come with some documentation (albeit incomplete) and has been developed using mdbook. The documentation can be locally served by installing the platform relevant version of mdbook from here and then executing the commands below:

cd bin2ml/docs
mdbook serve

Alternatively, they can be viewed raw by going to the docs folder here

License

The bin2ml source and documentation are released under the MIT license.

Citation

@misc{collyer2023bin2ml,
  author = {Josh Collyer},
  title = {bin2ml},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/br0kej/bin2ml/}},
}