Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: Import and export tool for csv/parquet/json #181

Conversation

Arman-Ghazaryan
Copy link
Contributor

No description provided.

tools/dataset.cpp Show resolved Hide resolved
void import_parquet(ukv_graph_import_t& c, ukv_size_t max_batch_size) {

arrow::Status status;
arrow::MemoryPool* pool = arrow::default_memory_pool();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use our arenas, like in client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not working with parquet.

tools/dataset.cpp Outdated Show resolved Hide resolved

void import_json(ukv_graph_import_t& c, ukv_size_t max_batch_size) {

std::vector<edge_t> array;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cant we preallocate a max size vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For small data, this may be overkill.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tools are only intended for large imports

tools/dataset.cpp Outdated Show resolved Hide resolved
tools/dataset.cpp Outdated Show resolved Hide resolved
@ashvardanian ashvardanian changed the title Implement import/export tool for csv/parquet/json Add: Import and export tool for csv/parquet/json Nov 8, 2022
@@ -1281,6 +1287,30 @@ void ukv_docs_write(ukv_docs_write_t* c_ptr) {
linked_memory_lock_t arena = linked_memory(c.arena, c.options, c.error);
return_on_error(c.error);

std::vector<ukv_key_t> keys_vec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No std::vector-s allowed... We have to use our own memory.

strided_iterator_gt<ukv_length_t const> lens {c.lengths, c.lengths_stride};

for (size_t idx = 0; idx < c.tasks_count; ++idx, ++vals, ++lens) {
simdjson::ondemand::parser parser;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reuse the state...

return arrow::Status(arrow::StatusCode::TypeError, "Not supported type");
}
arrow::Status Visit(arrow::BooleanArray const& arr) {
json = fmt::format("{}{},", json, arr.Value(idx));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is horrible.
You are reallocating a new string every time you want to append a boolean.
At least use fmt::format_to.


///////// Helpers /////////

class arrow_visitor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check styling guidelines.

size_t idx = 0;
};

bool strcmp_(const char* lhs, const char* rhs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style guidelines: East const

docs_vec.reserve(size);

if (c.fields) {
std::vector<std::string> fields(c.fields_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a std::vector or std::string?


char file_name[uuid_length];
make_uuid(file_name);
std::ofstream output(fmt::format("{}{}", file_name, c.paths_extension));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, never use ofstream or the rest of the old-school IO libraries from STL. Especially on the hot path.

@ashvardanian ashvardanian deleted the 154-implement-importexport-tool-for-csvparquetjson-arrow-datasets branch December 9, 2022 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement import/export tool for CSV/Parquet/JSON Arrow datasets
4 participants