Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PyArrow arrays as tokenizer input #1415

Open
mariosasko opened this issue Dec 14, 2023 · 11 comments · May be fixed by #1535
Open

Support PyArrow arrays as tokenizer input #1415

mariosasko opened this issue Dec 14, 2023 · 11 comments · May be fixed by #1535
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@mariosasko
Copy link
Contributor

Most data processing libraries (Datasets, Polars, Pandas, DuckDB, etc.) are integrated with PyArrow, so native (zero-copy if possible) support for PyArrow arrays as input to avoid the unnecessary PyArrow to Python/NumPy conversion (pretty slow for string arrays) would be nice.

PS: PyArrow has recently added support for the PyCapsule interface, which should help with the implementation.

@mariosasko mariosasko added the enhancement New feature or request label Dec 14, 2023
@ArthurZucker
Copy link
Collaborator

Would you like to open a PR for this? 🤗

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jan 18, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 23, 2024
@mariosasko mariosasko reopened this Jan 31, 2024
@mariosasko mariosasko removed the Stale label Jan 31, 2024
@mariosasko
Copy link
Contributor Author

Marking this issue as a good first issue. If it doesn't get addressed after a while, I'll take a stab at it.

@mariosasko mariosasko added the good first issue Good for newcomers label Jan 31, 2024
Copy link

github-actions bot commented Mar 2, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Mar 2, 2024
@mariosasko mariosasko removed the Stale label Mar 2, 2024
@WenheLI
Copy link

WenheLI commented Mar 26, 2024

Interested in this. Can someone get it assigned to me?

@mariosasko
Copy link
Contributor Author

@WenheLI Assigned :)

@WenheLI
Copy link

WenheLI commented Apr 2, 2024

@mariosasko - Sorry just saw this! Can you guide me how to get started as I am still new to this project! Thanks a lot for your help!

@mariosasko
Copy link
Contributor Author

Sure! The idea is to use the arrow crate (e.g., with ArrayData.from_pyarrow) to decode PyArrow StringArray/LargeStringArrays (when they are given as input to the Tokenizer). You can find the relevant code here (maybe this PR can also help, which has done the same thing for NumPy arrays).

To build the project, check this workflow file, in particular the part that installs the dependencies.

@shreya-51
Copy link

hello! @WenheLI are you still working on this?

@WenheLI
Copy link

WenheLI commented Apr 14, 2024

@shreya-51 - Hi! Sorry for the late reply. And yes, I am still working on that

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 15, 2024
@notjedi notjedi linked a pull request May 18, 2024 that will close this issue
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024
@ArthurZucker ArthurZucker reopened this Jun 7, 2024
@github-actions github-actions bot removed the Stale label Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants