Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Adding more pyarrow types to pandas engine #1676

Open
aaravind100 opened this issue Jun 8, 2024 · 3 comments
Open

Feat: Adding more pyarrow types to pandas engine #1676

aaravind100 opened this issue Jun 8, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@aaravind100
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I'd like to continue to add some of the remaining pyarrow types to the pandas engine. I've come across these two apart from the existing types.

  • list which is analogous to python list.
  • struct which is analogous to python dict.

Describe the solution you'd like

Extend pandas_engine with ArrowList and ArrowStruct types.

I do have a working prototype here and can raise a pr.

Additional context

Would you like to add or prioritize some other types from here?

@aaravind100 aaravind100 added the enhancement New feature or request label Jun 8, 2024
@cosmicBboy
Copy link
Collaborator

Hi @aaravind100 the prototype looks good, can you make a PR? Will just have to add some unit tests.

Would you like to add or prioritize some other types from

I'll leave that to you and others in the community to prioritize :) Which ones are left that are currently unsupported?

@aaravind100
Copy link
Contributor Author

aaravind100 commented Jun 22, 2024

@cosmicBboy created pr #1699

I'll leave that to you and others in the community to prioritize :) Which ones are left that are currently unsupported?

These types are compatible with pandas which are not added. I'll try adding some next week.

  • null
  • date32
  • date64
  • duration
  • float16 (recently added to pyarrow) (casting to float16 was added in pyarrow 16.0.0)
  • time32
  • time64
  • map_
  • binary
  • large_binary
  • large_string

@MarcSkovMadsen
Copy link

MarcSkovMadsen commented Jun 25, 2024

+1. Came looking for date64.

Workaround

The below seems to work as a workaround for me for now.

import pandas as pd
import pandera as pa
import datetime as dt

from pandera.engines.pandas_engine import Engine, immutable, pd, pyarrow, dtypes, DataType

@Engine.register_dtype(
    equivalents=[
        "date64[pyarrow]",
        pyarrow.date64,
        pd.ArrowDtype(pyarrow.date64()),
    ]
)
@immutable
class ArrowDate64(DataType, dtypes.Date):
    """Semantic representation of a :class:`pyarrow.date64`."""

    type = pd.ArrowDtype(pyarrow.date64())
    bit_width: int = 64

class DFSchema(pa.DataFrameModel):
    """Schema for a dataframe of jobs from the endpoint

    https://algodon.de-prod.dk/api/hadrian/joblist/{environment}
    """

    model: str = pa.Field()
    notationtime: ArrowDate64 = pa.Field()
    value: int = pa.Field()

df = pd.DataFrame({
    "model": ["A", "B", "A", "B"],
    "notationtime": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"],
    "value": [1,2,3,4]
})
df.notationtime=pd.to_datetime(df.notationtime).astype("date64[pyarrow]")

DFSchema(df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants