Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataframe_parsers that rename columns conflict with type coercion #1666

Open
3 tasks done
bustosalex1 opened this issue May 31, 2024 · 0 comments
Open
3 tasks done

dataframe_parsers that rename columns conflict with type coercion #1666

bustosalex1 opened this issue May 31, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@bustosalex1
Copy link
Contributor

bustosalex1 commented May 31, 2024

I'm not sure if this qualifies as a bug, or is just the expected behavior, but I thought I would report just in case this behavior was unintended. Using the new dataframe_parser decorator introduced recently, I tried making a parser that maps some of the column names of the data I am ingesting to new names. On this same schema model, I also enabled schema-wide type coercion by setting coerce = True in the Config class. Basically, it seems like due to the order of operations of core parsers and custom parsers, it isn't possible to perform coercion with this setup since core parsers operate before custom parsers and the columns defined in the schema are not present in the input DataFrame before the custom parsers operate, which ultimately leads to a SchemaError when attempting to validate data.
It seems like this could potentially be solved by allowing users a greater degree of control over the order of custom parsers and core parsers. Again, I'm not sure if this is the expected behavior, and I can also appreciate that renaming columns in this manner may not be an intended use case for this functionality. If that's the case or there's some other workaround, please feel free to close this issue.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import pandera as pa
import pandas as pd


class PassingSchema(pa.DataFrameModel):
    class Config:
        coerce = True

    col: pa.typing.Series[int]


class FailingSchema(pa.DataFrameModel):
    class Config:
        coerce = True

    @pa.dataframe_parser
    def rename(cls, df: pd.DataFrame) -> pd.DataFrame:
        print(df.dtypes)
        df = df.rename(columns={"a": "col"})
        return df

    col: pa.typing.Series[int] = pa.Field(coerce=True)


if __name__ == "__main__":
    df = pd.DataFrame({"col": [1.0, 2.0, 3.0]})

    passing_schema = PassingSchema(df)

    df = df.rename(columns={"col": "a"})

    # throws a SchemaError: expected series 'col' to have type int64, got float64
    failing_schema = FailingSchema(df)

Expected behavior

I would expect FailingSchema to validate successfully, because I have asserted that col should be coerced into an int.

Desktop (please complete the following information):

  • OS: macOS 14.5
  • Browser: Firefox 126.0.1
@bustosalex1 bustosalex1 added the bug Something isn't working label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant