Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior when validating date objects. pandera=0.19.1 #1653

Open
2 of 3 tasks
JulianoFin opened this issue May 21, 2024 · 0 comments
Open
2 of 3 tasks

Unexpected behavior when validating date objects. pandera=0.19.1 #1653

JulianoFin opened this issue May 21, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@JulianoFin
Copy link

  • I have checked that this issue has not already been reported. (i believe i have)
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample

class InputMomentumTimestamp(pa.DataFrameModel):
    date: pd.Timestamp = pa.Field(alias='dt_reference', nullable=False, unique=False, coerce=True)
    bin: pd.StringDtype = pa.Field(alias='bin', nullable=False, unique=False, coerce=True)

    class Config:
        drop_invalid_rows = True

class InputMomentumDateTime(pa.DataFrameModel):
    date: pa.dtypes.DateTime = pa.Field(alias='dt_reference', nullable=False, unique=False, coerce=True)
    bin: pd.StringDtype = pa.Field(alias='bin', nullable=False, unique=False, coerce=True)

    class Config:
        drop_invalid_rows = True

class InputMomentumDate(pa.DataFrameModel):
    date: pa.dtypes.Date = pa.Field(alias='dt_reference', nullable=False, unique=False, coerce=True)
    bin: pd.StringDtype = pa.Field(alias='bin', nullable=False, unique=False, coerce=True)

    class Config:
        drop_invalid_rows = True


input_data={'dt_reference': ['09/02/1996','09/02/1996','09/02/1996','16/02/1996','16/02/1996','16/02/1996'],
            'bin': ['(250, 500]','(500, 750]','(750, 1000]','(0, 250]','(1000, 1500]','(1500, 2000]']}

dataframe = pd.DataFrame(input_data)

print(InputMomentumTimestamp.validate(dataframe, lazy=True))
print(InputMomentumDateTime.validate(dataframe, lazy=True))
print(InputMomentumDate.validate(dataframe, lazy=True))

Expected behavior

The input dataframe contains the dt_reference column - which contains date information in the format "%d/%m/%Y" - as a Pandas Object dtype.

When i set the "dt_reference" column format to be of any date (pandas.Timestamp, pandera.dtypes.DateTime or pandera.dtypes.Date), I would expect it to try and coerce the column to my desired dtype. However as it cannot format the date using its default ISO format, it would raise a SchemaErrors. That, combined with the Config "drop_invalid_rows" setted to True, i would expect it to drop all rows, returning an empty dataframe.

This behavior is achieved by the last class (InputMomentumDate), but the previous classes (InputMomentumTimestamp and InputMomentumDateTime) return the validated dataframe without any errors but with the dt_reference column still being Object dtype.

Is this a bug or expected behavior? If expected, could someone explain why this happens?

Thanks

@JulianoFin JulianoFin added the bug Something isn't working label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant