Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting coerce on a column causes the column to be required when required=False #1660

Open
2 of 3 tasks
adam-porich-sm opened this issue May 29, 2024 · 1 comment
Open
2 of 3 tasks
Labels
bug Something isn't working

Comments

@adam-porich-sm
Copy link

Describe the bug
A clear and concise description of what the bug is.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import polars as pl
import pandera.polars as pa

s = pa.DataFrameSchema(
    {
        "missing_column": pa.Column(int, required=False),
        "int_from_string_column": pa.Column(int, coerce=True),
    }
)

df = pl.DataFrame(
    [
        pl.Series("int_from_string_column", ["1", "2"], dtype=pl.String),
    ]
)

s.validate(df)

s2 = pa.DataFrameSchema(
    {
        "missing_column": pa.Column(int, required=False),
        "int_from_string_column": pa.Column(int),
    },
    coerce=True
)

s2.validate(df)

Expected behavior

I expect both validate calls to succeed

Desktop (please complete the following information):

  • OS: Ubuntu 20.04
  • Version of Pandera: 0.19.3

Additional context

It appears that the coerce at the DataFrameSchema causes the required=False on the Column to be ignored.

@adam-porich-sm adam-porich-sm added the bug Something isn't working label May 29, 2024
@adam-porich-sm
Copy link
Author

adam-porich-sm commented May 29, 2024

Actually here is a more minimal repro

s = pa.DataFrameSchema(
    {
        "missing_column": pa.Column(int, required=False, coerce=True),
    }
)

df = pl.DataFrame()

s.validate(df)
> python3.10 /tmp/repro_test.py
Traceback (most recent call last):
  File "/tmp/repro_test.py", line 12, in <module>
    s.validate(df)
  File "/home/adam.porich/.local/lib/python3.10/site-packages/pandera/api/polars/container.py", line 58, in validate
    output = self.get_backend(check_obj).validate(
  File "/home/adam.porich/.local/lib/python3.10/site-packages/pandera/backends/polars/container.py", line 62, in validate
    check_obj = parser(check_obj, *args)
  File "/home/adam.porich/.local/lib/python3.10/site-packages/pandera/backends/polars/container.py", line 395, in coerce_dtype
    check_obj = self._coerce_dtype_helper(check_obj, schema)
  File "/home/adam.porich/.local/lib/python3.10/site-packages/pandera/backends/polars/container.py", line 454, in _coerce_dtype_helper
    obj = getattr(col_schema.dtype, coerce_fn)(
  File "/home/adam.porich/.local/lib/python3.10/site-packages/pandera/engines/polars_engine.py", line 146, in try_coerce
    lf.collect()
  File "/home/adam.porich/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1816, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ColumnNotFoundError: missing_column

I was surprised to see coerce=True required that the column exists. I would expect the behaviour of checking if the column exists / is missing first and then if the column exists trying to coerce it

@adam-porich-sm adam-porich-sm changed the title coerce at the DataFrameSchema level overrides required at the Column level Setting coerce on a column causes the column to be required when required=False May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant