Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars checks not being evaluated correctly #1662

Open
2 tasks
mxblsdl opened this issue May 30, 2024 · 3 comments
Open
2 tasks

Polars checks not being evaluated correctly #1662

mxblsdl opened this issue May 30, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@mxblsdl
Copy link

mxblsdl commented May 30, 2024

Describe the bug
The column checks on polars LazyFrames are not registering errors when they should. Values outside of a defined range pass validation with no warnings or errors. This is not true for polars DataFrame which does register an error.

It looks like this was addressed in a recent PR but I am still seeing the bug in the 0.19.3 release.

  • I have checked that this issue has not already been reported.
    • The issue has been reported and merged to main, but is still persisting in the most recent release
  • [ x] I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample,

# This code is taken from the examples page [here](https://pandera--1373.org.readthedocs.build/en/1373/polars.html)
# With values changed to be outside the define range.

import pandera.polars as pa
import polars as pl


schema = pa.DataFrameSchema(
    {
        "state": pa.Column(str),
        "city": pa.Column(str),
        "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)), # check is defined
    }
)


lf = pl.LazyFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [2, 12, 10, 16, 20, 180], # values outside of defined range are passed
    }
)
print(schema.validate(lf).collect()) # no errors are raised

Expected behavior

I would expect a pandera.errors.SchemaError to be raised. Note that the polars.DataFrame version of this code does raise and error.

import pandera.polars as pa
import polars as pl


schema = pa.DataFrameSchema(
    {
        "state": pa.Column(str),
        "city": pa.Column(str),
        "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)),
    }
)


lf = pl.DataFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [2, 12, 10, 16, 20, 180],
    }
)
print(schema.validate(lf))

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser: Chrome
  • Version: pandera: 0.19.3, polars: 0.20.28
@mxblsdl mxblsdl added the bug Something isn't working label May 30, 2024
@kacper-sellforte
Copy link

Screenshot 2024-06-12 at 21 03 31

https://pandera.readthedocs.io/en/stable/polars.html#how-it-works

I think this behaviour is expected. pa.Check.in_range(min_value=5, max_value=20) cannot be performed on pl.LazyFrame object as it requires reading of the data.

@mxblsdl
Copy link
Author

mxblsdl commented Jun 17, 2024

So are checks never assessed for LazyFrame objects?

I feel like the documentation should make this more explicit or a warning should be issued. The top example comes directly from Pandera documentation and having a check that is never assessed creates a false sense of coverage.

@kacper-sellforte
Copy link

Checks are assessed for LazyFrame objects, but only those that don't require data being present in the memory are evaluated - so most importantly data types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants