Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks with element_wise = True while using PolarsCheckBackend are called twice. #1707

Open
3 tasks done
jcadam14 opened this issue Jun 24, 2024 · 2 comments · May be fixed by #1719
Open
3 tasks done

Checks with element_wise = True while using PolarsCheckBackend are called twice. #1707

jcadam14 opened this issue Jun 24, 2024 · 2 comments · May be fixed by #1719
Labels
bug Something isn't working

Comments

@jcadam14
Copy link

jcadam14 commented Jun 24, 2024

Describe the bug
This may be an issue either within python polars or potentially rust polars, but figured I'd start here. I also posted on Stack Overflow here

I'm using pandera 0.19.3 with the polars 0.20.31 backend. While running my schema validation with cProfile, I noticed that all my custom validation checks are being called twice. The check functions when using pandas backend are only called once. I do not have any schema/container wide checks, just column/component checks.

As stated in the Stack Overflow post, I traced this down to a polars.expr.expr.py file at line 4837 calling self._pyexpr.map_batches which wraps a call to col('uid').map_list() (I believe a rust function), and this is where the check function is getting called twice, eventually in the polars.series.series.py at line 5518.

So I am not positive if this is an issue with Pandera or with polars. But figured I'd start here.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import polars as pl
import pandera as pd
import re


from copy import deepcopy
from typing import Dict

from pandera.polars import Column

from typing import Any, Callable, Type

from pandera import Check
from pandera.backends.base import BaseCheckBackend
from pandera.backends.polars.checks import PolarsCheckBackend
from pandera.errors import SchemaErrors
import pandera.polars as pa

def get_template() -> Dict:
    return deepcopy(_schema_template)

def has_valid_format(value: str, regex: str) -> bool:
    return bool(re.match(regex, value))


class MyCheck(Check):
    def __init__(
        self,
        check_fn: Callable,
        id: str,
        name: str,
        description: str,
        severity: str,
        scope: str,
        **check_kwargs
    ):
        self.severity = severity
        self.scope = scope

        super().__init__(check_fn, title=id, name=name, description=description, **check_kwargs)

    @classmethod
    def get_backend(cls, check_obj: Any) -> Type[BaseCheckBackend]:
        return PolarsCheckBackend

_schema_template = {
    "uid": Column(
        str,
        title="Field 1: Unique identifier",
        checks=[MyCheck(
                    has_valid_format,
                    id="E0002",
                    name="uid.invalid_text_pattern",
                    description="Checks the format",
                    severity="Error",
                    scope="Syntax",
                    element_wise=True,
                    regex="^[A-Z0-9]+$",
                )],
    ),
    "action": Column(
        str,
        title="Action Taken",
        checks=[],
    )
}

def validate_schedule():
    data = {"uid": ["A12B"], "action": ["stop"]}
    df = pl.DataFrame(data)
    
    schema = pa.DataFrameSchema(get_template())
    try:
        import cProfile
        import pstats
        profiler = cProfile.Profile()
        profiler.enable()
        schema.validate(df, lazy=True)
        profiler.disable()
        pstats.Stats(profiler).strip_dirs().sort_stats('cumulative').print_stats("has_valid_format")
        print("Passed validation")
    except SchemaErrors as err:
        for schema_error in err.schema_errors:
            print(f"Schema Error: {schema_error}, Check: {schema_error.check}")
            
if __name__ == '__main__':
    validate_schedule()

Expected behavior

Doing this with polars results in :

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 sample_issue.py:24(has_valid_format)

whereas switching it to use pandas and pandera.pandas results in:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 sample_issue.py:24(has_valid_format)

I would expect only a single call to the check function in the above example code.

Desktop (please complete the following information):

  • OS:
    • ProductName: macOS
    • ProductVersion: 13.6.3
    • BuildVersion: 22G436
  • Browser: N/A (python command line)
@jcadam14 jcadam14 added the bug Something isn't working label Jun 24, 2024
@jcadam14
Copy link
Author

jcadam14 commented Jun 28, 2024

I've found the culprit. It seems using element_wise = True in a Check is causing the double call. If I change that to "groupby='uid'" and change the function to take grouped_data, the check function is only called once.

I furthered this example by adding the following to the schema:

    "app_date": Column(
        str,
        title="Field 2: Application date",
        checks=[pa.Check(is_date, element_wise=True)],
    ),

And the is_date check function also gets called twice, so it's not just custom checks. If I switch that Check to use groupby="app_date" then the check function gets called once.

@jcadam14 jcadam14 changed the title Custom Checks while using PolarsCheckBackend are called twice. Checks with element_wise = True while using PolarsCheckBackend are called twice. Jun 28, 2024
@cosmicBboy
Copy link
Collaborator

Thanks for unearthing this @jcadam14 ! would you mind making a PR to fix this?

@jcadam14 jcadam14 linked a pull request Jun 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants