Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal validation not fully supported #1689

Open
2 of 3 tasks
benlee1284 opened this issue Jun 17, 2024 · 5 comments
Open
2 of 3 tasks

Decimal validation not fully supported #1689

benlee1284 opened this issue Jun 17, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@benlee1284
Copy link

Describe the bug
Validation of Decimal type not fully supported.

I have run into a number of issues when trying to use Decimals in pandera (e.g. failing to encode JSON when rendering SchemaErrors) but this one I couldn't avoid.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from decimal import Decimal

import pandera.polars as pa
import polars as pl

schema = pa.DataFrameSchema(
    columns={'a': pa.Column(pl.Decimal()),},
    checks=[pa.Check(lambda x: False, element_wise=True)]
)

df = pl.DataFrame(data=[{'a': Decimal(1)}])

schema.validate(df)

>> PanicException: dtype Decimal(None, Some(0)) not supported

Expected behavior

A SchemaError / SchemaErrors

Desktop (please complete the following information):

  • OS: Windows 11 Pro (10.0.22631 Build 22631)
  • Browser: Chrome
  • Version: 0.19.3
  • Python Version: 3.11

Additional context

Full traceback:

thread '<unnamed>' panicked at py-polars\src\series\mod.rs:546:46:
dtype Decimal(None, Some(0)) not supported
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\expr\expr.py:4516, in Expr._map_batches_wrapper.__call__(self, *args, **kwargs)
   4515 def __call__(self, *args: Any, **kwargs: Any) -> Any:
-> 4516     result = self.function(*args, **kwargs)
   4517     if _check_for_numpy(result) and isinstance(result, np.ndarray):
   4518         result = pl.Series(result, dtype=self.return_dtype)

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\expr\expr.py:4862, in Expr.map_elements.<locals>.wrap_f(x)
   4860 with warnings.catch_warnings():
   4861     warnings.simplefilter("ignore", PolarsInefficientMapWarning)
-> 4862     return x.map_elements(
   4863         function, return_dtype=return_dtype, skip_nulls=skip_nulls
   4864     )

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\series\series.py:5504, in Series.map_elements(self, function, return_dtype, skip_nulls)
   5500     pl_return_dtype = py_type_to_dtype(return_dtype)
   5502 warn_on_inefficient_map(function, columns=[self.name], map_target="series")
   5503 return self._from_pyseries(
-> 5504     self._s.apply_lambda(function, pl_return_dtype, skip_nulls)
   5505 )

PanicException: dtype Decimal(None, Some(0)) not supported
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[18], line 1
----> 1 schema.validate(df)

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\api\polars\container.py:58, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
     54     if is_dataframe:
     55         # if validating a polars DataFrame, use the global config setting
     56         check_obj = check_obj.lazy()
---> 58     output = self.get_backend(check_obj).validate(
     59         check_obj=check_obj,
     60         schema=self,
     61         head=head,
     62         tail=tail,
     63         sample=sample,
     64         random_state=random_state,
     65         lazy=lazy,
     66         inplace=inplace,
     67     )
     69 if is_dataframe:
     70     output = output.collect()

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\container.py:89, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
     81 core_checks = [
     82     (self.check_column_presence, (check_obj, schema, column_info)),
     83     (self.check_column_values_are_unique, (sample, schema)),
     84     (self.run_schema_component_checks, (sample, components, lazy)),
     85     (self.run_checks, (sample, schema)),
     86 ]
     88 for check, args in core_checks:
---> 89     results = check(*args)  # type: ignore[operator]
     90     if isinstance(results, CoreCheckResult):
     91         results = [results]

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\validation_depth.py:79, in validate_scope.<locals>._wrapper.<locals>.wrapper(self, check_obj, *args, **kwargs)
     73     logger.debug(
     74         f"Skipping execution of check {func.__name__} since "
     75         "validation depth is set to SCHEMA_ONLY",
     76         stacklevel=2,
     77     )
     78     return CoreCheckResult(passed=True)
---> 79 return func(self, check_obj, *args, **kwargs)

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\container.py:146, in DataFrameSchemaBackend.run_checks(self, check_obj, schema)
    143 for check_index, check in enumerate(schema.checks):
    144     try:
    145         check_results.append(
--> 146             self.run_check(check_obj, schema, check, check_index)
    147         )
    148     except SchemaDefinitionError:
    149         raise

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\pandera\backends\polars\base.py:75, in PolarsSchemaBackend.run_check(self, check_obj, schema, check, check_index, *args)
     63 """Handle check results, raising SchemaError on check failure.
     64
     65 :param check_obj: data object to be validated.
   (...)
     71     False.
     72 """
     73 check_result: CheckResult = check(check_obj, *args)
---> 75 passed = check_result.check_passed.collect().item()
     76 failure_cases = None
     77 message = None

File ~\.virtualenvs\zeus-Cm3St1yy\Lib\site-packages\polars\lazyframe\frame.py:1855, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1852 # Only for testing purposes atm.
   1853 callback = _kwargs.get("post_opt_callback")
-> 1855 return wrap_df(ldf.collect(callback))

PanicException: dtype Decimal(None, Some(0)) not supported
@benlee1284 benlee1284 added the bug Something isn't working label Jun 17, 2024
@cosmicBboy
Copy link
Collaborator

@benlee1284
Copy link
Author

benlee1284 commented Jun 18, 2024

@cosmicBboy
I used that for the schema (see original snippet) but I can't see a way to use the polars Decimal type when instantiating a DataFrame
My understanding was that you're meant to use the python decimal.Decimal type in polars Decimal columns

In fact if I cast a column to polars Decimal, it actually results in a python decimal.Decimal

from decimal import Decimal
import polars as pl

df = pl.DataFrame({'a': [Decimal('1')]})
cast = df.select(pl.col('a').cast(pl.Decimal()))
cast['a'][0]
>> Decimal('1')
type(cast['a'][0])
>> decimal.Decimal

Maybe I've missed a step here though!

@cosmicBboy
Copy link
Collaborator

So in pandera, element wise checks use map_elements under the hood: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html#polars-expr-map-elements

And it looks like it currently does not support mapping a function over decimal dtype:

(
    pl.LazyFrame({"a": [Decimal(1)]})
    .with_columns(
        pl.col("a").map_elements(lambda x: x)
    ).collect()
)

error:

  File "/Users/nielsbantilan/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1817, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: dtype Decimal(None, Some(0)) not supported

You can raise this issue in the polars repo.

For now, I'd recommend using the vectorized checks that operate on the lazyframe itself: https://pandera.readthedocs.io/en/latest/polars.html#column-level-checks

@benlee1284
Copy link
Author

Ok cool thank you

So basically you're saying it'll fail for any DataFrame-level checks?

@cosmicBboy
Copy link
Collaborator

yeah, it'll fail for any element-wise check that operates on decimal types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants