Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BackendNotFoundError on databricks/pyspark cluster #1673

Open
TobiRoby opened this issue Jun 5, 2024 · 0 comments
Open

BackendNotFoundError on databricks/pyspark cluster #1673

TobiRoby opened this issue Jun 5, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@TobiRoby
Copy link

TobiRoby commented Jun 5, 2024

Hi,

I am trying to get pandera up and running on databricks.
However, I receive the following BackendNotFoundError error and do not know what the cause is:

BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pyspark.container.DataFrameSchema'>, <class 'pyspark.sql.connect.dataframe.DataFrame'>). Looked up the following base classes: (<class 'pyspark.sql.connect.dataframe.DataFrame'>, <class 'object'>)

Code example

import pandera.pyspark as pa
import pyspark.sql.types as T

class TestSchema(pa.DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    name: T.StringType() = pa.Field(str_startswith="B")

df = spark.createDataFrame([(5, "Bread"), (15, "Butter")], ["id", "name"])

TestSchema.validate(check_obj=df)

Complete error

BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pyspark.container.DataFrameSchema'>, <class 'pyspark.sql.connect.dataframe.DataFrame'>). Looked up the following base classes: (<class 'pyspark.sql.connect.dataframe.DataFrame'>, <class 'object'>)
File <command-2461794647677534>, line 10
      6     name: T.StringType() = pa.Field(str_startswith="B")
      8 df = spark.createDataFrame([(5, "Bread"), (15, "Butter")], ["id", "name"])
---> 10 TestSchema.validate(check_obj=df)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/model.py:289, in DataFrameModel.validate(cls, check_obj, head, tail, sample, random_state, lazy, inplace)
    274 @classmethod
    275 @docstring_substitution(validate_doc=DataFrameSchema.validate.__doc__)
    276 def validate(
   (...)
    284     inplace: bool = False,
    285 ) -> Optional[DataFrameBase[TDataFrameModel]]:
    286     """%(validate_doc)s"""
    287     return cast(
    288         DataFrameBase[TDataFrameModel],
--> 289         cls.to_schema().validate(
    290             check_obj, head, tail, sample, random_state, lazy, inplace
    291         ),
    292     )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/container.py:333, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    330     return check_obj
    331 error_handler = ErrorHandler(lazy)
--> 333 return self._validate(
    334     check_obj=check_obj,
    335     head=head,
    336     tail=tail,
    337     sample=sample,
    338     random_state=random_state,
    339     lazy=lazy,
    340     inplace=inplace,
    341     error_handler=error_handler,
    342 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/container.py:364, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace, error_handler)
    355 if self._is_inferred:
    356     warnings.warn(
    357         f"This {type(self)} is an inferred schema that hasn't been "
    358         "modified. It's recommended that you refine the schema "
   (...)
    361         UserWarning,
    362     )
--> 364 return self.get_backend(check_obj).validate(
    365     check_obj=check_obj,
    366     schema=self,
    367     head=head,
    368     tail=tail,
    369     sample=sample,
    370     random_state=random_state,
    371     lazy=lazy,
    372     inplace=inplace,
    373     error_handler=error_handler,
    374 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/base/schema.py:96, in BaseSchema.get_backend(cls, check_obj, check_type)
     94     except KeyError:
     95         pass
---> 96 raise BackendNotFoundError(
     97     f"Backend not found for backend, class: {(cls, check_obj_cls)}. "
     98     f"Looked up the following base classes: {classes}"
     99 )

Runtime environment short:

  • databricks-runtime 14.3 LTS
  • python 3.10.12
  • pyspark 3.5.0
  • pandera 0.19.3
@TobiRoby TobiRoby added the bug Something isn't working label Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant