BackendNotFoundError on databricks/pyspark cluster #1673

TobiRoby · 2024-06-05T15:34:13Z

Hi,

I am trying to get pandera up and running on databricks.
However, I receive the following BackendNotFoundError error and do not know what the cause is:

BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pyspark.container.DataFrameSchema'>, <class 'pyspark.sql.connect.dataframe.DataFrame'>). Looked up the following base classes: (<class 'pyspark.sql.connect.dataframe.DataFrame'>, <class 'object'>)

Code example

import pandera.pyspark as pa
import pyspark.sql.types as T

class TestSchema(pa.DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    name: T.StringType() = pa.Field(str_startswith="B")

df = spark.createDataFrame([(5, "Bread"), (15, "Butter")], ["id", "name"])

TestSchema.validate(check_obj=df)

Complete error

BackendNotFoundError: Backend not found for backend, class: (<class 'pandera.api.pyspark.container.DataFrameSchema'>, <class 'pyspark.sql.connect.dataframe.DataFrame'>). Looked up the following base classes: (<class 'pyspark.sql.connect.dataframe.DataFrame'>, <class 'object'>)
File <command-2461794647677534>, line 10
      6     name: T.StringType() = pa.Field(str_startswith="B")
      8 df = spark.createDataFrame([(5, "Bread"), (15, "Butter")], ["id", "name"])
---> 10 TestSchema.validate(check_obj=df)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/model.py:289, in DataFrameModel.validate(cls, check_obj, head, tail, sample, random_state, lazy, inplace)
    274 @classmethod
    275 @docstring_substitution(validate_doc=DataFrameSchema.validate.__doc__)
    276 def validate(
   (...)
    284     inplace: bool = False,
    285 ) -> Optional[DataFrameBase[TDataFrameModel]]:
    286     """%(validate_doc)s"""
    287     return cast(
    288         DataFrameBase[TDataFrameModel],
--> 289         cls.to_schema().validate(
    290             check_obj, head, tail, sample, random_state, lazy, inplace
    291         ),
    292     )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/container.py:333, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    330     return check_obj
    331 error_handler = ErrorHandler(lazy)
--> 333 return self._validate(
    334     check_obj=check_obj,
    335     head=head,
    336     tail=tail,
    337     sample=sample,
    338     random_state=random_state,
    339     lazy=lazy,
    340     inplace=inplace,
    341     error_handler=error_handler,
    342 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/pyspark/container.py:364, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace, error_handler)
    355 if self._is_inferred:
    356     warnings.warn(
    357         f"This {type(self)} is an inferred schema that hasn't been "
    358         "modified. It's recommended that you refine the schema "
   (...)
    361         UserWarning,
    362     )
--> 364 return self.get_backend(check_obj).validate(
    365     check_obj=check_obj,
    366     schema=self,
    367     head=head,
    368     tail=tail,
    369     sample=sample,
    370     random_state=random_state,
    371     lazy=lazy,
    372     inplace=inplace,
    373     error_handler=error_handler,
    374 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/pandera/api/base/schema.py:96, in BaseSchema.get_backend(cls, check_obj, check_type)
     94     except KeyError:
     95         pass
---> 96 raise BackendNotFoundError(
     97     f"Backend not found for backend, class: {(cls, check_obj_cls)}. "
     98     f"Looked up the following base classes: {classes}"
     99 )

Runtime environment short:

databricks-runtime 14.3 LTS
python 3.10.12
pyspark 3.5.0
pandera 0.19.3

The text was updated successfully, but these errors were encountered:

TobiRoby added the bug Something isn't working label Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BackendNotFoundError on databricks/pyspark cluster #1673

BackendNotFoundError on databricks/pyspark cluster #1673

TobiRoby commented Jun 5, 2024

BackendNotFoundError on databricks/pyspark cluster #1673

BackendNotFoundError on databricks/pyspark cluster #1673

Comments

TobiRoby commented Jun 5, 2024