pyspark_sql docs run time error #1678

invalidarg · 2024-06-11T13:19:33Z

Location of the documentation

https://pandera.readthedocs.io/en/stable/pyspark_sql.html
https://pandera.readthedocs.io/en/latest/pyspark.html

Documentation problem

Code output has run time errors

https://pandera.readthedocs.io/en/latest/pyspark.html also mentions dask

billyvinning · 2024-06-23T13:44:00Z

To give some more context on this issue, PySpark code snippets in the documentation result in the following traceback:

JAVA_HOME is not set
---------------------------------------------------------------------------
PySparkRuntimeError                       Traceback (most recent call last)
Cell In[1], line 15
     11     price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
     14 # create a pyspark.pandas dataframe that's validated on object initialization
---> 15 df = DataFrame[Schema](
     16     {
     17         'state': ['FL','FL','FL','CA','CA','CA'],
     18         'city': [
     19             'Orlando',
     20             'Miami',
     21             'Tampa',
     22             'San Francisco',
     23             'Los Angeles',
     24             'San Diego',
     25         ],
     26         'price': [8, 12, 10, 16, 20, 18],
     27     }
     28 )
     29 print(df)

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pandera/typing/common.py:137, in __patched_generic_alias_call(self, *args, **kwargs)
    132 if not self._inst:
    133     raise TypeError(
    134         f"Type {self._name} cannot be instantiated; "
    135         f"use {self.__origin__.__name__}() instead"
    136     )
--> 137 result = self.__origin__(*args, **kwargs)
    138 try:
    139     result.__orig_class__ = self

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pyspark/pandas/frame.py:573, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    570         index = index._to_pandas()
    572     pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, copy=copy)
--> 573     internal = InternalFrame.from_pandas(pdf)
    574     index_assigned = True
    576 if index is not None and not index_assigned:
    577     # TODO(SPARK-40226): Support MultiIndex

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pyspark/pandas/internal.py:1532, in InternalFrame.from_pandas(pdf)
   1522 (
   1523     pdf,
   1524     index_columns,
   (...)
   1527     data_fields,
   1528 ) = InternalFrame.prepare_pandas_frame(pdf, prefer_timestamp_ntz=prefer_timestamp_ntz)
   1530 schema = StructType([field.struct_field for field in index_fields + data_fields])
-> 1532 sdf = default_session().createDataFrame(pdf, schema=schema)
   1533 return InternalFrame(
   1534     spark_frame=sdf,
   1535     index_spark_columns=[scol_for(sdf, col) for col in index_columns],
   (...)
   1541     column_label_names=column_label_names,
   1542 )

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pyspark/pandas/utils.py:483, in default_session()
    481 spark = SparkSession.getActiveSession()
    482 if spark is None:
--> 483     spark = SparkSession.builder.appName("pandas-on-Spark").getOrCreate()
    485 # Turn ANSI off when testing the pandas API on Spark since
    486 # the behavior of pandas API on Spark follows pandas, not SQL.
    487 if is_testing():

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pyspark/sql/session.py:497, in SparkSession.Builder.getOrCreate(self)
    495     sparkConf.set(key, value)
    496 # This SparkContext may be an existing one.
--> 497 sc = SparkContext.getOrCreate(sparkConf)
    498 # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    499 # by all sessions.
    500 session = SparkSession(sc, options=self._options)

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pyspark/context.py:515, in SparkContext.getOrCreate(cls, conf)
    513 with SparkContext._lock:
    514     if SparkContext._active_spark_context is None:
--> 515         SparkContext(conf=conf or SparkConf())
    516     assert SparkContext._active_spark_context is not None
    517     return SparkContext._active_spark_context

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pyspark/context.py:201, in SparkContext.__init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls, udf_profiler_cls, memory_profiler_cls)
    195 if gateway is not None and gateway.gateway_parameters.auth_token is None:
    196     raise ValueError(
    197         "You are trying to pass an insecure Py4j gateway to Spark. This"
    198         " is not allowed as it is a security risk."
    199     )
--> 201 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    202 try:
    203     self._do_init(
    204         master,
    205         appName,
   (...)
    215         memory_profiler_cls,
    216     )

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pyspark/context.py:436, in SparkContext._ensure_initialized(cls, instance, gateway, conf)
    434 with SparkContext._lock:
    435     if not SparkContext._gateway:
--> 436         SparkContext._gateway = gateway or launch_gateway(conf)
    437         SparkContext._jvm = SparkContext._gateway.jvm
    439     if instance:

File ~/checkouts/readthedocs.org/user_builds/pandera/conda/latest/lib/python3.11/site-packages/pyspark/java_gateway.py:107, in launch_gateway(conf, popen_kwargs)
    104     time.sleep(0.1)
    106 if not os.path.isfile(conn_info_file):
--> 107     raise PySparkRuntimeError(
    108         error_class="JAVA_GATEWAY_EXITED",
    109         message_parameters={},
    110     )
    112 with open(conn_info_file, "rb") as info:
    113     gateway_port = read_int(info)

PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

Java 8/11/17 is a strict dependency of Spark. A Java implementation does not ship with Pandera's Conda environment, so an attempt to open up the session inevitably fails in the CI environment.

invalidarg added the docs label Jun 11, 2024

billyvinning linked a pull request Jun 23, 2024 that will close this issue

Depend on OpenJDK>8.0.0 for PySpark support #1701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyspark_sql docs run time error #1678

pyspark_sql docs run time error #1678

invalidarg commented Jun 11, 2024 •

edited

Loading

billyvinning commented Jun 23, 2024

pyspark_sql docs run time error #1678

pyspark_sql docs run time error #1678

Comments

invalidarg commented Jun 11, 2024 • edited Loading

Location of the documentation

Documentation problem

billyvinning commented Jun 23, 2024

invalidarg commented Jun 11, 2024 •

edited

Loading