Skip to content

run_ddl.py fails with PySparkRuntimeError: [JAVA_GATEWAY_EXITED] #5

@pmayd

Description

@pmayd

I successfully cloned the repository and could run docker compose up -d --build and the container is shown all green in Docker Desktop. I could also connect to Jupyter notebook locally on port 8888 and run the first script.

However, the second script directly fails with a java error:

[/opt/spark/bin/spark-class](http://localhost:8888/opt/spark/bin/spark-class): line 71: [/usr/lib/jvm/java-17-openjdk-amd64/bin/java](http://localhost:8888/usr/lib/jvm/java-17-openjdk-amd64/bin/java): No such file or directory
[/opt/spark/bin/spark-class](http://localhost:8888/opt/spark/bin/spark-class): line 97: CMD: bad array subscript

---------------------------------------------------------------------------
PySparkRuntimeError                       Traceback (most recent call last)
File [/home/airflow/notebooks/run_ddl.py:14](http://localhost:8888/lab/tree/notebooks/notebooks/run_ddl.py#line=13)
     11 logger = logging.getLogger(__name__)
     13 # Create Spark session
---> 14 spark = SparkSession.builder.appName("Run DDLs for TPCH data").getOrCreate()
     16 spark.sql("CREATE SCHEMA IF NOT EXISTS prod_db")
     17 logger.info("Dropping any existing TPCH tables")

File /home/airflow/.venv/lib/python3.13/site-packages/pyspark/sql/session.py:556, in SparkSession.Builder.getOrCreate(self)
    554     sparkConf.set(key, value)
    555 # This SparkContext may be an existing one.
--> 556 sc = SparkContext.getOrCreate(sparkConf)
    557 # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    558 # by all sessions.
    559 session = SparkSession(sc, options=self._options)

File /home/airflow/.venv/lib/python3.13/site-packages/pyspark/core/context.py:523, in SparkContext.getOrCreate(cls, conf)
    521 with SparkContext._lock:
    522     if SparkContext._active_spark_context is None:
--> 523         SparkContext(conf=conf or SparkConf())
    524     assert SparkContext._active_spark_context is not None
    525     return SparkContext._active_spark_context

File /home/airflow/.venv/lib/python3.13/site-packages/pyspark/core/context.py:205, in SparkContext.__init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls, udf_profiler_cls, memory_profiler_cls)
    199 if gateway is not None and gateway.gateway_parameters.auth_token is None:
    200     raise ValueError(
    201         "You are trying to pass an insecure Py4j gateway to Spark. This"
    202         " is not allowed as it is a security risk."
    203     )
--> 205 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    206 try:
    207     self._do_init(
    208         master,
    209         appName,
   (...)    219         memory_profiler_cls,
    220     )

File /home/airflow/.venv/lib/python3.13/site-packages/pyspark/core/context.py:444, in SparkContext._ensure_initialized(cls, instance, gateway, conf)
    442 with SparkContext._lock:
    443     if not SparkContext._gateway:
--> 444         SparkContext._gateway = gateway or launch_gateway(conf)
    445         SparkContext._jvm = SparkContext._gateway.jvm
    447     if instance:

File /home/airflow/.venv/lib/python3.13/site-packages/pyspark/java_gateway.py:111, in launch_gateway(conf, popen_kwargs)
    108     time.sleep(0.1)
    110 if not os.path.isfile(conn_info_file):
--> 111     raise PySparkRuntimeError(
    112         errorClass="JAVA_GATEWAY_EXITED",
    113         messageParameters={},
    114     )
    116 with open(conn_info_file, "rb") as info:
    117     gateway_port = read_int(info)

PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions