[DataLoader] Simplify integration tests by robreeves · Pull Request #487 · linkedin/openhouse

robreeves · 2026-03-04T17:46:53Z

Summary

The data loader integration tests were getting brittle and complicated. They wrote data locally with PyIceberg and then manually manipulated table metadata to add the data and different snapshots. As the use cases grow (branches, ORC support) this is not sustainable.

This PR refactors the data loader integration tests to use the oh-hadoop-spark docker recipe. Then the integration tests can use Spark to do writes and the data loader to do reads. Using oh-hadoop-spark is a more expensive setup so I moved it to its own workflow to only run when data loader changes are made.

Details

Integration tests run inside a Docker container on the same network as the oh-hadoop-spark Docker Compose services. The test container needs Python dependencies with Linux x86_64 native extensions (pyarrow, datafusion, etc.) plus JRE 8 and Hadoop 2.8 client jars for HDFS access via PyArrow's libhdfs JNI bridge.

Cross-platform problem
Dependencies like pyarrow and datafusion include platform-specific compiled extensions (.so files). On CI (Linux x86_64), these match the Docker container's platform natively. On a macOS ARM dev machine, a normal pip install produces macOS ARM binaries that can't run inside the Linux container.

How it's handled
uv pip install --target supports a --python-platform flag that downloads pre-built wheels for a different platform. The Makefile uses this to install all dependencies with Linux x86_64 native extensions regardless of the host OS. The Dockerfile then copies the pre-built site-packages directory into the image and sets PYTHONPATH.

Changes

New file: integrations/python/dataloader/tests/Dockerfile — Test runner container with
Python 3.12, JDK, Hadoop 2.8.5 client (for PyArrow's libhdfs JNI bridge), and uv.

integrations/python/dataloader/tests/integration_tests.py — Removed _LocalCommitCatalog,
_append_data, _copy_metadata_from_container, _get_metadata_path, _create_table,
_delete_table, _cleanup_table and their imports. Added LivySession class that creates a
Livy SQL session and executes Spark SQL statements. All test_* functions and assertions are
unchanged.

integrations/python/dataloader/Makefile — integration-tests target now builds a Docker
image and runs it on the oh-hadoop-spark_default network. Token is passed via OH_TOKEN env var.

.github/workflows/build-run-tests.yml — Removed data loader steps.

integrations/python/dataloader/CLAUDE.md — Updated to reflect oh-hadoop-spark and
containerized test execution.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

All test functions and assertions are unchanged — only the test data setup mechanism changed
(Spark SQL via Livy instead of PyIceberg metadata manipulation). make verify passes (lint,
format, typecheck, unit tests).

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

Replace fragile PyIceberg metadata manipulation (LocalCommitCatalog, docker cp, host-filesystem path juggling) with Spark SQL statements submitted through Livy's REST API. Tests now run inside a Docker container on the oh-hadoop-spark network so they can read HDFS data directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add an explicit `image: oh-spark-base` tag to the spark-master service in docker compose so the test Dockerfile can `FROM oh-spark-base` instead of installing JDK and downloading Hadoop from scratch. The base image already provides Java, Hadoop 2.8.0, and HADOOP_HOME. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The spark Dockerfile COPYs openhouse-spark-apps_2.12-uber.jar and openhouse-spark-runtime_2.12-uber.jar, which are produced by the shadowJar task. The previous oh-only recipe never built spark images so this wasn't needed, but oh-hadoop-spark requires them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace sleep 90 with a loop that polls openhouse-tables and spark-livy every 5s, exiting as soon as both respond. Fails explicitly on timeout instead of silently proceeding with broken services. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cbb330 · 2026-03-04T18:46:41Z

This PR switches CI from oh-only to oh-hadoop-spark, which adds HDFS, Spark, and Livy containers + shadowJar to the build. That's a significant increase in startup time and build cost that will slow down every PR, including Java-only changes that don't touch the dataloader at all.

Split the workflow into two independent jobs:

build-and-run-tests — stays exactly as it is on main today (gradlew build, oh-only, scripts/python/integration_test.py). No new dependencies.
dataloader-tests (new) — runs in parallel, owns the oh-hadoop-spark recipe, shadowJar, Livy wait, make verify, and make integration-tests.

No needs: between them, so Java CI is never blocked on the heavier Python/Spark setup. The existing build-tag-publish.yml gate (needs: build-and-run-tests) stays on the Java job only.

Example workflow sketch

jobs:
  # Unchanged from main — Java build + lightweight API integration test
  build-and-run-tests:
    name: Build and Run Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-java@v5
        with:
          distribution: 'microsoft'
          java-version: '17'
      - uses: gradle/actions/setup-gradle@v5
      - run: ./gradlew clean build

      - run: docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml up -d --build
      - run: sleep 30

      - uses: actions/setup-python@v6
        with:
          python-version: '3.12'
      - run: pip install -r scripts/python/requirements.txt
      - run: python scripts/python/integration_test.py ./tables-test-fixtures/tables-test-fixtures-iceberg-1.2/src/main/resources/dummy.token

      - if: always()
        run: docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml down

  # New parallel job — dataloader lint + unit + integration
  dataloader-tests:
    name: Dataloader Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-java@v5
        with:
          distribution: 'microsoft'
          java-version: '17'
      - uses: gradle/actions/setup-gradle@v5
      - run: ./gradlew shadowJar

      - run: docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml up -d --build
      - name: Wait for services
        run: |
          for i in $(seq 1 30); do
            if curl -sf http://localhost:8000/v1/databases > /dev/null 2>&1 && \
               curl -sf http://localhost:9003/sessions > /dev/null 2>&1; then
              echo "Services ready after $((i * 5))s"
              exit 0
            fi
            sleep 5
          done
          echo "Timed out after 150s"
          exit 1

      - uses: actions/setup-python@v6
        with:
          python-version: '3.12'
      - uses: astral-sh/setup-uv@v7
        with:
          enable-cache: true

      - working-directory: integrations/python/dataloader
        run: make sync verify
      - working-directory: integrations/python/dataloader
        run: make integration-tests TOKEN_FILE=../../../tables-test-fixtures/tables-test-fixtures-iceberg-1.2/src/main/resources/dummy.token

      - if: always()
        run: docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml down

robreeves · 2026-03-04T23:41:23Z

This PR switches CI from oh-only to oh-hadoop-spark, which adds HDFS, Spark, and Livy containers + shadowJar to the build. That's a significant increase in startup time and build cost that will slow down every PR, including Java-only changes that don't touch the dataloader at all.

Split the workflow into two independent jobs:

build-and-run-tests — stays exactly as it is on main today (gradlew build, oh-only, scripts/python/integration_test.py). No new dependencies.

dataloader-tests (new) — runs in parallel, owns the oh-hadoop-spark recipe, shadowJar, Livy wait, make verify, and make integration-tests.

No needs: between them, so Java CI is never blocked on the heavier Python/Spark setup. The existing build-tag-publish.yml gate (needs: build-and-run-tests) stays on the Java job only.

Example workflow sketch

I like that idea. I feel a lot better about not adding the cost for every PR. I was also not satisfied with the change yet, but wanted to see how it ran in the PR check (still in draft state).

Livy needs Spark master and worker to be ready before it can accept connections, which takes longer than 150s in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nges Restore build-and-run-tests to match main (gradlew build, oh-only, scripts/python integration tests). Add a parallel dataloader-tests job that only runs when files under integrations/python/dataloader/ change. This avoids the oh-hadoop-spark startup cost for Java-only PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move dataloader lint, unit tests, and integration tests from build-run-tests.yml into a standalone dataloader-tests.yml workflow. This workflow only triggers when files under integrations/python/dataloader/ change, avoiding the ~5 min oh-hadoop-spark startup cost on every PR. build-run-tests.yml is restored to match main (Java-only, oh-only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Run `./gradlew clean build shadowJar` in dataloader workflow since oh-hadoop-spark compose needs JARs from both build and shadowJar - Remove accidental `if: always()` addition from build-run-tests.yml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The test container only needs Hadoop client JARs for PyArrow's libhdfs bridge — it doesn't need Spark or Livy. Switch from oh-spark-base to bde2020/hadoop-namenode:1.2.0-hadoop2.8-java8 directly and revert the image tag addition to spark-services.yml. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move Python setup and make sync verify before the readiness poll so they run in parallel with container startup, reducing overall wall time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Java tests already run in the main build-run-tests workflow. The dataloader workflow only needs the compiled JARs for Docker compose. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Move table setup and teardown into dedicated functions - Remove asserts from main, keep only setup/test/teardown calls - Get snapshot IDs through DataLoader.snapshot_id instead of catalog.load_table().metadata - Test nonexistent table through DataLoader instead of catalog directly - Split monolithic tests into focused single-assertion tests - Extract _read_all helper to reduce batch-concat-sort boilerplate Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Dump docker compose ps on timeout and container logs on teardown to help diagnose why services fail to become ready. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Print HTTP status codes for each poll attempt to understand why services appear running but curl never succeeds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openhouse-tables returns 401 on unauthenticated requests when OPA authorization is enabled (oh-hadoop-spark config). Any HTTP response proves the service is up — only 000 (connection refused) means it is still starting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Docker image has no .git directory, so hatch-vcs/setuptools-scm cannot determine the package version. Set a dummy version to unblock the editable install during uv sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hatchling validates that the readme file referenced in pyproject.toml exists during editable install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The old hadoop-namenode base image has glibc too old for manylinux wheels, causing pyarrow to build from source. Switch to python:3.12-slim-bookworm and copy only the Hadoop client JARs and native libs from the hadoop image via multi-stage COPY. JRE is needed because PyArrow's libhdfs reads HDFS data files via JNI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of running uv sync inside the Docker container (which tried to build pyarrow from source on old glibc), build a shiv on the CI runner where manylinux wheels are available. The Docker image only needs Python, JRE, and Hadoop libs — no uv or pip. The shiv bundles all Python deps into a single executable zipapp. A preamble script runs the integration tests when the shiv executes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Hadoop 2.8 is incompatible with Java 17 — PyArrow's libhdfs JNI bridge throws NoClassDefFoundError when trying to connect to HDFS. Copy JRE 8 from the same hadoop image we already use for the Hadoop client JARs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ult scheme Three fixes for the integration test Docker container: 1. Multi-stage Docker build: build shiv inside Docker so native extensions (pyarrow) match the target platform. Works on both CI (amd64) and local (ARM Mac via emulation). 2. Expand CLASSPATH globs at build time. JNI does not expand '*' wildcards like the java launcher does, so libhdfs was getting NoClassDefFoundError. 3. Pass DEFAULT_SCHEME=hdfs and DEFAULT_NETLOC=namenode:9000 to the catalog so PyIceberg resolves schemeless paths (from Iceberg metadata) to HDFS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use one table through the entire test instead of separate tables for each concern. The test follows a natural progression: nonexistent table error, empty table, write data, read/filter/project, write second snapshot, pin to old snapshot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Row filter, column projection, second snapshot, and pinned snapshot steps now assert all column values, not just row counts or IDs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ion tests Use uv's --python-platform flag to install Linux x86_64 packages directly into a site-packages directory, regardless of the host OS. This eliminates shiv entirely — the Docker image just sets PYTHONPATH and runs the test script directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves · 2026-03-05T15:04:18Z

@cbb330 this is ready for review

Copilot

Pull request overview

Refactors the Python DataLoader integration tests to provision Iceberg tables via Spark SQL through Livy (using the oh-hadoop-spark docker recipe), and moves these heavier integration tests into a dedicated GitHub Actions workflow that runs only when the dataloader subtree changes.

Changes:

Reworked integration_tests.py to use a LivySession helper for Spark SQL table setup/inserts instead of local PyIceberg metadata manipulation.
Added a dedicated Docker-based integration test runner image + Makefile targets to build platform-correct dependencies and run tests on the Compose network.
Introduced a new CI workflow for dataloader-only unit + integration tests; removed dataloader steps from the shared build workflow.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
integrations/python/dataloader/tests/integration_tests.py	Uses Livy/Spark SQL to create/populate tables in HDFS and validates DataLoader reads (filters, projections, snapshots).
integrations/python/dataloader/tests/Dockerfile	Adds a minimal test runner image with Java 8 + Hadoop 2.8 client and prebuilt Python deps for HDFS access.
integrations/python/dataloader/Makefile	Builds platform-targeted site-packages for the test image and runs integration tests in Docker on the Compose network.
integrations/python/dataloader/CLAUDE.md	Updates local validation instructions for `oh-hadoop-spark` and containerized integration tests.
.github/workflows/dataloader-tests.yml	New workflow to run dataloader unit + integration tests when `integrations/python/dataloader/**` changes.
.github/workflows/build-run-tests.yml	Removes dataloader test steps from the general Gradle build/test workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

integrations/python/dataloader/tests/integration_tests.py

integrations/python/dataloader/Makefile

.github/workflows/dataloader-tests.yml

integrations/python/dataloader/tests/Dockerfile

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves and others added 4 commits March 4, 2026 09:43

robreeves changed the title ~~Simplify dataloader integration tests: use Spark SQL via oh-hadoop-spark~~ [DataLoader] Simplify integration tests Mar 4, 2026

robreeves and others added 23 commits March 4, 2026 15:41

CI: increase service readiness timeout to 300s

0d2b9c1

Livy needs Spark master and worker to be ready before it can accept connections, which takes longer than 150s in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CI: run dataloader unit tests while Docker services start

14a3741

Move Python setup and make sync verify before the readiness poll so they run in parallel with container startup, reducing overall wall time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CI: skip Java unit tests in dataloader workflow

7ecd4fb

The Java tests already run in the main build-run-tests workflow. The dataloader workflow only needs the compiled JARs for Docker compose. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CI: increase service readiness timeout to 600s

60fe2e1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CI: add container logs on failure for debugging

df6b6b9

Dump docker compose ps on timeout and container logs on teardown to help diagnose why services fail to become ready. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CI: add verbose polling to debug service readiness

5cb678f

Print HTTP status codes for each poll attempt to understand why services appear running but curl never succeeds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Dockerfile: copy README.md for hatchling metadata validation

ad8f3b6

hatchling validates that the readme file referenced in pyproject.toml exists during editable install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verify all column values in every integration test step

0db9cb1

Row filter, column projection, second snapshot, and pinned snapshot steps now assert all column values, not just row counts or IDs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Build shiv on host instead of in Docker multi-stage build

c3f19b8

Build shiv in Docker container on non-Linux hosts for platform compat…

f3427f3

…ibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves and others added 3 commits March 4, 2026 22:51

Move itest site-packages to build/ to avoid pytest collection conflicts

d821df8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Run dataloader unit tests before Gradle build for faster CI feedback

0a81c2d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove Docker container logs step from CI workflow

d97f857

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves marked this pull request as ready for review March 5, 2026 15:03

robreeves requested review from ShreyeshArangath, cbb330, Copilot and sumedhsakdeo March 5, 2026 15:03

Copilot started reviewing on behalf of robreeves March 5, 2026 15:04 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

robreeves and others added 2 commits March 5, 2026 08:01

Add timeouts to Livy polling and uv.lock to Make dependencies

f4e0b9a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verify custom table properties are accessible via dataloader

7c444cf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Simplify integration tests#487

[DataLoader] Simplify integration tests#487
robreeves wants to merge 32 commits intolinkedin:mainfrom
robreeves:dl_itest

robreeves commented Mar 4, 2026 •

edited

Loading

Uh oh!

cbb330 commented Mar 4, 2026 •

edited

Loading

Uh oh!

robreeves commented Mar 4, 2026

Uh oh!

robreeves commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

robreeves commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Changes

Testing Done

Additional Information

Uh oh!

cbb330 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robreeves commented Mar 4, 2026

Uh oh!

robreeves commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robreeves commented Mar 4, 2026 •

edited

Loading

cbb330 commented Mar 4, 2026 •

edited

Loading