Skip to content

[DataLoader] Simplify integration tests#487

Open
robreeves wants to merge 32 commits intolinkedin:mainfrom
robreeves:dl_itest
Open

[DataLoader] Simplify integration tests#487
robreeves wants to merge 32 commits intolinkedin:mainfrom
robreeves:dl_itest

Conversation

@robreeves
Copy link
Collaborator

@robreeves robreeves commented Mar 4, 2026

Summary

The data loader integration tests were getting brittle and complicated. They wrote data locally with PyIceberg and then manually manipulated table metadata to add the data and different snapshots. As the use cases grow (branches, ORC support) this is not sustainable.

This PR refactors the data loader integration tests to use the oh-hadoop-spark docker recipe. Then the integration tests can use Spark to do writes and the data loader to do reads. Using oh-hadoop-spark is a more expensive setup so I moved it to its own workflow to only run when data loader changes are made.

Details

Integration tests run inside a Docker container on the same network as the oh-hadoop-spark Docker Compose services. The test container needs Python dependencies with Linux x86_64 native extensions (pyarrow, datafusion, etc.) plus JRE 8 and Hadoop 2.8 client jars for HDFS access via PyArrow's libhdfs JNI bridge.

Cross-platform problem
Dependencies like pyarrow and datafusion include platform-specific compiled extensions (.so files). On CI (Linux x86_64), these match the Docker container's platform natively. On a macOS ARM dev machine, a normal pip install produces macOS ARM binaries that can't run inside the Linux container.

How it's handled
uv pip install --target supports a --python-platform flag that downloads pre-built wheels for a different platform. The Makefile uses this to install all dependencies with Linux x86_64 native extensions regardless of the host OS. The Dockerfile then copies the pre-built site-packages directory into the image and sets PYTHONPATH.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

New file: integrations/python/dataloader/tests/Dockerfile — Test runner container with
Python 3.12, JDK, Hadoop 2.8.5 client (for PyArrow's libhdfs JNI bridge), and uv.

integrations/python/dataloader/tests/integration_tests.py — Removed _LocalCommitCatalog,
_append_data, _copy_metadata_from_container, _get_metadata_path, _create_table,
_delete_table, _cleanup_table and their imports. Added LivySession class that creates a
Livy SQL session and executes Spark SQL statements. All test_* functions and assertions are
unchanged.

integrations/python/dataloader/Makefileintegration-tests target now builds a Docker
image and runs it on the oh-hadoop-spark_default network. Token is passed via OH_TOKEN env var.

.github/workflows/build-run-tests.yml — Removed data loader steps.

integrations/python/dataloader/CLAUDE.md — Updated to reflect oh-hadoop-spark and
containerized test execution.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

All test functions and assertions are unchanged — only the test data setup mechanism changed
(Spark SQL via Livy instead of PyIceberg metadata manipulation). make verify passes (lint,
format, typecheck, unit tests).

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

robreeves and others added 4 commits March 4, 2026 09:43
Replace fragile PyIceberg metadata manipulation (LocalCommitCatalog,
docker cp, host-filesystem path juggling) with Spark SQL statements
submitted through Livy's REST API. Tests now run inside a Docker
container on the oh-hadoop-spark network so they can read HDFS data
directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an explicit `image: oh-spark-base` tag to the spark-master service
in docker compose so the test Dockerfile can `FROM oh-spark-base`
instead of installing JDK and downloading Hadoop from scratch. The base
image already provides Java, Hadoop 2.8.0, and HADOOP_HOME.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The spark Dockerfile COPYs openhouse-spark-apps_2.12-uber.jar and
openhouse-spark-runtime_2.12-uber.jar, which are produced by the
shadowJar task. The previous oh-only recipe never built spark images
so this wasn't needed, but oh-hadoop-spark requires them.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace sleep 90 with a loop that polls openhouse-tables and spark-livy
every 5s, exiting as soon as both respond. Fails explicitly on timeout
instead of silently proceeding with broken services.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robreeves robreeves changed the title Simplify dataloader integration tests: use Spark SQL via oh-hadoop-spark [DataLoader] Simplify integration tests Mar 4, 2026
@cbb330
Copy link
Collaborator

cbb330 commented Mar 4, 2026

This PR switches CI from oh-only to oh-hadoop-spark, which adds HDFS, Spark, and Livy containers + shadowJar to the build. That's a significant increase in startup time and build cost that will slow down every PR, including Java-only changes that don't touch the dataloader at all.

Split the workflow into two independent jobs:

  1. build-and-run-tests — stays exactly as it is on main today (gradlew build, oh-only, scripts/python/integration_test.py). No new dependencies.
  2. dataloader-tests (new) — runs in parallel, owns the oh-hadoop-spark recipe, shadowJar, Livy wait, make verify, and make integration-tests.

No needs: between them, so Java CI is never blocked on the heavier Python/Spark setup. The existing build-tag-publish.yml gate (needs: build-and-run-tests) stays on the Java job only.

Example workflow sketch
jobs:
  # Unchanged from main — Java build + lightweight API integration test
  build-and-run-tests:
    name: Build and Run Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-java@v5
        with:
          distribution: 'microsoft'
          java-version: '17'
      - uses: gradle/actions/setup-gradle@v5
      - run: ./gradlew clean build

      - run: docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml up -d --build
      - run: sleep 30

      - uses: actions/setup-python@v6
        with:
          python-version: '3.12'
      - run: pip install -r scripts/python/requirements.txt
      - run: python scripts/python/integration_test.py ./tables-test-fixtures/tables-test-fixtures-iceberg-1.2/src/main/resources/dummy.token

      - if: always()
        run: docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml down

  # New parallel job — dataloader lint + unit + integration
  dataloader-tests:
    name: Dataloader Tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-java@v5
        with:
          distribution: 'microsoft'
          java-version: '17'
      - uses: gradle/actions/setup-gradle@v5
      - run: ./gradlew shadowJar

      - run: docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml up -d --build
      - name: Wait for services
        run: |
          for i in $(seq 1 30); do
            if curl -sf http://localhost:8000/v1/databases > /dev/null 2>&1 && \
               curl -sf http://localhost:9003/sessions > /dev/null 2>&1; then
              echo "Services ready after $((i * 5))s"
              exit 0
            fi
            sleep 5
          done
          echo "Timed out after 150s"
          exit 1

      - uses: actions/setup-python@v6
        with:
          python-version: '3.12'
      - uses: astral-sh/setup-uv@v7
        with:
          enable-cache: true

      - working-directory: integrations/python/dataloader
        run: make sync verify
      - working-directory: integrations/python/dataloader
        run: make integration-tests TOKEN_FILE=../../../tables-test-fixtures/tables-test-fixtures-iceberg-1.2/src/main/resources/dummy.token

      - if: always()
        run: docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml down

@robreeves
Copy link
Collaborator Author

This PR switches CI from oh-only to oh-hadoop-spark, which adds HDFS, Spark, and Livy containers + shadowJar to the build. That's a significant increase in startup time and build cost that will slow down every PR, including Java-only changes that don't touch the dataloader at all.

Split the workflow into two independent jobs:

  1. build-and-run-tests — stays exactly as it is on main today (gradlew build, oh-only, scripts/python/integration_test.py). No new dependencies.
  2. dataloader-tests (new) — runs in parallel, owns the oh-hadoop-spark recipe, shadowJar, Livy wait, make verify, and make integration-tests.

No needs: between them, so Java CI is never blocked on the heavier Python/Spark setup. The existing build-tag-publish.yml gate (needs: build-and-run-tests) stays on the Java job only.

Example workflow sketch

I like that idea. I feel a lot better about not adding the cost for every PR. I was also not satisfied with the change yet, but wanted to see how it ran in the PR check (still in draft state).

robreeves and others added 23 commits March 4, 2026 15:41
Livy needs Spark master and worker to be ready before it can accept
connections, which takes longer than 150s in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nges

Restore build-and-run-tests to match main (gradlew build, oh-only,
scripts/python integration tests). Add a parallel dataloader-tests job
that only runs when files under integrations/python/dataloader/ change.
This avoids the oh-hadoop-spark startup cost for Java-only PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move dataloader lint, unit tests, and integration tests from
build-run-tests.yml into a standalone dataloader-tests.yml workflow.
This workflow only triggers when files under
integrations/python/dataloader/ change, avoiding the ~5 min
oh-hadoop-spark startup cost on every PR.

build-run-tests.yml is restored to match main (Java-only, oh-only).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Run `./gradlew clean build shadowJar` in dataloader workflow since
  oh-hadoop-spark compose needs JARs from both build and shadowJar
- Remove accidental `if: always()` addition from build-run-tests.yml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test container only needs Hadoop client JARs for PyArrow's libhdfs
bridge — it doesn't need Spark or Livy. Switch from oh-spark-base to
bde2020/hadoop-namenode:1.2.0-hadoop2.8-java8 directly and revert the
image tag addition to spark-services.yml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move Python setup and make sync verify before the readiness poll so
they run in parallel with container startup, reducing overall wall time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Java tests already run in the main build-run-tests workflow. The
dataloader workflow only needs the compiled JARs for Docker compose.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move table setup and teardown into dedicated functions
- Remove asserts from main, keep only setup/test/teardown calls
- Get snapshot IDs through DataLoader.snapshot_id instead of
  catalog.load_table().metadata
- Test nonexistent table through DataLoader instead of catalog directly
- Split monolithic tests into focused single-assertion tests
- Extract _read_all helper to reduce batch-concat-sort boilerplate

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dump docker compose ps on timeout and container logs on teardown to
help diagnose why services fail to become ready.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Print HTTP status codes for each poll attempt to understand why
services appear running but curl never succeeds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
openhouse-tables returns 401 on unauthenticated requests when OPA
authorization is enabled (oh-hadoop-spark config). Any HTTP response
proves the service is up — only 000 (connection refused) means it
is still starting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker image has no .git directory, so hatch-vcs/setuptools-scm
cannot determine the package version. Set a dummy version to unblock
the editable install during uv sync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hatchling validates that the readme file referenced in pyproject.toml
exists during editable install.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old hadoop-namenode base image has glibc too old for manylinux
wheels, causing pyarrow to build from source. Switch to
python:3.12-slim-bookworm and copy only the Hadoop client JARs and
native libs from the hadoop image via multi-stage COPY. JRE is needed
because PyArrow's libhdfs reads HDFS data files via JNI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of running uv sync inside the Docker container (which tried
to build pyarrow from source on old glibc), build a shiv on the CI
runner where manylinux wheels are available. The Docker image only
needs Python, JRE, and Hadoop libs — no uv or pip.

The shiv bundles all Python deps into a single executable zipapp.
A preamble script runs the integration tests when the shiv executes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hadoop 2.8 is incompatible with Java 17 — PyArrow's libhdfs JNI bridge
throws NoClassDefFoundError when trying to connect to HDFS. Copy JRE 8
from the same hadoop image we already use for the Hadoop client JARs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ult scheme

Three fixes for the integration test Docker container:

1. Multi-stage Docker build: build shiv inside Docker so native extensions
   (pyarrow) match the target platform. Works on both CI (amd64) and local
   (ARM Mac via emulation).

2. Expand CLASSPATH globs at build time. JNI does not expand '*' wildcards
   like the java launcher does, so libhdfs was getting NoClassDefFoundError.

3. Pass DEFAULT_SCHEME=hdfs and DEFAULT_NETLOC=namenode:9000 to the catalog
   so PyIceberg resolves schemeless paths (from Iceberg metadata) to HDFS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use one table through the entire test instead of separate tables for each
concern. The test follows a natural progression: nonexistent table error,
empty table, write data, read/filter/project, write second snapshot,
pin to old snapshot.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Row filter, column projection, second snapshot, and pinned snapshot
steps now assert all column values, not just row counts or IDs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion tests

Use uv's --python-platform flag to install Linux x86_64 packages directly
into a site-packages directory, regardless of the host OS. This eliminates
shiv entirely — the Docker image just sets PYTHONPATH and runs the test
script directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
robreeves and others added 3 commits March 4, 2026 22:51
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robreeves robreeves marked this pull request as ready for review March 5, 2026 15:03
@robreeves
Copy link
Collaborator Author

@cbb330 this is ready for review

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the Python DataLoader integration tests to provision Iceberg tables via Spark SQL through Livy (using the oh-hadoop-spark docker recipe), and moves these heavier integration tests into a dedicated GitHub Actions workflow that runs only when the dataloader subtree changes.

Changes:

  • Reworked integration_tests.py to use a LivySession helper for Spark SQL table setup/inserts instead of local PyIceberg metadata manipulation.
  • Added a dedicated Docker-based integration test runner image + Makefile targets to build platform-correct dependencies and run tests on the Compose network.
  • Introduced a new CI workflow for dataloader-only unit + integration tests; removed dataloader steps from the shared build workflow.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
integrations/python/dataloader/tests/integration_tests.py Uses Livy/Spark SQL to create/populate tables in HDFS and validates DataLoader reads (filters, projections, snapshots).
integrations/python/dataloader/tests/Dockerfile Adds a minimal test runner image with Java 8 + Hadoop 2.8 client and prebuilt Python deps for HDFS access.
integrations/python/dataloader/Makefile Builds platform-targeted site-packages for the test image and runs integration tests in Docker on the Compose network.
integrations/python/dataloader/CLAUDE.md Updates local validation instructions for oh-hadoop-spark and containerized integration tests.
.github/workflows/dataloader-tests.yml New workflow to run dataloader unit + integration tests when integrations/python/dataloader/** changes.
.github/workflows/build-run-tests.yml Removes dataloader test steps from the general Gradle build/test workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

robreeves and others added 2 commits March 5, 2026 08:01
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants