[DataLoader] Simplify integration tests#487
[DataLoader] Simplify integration tests#487robreeves wants to merge 32 commits intolinkedin:mainfrom
Conversation
Replace fragile PyIceberg metadata manipulation (LocalCommitCatalog, docker cp, host-filesystem path juggling) with Spark SQL statements submitted through Livy's REST API. Tests now run inside a Docker container on the oh-hadoop-spark network so they can read HDFS data directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an explicit `image: oh-spark-base` tag to the spark-master service in docker compose so the test Dockerfile can `FROM oh-spark-base` instead of installing JDK and downloading Hadoop from scratch. The base image already provides Java, Hadoop 2.8.0, and HADOOP_HOME. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The spark Dockerfile COPYs openhouse-spark-apps_2.12-uber.jar and openhouse-spark-runtime_2.12-uber.jar, which are produced by the shadowJar task. The previous oh-only recipe never built spark images so this wasn't needed, but oh-hadoop-spark requires them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace sleep 90 with a loop that polls openhouse-tables and spark-livy every 5s, exiting as soon as both respond. Fails explicitly on timeout instead of silently proceeding with broken services. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This PR switches CI from Split the workflow into two independent jobs:
No Example workflow sketchjobs:
# Unchanged from main — Java build + lightweight API integration test
build-and-run-tests:
name: Build and Run Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-java@v5
with:
distribution: 'microsoft'
java-version: '17'
- uses: gradle/actions/setup-gradle@v5
- run: ./gradlew clean build
- run: docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml up -d --build
- run: sleep 30
- uses: actions/setup-python@v6
with:
python-version: '3.12'
- run: pip install -r scripts/python/requirements.txt
- run: python scripts/python/integration_test.py ./tables-test-fixtures/tables-test-fixtures-iceberg-1.2/src/main/resources/dummy.token
- if: always()
run: docker compose -f infra/recipes/docker-compose/oh-only/docker-compose.yml down
# New parallel job — dataloader lint + unit + integration
dataloader-tests:
name: Dataloader Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-java@v5
with:
distribution: 'microsoft'
java-version: '17'
- uses: gradle/actions/setup-gradle@v5
- run: ./gradlew shadowJar
- run: docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml up -d --build
- name: Wait for services
run: |
for i in $(seq 1 30); do
if curl -sf http://localhost:8000/v1/databases > /dev/null 2>&1 && \
curl -sf http://localhost:9003/sessions > /dev/null 2>&1; then
echo "Services ready after $((i * 5))s"
exit 0
fi
sleep 5
done
echo "Timed out after 150s"
exit 1
- uses: actions/setup-python@v6
with:
python-version: '3.12'
- uses: astral-sh/setup-uv@v7
with:
enable-cache: true
- working-directory: integrations/python/dataloader
run: make sync verify
- working-directory: integrations/python/dataloader
run: make integration-tests TOKEN_FILE=../../../tables-test-fixtures/tables-test-fixtures-iceberg-1.2/src/main/resources/dummy.token
- if: always()
run: docker compose -f infra/recipes/docker-compose/oh-hadoop-spark/docker-compose.yml down |
I like that idea. I feel a lot better about not adding the cost for every PR. I was also not satisfied with the change yet, but wanted to see how it ran in the PR check (still in draft state). |
Livy needs Spark master and worker to be ready before it can accept connections, which takes longer than 150s in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nges Restore build-and-run-tests to match main (gradlew build, oh-only, scripts/python integration tests). Add a parallel dataloader-tests job that only runs when files under integrations/python/dataloader/ change. This avoids the oh-hadoop-spark startup cost for Java-only PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move dataloader lint, unit tests, and integration tests from build-run-tests.yml into a standalone dataloader-tests.yml workflow. This workflow only triggers when files under integrations/python/dataloader/ change, avoiding the ~5 min oh-hadoop-spark startup cost on every PR. build-run-tests.yml is restored to match main (Java-only, oh-only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Run `./gradlew clean build shadowJar` in dataloader workflow since oh-hadoop-spark compose needs JARs from both build and shadowJar - Remove accidental `if: always()` addition from build-run-tests.yml Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test container only needs Hadoop client JARs for PyArrow's libhdfs bridge — it doesn't need Spark or Livy. Switch from oh-spark-base to bde2020/hadoop-namenode:1.2.0-hadoop2.8-java8 directly and revert the image tag addition to spark-services.yml. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move Python setup and make sync verify before the readiness poll so they run in parallel with container startup, reducing overall wall time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Java tests already run in the main build-run-tests workflow. The dataloader workflow only needs the compiled JARs for Docker compose. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move table setup and teardown into dedicated functions - Remove asserts from main, keep only setup/test/teardown calls - Get snapshot IDs through DataLoader.snapshot_id instead of catalog.load_table().metadata - Test nonexistent table through DataLoader instead of catalog directly - Split monolithic tests into focused single-assertion tests - Extract _read_all helper to reduce batch-concat-sort boilerplate Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dump docker compose ps on timeout and container logs on teardown to help diagnose why services fail to become ready. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Print HTTP status codes for each poll attempt to understand why services appear running but curl never succeeds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
openhouse-tables returns 401 on unauthenticated requests when OPA authorization is enabled (oh-hadoop-spark config). Any HTTP response proves the service is up — only 000 (connection refused) means it is still starting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker image has no .git directory, so hatch-vcs/setuptools-scm cannot determine the package version. Set a dummy version to unblock the editable install during uv sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hatchling validates that the readme file referenced in pyproject.toml exists during editable install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The old hadoop-namenode base image has glibc too old for manylinux wheels, causing pyarrow to build from source. Switch to python:3.12-slim-bookworm and copy only the Hadoop client JARs and native libs from the hadoop image via multi-stage COPY. JRE is needed because PyArrow's libhdfs reads HDFS data files via JNI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of running uv sync inside the Docker container (which tried to build pyarrow from source on old glibc), build a shiv on the CI runner where manylinux wheels are available. The Docker image only needs Python, JRE, and Hadoop libs — no uv or pip. The shiv bundles all Python deps into a single executable zipapp. A preamble script runs the integration tests when the shiv executes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hadoop 2.8 is incompatible with Java 17 — PyArrow's libhdfs JNI bridge throws NoClassDefFoundError when trying to connect to HDFS. Copy JRE 8 from the same hadoop image we already use for the Hadoop client JARs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ult scheme Three fixes for the integration test Docker container: 1. Multi-stage Docker build: build shiv inside Docker so native extensions (pyarrow) match the target platform. Works on both CI (amd64) and local (ARM Mac via emulation). 2. Expand CLASSPATH globs at build time. JNI does not expand '*' wildcards like the java launcher does, so libhdfs was getting NoClassDefFoundError. 3. Pass DEFAULT_SCHEME=hdfs and DEFAULT_NETLOC=namenode:9000 to the catalog so PyIceberg resolves schemeless paths (from Iceberg metadata) to HDFS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use one table through the entire test instead of separate tables for each concern. The test follows a natural progression: nonexistent table error, empty table, write data, read/filter/project, write second snapshot, pin to old snapshot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Row filter, column projection, second snapshot, and pinned snapshot steps now assert all column values, not just row counts or IDs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion tests Use uv's --python-platform flag to install Linux x86_64 packages directly into a site-packages directory, regardless of the host OS. This eliminates shiv entirely — the Docker image just sets PYTHONPATH and runs the test script directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@cbb330 this is ready for review |
There was a problem hiding this comment.
Pull request overview
Refactors the Python DataLoader integration tests to provision Iceberg tables via Spark SQL through Livy (using the oh-hadoop-spark docker recipe), and moves these heavier integration tests into a dedicated GitHub Actions workflow that runs only when the dataloader subtree changes.
Changes:
- Reworked
integration_tests.pyto use aLivySessionhelper for Spark SQL table setup/inserts instead of local PyIceberg metadata manipulation. - Added a dedicated Docker-based integration test runner image + Makefile targets to build platform-correct dependencies and run tests on the Compose network.
- Introduced a new CI workflow for dataloader-only unit + integration tests; removed dataloader steps from the shared build workflow.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| integrations/python/dataloader/tests/integration_tests.py | Uses Livy/Spark SQL to create/populate tables in HDFS and validates DataLoader reads (filters, projections, snapshots). |
| integrations/python/dataloader/tests/Dockerfile | Adds a minimal test runner image with Java 8 + Hadoop 2.8 client and prebuilt Python deps for HDFS access. |
| integrations/python/dataloader/Makefile | Builds platform-targeted site-packages for the test image and runs integration tests in Docker on the Compose network. |
| integrations/python/dataloader/CLAUDE.md | Updates local validation instructions for oh-hadoop-spark and containerized integration tests. |
| .github/workflows/dataloader-tests.yml | New workflow to run dataloader unit + integration tests when integrations/python/dataloader/** changes. |
| .github/workflows/build-run-tests.yml | Removes dataloader test steps from the general Gradle build/test workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
The data loader integration tests were getting brittle and complicated. They wrote data locally with PyIceberg and then manually manipulated table metadata to add the data and different snapshots. As the use cases grow (branches, ORC support) this is not sustainable.
This PR refactors the data loader integration tests to use the
oh-hadoop-sparkdocker recipe. Then the integration tests can use Spark to do writes and the data loader to do reads. Usingoh-hadoop-sparkis a more expensive setup so I moved it to its own workflow to only run when data loader changes are made.Details
Integration tests run inside a Docker container on the same network as the
oh-hadoop-sparkDocker Compose services. The test container needs Python dependencies with Linux x86_64 native extensions (pyarrow, datafusion, etc.) plus JRE 8 and Hadoop 2.8 client jars for HDFS access via PyArrow's libhdfs JNI bridge.Cross-platform problem
Dependencies like pyarrow and datafusion include platform-specific compiled extensions (
.sofiles). On CI (Linux x86_64), these match the Docker container's platform natively. On a macOS ARM dev machine, a normalpip installproduces macOS ARM binaries that can't run inside the Linux container.How it's handled
uv pip install --targetsupports a--python-platformflag that downloads pre-built wheels for a different platform. The Makefile uses this to install all dependencies with Linux x86_64 native extensions regardless of the host OS. The Dockerfile then copies the pre-built site-packages directory into the image and sets PYTHONPATH.Changes
New file:
integrations/python/dataloader/tests/Dockerfile— Test runner container withPython 3.12, JDK, Hadoop 2.8.5 client (for PyArrow's libhdfs JNI bridge), and uv.
integrations/python/dataloader/tests/integration_tests.py— Removed_LocalCommitCatalog,_append_data,_copy_metadata_from_container,_get_metadata_path,_create_table,_delete_table,_cleanup_tableand their imports. AddedLivySessionclass that creates aLivy SQL session and executes Spark SQL statements. All
test_*functions and assertions areunchanged.
integrations/python/dataloader/Makefile—integration-teststarget now builds a Dockerimage and runs it on the
oh-hadoop-spark_defaultnetwork. Token is passed viaOH_TOKENenv var..github/workflows/build-run-tests.yml— Removed data loader steps.integrations/python/dataloader/CLAUDE.md— Updated to reflect oh-hadoop-spark andcontainerized test execution.
Testing Done
All test functions and assertions are unchanged — only the test data setup mechanism changed
(Spark SQL via Livy instead of PyIceberg metadata manipulation).
make verifypasses (lint,format, typecheck, unit tests).
Additional Information