Add HDFS diagnostic FileIO and tables diagnostics logging profile#7
Open
Add HDFS diagnostic FileIO and tables diagnostics logging profile#7
Conversation
## Summary Unit tests in `OperationTests` are failing due to time zone difference. Spark session by default uses system time zone which is IST in my case & Iceberg partition transform and `System.currentTimeMillis` use UTC time zone. This was causing snapshot expiration related unit test failures. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [X] Bug Fixes: Set spark session timezone as UTC to fix unit tests. - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [X] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz>
## Summary
- Fix stale snapshot detection during concurrent modifications to return
HTTP 409 (Conflict) instead of HTTP 400 (Bad Request)
- Reclassify `ValidationException` with stale snapshot message to
`CommitFailedException` (409) to allow client retry
- Ensure other ValidationException instances are handled as HTTP 400 Bad
Request responses (e.g., attempting to delete a non-existent snapshot)
## Problem
When concurrent modifications occur during a transaction commit:
1. Client builds snapshots based on table version N (e.g.,
`lastSequenceNumber = 4`)
2. Client sends commit request with these snapshots in
`SNAPSHOTS_JSON_KEY`
3. Meanwhile, another process commits version N+1 (e.g.,
`lastSequenceNumber = 5`)
4. Server calls `doRefresh()` which updates `current()` to version N+1
5. **Bug:** The snapshots in `SNAPSHOTS_JSON_KEY` are now stale (their
sequence numbers are based on version N)
6. Iceberg's `TableMetadata.addSnapshot()` throws `ValidationException`
→ mapped to 400 Bad Request
7. Should return 409 Conflict so clients know to refresh and retry
## Solution
Let Iceberg's existing validation detect sequence number conflicts, then
catch the `ValidationException` and reclassify it as
`CommitFailedException` for the specific stale snapshot error pattern:
```java
} catch (ValidationException e) {
// Stale snapshot errors are retryable - client should refresh and retry
if (isStaleSnapshotError(e)) {
throw new CommitFailedException(e);
}
throw new BadRequestException(e, e.getMessage());
}
```
This approach is simpler than pre-checking and leverages Iceberg's
existing validation.
## Test Plan
- [x] Unit test `testStaleSnapshotErrorDetection()` verifies error
detection logic
- [x] All existing internalcatalog tests pass
- [ ] Integration testing in staging environment
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> Uploader could not discover partitions in backup folder without data_manifest.json. This led the orphan files to stay in backup folder for long time. This PR made OFD to purge these data when data_manifest.json does not exist. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [x] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [x] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
Refactors TableStatsCollectorUtil by extracting reusable helper methods from the populateCommitEventTablePartitions implementation. This improves code organization, testability, and enables future code reuse without changing any functionality. ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> This is a pure refactoring PR that extracts well-designed, reusable helper methods from inline code in populateCommitEventTablePartitions. The goal is to: - Improve code organization and readability - Create reusable building blocks for future features - Reduce code duplication - No functional changes - behavior remains identical. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [x] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. --------- Co-authored-by: srawat <srawat@linkedin.com>
## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> Update iceberg to the latest version. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
## Summary Add arg to run OFD delete in parallel. ## Changes - [ ] Client-facing API Changes - [x] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [x] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [x] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
… user. (linkedin#429) ## Summary The table-configuration: `write.metadata.previous-versions-max` in Apache Iceberg sets the maximum number of old metadata files (snapshots) to keep before they are potentially deleted after a new commit. In openhouse catalog, it is always hardcoded to be 168 regardless of the value defined for the configuration in the table-properties by the user. For streaming applications, which commits frequently (every 5 mins) to iceberg, it is essential to allow users to override this configuration so that they can control the time-travel queries for long duration and rollback to any previous snapshots if needed. To support the above, this patch sets the configuration: `write.metadata.previous-versions-max` to default value of 168 only if it is not defined by the user in their table properties. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> Added unit tests for changes introduced. - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [x] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> Update iceberg to the latest version. ## Changes - [ ] Client-facing API Changes - [x] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done ./gradlew clean && ./gradlew build - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
…rtition-level statistics collection and publishing for tables in TableStatsCollectionSparkApp (linkedin#413) ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> I extended the existing TableStatsCollectionSparkApp to implement the logic for populating the OpenhouseCommitEventTablePartitionStats table. This new table will serve as the partition-level source of truth for statistics and commit metadata across all OpenHouse datasets. The table contains exactly one row per partition, where the commit metadata reflects the latest commit that modified that partition. Each record includes: 1. **Commit Metadata** (from the latest commit that changed the respective partition) - Commit ID (snapshot_id) - Commit timestamp (committed_at) - Commit App Id (spark app id) - Commit App Name (spark app name) - Commit operation (APPEND, DELETE, OVERWRITE, REPLACE) 2. **Table Identifier** (database, table, cluster, location, partition spec) 3. **Partition Data** (typed column values for all partition columns) 4. **Table Level Stats** (rowCount, columnCount) 5. **Field/Column Level Stats** (nullCount, nanCount, minValue, maxValue, columnSizeInBytes) This enables granular partition-level analytics and monitoring, providing: 1. **Partition-level statistics** - Access detailed metrics (row counts, column stats) for each partition 2. **Latest state tracking** - Know the current state of each partition and when it was last modified 3. **Fine-grained monitoring** - Monitor data quality and distribution at partition granularity 4. **Optimized queries** - Identify partitions to scan based on min/max values and data freshness 5. **Data profiling** - Analyze data characteristics (nulls, NaNs, size) per partition for optimization 6. **Incremental processing** - Efficiently identify which partitions contain relevant data for downstream pipelines ## Output This PR ensures the TableStatsCollectionSparkApp executes all 4 collection tasks (table stats, commit events, partition events, and partition stats) synchronously while maintaining complete data collection and publishing functionality. **End-to-End Verification (Docker)** ### 1. Sequential Execution Timeline ``` 25/12/11 09:08:26 INFO spark.TableStatsCollectionSparkApp: Starting table stats collection for table: testdb.partition_stats_test 25/12/11 09:08:36 INFO spark.TableStatsCollectionSparkApp: Completed table stats collection for table: testdb.partition_stats_test in 9694 ms 25/12/11 09:08:36 INFO spark.TableStatsCollectionSparkApp: Starting commit events collection for table: testdb.partition_stats_test 25/12/11 09:08:38 INFO spark.TableStatsCollectionSparkApp: Completed commit events collection for table: testdb.partition_stats_test (3 events) in 2258 ms 25/12/11 09:08:38 INFO spark.TableStatsCollectionSparkApp: Starting partition events collection for table: testdb.partition_stats_test 25/12/11 09:08:41 INFO spark.TableStatsCollectionSparkApp: Completed partition events collection for table: testdb.partition_stats_test (3 partition events) in 3282 ms 25/12/11 09:08:41 INFO spark.TableStatsCollectionSparkApp: Starting partition stats collection for table: testdb.partition_stats_test 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Completed partition stats collection for table: testdb.partition_stats_test (3 partition stats) in 7895 ms 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Total collection time for table: testdb.partition_stats_test in 23137 ms ``` **Key Points:** - ✅ Tasks execute sequentially (no overlapping timestamps) - ✅ Each task starts immediately after previous completes - ✅ Total time = sum of individual tasks (9.7s + 2.3s + 3.3s + 7.9s = 23.1s) - ✅ No "parallel execution" in log message (synchronous pattern confirmed) ### 2. publishStats Log Output ``` 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Publishing stats for table: testdb.partition_stats_test 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: {"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","creationTimeMs":1765444016084,"numFiles":3,"sizeInBytes":3045,"numRows":5,"numColumns":4,"numPartitions":3,"earliestPartitionDate":"2024-01-01"} ``` **Key Points:** - ✅ Table-level stats published successfully - ✅ numRows: 5 (total across all partitions) - ✅ numPartitions: 3 (2024-01-01, 2024-01-02, 2024-01-03) - ✅ earliestPartitionDate: "2024-01-01" (correctly identified) - ✅ Table metadata and size metrics populated ### 3. publishCommitEvents Log Output ``` 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Publishing commit events for table: testdb.partition_stats_test 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: [{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":5642811578603876150,"commitTimestampMs":1765444061000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777},{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":7929592344081159299,"commitTimestampMs":1765444064000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777},{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":8368973829645132323,"commitTimestampMs":1765444066000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777}] ``` **Key Points:** - ✅ All 3 commit events published successfully - ✅ commitAppId: "local-1765443996768" (populated) - ✅ commitAppName: "Spark shell" (populated) - ✅ commitOperation: "APPEND" (properly parsed) - ✅ Commit timestamps in chronological order ### 4. publishPartitionEvents Log Output ``` 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Publishing partition events for table: testdb.partition_stats_test 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: [{"partitionData":[{"columnName":"event_time_day","value":"2024-01-01"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":5642811578603876150,"commitTimestampMs":1765444061000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-02"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":7929592344081159299,"commitTimestampMs":1765444064000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-03"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":8368973829645132323,"commitTimestampMs":1765444066000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777}] ``` **Key Points:** - ✅ All 3 partition events published successfully - ✅ partitionData: Contains partition column name and values (event_time_day: 2024-01-01, 2024-01-02, 2024-01-03) - ✅ commitAppId: "local-1765443996768" (populated) - ✅ commitAppName: "Spark shell" (populated) - ✅ commitOperation: "APPEND" (properly parsed) - ✅ Each event represents a different partition with correct commit metadata ### 5. publishPartitionStats Log Output ``` 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Publishing partition stats for table: testdb.partition_stats_test (3 stats) 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: [{"partitionData":[{"columnName":"event_time_day","value":"2024-01-01"}],"rowCount":2,"columnCount":4,"nullCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"nanCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"minValue":[{"columnName":"event_time","value":"2024-01-01 10:00:00.0"},{"columnName":"id","value":1},{"columnName":"name","value":"Alice"},{"columnName":"region","value":"EU"}],"maxValue":[{"columnName":"event_time","value":"2024-01-01 11:00:00.0"},{"columnName":"id","value":2},{"columnName":"name","value":"Bob"},{"columnName":"region","value":"US"}],"columnSizeInBytes":[{"columnName":"event_time","value":30},{"columnName":"id","value":12},{"columnName":"name","value":26},{"columnName":"region","value":22}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":5642811578603876150,"commitTimestampMs":1765444061,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129782},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-02"}],"rowCount":2,"columnCount":4,"nullCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"nanCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"minValue":[{"columnName":"event_time","value":"2024-01-02 10:00:00.0"},{"columnName":"id","value":3},{"columnName":"name","value":"Charlie"},{"columnName":"region","value":"APAC"}],"maxValue":[{"columnName":"event_time","value":"2024-01-02 11:00:00.0"},{"columnName":"id","value":4},{"columnName":"name","value":"David"},{"columnName":"region","value":"US"}],"columnSizeInBytes":[{"columnName":"event_time","value":30},{"columnName":"id","value":12},{"columnName":"name","value":30},{"columnName":"region","value":24}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":7929592344081159299,"commitTimestampMs":1765444064,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129782},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-03"}],"rowCount":1,"columnCount":4,"nullCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"nanCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"minValue":[{"columnName":"event_time","value":"2024-01-03 10:00:00.0"},{"columnName":"id","value":5},{"columnName":"name","value":"Eve"},{"columnName":"region","value":"EU"}],"maxValue":[{"columnName":"event_time","value":"2024-01-03 10:00:00.0"},{"columnName":"id","value":5},{"columnName":"name","value":"Eve"},{"columnName":"region","value":"EU"}],"columnSizeInBytes":[{"columnName":"event_time","value":15},{"columnName":"id","value":6},{"columnName":"name","value":12},{"columnName":"region","value":11}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":8368973829645132323,"commitTimestampMs":1765444066,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129782}] ``` **Key Points:** - ✅ All 3 partition stats published successfully - ✅ Complete column-level metrics: nullCount, nanCount, minValue, maxValue, columnSizeInBytes - ✅ Partition data correctly captured (event_time_day: 2024-01-01, 2024-01-02, 2024-01-03) - ✅ Row counts accurate: 2, 2, 1 for respective partitions - ✅ Min/max values correctly computed per partition (Alice/Bob, Charlie/David, Eve) - ✅ Commit metadata properly associated with each partition stat - ✅ Latest commit info reflects the commit that created/modified each partition ### 6. Job Completion ``` 2025-12-11 09:08:59 INFO OperationTask:233 - Finished job for entity TableMetadata(super=Metadata(creator=openhouse), dbName=testdb, tableName=partition_stats_test, ...): JobId TABLE_STATS_COLLECTION_testdb_partition_stats_test_83a5ebff-d232-4217-97d9-6a1da8881ddd, executionId 0, runTime 37322, queuedTime 13259, state SUCCEEDED ``` **Key Points:** - ✅ Job completed successfully: state SUCCEEDED - ✅ Total runtime: 37.3 seconds (including scheduler overhead) - ✅ Collection time: 23.1 seconds (synchronous execution) - ✅ All 4 publishing methods executed without errors This Output section: ✅ Shows all 4 publish methods (stats, commit events, partition events, partition stats) ✅ Includes actual log output with JSON data ✅ Highlights the sequential execution pattern ✅ Provides key validation points for each publish method ✅ Demonstrates successful end-to-end execution ✅ Uses your actual Docker test logs ## Key Features: ### 1. Synchronous Sequential Execution - All 4 collection tasks execute one after another in a predictable order: 1. Table Stats Collection 2. Commit Events Collection 3. Partition Events Collection 4. Partition Stats Collection - Each task waits for the previous to complete before starting - No CompletableFuture or parallel processing complexity - Example execution: Task 1 (9.7s) → Task 2 (2.3s) → Task 3 (3.3s) → Task 4 (7.9s) = 23.1s total ### 2. Predictable Execution Order - Guaranteed sequential execution eliminates race conditions - Resources allocated and released in a controlled manner - Easier to debug with clear execution timeline - Simplified error handling - failures don't affect parallel tasks ### 3. Maintained Data Collection Functionality - ✅ Table stats collected and published (IcebergTableStats) - ✅ Commit events collected and published (CommitEventTable) - ✅ Partition events collected and published (CommitEventTablePartitions) - ✅ Partition stats collected and published (CommitEventTablePartitionStats) - All existing functionality preserved with synchronous execution pattern ### 4. Robust Error Handling - ✅ Null/empty results handled gracefully for each task - ✅ Publishing skipped if collection fails or returns no data - ✅ Unpartitioned tables handled correctly (empty partition events/stats) - ✅ Each task logs start/completion with timing information - ✅ Failures in one task don't impact subsequent tasks ### 5. Performance Trade-off Accepted - **Sequential execution:** ~23 seconds (4 tasks in series) - **Previous parallel execution:** ~14 seconds (estimated) - **Trade-off justification:** - Resolves downstream repository execution errors - Can be optimized later if needed without changing API ### 6. Comprehensive Timing Metrics - Individual task timing logged: "Completed [task] for table: [name] in [ms] ms" - Total collection time logged: "Total collection time for table: [name] in [ms] ms" - No misleading "parallel execution" message - Clear visibility into where time is spent ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [x] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [x] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. --------- Co-authored-by: srawat <srawat@linkedin.com>
…nkedin#433) ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. Add option for derived classes of BaseClass to specify truststore location to send job status to HTS. This is needed to support the java job ecosystem where jobs can report their job status back to OH JobService. ## Changes - [ ] Client-facing API Changes - [x] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [x] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. Tested with snapshot on test jobs # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
…metric (linkedin#432) ## Summary Critical latency metric is pinned to 30s when in reality it is as high as 5 minutes. Incorrect latency was interpreted in an investigation and led to wrong conclusions. above: log based metric shows real latency below: prometheus based metric shows p99 capped at 30s <img width="857" height="551" alt="image" src="https://github.com/user-attachments/assets/600a2214-3d05-40f9-90e6-1939c3919268" /> Configure histogram buckets to extend to 600 seconds for the `catalog_metadata_retrieval_latency` metric, enabling accurate capture of long-running metadata operations. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [x] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [x] Tests Added `maximum-expected-value.catalog_metadata_retrieval_latency=600s` to application.properties and a new Spring Boot integration test to verify the configuration. ## Testing Done - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [x] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. Added `MetricsHistogramConfigurationTest` which verifies: - The 600s max expected value configuration is set - Percentiles histogram is enabled - MeterRegistry is PrometheusMeterRegistry with histogram buckets - Histogram buckets extend to 600s (verified via `Timer.takeSnapshot().histogramCounts()`) - The configuration value is parseable as a Duration ``` ./gradlew :services:tables:test --tests "MetricsHistogramConfigurationTest" ``` # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description.
…ds (linkedin#424) ## Summary Enable Gradle build cache for faster incremental and repeated builds. Adding `org.gradle.caching=true` allows Gradle to reuse task outputs from previous builds. **Build time improvement** (`./gradlew clean build -x test`, run twice): | | 1st Clean Build | 2nd Clean Build | Tasks from Cache | |--|-----------------|-----------------|------------------| | Before (no cache) | 295s (4m 55s) | 298s (4m 58s) | 0 | | After (with cache) | 323s (5m 23s) | 281s (4m 41s) | 52 | | **Improvement** | | **-17s (6% faster)** | | ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [x] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests ### Performance Improvements Added `org.gradle.caching=true` to `gradle.properties` to enable local build caching. **Before** (no caching, second build same as first): ``` BUILD SUCCESSFUL in 4m 58s 254 actionable tasks: 254 executed ``` **After** (with caching, second build reuses cached outputs): ``` BUILD SUCCESSFUL in 4m 41s 254 actionable tasks: 192 executed, 52 from cache, 10 up-to-date ``` ## Testing Done - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [x] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. ### Manual Testing **Before (main branch, no cache):** 1. `./gradlew clean build -x test` → 4m 55s 2. `./gradlew clean build -x test` → 4m 58s (no improvement) **After (with cache enabled):** 1. `rm -rf ~/.gradle/caches/build-cache-*` (clear cache) 2. `./gradlew clean build -x test` → 5m 23s (populate cache) 3. `./gradlew clean build -x test` → 4m 41s (52 tasks from cache) ### No Tests Added This is a build infrastructure change that doesn't affect runtime behavior. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. **Future enhancement:** Configure remote build cache for CI to share cached outputs across builds. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…nkedin#419) ## Summary Share OpenAPI generator JAR across client modules to reduce build time by 10%. Previously, each client module (`hts`, `tableclient`, `jobsclient`) downloaded the OpenAPI generator CLI JAR (~24MB) to its own `$buildDir/bin` directory. This change uses a shared location with file locking to ensure only one download occurs. **Build time improvement** (`./gradlew clean build -x test`): | | Time | |--|------| | Before | 311s (5m 10s) | | After | 280s (4m 39s) | | **Improvement** | **-31s (10% faster)** | ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [x] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests ### Performance Improvements - Modified `client/common/codegen.build.gradle` to use shared JAR location (`${rootProject.buildDir}/openapi-cli`) - Updated `client/common/jar_download.sh` with: - Portable file locking using `mkdir` (works on Linux/macOS) - Download to temp file then atomic `mv` to final location - Proper error handling and lock cleanup **Before** (3 separate downloads): ``` > Task :client:hts:setUp Downloading openapi generator JAR in bin folder .../build/hts/bin if needed... > Task :client:jobsclient:setUp Downloading openapi generator JAR in bin folder .../build/jobsclient/bin if needed... > Task :client:tableclient:setUp Downloading openapi generator JAR in bin folder .../build/tableclient/bin if needed... ``` **After** (1 shared download): ``` > Task :client:hts:setUp Downloading openapi generator JAR in bin folder .../build/openapi-cli if needed... Downloading openapi-generator-cli-5.3.0.jar... > Task :client:jobsclient:setUp openapi-generator-cli-5.3.0.jar exists. > Task :client:tableclient:setUp openapi-generator-cli-5.3.0.jar exists. ``` ## Testing Done - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [x] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. ### Manual Testing 1. Verified JAR is downloaded once to shared location and reused by other client modules 2. Verified file locking works correctly with parallel builds (concurrent tasks wait for download) 3. Measured before/after build times with `./gradlew clean build -x test` ### No Tests Added This is a build infrastructure change that doesn't affect runtime behavior. The optimization is validated by the build output showing "exists" messages for subsequent client modules. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…licit dependencies (linkedin#420) ## Summary Fix port conflicts in OpenAPI spec generation to enable parallel builds, reducing build time by 50%. Previously, all three services (tables, housetables, jobs) would start Spring Boot on the same default port (8080) during OpenAPI spec generation. In parallel builds, this caused port conflicts and incorrect API specs, leading to compilation failures. **Build time improvement** (`./gradlew clean build -x test --parallel`): | | Time | |--|------| | Before | BUILD FAILED (port conflicts) | | After | 156s (2m 35s) | | **vs Sequential** | **314s → 156s (-50% faster)** | ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [x] Bug Fixes - [ ] New Features - [x] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests ### Bug Fix & Performance Improvements - Configured unique ports for each service's OpenAPI spec generation: - Tables service: port 8000 - HouseTables service: port 8001 - Jobs service: port 8002 - Added explicit `dependsOn configurations.runtimeClasspath` to `dummytokens:jar` task to fix implicit dependency warning **Before** (parallel build fails): ``` FAILURE: Build completed with 2 failures. 1: Task failed with an exception. * What went wrong: Execution failed for task ':integrations:java:iceberg-1.2:openhouse-java-runtime:compileJava'. > Compilation failed; see the compiler error output for details. 2: Task failed with an exception. * What went wrong: Execution failed for task ':integrations:java:iceberg-1.5:openhouse-java-iceberg-1.5-runtime:compileJava'. > Compilation failed; see the compiler error output for details. BUILD FAILED in 1m 10s ``` **After** (parallel build succeeds): ``` > Task :services:tables:generateOpenApiDocs > Task :services:housetables:generateOpenApiDocs > Task :services:jobs:generateOpenApiDocs BUILD SUCCESSFUL in 2m 35s 254 actionable tasks: 244 executed, 10 up-to-date ``` ## Testing Done - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [x] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. ### Manual Testing 1. Verified parallel build fails on `main` branch due to port conflicts 2. Verified parallel build succeeds with fixes applied 3. Verified generated OpenAPI specs are correct (tables.json contains Tables API endpoints) 4. Measured before/after build times with `./gradlew clean build -x test --parallel` ### No Tests Added This is a build infrastructure change that doesn't affect runtime behavior. The fix is validated by the parallel build completing successfully. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…nkedin#435) ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. Table policies are currently always written to table properties when updated. In scenarios where we want this behavior to change for specific policies (e.g. Retention) this is not easy to extend in another class. This PR refactors the writing and comparison of policies into its own class so that it is possible to change the behavior of storing table policies. To manage policies at a granular level, incoming policies can be read through `TableDto` and modifying the policy object to store only the policies that should be saved. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
## Problem & Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> In the current state, we do not have entity level metrics for maintenance jobs. All the metrics are at an aggregate level and it does not help with action items directly. For example, if the number of failed maintenance jobs is 10, there is no description of which entities the failed jobs correspond to. Someone needs to parse the logs to identify what are the tables impacted by these failures. This change adds granular task level maintenance job metrics to solve for such cases. Added metrics: * maintenance_job_triggered * Counter -- tracks number of maintenance jobs triggered per entity * maintenance_job_skipped * Counter -- tracks number of maintenance jobs skipped per entity * maintenance_job_completed * Counter -- tracks number of maintenance jobs completed per entity along with the status of the maintenance job ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [x] Observability - Add new metrics to track maintenance job updates at an entity level - [ ] Tests ## Testing Done <!--- Check any relevant boxes with "x" --> Adding new metrics, no additional tests added
…warnings (linkedin#421) ## Summary Enable parallel build by default so users don't need to pass `--parallel` flag. This PR adds `org.gradle.parallel=true` to gradle.properties. It also fixes Gradle deprecation warnings related to `mainClassName` and `JavaExec.main`. > **Note:** This PR builds on linkedin#420 (port conflict fixes) and should be merged after it. **Build time improvement** (`./gradlew clean build -x test`): same existing improvement from linkedin#420 ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [x] Performance Improvements - [ ] Code Style - [x] Refactoring - [ ] Documentation - [ ] Tests ### Performance Improvements - Added `org.gradle.parallel=true` to `gradle.properties` to enable parallel builds by default - Users no longer need to remember to pass `--parallel` flag ### Refactoring (Deprecation Fixes) - Fixed deprecated `mainClassName` in `scripts/java/tools/dummytokens/build.gradle` → use `application { mainClass = ... }` - Fixed deprecated `JavaExec.main` in `integrations/spark/spark-3.1/openhouse-spark-runtime/build.gradle` → use `mainClass` **Before** (deprecation warnings): ``` The JavaExec.main property has been deprecated. This is scheduled to be removed in Gradle 8.0. Please use the mainClass property instead. ``` **After** (no warnings): ``` BUILD SUCCESSFUL in 2m 35s 254 actionable tasks: 244 executed, 10 up-to-date ``` ## Testing Done - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [x] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. ### Manual Testing 1. Verified build runs in parallel by default (no `--parallel` flag needed) 2. Verified deprecation warnings are fixed with `--warning-mode all` 3. Measured before/after build times ### No Tests Added This is a build infrastructure change that doesn't affect runtime behavior. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. **Dependencies:** Merge linkedin#420 first (port conflict fixes required for parallel builds to work correctly) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary This PR adds capability of certificate-based authentication for MySQL. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [X] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [X] Some other form of testing like staging or soak time in production. Please explain. - Tested with internal test cluster setup - MySQL database connection was successful with ssl certificates. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. --------- Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz>
## Summary Remove shadowJar from build task to speed up development builds by 56%. The `build.dependsOn shadowJar` was explicitly added but is unnecessary because the Shadow plugin's maven-publish integration already triggers shadowJar when running `publish`. CI workflows are unaffected since `./gradlew publish` runs before Docker builds. **Build time improvement** (`./gradlew clean build -x test`): | | Time | |--|------| | Before | 314s (5m 13s) | | After | 137s (2m 16s) | | **Improvement** | **-177s (56% faster)** | ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [x] Performance Improvements - [ ] Code Style - [x] Refactoring - [ ] Documentation - [ ] Tests ### Performance Improvements Removed `tasks.build.dependsOn tasks.shadowJar` from: - `buildSrc/src/main/groovy/openhouse.apps-spark-common.gradle` - `tables-test-fixtures/tables-test-fixtures-iceberg-1.2/build.gradle` - `tables-test-fixtures/tables-test-fixtures-iceberg-1.5/build.gradle` **Why this is safe:** - `publish` task already triggers `shadowJar` via Shadow plugin's maven-publish integration - CI workflow runs `./gradlew publish` before Docker builds - Tests that depend on `configuration: 'shadow'` still trigger shadowJar for their dependencies **Before** (shadowJar runs on every build): ``` > Task :apps:openhouse-spark-apps_2.12:shadowJar > Task :apps:openhouse-spark-apps-1.5_2.12:shadowJar > Task :tables-test-fixtures:tables-test-fixtures_2.12:shadowJar ... BUILD SUCCESSFUL in 5m 13s ``` **After** (shadowJar only runs on publish): ``` BUILD SUCCESSFUL in 2m 16s 250 actionable tasks: 244 executed, 6 up-to-date ``` ## Testing Done - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [x] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. ### Manual Testing 1. Verified `./gradlew clean build -x test` no longer runs shadowJar tasks 2. Verified `./gradlew publish --dry-run` still triggers shadowJar 3. Measured before/after build times ### No Tests Added This is a build infrastructure change that doesn't affect runtime behavior. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary Adds new Gradle tasks to simplify the Docker-based local development workflow, replacing the manual multi-step process with a single command. the OpenHouse build currently depends on shadowJars, which significantly slows builds (in CI and ELR so this time compounds) This dependency existed to simplify the local testing UX.As a first step toward removing shadowJars(which cuts the build time in half from ~5 minutes to ~2 minutes) I've added a one-line command to start the OpenHouse local server, effectively replacing the old shadowJars-based workflow. ### Before (Manual Process) ```bash # Step 1: Build all JARs, this task explicitly depends on shadowJars publish step ./gradlew build # Step 2: Navigate to recipe directory cd infra/recipes/docker-compose/oh-hadoop-spark # Step 3: Build Docker images using the shadowjars in step1 docker compose build # Step 4: Start containers docker compose up -d ``` ### After (Single Command) ```bash ./gradlew dockerUp -Precipe=oh-hadoop-spark ``` ## New Gradle Tasks | Task | Description | |------|-------------| | `dockerPrereqs` | Builds all JAR files required by Docker images | | `dockerBuild` | Builds Docker images after ensuring prerequisites | | `dockerUp` | One-command build and start (JARs → images → containers) | | `dockerDown` | Stop and remove containers | ### Usage ```bash # Quick start with lightweight recipe ./gradlew dockerUp -Precipe=oh-only # Full stack with Spark (default) ./gradlew dockerUp -Precipe=oh-hadoop-spark # Stop containers ./gradlew dockerDown -Precipe=oh-only ``` ## Implementation Details **Explicit JAR Dependencies**: The `dockerPrereqs` task declares all JAR dependencies explicitly, enabling proper Gradle parallelism: - Service bootJars: `tables`, `housetables`, `jobs` - Spark runtime uber JARs: `spark-3.1`, `spark-3.5` - Spark apps uber JAR - Utility JAR: `dummytokens` **Recipe Selection**: Use `-Precipe=<name>` to select docker-compose recipe: - `oh-only` - Lightweight, local filesystem (fastest startup) - `oh-hadoop` - With HDFS - `oh-hadoop-spark` - Full stack with Spark (default) **Design Decisions**: - Tasks are in separate `docker` group, NOT integrated into `./gradlew build` - Helpful output messages show service URLs and next steps - Error handling for invalid recipe names ## Documentation Updates - **README.md**: Added quick start commands in "Running OpenHouse with Docker Compose" - **SETUP.md**: - New "Quick Start (Recommended)" section at top - Task reference table - Restructured with "Manual Docker Compose (Advanced)" section for users who need fine-grained control ## Test Plan - [x] Verified build works from clean state (no `build/` directory) - [x] Verified docker tasks are NOT part of `./gradlew build` (independent) - [x] Verified proper Gradle dependency resolution and parallelism (85 tasks, 66 executed in parallel) - [x] Verified services start and respond correctly: - Tables Service (8000): 200 OK - Create/Read/Delete table API tested - HouseTables Service (8001): 200 OK - Prometheus (9090): 200 OK - [x] Verified `dockerDown` properly stops and removes containers Co-authored-by: Vibe Kanban <noreply@vibekanban.com>
## Summary This is the initial commit for a Python data loader library for distributed loading of OpenHouse tables. This PR establishes the project structure, core interfaces, and CI integration. **Key Components** - `OpenHouseDataLoader` - Main API that creates distributable splits for parallel table loading - `TableIdentifier` - Identifies tables by database, name, and optional branch - `DataLoaderSplits` / `DataLoaderSplit` - Iterable splits that can be distributed across workers - `TableTransformer` / `UDFRegistry` - Extension points for table transformations and UDFs **Project Setup** - Python 3.12+ with `uv` for dependency management - Ruff for linting and formatting - Makefile with `sync`, `check`, `test`, `all` targets - Integrated into `build-run-tests.yml` CI workflow **Not included** - Publishing the new python package to pypi. That will happen in a later PR. ## Changes - [x] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [x] Code Style - [ ] Refactoring - [x] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [x] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. I tested by running `make -C integrations/python/dataloader all`. This PR is project setup and interfaces so no new functionality needs to be tested in this PR. ```bash uv run ruff check src/ tests/ All checks passed! uv run ruff format --check src/ tests/ 10 files already formatted uv run pytest ============================================================================ test session starts ============================================================================ platform darwin -- Python 3.14.0, pytest-9.0.2, pluggy-1.6.0 rootdir: /Users/roreeves/li/openhouse_oss/integrations/python/dataloader configfile: pyproject.toml collected 1 item tests/test_data_loader.py . [100%] ============================================================================= 1 passed in 0.01s ============================================================================= ``` # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Sumedh Sakdeo <sumedhsakdeo@gmail.com>
## Summary This pull request makes a minor update to the `README.md` file, correcting the GitHub link to point to the project's documentation site instead of the repository page. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
…kedin#442) ## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> Add support for Trino query IDs in commit metadata collection to ensure proper tracking of commits made via Trino queries, in addition to existing Spark application tracking. Previously, the `commitAppId` field only captured Spark application IDs from `spark.app.id` in the commit summary, and `commitAppName` only captured `spark.app.name`. Tables updated via Trino queries store their query IDs under `trino_query_id` instead, resulting in null values for both fields in Trino-based commits. This PR adds fallback logic to capture Trino query IDs in `commitAppId` and sets `commitAppName` to "trino" for Trino-based commits, enabling complete tracking regardless of execution engine. ## Changes - [ ] Client-facing API Changes - [ ] Internal API Changes - [ ] Bug Fixes - [x] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [x] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request. --------- Co-authored-by: srawat <srawat@linkedin.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary <!--- HINT: Replace #nnn with corresponding Issue number, if you are fixing an existing issue --> [Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly discuss the summary of the changes made in this pull request in 2-3 lines. TablesClient uses the methods that convert a `TableResponseBody` to a policy. This PR makes this method to be protected rather than private so that it can be extended if needed. ## Changes - [ ] Client-facing API Changes - [x] Internal API Changes - [ ] Bug Fixes - [ ] New Features - [ ] Performance Improvements - [ ] Code Style - [ ] Refactoring - [ ] Documentation - [ ] Tests For all the boxes checked, please include additional details of the changes made in this pull request. ## Testing Done <!--- Check any relevant boxes with "x" --> - [ ] Manually Tested on local docker setup. Please include commands ran, and their output. - [ ] Added new tests for the changes made. - [ ] Updated existing tests to reflect the changes made. - [ ] No tests added or updated. Please explain why. If unsure, please feel free to ask for help. - [ ] Some other form of testing like staging or soak time in production. Please explain. For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request. # Additional Information - [ ] Breaking Changes - [ ] Deprecations - [ ] Large PR broken into smaller PRs, and PR plan linked in the description. For all the boxes checked, include additional details of the changes made in this pull request.
## Summary
Add support for publishing the OH dataloader to PyPI on every commit.
## Changes
- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [x] Documentation
- [ ] Tests
## Testing Done
Tested the Python release workflow locally using `act`:
```
### ✅ Test Python Packages
- Python 3.12.12 setup successful
- uv 0.9.30 installed
- Dependencies synced (51 packages installed)
- Linting: `All checks passed!`
- Formatting: `9 files already formatted`
- Tests: `1 passed in 0.00s`
### ✅ Tag Python Release
- Version extracted: `0.1.0`
- Output set correctly for downstream jobs
### ✅ Discover Python Packages
- Discovered: `["integrations/python/dataloader"]`
### ✅ Build Python Package
- Dependencies synced successfully
- Version updated: `0.1.0`
- Build artifacts created:
- `openhouse_dataloader-0.1.0.tar.gz`
- `openhouse_dataloader-0.1.0-py3-none-any.whl`
- **Twine validation: PASSED** ✅
- Wheel: `PASSED`
- Source dist: `PASSED`
### ⏭️ Publish to PyPI
- Skipped (requires actual GitHub Actions environment)
### Note
- Upload artifacts step fails in `act` (expected - requires `ACTIONS_RUNTIME_TOKEN`)
- All critical build and validation steps pass successfully
```
- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.
For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.
# Additional Information
- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.
For all the boxes checked, include additional details of the changes
made in this pull request.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds opt-in HDFS diagnostics for Iceberg FileIO operations and a tables-only runtime logging profile for Hadoop client internals.
The goal is to make HDFS latency/availability issues diagnosable using production logs without invasive code changes.
What Changed
Added
DiagnosticHadoopFileIOincluster/storage:HadoopFileIOinput/output streams.HDFS_READwithtotal_ms,nn_ms(open stream),dn_ms(read path),bytesHDFS_WRITEwithtotal_ms,create_ms,write_ms,close_ms,bytesHDFS_SLOW_NN,HDFS_SLOW_DN,HDFS_SLOW_CLOSEwarnings for slow-path classificationHDFS_STATSsummarieshedged threshold,failover backoff,socket timeout, etc.).Updated
FileIOConfigin internal catalog:openhouse.hdfs.diagnostic.logging.enabled(defaultfalse).HdfsFileIObean usesDiagnosticHadoopFileIO; otherwise behavior remainsHadoopFileIO.Added tables profile file:
services/tables/src/main/resources/application-hdfs-diagnostics.propertiesorg.apache.hadoop.hdfs.DFSClientorg.apache.hadoop.hdfs.DFSInputStreamorg.apache.hadoop.hdfs.DFSOutputStreamorg.apache.hadoop.hdfs.DataStreamerorg.apache.hadoop.io.retry.RetryInvocationHandlerorg.apache.hadoop.ipc.Clientorg.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProviderorg.apache.hadoop.hdfs.server.namenode.ha.RequestHedgingProxyProviderWhy This Helps (Data-Driven HDFS Client Improvements)
This instrumentation allows us to move from anecdotal tuning to measured tuning:
Peak latency on
refreshMetadata/updateMetadata:HDFS_READandHDFS_WRITEprovide per-call latency decomposition.nn_ms,dn_ms, orclose_ms.NameNode vs DataNode bottleneck attribution:
nn_msconcentration indicates metadata/open path pressure (failover/proxy/RPC path).dn_msconcentration indicates block read path pressure (hedged read threshold/pool effectiveness, DN hotspots).NN / ObserverNN / failover behavior during partial outages:
ObserverReadProxyProvider,RequestHedgingProxyProvider, andRetryInvocationHandlerexpose retry/failover/observer fallback behavior.HDFS_SLOW_NNand elevatednn_msreveals observer unavailability or active/standby routing instability.DataNode unavailability / pipeline instability:
DataStreamer+ipc.Client+HDFS_SLOW_CLOSEsurface pipeline recovery and ack delays.In short, this gives direct evidence to tune:
dfs.client.failover.sleep.base.millisdfs.client.failover.sleep.max.millisdfs.client.hedged.read.threshold.millisdfs.client.hedged.read.threadpool.sizedfs.client.socket-timeoutRuntime Usage
Enable both:
openhouse.hdfs.diagnostic.logging.enabled=trueSPRING_PROFILES_ACTIVE=...,hdfs-diagnosticsValidation
Build/compile:
./gradlew :cluster:storage:compileJava :iceberg:openhouse:internalcatalog:compileJava :services:tables:processResources -x testRuntime logger verification (tables):
--spring.profiles.active=hdfs-diagnostics/actuator/loggers/<logger>that configured and effective levels areDEBUGfor all target Hadoop categories.Risk / Rollout
false) and profile-gated for Hadoop DEBUG logs.