Skip to content

Add HDFS diagnostic FileIO and tables diagnostics logging profile#7

Open
cbb330 wants to merge 31 commits intomainfrom
chbush/hdfs-diagnostics-logging-profile
Open

Add HDFS diagnostic FileIO and tables diagnostics logging profile#7
cbb330 wants to merge 31 commits intomainfrom
chbush/hdfs-diagnostics-logging-profile

Conversation

@cbb330
Copy link
Owner

@cbb330 cbb330 commented Feb 16, 2026

Summary

This PR adds opt-in HDFS diagnostics for Iceberg FileIO operations and a tables-only runtime logging profile for Hadoop client internals.

The goal is to make HDFS latency/availability issues diagnosable using production logs without invasive code changes.

What Changed

  • Added DiagnosticHadoopFileIO in cluster/storage:

    • Wraps Iceberg HadoopFileIO input/output streams.
    • Logs per-operation timing and payload signals:
      • HDFS_READ with total_ms, nn_ms (open stream), dn_ms (read path), bytes
      • HDFS_WRITE with total_ms, create_ms, write_ms, close_ms, bytes
      • HDFS_SLOW_NN, HDFS_SLOW_DN, HDFS_SLOW_CLOSE warnings for slow-path classification
      • periodic HDFS_STATS summaries
    • Logs effective client-side HDFS config (hedged threshold, failover backoff, socket timeout, etc.).
  • Updated FileIOConfig in internal catalog:

    • Added openhouse.hdfs.diagnostic.logging.enabled (default false).
    • When enabled, HdfsFileIO bean uses DiagnosticHadoopFileIO; otherwise behavior remains HadoopFileIO.
    • Kept compile-safe Spring configuration reference (no Java import alias).
  • Added tables profile file:

    • services/tables/src/main/resources/application-hdfs-diagnostics.properties
    • Enables DEBUG loggers for:
      • org.apache.hadoop.hdfs.DFSClient
      • org.apache.hadoop.hdfs.DFSInputStream
      • org.apache.hadoop.hdfs.DFSOutputStream
      • org.apache.hadoop.hdfs.DataStreamer
      • org.apache.hadoop.io.retry.RetryInvocationHandler
      • org.apache.hadoop.ipc.Client
      • org.apache.hadoop.hdfs.server.namenode.ha.ObserverReadProxyProvider
      • org.apache.hadoop.hdfs.server.namenode.ha.RequestHedgingProxyProvider

Why This Helps (Data-Driven HDFS Client Improvements)

This instrumentation allows us to move from anecdotal tuning to measured tuning:

  • Peak latency on refreshMetadata / updateMetadata:

    • These paths are dominated by metadata file reads/writes.
    • HDFS_READ and HDFS_WRITE provide per-call latency decomposition.
    • We can quantify p50/p90/p99 and isolate whether peaks are mostly nn_ms, dn_ms, or close_ms.
  • NameNode vs DataNode bottleneck attribution:

    • High nn_ms concentration indicates metadata/open path pressure (failover/proxy/RPC path).
    • High dn_ms concentration indicates block read path pressure (hedged read threshold/pool effectiveness, DN hotspots).
  • NN / ObserverNN / failover behavior during partial outages:

    • DEBUG logs from ObserverReadProxyProvider, RequestHedgingProxyProvider, and RetryInvocationHandler expose retry/failover/observer fallback behavior.
    • Correlating these with HDFS_SLOW_NN and elevated nn_ms reveals observer unavailability or active/standby routing instability.
  • DataNode unavailability / pipeline instability:

    • DataStreamer + ipc.Client + HDFS_SLOW_CLOSE surface pipeline recovery and ack delays.
    • Supports targeted changes like pipeline recovery toggles and timeout/retry tuning.

In short, this gives direct evidence to tune:

  • dfs.client.failover.sleep.base.millis
  • dfs.client.failover.sleep.max.millis
  • dfs.client.hedged.read.threshold.millis
  • dfs.client.hedged.read.threadpool.size
  • dfs.client.socket-timeout
  • write pipeline recovery behavior

Runtime Usage

Enable both:

  • FileIO diagnostics:
    • openhouse.hdfs.diagnostic.logging.enabled=true
  • Tables logging profile:
    • SPRING_PROFILES_ACTIVE=...,hdfs-diagnostics

Validation

  • Build/compile:

    • ./gradlew :cluster:storage:compileJava :iceberg:openhouse:internalcatalog:compileJava :services:tables:processResources -x test
  • Runtime logger verification (tables):

    • Started tables with --spring.profiles.active=hdfs-diagnostics
    • Verified via /actuator/loggers/<logger> that configured and effective levels are DEBUG for all target Hadoop categories.

Risk / Rollout

  • Default-off for FileIO diagnostics (false) and profile-gated for Hadoop DEBUG logs.
  • Main risk is log-volume increase when enabled; recommended for controlled windows/incidents.
  • Rollback is trivial: disable property/profile.

dushyantk1509 and others added 30 commits January 2, 2026 17:46
## Summary
Unit tests in `OperationTests` are failing due to time zone difference.
Spark session by default uses system time zone which is IST in my case &
Iceberg partition transform and `System.currentTimeMillis` use UTC time
zone. This was causing snapshot expiration related unit test failures.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [X] Bug Fixes: Set spark session timezone as UTC to fix unit tests.
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [X] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz>
## Summary
- Fix stale snapshot detection during concurrent modifications to return
HTTP 409 (Conflict) instead of HTTP 400 (Bad Request)
- Reclassify `ValidationException` with stale snapshot message to
`CommitFailedException` (409) to allow client retry
- Ensure other ValidationException instances are handled as HTTP 400 Bad
Request responses (e.g., attempting to delete a non-existent snapshot)

## Problem
When concurrent modifications occur during a transaction commit:
1. Client builds snapshots based on table version N (e.g.,
`lastSequenceNumber = 4`)
2. Client sends commit request with these snapshots in
`SNAPSHOTS_JSON_KEY`
3. Meanwhile, another process commits version N+1 (e.g.,
`lastSequenceNumber = 5`)
4. Server calls `doRefresh()` which updates `current()` to version N+1
5. **Bug:** The snapshots in `SNAPSHOTS_JSON_KEY` are now stale (their
sequence numbers are based on version N)
6. Iceberg's `TableMetadata.addSnapshot()` throws `ValidationException`
→ mapped to 400 Bad Request
7. Should return 409 Conflict so clients know to refresh and retry

## Solution
Let Iceberg's existing validation detect sequence number conflicts, then
catch the `ValidationException` and reclassify it as
`CommitFailedException` for the specific stale snapshot error pattern:

```java
} catch (ValidationException e) {
  // Stale snapshot errors are retryable - client should refresh and retry
  if (isStaleSnapshotError(e)) {
    throw new CommitFailedException(e);
  }
  throw new BadRequestException(e, e.getMessage());
}
```

This approach is simpler than pre-checking and leverages Iceberg's
existing validation.

## Test Plan
- [x] Unit test `testStaleSnapshotErrorDetection()` verifies error
detection logic
- [x] All existing internalcatalog tests pass
- [ ] Integration testing in staging environment

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

Uploader could not discover partitions in backup folder without
data_manifest.json. This led the orphan files to stay in backup folder
for long time. This PR made OFD to purge these data when
data_manifest.json does not exist.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
Refactors TableStatsCollectorUtil by extracting reusable helper methods
from the populateCommitEventTablePartitions implementation. This
improves code organization, testability, and enables future code reuse
without changing any functionality.

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

This is a pure refactoring PR that extracts well-designed, reusable
helper methods from inline code in
populateCommitEventTablePartitions. The goal is to:

- Improve code organization and readability
- Create reusable building blocks for future features
- Reduce code duplication
- No functional changes - behavior remains identical.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [x] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

---------

Co-authored-by: srawat <srawat@linkedin.com>
## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

Update iceberg to the latest version.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
## Summary

Add arg to run OFD delete in parallel.

## Changes

- [ ] Client-facing API Changes
- [x] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
… user. (linkedin#429)

## Summary

The table-configuration: `write.metadata.previous-versions-max` in
Apache Iceberg sets the maximum number of old metadata files (snapshots)
to keep before they are potentially deleted after a new commit. In
openhouse catalog, it is always hardcoded to be 168 regardless of the
value defined for the configuration in the table-properties by the user.
For streaming applications, which commits frequently (every 5 mins) to
iceberg, it is essential to allow users to override this configuration
so that they can control the time-travel queries for long duration and
rollback to any previous snapshots if needed.

To support the above, this patch sets the configuration:
`write.metadata.previous-versions-max` to default value of 168 only if
it is not defined by the user in their table properties.


## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

Added unit tests for changes introduced. 

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

Update iceberg to the latest version.

## Changes

- [ ] Client-facing API Changes
- [x] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done

./gradlew clean && ./gradlew build

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
…rtition-level statistics collection and publishing for tables in TableStatsCollectionSparkApp (linkedin#413)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

I extended the existing TableStatsCollectionSparkApp to implement the
logic for populating the OpenhouseCommitEventTablePartitionStats table.

This new table will serve as the partition-level source of truth for
statistics and commit metadata across all OpenHouse datasets. The table
contains exactly one row per partition, where the commit metadata
reflects the latest commit that modified that partition. Each record
includes:

1. **Commit Metadata** (from the latest commit that changed the
respective partition)
   - Commit ID (snapshot_id)
   - Commit timestamp (committed_at)
   - Commit App Id (spark app id)
   - Commit App Name (spark app name)
   - Commit operation (APPEND, DELETE, OVERWRITE, REPLACE)
2. **Table Identifier** (database, table, cluster, location, partition
spec)
3. **Partition Data** (typed column values for all partition columns)
4. **Table Level Stats** (rowCount, columnCount)
5. **Field/Column Level Stats** (nullCount, nanCount, minValue,
maxValue, columnSizeInBytes)

This enables granular partition-level analytics and monitoring,
providing:

1. **Partition-level statistics** - Access detailed metrics (row counts,
column stats) for each partition
2. **Latest state tracking** - Know the current state of each partition
and when it was last modified
3. **Fine-grained monitoring** - Monitor data quality and distribution
at partition granularity
4. **Optimized queries** - Identify partitions to scan based on min/max
values and data freshness
5. **Data profiling** - Analyze data characteristics (nulls, NaNs, size)
per partition for optimization
6. **Incremental processing** - Efficiently identify which partitions
contain relevant data for downstream pipelines

## Output

This PR ensures the TableStatsCollectionSparkApp executes all 4
collection tasks (table stats, commit events, partition events, and
partition stats) synchronously while maintaining complete data
collection and publishing functionality.

**End-to-End Verification (Docker)**
### 1. Sequential Execution Timeline

```
25/12/11 09:08:26 INFO spark.TableStatsCollectionSparkApp: Starting table stats collection for table: testdb.partition_stats_test 25/12/11 09:08:36 INFO spark.TableStatsCollectionSparkApp: Completed table stats collection for table: testdb.partition_stats_test in 9694 ms

25/12/11 09:08:36 INFO spark.TableStatsCollectionSparkApp: Starting commit events collection for table: testdb.partition_stats_test 25/12/11 09:08:38 INFO spark.TableStatsCollectionSparkApp: Completed commit events collection for table: testdb.partition_stats_test (3 events) in 2258 ms

25/12/11 09:08:38 INFO spark.TableStatsCollectionSparkApp: Starting partition events collection for table: testdb.partition_stats_test 25/12/11 09:08:41 INFO spark.TableStatsCollectionSparkApp: Completed partition events collection for table: testdb.partition_stats_test (3 partition events) in 3282 ms

25/12/11 09:08:41 INFO spark.TableStatsCollectionSparkApp: Starting partition stats collection for table: testdb.partition_stats_test 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Completed partition stats collection for table: testdb.partition_stats_test (3 partition stats) in 7895 ms

25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Total collection time for table: testdb.partition_stats_test in 23137 ms
```

**Key Points:**
- ✅ Tasks execute sequentially (no overlapping timestamps)
- ✅ Each task starts immediately after previous completes
- ✅ Total time = sum of individual tasks (9.7s + 2.3s + 3.3s + 7.9s =
23.1s)
- ✅ No "parallel execution" in log message (synchronous pattern
confirmed)


### 2. publishStats Log Output

```
25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Publishing stats for table: testdb.partition_stats_test 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: {"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","creationTimeMs":1765444016084,"numFiles":3,"sizeInBytes":3045,"numRows":5,"numColumns":4,"numPartitions":3,"earliestPartitionDate":"2024-01-01"}
```


**Key Points:**
- ✅ Table-level stats published successfully
- ✅ numRows: 5 (total across all partitions)
- ✅ numPartitions: 3 (2024-01-01, 2024-01-02, 2024-01-03)
- ✅ earliestPartitionDate: "2024-01-01" (correctly identified)
- ✅ Table metadata and size metrics populated

### 3. publishCommitEvents Log Output

```
25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Publishing commit events for table: testdb.partition_stats_test 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: [{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":5642811578603876150,"commitTimestampMs":1765444061000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777},{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":7929592344081159299,"commitTimestampMs":1765444064000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777},{"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":8368973829645132323,"commitTimestampMs":1765444066000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777}]
```


**Key Points:**
- ✅ All 3 commit events published successfully
- ✅ commitAppId: "local-1765443996768" (populated)
- ✅ commitAppName: "Spark shell" (populated)
- ✅ commitOperation: "APPEND" (properly parsed)
- ✅ Commit timestamps in chronological order

### 4. publishPartitionEvents Log Output

```
25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Publishing partition events for table: testdb.partition_stats_test 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: [{"partitionData":[{"columnName":"event_time_day","value":"2024-01-01"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":5642811578603876150,"commitTimestampMs":1765444061000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-02"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":7929592344081159299,"commitTimestampMs":1765444064000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-03"}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":8368973829645132323,"commitTimestampMs":1765444066000,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129777}]
```

**Key Points:**
- ✅ All 3 partition events published successfully
- ✅ partitionData: Contains partition column name and values
(event_time_day: 2024-01-01, 2024-01-02, 2024-01-03)
- ✅ commitAppId: "local-1765443996768" (populated)
- ✅ commitAppName: "Spark shell" (populated)
- ✅ commitOperation: "APPEND" (properly parsed)
- ✅ Each event represents a different partition with correct commit
metadata

### 5. publishPartitionStats Log Output

```
25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: Publishing partition stats for table: testdb.partition_stats_test (3 stats) 25/12/11 09:08:49 INFO spark.TableStatsCollectionSparkApp: [{"partitionData":[{"columnName":"event_time_day","value":"2024-01-01"}],"rowCount":2,"columnCount":4,"nullCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"nanCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"minValue":[{"columnName":"event_time","value":"2024-01-01 10:00:00.0"},{"columnName":"id","value":1},{"columnName":"name","value":"Alice"},{"columnName":"region","value":"EU"}],"maxValue":[{"columnName":"event_time","value":"2024-01-01 11:00:00.0"},{"columnName":"id","value":2},{"columnName":"name","value":"Bob"},{"columnName":"region","value":"US"}],"columnSizeInBytes":[{"columnName":"event_time","value":30},{"columnName":"id","value":12},{"columnName":"name","value":26},{"columnName":"region","value":22}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":5642811578603876150,"commitTimestampMs":1765444061,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129782},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-02"}],"rowCount":2,"columnCount":4,"nullCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"nanCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"minValue":[{"columnName":"event_time","value":"2024-01-02 10:00:00.0"},{"columnName":"id","value":3},{"columnName":"name","value":"Charlie"},{"columnName":"region","value":"APAC"}],"maxValue":[{"columnName":"event_time","value":"2024-01-02 11:00:00.0"},{"columnName":"id","value":4},{"columnName":"name","value":"David"},{"columnName":"region","value":"US"}],"columnSizeInBytes":[{"columnName":"event_time","value":30},{"columnName":"id","value":12},{"columnName":"name","value":30},{"columnName":"region","value":24}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":7929592344081159299,"commitTimestampMs":1765444064,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129782},{"partitionData":[{"columnName":"event_time_day","value":"2024-01-03"}],"rowCount":1,"columnCount":4,"nullCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"nanCount":[{"columnName":"event_time","value":0},{"columnName":"id","value":0},{"columnName":"name","value":0},{"columnName":"region","value":0}],"minValue":[{"columnName":"event_time","value":"2024-01-03 10:00:00.0"},{"columnName":"id","value":5},{"columnName":"name","value":"Eve"},{"columnName":"region","value":"EU"}],"maxValue":[{"columnName":"event_time","value":"2024-01-03 10:00:00.0"},{"columnName":"id","value":5},{"columnName":"name","value":"Eve"},{"columnName":"region","value":"EU"}],"columnSizeInBytes":[{"columnName":"event_time","value":15},{"columnName":"id","value":6},{"columnName":"name","value":12},{"columnName":"region","value":11}],"dataset":{"databaseName":"testdb","tableName":"partition_stats_test","clusterName":"LocalHadoopCluster","tableMetadataLocation":"/data/openhouse/testdb/partition_stats_test-088a1368-1212-49b1-b3d9-b6cabdec290e","partitionSpec":"[\n 1000: event_time_day: day(4)\n]"},"commitMetadata":{"commitId":8368973829645132323,"commitTimestampMs":1765444066,"commitAppId":"local-1765443996768","commitAppName":"Spark shell","commitOperation":"APPEND"},"eventTimestampMs":1765444129782}]
```

**Key Points:**
- ✅ All 3 partition stats published successfully
- ✅ Complete column-level metrics: nullCount, nanCount, minValue,
maxValue, columnSizeInBytes
- ✅ Partition data correctly captured (event_time_day: 2024-01-01,
2024-01-02, 2024-01-03)
- ✅ Row counts accurate: 2, 2, 1 for respective partitions
- ✅ Min/max values correctly computed per partition (Alice/Bob,
Charlie/David, Eve)
- ✅ Commit metadata properly associated with each partition stat
- ✅ Latest commit info reflects the commit that created/modified each
partition

### 6. Job Completion
```
2025-12-11 09:08:59 INFO OperationTask:233 - Finished job for entity TableMetadata(super=Metadata(creator=openhouse), dbName=testdb, tableName=partition_stats_test, ...): JobId TABLE_STATS_COLLECTION_testdb_partition_stats_test_83a5ebff-d232-4217-97d9-6a1da8881ddd, executionId 0, runTime 37322, queuedTime 13259, state SUCCEEDED
```

**Key Points:**
- ✅ Job completed successfully: state SUCCEEDED
- ✅ Total runtime: 37.3 seconds (including scheduler overhead)
- ✅ Collection time: 23.1 seconds (synchronous execution)
- ✅ All 4 publishing methods executed without errors

This Output section:

✅ Shows all 4 publish methods (stats, commit events, partition events,
partition stats)
✅ Includes actual log output with JSON data
✅ Highlights the sequential execution pattern
✅ Provides key validation points for each publish method
✅ Demonstrates successful end-to-end execution
✅ Uses your actual Docker test logs

## Key Features:

### 1. Synchronous Sequential Execution

- All 4 collection tasks execute one after another in a predictable
order:
  1. Table Stats Collection
  2. Commit Events Collection
  3. Partition Events Collection
  4. Partition Stats Collection
- Each task waits for the previous to complete before starting
- No CompletableFuture or parallel processing complexity
- Example execution: Task 1 (9.7s) → Task 2 (2.3s) → Task 3 (3.3s) →
Task 4 (7.9s) = 23.1s total

### 2. Predictable Execution Order

- Guaranteed sequential execution eliminates race conditions
- Resources allocated and released in a controlled manner
- Easier to debug with clear execution timeline
- Simplified error handling - failures don't affect parallel tasks

### 3. Maintained Data Collection Functionality

- ✅ Table stats collected and published (IcebergTableStats)
- ✅ Commit events collected and published (CommitEventTable)
- ✅ Partition events collected and published
(CommitEventTablePartitions)
- ✅ Partition stats collected and published
(CommitEventTablePartitionStats)
- All existing functionality preserved with synchronous execution
pattern

### 4. Robust Error Handling

- ✅ Null/empty results handled gracefully for each task
- ✅ Publishing skipped if collection fails or returns no data
- ✅ Unpartitioned tables handled correctly (empty partition
events/stats)
- ✅ Each task logs start/completion with timing information
- ✅ Failures in one task don't impact subsequent tasks

### 5. Performance Trade-off Accepted

- **Sequential execution:** ~23 seconds (4 tasks in series)
- **Previous parallel execution:** ~14 seconds (estimated)
- **Trade-off justification:**
  - Resolves downstream repository execution errors
  - Can be optimized later if needed without changing API

### 6. Comprehensive Timing Metrics

- Individual task timing logged: "Completed [task] for table: [name] in
[ms] ms"
- Total collection time logged: "Total collection time for table: [name]
in [ms] ms"
- No misleading "parallel execution" message
- Clear visibility into where time is spent

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [x] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [x] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

---------

Co-authored-by: srawat <srawat@linkedin.com>
…nkedin#433)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

Add option for derived classes of BaseClass to specify truststore
location to send job status to HTS.
This is needed to support the java job ecosystem where jobs can report
their job status back to OH JobService.

## Changes

- [ ] Client-facing API Changes
- [x] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [x] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

Tested with snapshot on test jobs


# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
…metric (linkedin#432)

## Summary
Critical latency metric is pinned to 30s when in reality it is as high
as 5 minutes. Incorrect latency was interpreted in an investigation and
led to wrong conclusions.

above: log based metric shows real latency

below: prometheus based metric shows p99 capped at 30s
<img width="857" height="551" alt="image"
src="https://github.com/user-attachments/assets/600a2214-3d05-40f9-90e6-1939c3919268"
/>

Configure histogram buckets to extend to 600 seconds for the
`catalog_metadata_retrieval_latency` metric, enabling accurate capture
of long-running metadata operations.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [x] Tests

Added `maximum-expected-value.catalog_metadata_retrieval_latency=600s`
to application.properties and a new Spring Boot integration test to
verify the configuration.

## Testing Done

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [x] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

Added `MetricsHistogramConfigurationTest` which verifies:
- The 600s max expected value configuration is set
- Percentiles histogram is enabled
- MeterRegistry is PrometheusMeterRegistry with histogram buckets
- Histogram buckets extend to 600s (verified via
`Timer.takeSnapshot().histogramCounts()`)
- The configuration value is parseable as a Duration

```
./gradlew :services:tables:test --tests "MetricsHistogramConfigurationTest"
```

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.
…ds (linkedin#424)

## Summary

Enable Gradle build cache for faster incremental and repeated builds.

Adding `org.gradle.caching=true` allows Gradle to reuse task outputs
from previous builds.

**Build time improvement** (`./gradlew clean build -x test`, run twice):

| | 1st Clean Build | 2nd Clean Build | Tasks from Cache |
|--|-----------------|-----------------|------------------|
| Before (no cache) | 295s (4m 55s) | 298s (4m 58s) | 0 |
| After (with cache) | 323s (5m 23s) | 281s (4m 41s) | 52 |
| **Improvement** | | **-17s (6% faster)** | |

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

### Performance Improvements
Added `org.gradle.caching=true` to `gradle.properties` to enable local
build caching.

**Before** (no caching, second build same as first):
```
BUILD SUCCESSFUL in 4m 58s
254 actionable tasks: 254 executed
```

**After** (with caching, second build reuses cached outputs):
```
BUILD SUCCESSFUL in 4m 41s
254 actionable tasks: 192 executed, 52 from cache, 10 up-to-date
```

## Testing Done

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

### Manual Testing
**Before (main branch, no cache):**
1. `./gradlew clean build -x test` → 4m 55s
2. `./gradlew clean build -x test` → 4m 58s (no improvement)

**After (with cache enabled):**
1. `rm -rf ~/.gradle/caches/build-cache-*` (clear cache)
2. `./gradlew clean build -x test` → 5m 23s (populate cache)
3. `./gradlew clean build -x test` → 4m 41s (52 tasks from cache)

### No Tests Added
This is a build infrastructure change that doesn't affect runtime
behavior.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

**Future enhancement:** Configure remote build cache for CI to share
cached outputs across builds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…nkedin#419)

## Summary

Share OpenAPI generator JAR across client modules to reduce build time
by 10%.

Previously, each client module (`hts`, `tableclient`, `jobsclient`)
downloaded the OpenAPI generator CLI JAR (~24MB) to its own
`$buildDir/bin` directory. This change uses a shared location with file
locking to ensure only one download occurs.

**Build time improvement** (`./gradlew clean build -x test`):
| | Time |
|--|------|
| Before | 311s (5m 10s) |
| After | 280s (4m 39s) |
| **Improvement** | **-31s (10% faster)** |

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

### Performance Improvements
- Modified `client/common/codegen.build.gradle` to use shared JAR
location (`${rootProject.buildDir}/openapi-cli`)
- Updated `client/common/jar_download.sh` with:
  - Portable file locking using `mkdir` (works on Linux/macOS)
  - Download to temp file then atomic `mv` to final location
  - Proper error handling and lock cleanup

**Before** (3 separate downloads):
```
> Task :client:hts:setUp
Downloading openapi generator JAR in bin folder .../build/hts/bin if needed...
> Task :client:jobsclient:setUp
Downloading openapi generator JAR in bin folder .../build/jobsclient/bin if needed...
> Task :client:tableclient:setUp  
Downloading openapi generator JAR in bin folder .../build/tableclient/bin if needed...
```

**After** (1 shared download):
```
> Task :client:hts:setUp
Downloading openapi generator JAR in bin folder .../build/openapi-cli if needed...
Downloading openapi-generator-cli-5.3.0.jar...
> Task :client:jobsclient:setUp
openapi-generator-cli-5.3.0.jar exists.
> Task :client:tableclient:setUp
openapi-generator-cli-5.3.0.jar exists.
```

## Testing Done

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

### Manual Testing
1. Verified JAR is downloaded once to shared location and reused by
other client modules
2. Verified file locking works correctly with parallel builds
(concurrent tasks wait for download)
3. Measured before/after build times with `./gradlew clean build -x
test`

### No Tests Added
This is a build infrastructure change that doesn't affect runtime
behavior. The optimization is validated by the build output showing
"exists" messages for subsequent client modules.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
…licit dependencies (linkedin#420)

## Summary

Fix port conflicts in OpenAPI spec generation to enable parallel builds,
reducing build time by 50%.

Previously, all three services (tables, housetables, jobs) would start
Spring Boot on the same default port (8080) during OpenAPI spec
generation. In parallel builds, this caused port conflicts and incorrect
API specs, leading to compilation failures.

**Build time improvement** (`./gradlew clean build -x test --parallel`):
| | Time |
|--|------|
| Before | BUILD FAILED (port conflicts) |
| After | 156s (2m 35s) |
| **vs Sequential** | **314s → 156s (-50% faster)** |

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [x] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

### Bug Fix & Performance Improvements
- Configured unique ports for each service's OpenAPI spec generation:
  - Tables service: port 8000
  - HouseTables service: port 8001  
  - Jobs service: port 8002
- Added explicit `dependsOn configurations.runtimeClasspath` to
`dummytokens:jar` task to fix implicit dependency warning

**Before** (parallel build fails):
```
FAILURE: Build completed with 2 failures.

1: Task failed with an exception.
* What went wrong:
Execution failed for task ':integrations:java:iceberg-1.2:openhouse-java-runtime:compileJava'.
> Compilation failed; see the compiler error output for details.

2: Task failed with an exception.
* What went wrong:
Execution failed for task ':integrations:java:iceberg-1.5:openhouse-java-iceberg-1.5-runtime:compileJava'.
> Compilation failed; see the compiler error output for details.

BUILD FAILED in 1m 10s
```

**After** (parallel build succeeds):
```
> Task :services:tables:generateOpenApiDocs
> Task :services:housetables:generateOpenApiDocs
> Task :services:jobs:generateOpenApiDocs

BUILD SUCCESSFUL in 2m 35s
254 actionable tasks: 244 executed, 10 up-to-date
```

## Testing Done

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

### Manual Testing
1. Verified parallel build fails on `main` branch due to port conflicts
2. Verified parallel build succeeds with fixes applied
3. Verified generated OpenAPI specs are correct (tables.json contains
Tables API endpoints)
4. Measured before/after build times with `./gradlew clean build -x test
--parallel`

### No Tests Added
This is a build infrastructure change that doesn't affect runtime
behavior. The fix is validated by the parallel build completing
successfully.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…nkedin#435)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

Table policies are currently always written to table properties when
updated.
In scenarios where we want this behavior to change for specific policies
(e.g. Retention) this is not easy to extend in another class.

This PR refactors the writing and comparison of policies into its own
class so that it is possible to change the behavior of storing table
policies.

To manage policies at a granular level, incoming policies can be read
through `TableDto` and modifying the policy object to store only the
policies that should be saved.

## Changes
- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
## Problem & Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

In the current state, we do not have entity level metrics for
maintenance jobs. All the metrics are at an aggregate level and it does
not help with action items directly. For example, if the number of
failed maintenance jobs is 10, there is no description of which entities
the failed jobs correspond to. Someone needs to parse the logs to
identify what are the tables impacted by these failures.

This change adds granular task level maintenance job metrics to solve
for such cases.

Added metrics:
* maintenance_job_triggered
   * Counter -- tracks number of maintenance jobs triggered per entity
* maintenance_job_skipped
   * Counter -- tracks number of maintenance jobs skipped per entity
* maintenance_job_completed
* Counter -- tracks number of maintenance jobs completed per entity
along with the status of the maintenance job
## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [x] Observability
   - Add new metrics to track maintenance job updates at an entity level
- [ ] Tests

## Testing Done
<!--- Check any relevant boxes with "x" -->

Adding new metrics, no additional tests added
…warnings (linkedin#421)

## Summary

Enable parallel build by default so users don't need to pass
`--parallel` flag.

This PR adds `org.gradle.parallel=true` to gradle.properties. It also
fixes Gradle deprecation warnings related to `mainClassName` and
`JavaExec.main`.

> **Note:** This PR builds on linkedin#420 (port conflict fixes) and should be
merged after it.

**Build time improvement** (`./gradlew clean build -x test`):
same existing improvement from
linkedin#420

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [x] Refactoring
- [ ] Documentation
- [ ] Tests

### Performance Improvements
- Added `org.gradle.parallel=true` to `gradle.properties` to enable
parallel builds by default
- Users no longer need to remember to pass `--parallel` flag

### Refactoring (Deprecation Fixes)
- Fixed deprecated `mainClassName` in
`scripts/java/tools/dummytokens/build.gradle` → use `application {
mainClass = ... }`
- Fixed deprecated `JavaExec.main` in
`integrations/spark/spark-3.1/openhouse-spark-runtime/build.gradle` →
use `mainClass`

**Before** (deprecation warnings):
```
The JavaExec.main property has been deprecated. This is scheduled to be removed in Gradle 8.0.
Please use the mainClass property instead.
```

**After** (no warnings):
```
BUILD SUCCESSFUL in 2m 35s
254 actionable tasks: 244 executed, 10 up-to-date
```

## Testing Done

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

### Manual Testing
1. Verified build runs in parallel by default (no `--parallel` flag
needed)
2. Verified deprecation warnings are fixed with `--warning-mode all`
3. Measured before/after build times

### No Tests Added
This is a build infrastructure change that doesn't affect runtime
behavior.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

**Dependencies:** Merge linkedin#420 first (port conflict fixes required for
parallel builds to work correctly)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary

This PR adds capability of certificate-based authentication for MySQL.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [X] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [X] Some other form of testing like staging or soak time in
production. Please explain. - Tested with internal test cluster setup -
MySQL database connection was successful with ssl certificates.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

---------

Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz>
## Summary

Remove shadowJar from build task to speed up development builds by 56%.

The `build.dependsOn shadowJar` was explicitly added but is unnecessary
because the Shadow plugin's maven-publish integration already triggers
shadowJar when running `publish`. CI workflows are unaffected since
`./gradlew publish` runs before Docker builds.

**Build time improvement** (`./gradlew clean build -x test`):
| | Time |
|--|------|
| Before | 314s (5m 13s) |
| After | 137s (2m 16s) |
| **Improvement** | **-177s (56% faster)** |

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [x] Performance Improvements
- [ ] Code Style
- [x] Refactoring
- [ ] Documentation
- [ ] Tests

### Performance Improvements
Removed `tasks.build.dependsOn tasks.shadowJar` from:
- `buildSrc/src/main/groovy/openhouse.apps-spark-common.gradle`
- `tables-test-fixtures/tables-test-fixtures-iceberg-1.2/build.gradle`
- `tables-test-fixtures/tables-test-fixtures-iceberg-1.5/build.gradle`

**Why this is safe:**
- `publish` task already triggers `shadowJar` via Shadow plugin's
maven-publish integration
- CI workflow runs `./gradlew publish` before Docker builds
- Tests that depend on `configuration: 'shadow'` still trigger shadowJar
for their dependencies

**Before** (shadowJar runs on every build):
```
> Task :apps:openhouse-spark-apps_2.12:shadowJar
> Task :apps:openhouse-spark-apps-1.5_2.12:shadowJar
> Task :tables-test-fixtures:tables-test-fixtures_2.12:shadowJar
...
BUILD SUCCESSFUL in 5m 13s
```

**After** (shadowJar only runs on publish):
```
BUILD SUCCESSFUL in 2m 16s
250 actionable tasks: 244 executed, 6 up-to-date
```

## Testing Done

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [x] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

### Manual Testing
1. Verified `./gradlew clean build -x test` no longer runs shadowJar
tasks
2. Verified `./gradlew publish --dry-run` still triggers shadowJar
3. Measured before/after build times

### No Tests Added
This is a build infrastructure change that doesn't affect runtime
behavior.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary

Adds new Gradle tasks to simplify the Docker-based local development
workflow, replacing the manual multi-step process with a single command.

the OpenHouse build currently depends on shadowJars, which significantly
slows builds (in CI and ELR so this time compounds) This dependency
existed to simplify the local testing UX.As a first step toward removing
shadowJars(which cuts the build time in half from ~5 minutes to ~2
minutes) I've added a one-line command to start the OpenHouse local
server, effectively replacing the old shadowJars-based workflow.

### Before (Manual Process)

```bash
# Step 1: Build all JARs, this task explicitly depends on shadowJars publish step
./gradlew build

# Step 2: Navigate to recipe directory
cd infra/recipes/docker-compose/oh-hadoop-spark

# Step 3: Build Docker images using the shadowjars in step1
docker compose build

# Step 4: Start containers
docker compose up -d
```

### After (Single Command)

```bash
./gradlew dockerUp -Precipe=oh-hadoop-spark
```

## New Gradle Tasks

| Task | Description |
|------|-------------|
| `dockerPrereqs` | Builds all JAR files required by Docker images |
| `dockerBuild` | Builds Docker images after ensuring prerequisites |
| `dockerUp` | One-command build and start (JARs → images → containers)
|
| `dockerDown` | Stop and remove containers |

### Usage

```bash
# Quick start with lightweight recipe
./gradlew dockerUp -Precipe=oh-only

# Full stack with Spark (default)
./gradlew dockerUp -Precipe=oh-hadoop-spark

# Stop containers
./gradlew dockerDown -Precipe=oh-only
```

## Implementation Details

**Explicit JAR Dependencies**: The `dockerPrereqs` task declares all JAR
dependencies explicitly, enabling proper Gradle parallelism:

- Service bootJars: `tables`, `housetables`, `jobs`
- Spark runtime uber JARs: `spark-3.1`, `spark-3.5`
- Spark apps uber JAR
- Utility JAR: `dummytokens`

**Recipe Selection**: Use `-Precipe=<name>` to select docker-compose
recipe:
- `oh-only` - Lightweight, local filesystem (fastest startup)
- `oh-hadoop` - With HDFS
- `oh-hadoop-spark` - Full stack with Spark (default)

**Design Decisions**:
- Tasks are in separate `docker` group, NOT integrated into `./gradlew
build`
- Helpful output messages show service URLs and next steps
- Error handling for invalid recipe names

## Documentation Updates

- **README.md**: Added quick start commands in "Running OpenHouse with
Docker Compose"
- **SETUP.md**: 
  - New "Quick Start (Recommended)" section at top
  - Task reference table
- Restructured with "Manual Docker Compose (Advanced)" section for users
who need fine-grained control

## Test Plan

- [x] Verified build works from clean state (no `build/` directory)
- [x] Verified docker tasks are NOT part of `./gradlew build`
(independent)
- [x] Verified proper Gradle dependency resolution and parallelism (85
tasks, 66 executed in parallel)
- [x] Verified services start and respond correctly:
  - Tables Service (8000): 200 OK - Create/Read/Delete table API tested
  - HouseTables Service (8001): 200 OK
  - Prometheus (9090): 200 OK
- [x] Verified `dockerDown` properly stops and removes containers

Co-authored-by: Vibe Kanban <noreply@vibekanban.com>
## Summary
This is the initial commit for a Python data loader library for
distributed loading of OpenHouse tables. This PR establishes the project
structure, core interfaces, and CI integration.
**Key Components**
- `OpenHouseDataLoader` - Main API that creates distributable splits for
parallel table loading
- `TableIdentifier` - Identifies tables by database, name, and optional
branch
- `DataLoaderSplits` / `DataLoaderSplit` - Iterable splits that can be
distributed across workers
- `TableTransformer` / `UDFRegistry` - Extension points for table
transformations and UDFs
**Project Setup**
- Python 3.12+ with `uv` for dependency management
- Ruff for linting and formatting
- Makefile with `sync`, `check`, `test`, `all` targets
- Integrated into `build-run-tests.yml` CI workflow

**Not included**
- Publishing the new python package to pypi. That will happen in a later
PR.

## Changes

- [x] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [x] Code Style
- [ ] Refactoring
- [x] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [x] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

I tested by running `make -C integrations/python/dataloader all`. This
PR is project setup and interfaces so no new functionality needs to be
tested in this PR.
```bash
uv run ruff check src/ tests/
All checks passed!
uv run ruff format --check src/ tests/
10 files already formatted
uv run pytest
============================================================================ test session starts ============================================================================
platform darwin -- Python 3.14.0, pytest-9.0.2, pluggy-1.6.0
rootdir: /Users/roreeves/li/openhouse_oss/integrations/python/dataloader
configfile: pyproject.toml
collected 1 item                                                                                                                                                            

tests/test_data_loader.py .                                                                                                                                           [100%]

============================================================================= 1 passed in 0.01s =============================================================================
```

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Sumedh Sakdeo <sumedhsakdeo@gmail.com>
## Summary


This pull request makes a minor update to the `README.md` file,
correcting the GitHub link to point to the project's documentation site
instead of the repository page.
## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
…kedin#442)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

Add support for Trino query IDs in commit metadata collection to ensure
proper tracking of commits made via Trino queries, in addition to
existing Spark application tracking.

Previously, the `commitAppId` field only captured Spark application IDs
from `spark.app.id` in the commit summary, and `commitAppName` only
captured `spark.app.name`. Tables updated via Trino queries store their
query IDs under `trino_query_id` instead, resulting in null values for
both fields in Trino-based commits. This PR adds fallback logic to
capture Trino query IDs in `commitAppId` and sets `commitAppName` to
"trino" for Trino-based commits, enabling complete tracking regardless
of execution engine.


## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

---------

Co-authored-by: srawat <srawat@linkedin.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

TablesClient uses the methods that convert a `TableResponseBody` to a
policy. This PR makes this method to be protected rather than private so
that it can be extended if needed.

## Changes

- [ ] Client-facing API Changes
- [x] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
## Summary

Add support for publishing the OH dataloader to PyPI on every commit. 

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [x] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [x] Documentation
- [ ] Tests


## Testing Done

Tested the Python release workflow locally using `act`:
```
  ### ✅ Test Python Packages
  - Python 3.12.12 setup successful
  - uv 0.9.30 installed
  - Dependencies synced (51 packages installed)
  - Linting: `All checks passed!`
  - Formatting: `9 files already formatted`
  - Tests: `1 passed in 0.00s`

  ### ✅ Tag Python Release
  - Version extracted: `0.1.0`
  - Output set correctly for downstream jobs

  ### ✅ Discover Python Packages
  - Discovered: `["integrations/python/dataloader"]`

  ### ✅ Build Python Package
  - Dependencies synced successfully
  - Version updated: `0.1.0`
  - Build artifacts created:
    - `openhouse_dataloader-0.1.0.tar.gz`
    - `openhouse_dataloader-0.1.0-py3-none-any.whl`
  - **Twine validation: PASSED** ✅
    - Wheel: `PASSED`
    - Source dist: `PASSED`

  ### ⏭️ Publish to PyPI
  - Skipped (requires actual GitHub Actions environment)

  ### Note
  - Upload artifacts step fails in `act` (expected - requires `ACTIONS_RUNTIME_TOKEN`)
  - All critical build and validation steps pass successfully
```
- [x] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.