#733 add numbers to duplicated sanitized cols by yruslan · Pull Request #734 · AbsaOSS/pramen

yruslan · 2026-04-14T07:12:44Z

Closes #733

Summary by CodeRabbit

Bug Fixes
- JDBC decimal loading now occurs in the correct order to preserve sanitized column names and metadata.
- Column name collisions during sanitization are resolved with automatic numeric suffixes.
New Features
- Original column names are recorded and retained in field metadata for traceability.
- Length info can be derived from stored CHAR/VARCHAR type strings when needed.
Tests
- Added tests covering collision handling, metadata propagation, case-insensitive collisions, and CHAR/VARCHAR length extraction.

… sanitized.

…d retain original column metadata

coderabbitai · 2026-04-14T07:12:59Z

Warning

Rate limit exceeded

@yruslan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 6 minutes and 43 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 6 minutes and 43 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c37bfe0c-1077-4c2e-9044-70b34e2a6fae

📥 Commits

Reviewing files that changed from the base of the PR and between 7dadc92 and 8f33eda.

📒 Files selected for processing (2)

pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala

Walkthrough

Reorders JDBC decimal-correction load to run before sanitization; enhances DataFrame column sanitization to avoid case-insensitive name collisions by appending numeric suffixes and preserve original column names and existing metadata; adds metadata parsing for CHAR/VARCHAR lengths and expands unit tests for these behaviors.

Changes

Cohort / File(s)	Summary
JDBC Decimal Correction `pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala`	Moved the corrected-decimals JDBC DataFrame reload to occur before sanitization and switched the `foreach` to brace syntax; no public API changes.
Column Sanitization & Metadata `pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`	Added `CHAR_VARCHAR_METADATA_KEY` and `ORIGINAL_NAME_METADATA_KEY`, a regex for char/varchar lengths, extraction from stored type strings, and collision-resolution in `sanitizeDfColumns` (case-insensitive uniqueness with numeric suffixes and `original_name` metadata preservation).
Unit Tests `pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`	Added tests covering case-insensitive collision handling, suffix numbering, metadata propagation (`original_name`) for sanitized names, and char/varchar length extraction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

#616: Classify column name prefix as table prefix if it is the same across all columns. Do not remove column prefix if it is not table prefix in sanitizeDfColumns #617: Modifies the same SparkUtils.sanitizeDfColumns logic (prefix-removal and normalization), likely interacting with collision-resolution and metadata changes.

Suggested reviewers

kevinwallimann

Poem

🐇 I hopped through names both large and small,

Replaced the spaces, fixed collisions all.
I tuck originals in metadata tight,
Suffixes numbering to set things right.
A carrot for tests—now columns sleep at night.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	❓ Inconclusive	The PR includes changes to TableReaderJdbc.scala reordering control flow for decimal correction, which appears tangential to the main sanitization collision fix but relates to the commit message mentioning '__CHAR_VARCHAR_TYPE_STRING' metadata handling.	Clarify whether the TableReaderJdbc.scala control-flow reordering is necessary for the `#733` fix or if it should be a separate PR to keep the scope focused on column sanitization.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title concisely references the issue number (`#733`) and accurately describes the main change: adding numbers to duplicated sanitized columns to prevent duplicates.
Linked Issues check	✅ Passed	The code changes fully implement the three requirements from issue `#733`: preventing duplicate column names by appending numbering when collisions occur, adding original_name metadata to renamed columns, and handling case-insensitive collisions.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch bugfix/733-add-numbers-to-duplicated-sanitized-cols

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala (1)

155-185: Tests look good, but consider adding mixed-case coverage.

The collision handling and metadata tests are comprehensive for lowercase scenarios. However, the case-sensitivity bug in sanitizeDfColumns (where remove/add operations don't use lowercase) wouldn't be caught by these tests since all column names are lowercase.

Consider adding a test with mixed-case columns to validate case-insensitive collision detection:

🧪 Suggested additional test case

"handle case-insensitive name collisions" in {
  val df = List(("A", 1, 2)).toDF("Test_Column", "Test Column", "TEST COLUMN")
  
  val actualDf = sanitizeDfColumns(df, " ")
  
  // All three should produce unique names despite case differences
  val colNames = actualDf.schema.fields.map(_.name)
  assert(colNames.distinct.length == 3, "All columns should have unique names")
  assert(colNames.contains("Test_Column"))
  // The other two should get suffixes like Test_Column_1, Test_Column_2
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`
around lines 155 - 185, Add a new test in SparkUtilsSuite.scala (e.g., "handle
case-insensitive name collisions") that constructs a DataFrame with mixed-case
colliding column names (for example "Test_Column", "Test Column", "TEST
COLUMN"), calls sanitizeDfColumns(...) and then verifies the resulting schema
has distinct names (colNames.distinct.length == originalCount) and that the
original base name (e.g., "Test_Column") is preserved with suffixes like "_1",
"_2" for the others; reference sanitizeDfColumns and the test name to locate
where to add it and reuse existing helpers (toDF, convertDataFrameToPrettyJSON
or schema assertions) to assert uniqueness and expected suffixing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala`:
- Around line 169-189: The column-selection fails because schemaWithMetaData is
taken from the already-sanitized df while dfJdbcWithCustomSchema contains
original unsanitized column names; update the code inside the
JdbcSparkUtils.getCorrectedDecimalsSchema block to either (a) apply your same
sanitization routine to dfJdbcWithCustomSchema before selecting (so names match
schemaWithMetaData), or (b) build cols by reading original names from
dfJdbcWithCustomSchema.schema and aliasing them to the sanitized names from
schemaWithMetaData (use .as(sanitizedName, metadata) on col(originalName));
adjust the selection on dfJdbcWithCustomSchema accordingly so column names
referenced exist and metadata is reattached.

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 145-154: The HashSet namesLowercase is populated with lowercase
strings but the code removes/adds using original-case names causing removals to
fail and future contains checks to break; in the block around uniqueName
(variables namesLowercase, srcName, trgName, getUniqueName) change the HashSet
interactions to use lowercase consistently (e.g., call toLowerCase on srcName
when removing and on newName or trgName when adding) so namesLowercase only ever
stores lowercase entries while leaving returned uniqueName value semantics
unchanged.

---

Nitpick comments:
In
`@pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala`:
- Around line 155-185: Add a new test in SparkUtilsSuite.scala (e.g., "handle
case-insensitive name collisions") that constructs a DataFrame with mixed-case
colliding column names (for example "Test_Column", "Test Column", "TEST
COLUMN"), calls sanitizeDfColumns(...) and then verifies the resulting schema
has distinct names (colNames.distinct.length == originalCount) and that the
original base name (e.g., "Test_Column") is preserved with suffixes like "_1",
"_2" for the others; reference sanitizeDfColumns and the test name to locate
where to add it and reuse existing helpers (toDF, convertDataFrameToPrettyJSON
or schema assertions) to assert uniqueness and expected suffixing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3b54f110-2142-4f3b-8175-535bd6e4fece

📥 Commits

Reviewing files that changed from the base of the PR and between 7fa21b3 and 5afcd75.

📒 Files selected for processing (3)

pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala

github-actions · 2026-04-14T07:25:41Z

Unit Test Coverage

Overall Project	77.81% `-0.02%`	❌
Files changed	93.66%	🍏

Module	Coverage
pramen:core Jacoco Report	78.94% `-0.02%`	🍏

Files

Module	File	Coverage
pramen:core Jacoco Report	SparkUtils.scala	87.98% `-1.03%`	🍏
	TableReaderJdbc.scala	82.82% `-0.15%`	🍏

…ve schema DDL.

yruslan · 2026-04-14T11:45:52Z

@coderabbitai review

coderabbitai · 2026-04-14T11:46:00Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala`:
- Around line 164-176: The sanitization call (SparkUtils.sanitizeDfColumns) is
currently gated by the decimal-correction if-block so it only runs when
jdbcReaderConfig.correctDecimalsInSchema || correctDecimalsFixPrecision is true;
move the sanitization so it always runs for data queries. Specifically, keep the
existing decimal-correction logic using
JdbcSparkUtils.getCorrectedDecimalsSchema and the reload into df, but remove the
SparkUtils.sanitizeDfColumns call from inside that if and instead call
SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters) once after
the decimal-correction block when isDataQuery is true (so sanitize runs
regardless of the decimal flags).

In `@pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala`:
- Around line 406-409: The code currently captures only the length from
CHAR/VARCHAR metadata, losing the type keyword so
transformSchemaForCatalog/transformPrimitive can only emit VarcharType; update
the regex charVarcharLengthPattern to capture the type keyword (CHAR or VARCHAR)
as a group, change getLengthFromMetadata to return both the keyword and length
(or a small tuple/object) when CHAR_VARCHAR_METADATA_KEY is present, and update
transformPrimitive/transformSchemaForCatalog to check that captured keyword and
create CharType(n) when the keyword is CHAR and VarcharType(n) when VARCHAR,
preserving the original semantic distinction.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c04c69dd-5fe4-4952-84d1-95ff0802cb27

📥 Commits

Reviewing files that changed from the base of the PR and between 5afcd75 and 7dadc92.

📒 Files selected for processing (3)

pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala
pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala
pramen/core/src/test/scala/za/co/absa/pramen/core/tests/utils/SparkUtilsSuite.scala

coderabbitai · 2026-04-14T11:52:54Z

    if (jdbcReaderConfig.correctDecimalsInSchema || jdbcReaderConfig.correctDecimalsFixPrecision) {
-      if (isDataQuery) {
-        df = SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters)
-      }
-
-      JdbcSparkUtils.getCorrectedDecimalsSchema(df, jdbcReaderConfig.correctDecimalsFixPrecision).foreach(schema =>
+      JdbcSparkUtils.getCorrectedDecimalsSchema(df, jdbcReaderConfig.correctDecimalsFixPrecision).foreach { schema =>
        df = spark
          .read
          .format("jdbc")
          .options(connectionOptions)
          .option("customSchema", schema)
          .load()
-      )
+      }
+
+      if (isDataQuery) {
+        df = SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters)
+      }


⚠️ Potential issue | 🔴 Critical

Move sanitization out of the decimal-correction branch.

sanitizeDfColumns() now runs only when one of the decimal-correction flags is enabled. With both flags off, data queries stop sanitizing entirely, so the duplicate-safe renames and original_name metadata from this PR never get applied.

💡 Proposed fix

if (jdbcReaderConfig.correctDecimalsInSchema || jdbcReaderConfig.correctDecimalsFixPrecision) { JdbcSparkUtils.getCorrectedDecimalsSchema(df, jdbcReaderConfig.correctDecimalsFixPrecision).foreach { schema => df = spark .read .format("jdbc") .options(connectionOptions) .option("customSchema", schema) .load() } - - if (isDataQuery) { - df = SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters) - } + } + + if (isDataQuery) { + df = SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (jdbcReaderConfig.correctDecimalsInSchema || jdbcReaderConfig.correctDecimalsFixPrecision) {

if (isDataQuery) {

df = SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters)

}

JdbcSparkUtils.getCorrectedDecimalsSchema(df, jdbcReaderConfig.correctDecimalsFixPrecision).foreach(schema =>

JdbcSparkUtils.getCorrectedDecimalsSchema(df, jdbcReaderConfig.correctDecimalsFixPrecision).foreach { schema =>

df = spark

.read

.format("jdbc")

.options(connectionOptions)

.option("customSchema", schema)

.load()

)

}

if (isDataQuery) {

df = SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters)

}

if (jdbcReaderConfig.correctDecimalsInSchema || jdbcReaderConfig.correctDecimalsFixPrecision) {

JdbcSparkUtils.getCorrectedDecimalsSchema(df, jdbcReaderConfig.correctDecimalsFixPrecision).foreach { schema =>

df = spark

.read

.format("jdbc")

.options(connectionOptions)

.option("customSchema", schema)

.load()

}

}

if (isDataQuery) {

df = SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters)

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala` around lines 164 - 176, The sanitization call (SparkUtils.sanitizeDfColumns) is currently gated by the decimal-correction if-block so it only runs when jdbcReaderConfig.correctDecimalsInSchema || correctDecimalsFixPrecision is true; move the sanitization so it always runs for data queries. Specifically, keep the existing decimal-correction logic using JdbcSparkUtils.getCorrectedDecimalsSchema and the reload into df, but remove the SparkUtils.sanitizeDfColumns call from inside that if and instead call SparkUtils.sanitizeDfColumns(df, jdbcReaderConfig.specialCharacters) once after the decimal-correction block when isDataQuery is true (so sanitize runs regardless of the decimal flags).

yruslan added 2 commits April 13, 2026 17:13

#181 Add 'original_name' to the metadata of a column when the name is…

0feaf08

… sanitized.

#733 Add support for unique column names in DataFrame sanitization an…

5afcd75

…d retain original column metadata

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

Comment thread pramen/core/src/main/scala/za/co/absa/pramen/core/reader/TableReaderJdbc.scala Outdated

Comment thread pramen/core/src/main/scala/za/co/absa/pramen/core/utils/SparkUtils.scala

#733 Honor '__CHAR_VARCHAR_TYPE_STRING' metadata key when creating Hi…

7dadc92

…ve schema DDL.

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

#733 Add support for CHAR data types in Hive tables.

8f33eda

yruslan merged commit 920b815 into main Apr 14, 2026
6 of 7 checks passed

yruslan deleted the bugfix/733-add-numbers-to-duplicated-sanitized-cols branch April 14, 2026 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#733 add numbers to duplicated sanitized cols#734

#733 add numbers to duplicated sanitized cols#734
yruslan merged 4 commits intomainfrom
bugfix/733-add-numbers-to-duplicated-sanitized-cols

yruslan commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

yruslan commented Apr 14, 2026

Uh oh!

coderabbitai bot commented Apr 14, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yruslan commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Coverage

Uh oh!

yruslan commented Apr 14, 2026

Uh oh!

coderabbitai bot commented Apr 14, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yruslan commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

github-actions bot commented Apr 14, 2026 •

edited

Loading