Skip to content

[Feature] PPL Highlight Command#5234

Draft
RyanL1997 wants to merge 10 commits intoopensearch-project:mainfrom
RyanL1997:ppl-highlight-cmd
Draft

[Feature] PPL Highlight Command#5234
RyanL1997 wants to merge 10 commits intoopensearch-project:mainfrom
RyanL1997:ppl-highlight-cmd

Conversation

@RyanL1997
Copy link
Collaborator

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 5891541.

PathLineSeverityDescription
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java516lowUser-supplied highlight terms are interpolated directly into a Lucene query_string expression with only double-quote wrapping and no further sanitization. A crafted term containing backslash escapes or unbalanced quotes could malform or expand the generated query. Impact is limited to highlighting behavior (not document access control), so this is likely an oversight rather than malicious intent.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

PR Reviewer Guide 🔍

(Review updated until commit b2d6afe)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: PPL Highlight AST, Parser, and Grammar

Relevant files:

  • core/src/main/java/org/opensearch/sql/ast/tree/Highlight.java
  • core/src/main/java/org/opensearch/sql/ast/AbstractNodeVisitor.java
  • core/src/main/java/org/opensearch/sql/ast/dsl/AstDSL.java
  • core/src/main/java/org/opensearch/sql/analysis/Analyzer.java
  • ppl/src/main/antlr/OpenSearchPPLLexer.g4
  • ppl/src/main/antlr/OpenSearchPPLParser.g4
  • ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java
  • ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java
  • ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java

Sub-PR theme: Calcite Highlight Logical Plan and Index Scan Pushdown

Relevant files:

  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
  • core/src/main/java/org/opensearch/sql/calcite/plan/rel/LogicalHighlight.java
  • core/src/main/java/org/opensearch/sql/expression/HighlightExpression.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/planner/rules/HighlightIndexScanRule.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/planner/rules/OpenSearchIndexRules.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java
  • opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/PushDownType.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLHighlightTest.java
  • opensearch/src/test/java/org/opensearch/sql/opensearch/storage/scan/CalciteIndexScanCostTest.java

Sub-PR theme: Highlight Response Formatting, Integration Tests, and Documentation

Relevant files:

  • protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java
  • protocol/src/main/java/org/opensearch/sql/protocol/response/format/SimpleJsonResponseFormatter.java
  • protocol/src/test/java/org/opensearch/sql/protocol/response/QueryResultTest.java
  • protocol/src/test/java/org/opensearch/sql/protocol/response/format/SimpleJsonResponseFormatterTest.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/security/CalciteCrossClusterSearchIT.java
  • docs/user/ppl/cmd/highlight.md
  • docs/category.json

⚡ Recommended focus areas for review

Term Injection Risk

In applyHighlightPushDown, highlight terms are embedded directly into a query_string query using string concatenation with only quote-wrapping. If a term contains special characters (e.g., backslash, double-quote), this could produce malformed or unintended queries. Consider escaping terms before embedding them.

protected static void applyHighlightPushDown(
    org.opensearch.sql.opensearch.request.OpenSearchRequestBuilder requestBuilder,
    java.util.List<String> highlightArgs) {
  if (highlightArgs == null || highlightArgs.isEmpty()) {
    return;
  }
  org.opensearch.search.fetch.subphase.highlight.HighlightBuilder hb =
      new org.opensearch.search.fetch.subphase.highlight.HighlightBuilder();
  if (highlightArgs.size() == 1 && "*".equals(highlightArgs.get(0))) {
    // Wildcard: highlight search query matches in all fields
    hb.field("*");
  } else {
    // Highlight specific terms across all fields
    String queryStr =
        highlightArgs.stream()
            .map(term -> "\"" + term + "\"")
            .collect(java.util.stream.Collectors.joining(" OR "));
    org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field field =
        new org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field("*")
            .highlightQuery(
                org.opensearch.index.query.QueryBuilders.queryStringQuery(queryStr)
                    .defaultField("*"));
    hb.field(field);
  }
  hb.fragmentSize(Integer.MAX_VALUE);
  requestBuilder.getSourceBuilder().highlighter(hb);
}
Unchecked Cast

The highlights() method contains an unchecked cast (Map<String, Object>) hlMap which is unnecessary since hlMap is already declared as Map<String, Object>. This suppresses compiler warnings without benefit and should be removed.

      Map<String, Object> hlMap = new LinkedHashMap<>();
      for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
        hlMap.put(
            entry.getKey(),
            entry.getValue().collectionValue().stream()
                .map(ExprValue::stringValue)
                .collect(Collectors.toList()));
      }
      return (Map<String, Object>) hlMap;
    })
.collect(Collectors.toList());
Magic String

The string "_highlight" is hardcoded in resolveForCalcite instead of using the constant HighlightExpression.HIGHLIGHT_FIELD. This is inconsistent with the rest of the codebase and could cause subtle bugs if the constant value changes.

if ("_highlight".equals(rawPath)) {
  ExprValue hl = ExprValueUtils.getTupleValue(value).get("_highlight");
  return (hl != null && !hl.isMissing()) ? hl : null;
}
Iterator Behavior Change

The iterator() method was refactored from using Map::values (which preserves insertion order) to filtering entries by key. If the tuple map does not guarantee insertion order, the column ordering in data rows could differ from the schema ordering, causing misaligned results. Verify that ExprTupleValue always preserves insertion order.

public Iterator<Object[]> iterator() {
  return exprValues.stream()
      .map(ExprValueUtils::getTupleValue)
      .map(
          tuple ->
              tuple.entrySet().stream()
                  .filter(e -> !HIGHLIGHT_FIELD.equals(e.getKey()))
                  .map(e -> e.getValue().value())
                  .toArray(Object[]::new))
      .iterator();
Duplicate Highlight Column

In pushDownHighlight, the _highlight field is unconditionally added to the schema without checking if it already exists (unlike LogicalHighlight.create which checks). If pushDownHighlight is called multiple times or the field already exists, it could result in duplicate columns.

public RelNode pushDownHighlight(List<String> highlightArgs) {
  RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
  schemaBuilder.addAll(getRowType().getFieldList());
  schemaBuilder.add(
      HighlightExpression.HIGHLIGHT_FIELD,
      getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
  CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
  newScan
      .getPushDownContext()
      .add(
          PushDownType.HIGHLIGHT,
          highlightArgs,
          (OSRequestBuilderAction)
              requestBuilder -> applyHighlightPushDown(requestBuilder, highlightArgs));
  return newScan;
}

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

PR Code Suggestions ✨

Latest suggestions up to b2d6afe

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against duplicate highlight column addition

The pushDownHighlight method unconditionally adds the _highlight field to the schema
even if it already exists (e.g., if highlight is pushed down twice). This can cause
duplicate column errors. Add a guard similar to the one in LogicalHighlight.create
to check before adding the field.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [88-95]

 public RelNode pushDownHighlight(List<String> highlightArgs) {
   RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
   schemaBuilder.addAll(getRowType().getFieldList());
-  schemaBuilder.add(
-      HighlightExpression.HIGHLIGHT_FIELD,
-      getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+  if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+    schemaBuilder.add(
+        HighlightExpression.HIGHLIGHT_FIELD,
+        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+  }
   CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
Suggestion importance[1-10]: 6

__

Why: The pushDownHighlight method unconditionally adds _highlight to the schema, while LogicalHighlight.create already guards against duplicates. This inconsistency could cause issues if highlight is pushed down multiple times, though in practice the optimizer rule likely prevents this.

Low
Align iterator output order with schema columns

The new iterator() implementation filters _highlight from the tuple map directly,
but the original implementation iterated over Map::values which preserves insertion
order aligned with the schema columns. If the tuple map contains extra fields not in
the schema, or fields in a different order, the resulting arrays will be misaligned
with the schema. The iteration should be driven by the schema columns, not the tuple
map keys.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [89-97]

+List<String> columnNames = schema.getColumns().stream()
+    .map(col -> col.getAlias() != null ? col.getAlias() : col.getName())
+    .filter(name -> !HIGHLIGHT_FIELD.equals(name))
+    .collect(Collectors.toList());
 return exprValues.stream()
     .map(ExprValueUtils::getTupleValue)
     .map(
         tuple ->
-            tuple.entrySet().stream()
-                .filter(e -> !HIGHLIGHT_FIELD.equals(e.getKey()))
-                .map(e -> e.getValue().value())
+            columnNames.stream()
+                .map(name -> {
+                  ExprValue v = tuple.get(name);
+                  return v != null ? v.value() : null;
+                })
                 .toArray(Object[]::new))
     .iterator();
Suggestion importance[1-10]: 6

__

Why: The new iterator filters _highlight from the tuple map, but if the tuple contains fields in a different order than the schema or extra fields, the output arrays could be misaligned. Driving iteration from schema columns would be more robust, though the existing code relies on LinkedHashMap ordering which is typically consistent.

Low
Escape special characters in highlight query terms

The terms are concatenated into a query string without escaping special characters.
If a term contains characters special to the query string syntax (e.g., +, -, &&,
||, !, (, ), {, }, [, ], ^, ~, *, ?, :, </code>, /), the query will fail or produce
unexpected results. The terms should be escaped before being embedded in the query
string.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [512-515]

 String queryStr =
     highlightArgs.stream()
-        .map(term -> "\"" + term + "\"")
+        .map(term -> "\"" + term.replace("\\", "\\\\").replace("\"", "\\\"") + "\"")
         .collect(java.util.stream.Collectors.joining(" OR "));
Suggestion importance[1-10]: 5

__

Why: Terms embedded in the query string without escaping special characters could cause query failures or unexpected behavior. However, the terms are already wrapped in quotes which handles most cases, and the improved code only escapes backslashes and quotes rather than all special characters.

Low
General
Guard against null values in highlight map extraction

The code calls entry.getValue().collectionValue() without checking if the value is
actually a collection type. If a highlight field value is not a collection (e.g.,
it's a string or null), this will throw a runtime exception. Add a null/type check
before calling collectionValue().

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [115-123]

 Map<String, Object> hlMap = new LinkedHashMap<>();
 for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
+  ExprValue fieldVal = entry.getValue();
+  if (fieldVal == null || fieldVal.isMissing() || fieldVal.isNull()) {
+    continue;
+  }
   hlMap.put(
       entry.getKey(),
-      entry.getValue().collectionValue().stream()
+      fieldVal.collectionValue().stream()
           .map(ExprValue::stringValue)
           .collect(Collectors.toList()));
 }
-return (Map<String, Object>) hlMap;
+return hlMap;
Suggestion importance[1-10]: 4

__

Why: Calling collectionValue() without checking if the value is a collection type could cause a runtime exception for unexpected highlight field value types. This is a defensive coding improvement but may not occur in practice given the controlled data flow.

Low

Previous suggestions

Suggestions up to commit 51356cd
CategorySuggestion                                                                                                                                    Impact
Possible issue
Preserve schema column order in row iteration

The new iterator() implementation iterates over tuple.entrySet() which may not
preserve the column order defined in the schema. The original code used Map::values
which relied on insertion order, but the schema column order should be the
authoritative source. If the tuple map has a different key order than the schema,
the data rows will be misaligned with the schema columns.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [89-97]

 return exprValues.stream()
     .map(ExprValueUtils::getTupleValue)
     .map(
         tuple ->
-            tuple.entrySet().stream()
-                .filter(e -> !HIGHLIGHT_FIELD.equals(e.getKey()))
-                .map(e -> e.getValue().value())
+            schema.getColumns().stream()
+                .filter(col -> !HIGHLIGHT_FIELD.equals(getColumnName(col)))
+                .map(col -> {
+                    ExprValue v = tuple.get(getColumnName(col));
+                    return v != null ? v.value() : null;
+                })
                 .toArray(Object[]::new))
     .iterator();
Suggestion importance[1-10]: 7

__

Why: The new iterator() iterates over tuple.entrySet() which may not match the schema column order, potentially causing data/schema misalignment in responses. Using the schema as the authoritative ordering source is more correct and consistent with how columnNameTypes() works.

Medium
Guard against duplicate highlight column addition

The pushDownHighlight method unconditionally adds the _highlight field to the schema
even if it already exists (e.g., if highlight is pushed down twice). This can cause
duplicate column errors. Add a guard similar to the one in LogicalHighlight.create
to check before adding the field.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [88-94]

 public RelNode pushDownHighlight(List<String> highlightArgs) {
     RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
     schemaBuilder.addAll(getRowType().getFieldList());
-    schemaBuilder.add(
-        HighlightExpression.HIGHLIGHT_FIELD,
-        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+        schemaBuilder.add(
+            HighlightExpression.HIGHLIGHT_FIELD,
+            getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    }
     CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
Suggestion importance[1-10]: 6

__

Why: The pushDownHighlight method unconditionally adds _highlight to the schema, unlike LogicalHighlight.create which checks first. While double-highlight pushdown may be unlikely in practice, the inconsistency could cause issues and the fix is straightforward.

Low
Escape special characters in highlight query terms

The terms are concatenated into a query string without escaping special characters.
If a term contains characters special to the query string syntax (e.g., ", </code>, +, -,
(, ), etc.), the resulting query string will be malformed or produce unexpected
results. The terms should be escaped before being wrapped in quotes.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [512-515]

 String queryStr =
     highlightArgs.stream()
-        .map(term -> "\"" + term + "\"")
+        .map(term -> "\"" + term.replace("\\", "\\\\").replace("\"", "\\\"") + "\"")
         .collect(java.util.stream.Collectors.joining(" OR "));
Suggestion importance[1-10]: 5

__

Why: Terms containing special query string characters (e.g., ", \) could produce malformed queries. However, since terms come from user-provided string literals in PPL, the risk is moderate and the escaping logic in the improved code is minimal (only handles \ and ").

Low
General
Guard against null highlight field values

The code calls entry.getValue().collectionValue() without checking whether the value
is actually a collection type. If a highlight field value is not a collection (e.g.,
it's a string or null), this will throw a runtime exception. Add a null/type check
before calling collectionValue().

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [115-123]

 Map<String, Object> hlMap = new LinkedHashMap<>();
 for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
+    ExprValue fieldVal = entry.getValue();
+    if (fieldVal == null || fieldVal.isMissing() || fieldVal.isNull()) {
+        continue;
+    }
     hlMap.put(
         entry.getKey(),
-        entry.getValue().collectionValue().stream()
+        fieldVal.collectionValue().stream()
             .map(ExprValue::stringValue)
             .collect(Collectors.toList()));
 }
-return (Map<String, Object>) hlMap;
+return hlMap;
Suggestion importance[1-10]: 5

__

Why: Calling collectionValue() on a non-collection ExprValue could throw a runtime exception if highlight field values are not always collections. Adding null/type checks improves robustness, though in practice highlight fragments should always be collections.

Low
Suggestions up to commit a7c44d8
CategorySuggestion                                                                                                                                    Impact
General
Prevent duplicate highlight column in projections

The _highlight field is unconditionally appended to every projection whenever it
exists in the schema, even when the user has explicitly excluded it or when it was
already included in expandedFields via an explicit fields command. This can result
in duplicate _highlight columns in the output. Add a check to only append it if it
is not already present in expandedFields.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [426-430]

 // Include _highlight in projections when the highlight column is present in the schema
 int hlIndex = currentFields.indexOf(HighlightExpression.HIGHLIGHT_FIELD);
-if (hlIndex >= 0) {
+boolean alreadyIncluded = expandedFields.stream()
+    .anyMatch(rex -> rex instanceof org.apache.calcite.rex.RexInputRef
+        && ((org.apache.calcite.rex.RexInputRef) rex).getIndex() == hlIndex);
+if (hlIndex >= 0 && !alreadyIncluded) {
   expandedFields.add(context.relBuilder.field(hlIndex));
 }
Suggestion importance[1-10]: 7

__

Why: This is a real concern: if _highlight is already included in expandedFields (e.g., via an explicit fields command referencing it), the current code would add it again, causing a duplicate column. The fix correctly checks for existing inclusion before appending, preventing potential downstream errors.

Medium
Handle non-collection highlight field values safely

entry.getValue().collectionValue() will throw an exception if the highlight value
for a field is not a collection type (e.g., if it is a plain string or some other
type). This can cause the entire response serialization to fail. Add a null/type
check or use a try-catch to handle unexpected value types gracefully.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [115-123]

 Map<String, Object> hlMap = new LinkedHashMap<>();
 for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
-  hlMap.put(
-      entry.getKey(),
-      entry.getValue().collectionValue().stream()
-          .map(ExprValue::stringValue)
-          .collect(Collectors.toList()));
+  ExprValue fieldVal = entry.getValue();
+  if (fieldVal == null || fieldVal.isMissing() || fieldVal.isNull()) {
+    continue;
+  }
+  try {
+    hlMap.put(
+        entry.getKey(),
+        fieldVal.collectionValue().stream()
+            .map(ExprValue::stringValue)
+            .collect(Collectors.toList()));
+  } catch (Exception e) {
+    hlMap.put(entry.getKey(), List.of(fieldVal.stringValue()));
+  }
 }
 return (Map<String, Object>) hlMap;
Suggestion importance[1-10]: 4

__

Why: The suggestion adds defensive error handling for unexpected highlight value types. While the OpenSearch highlight response is expected to always return collections, a try-catch for robustness is reasonable. However, the improved_code uses a broad catch (Exception e) which is generally discouraged, and the scenario is unlikely given the controlled data flow.

Low
Possible issue
Guard against duplicate highlight column addition

If pushDownHighlight is called more than once (e.g., due to rule re-application),
the _highlight column will be added to the schema multiple times, causing duplicate
column errors. Add a guard to check whether _highlight is already present in the row
type before adding it, consistent with the same check already done in
LogicalHighlight.create.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [88-95]

 public RelNode pushDownHighlight(List<String> highlightArgs) {
     RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
     schemaBuilder.addAll(getRowType().getFieldList());
-    schemaBuilder.add(
-        HighlightExpression.HIGHLIGHT_FIELD,
-        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+        schemaBuilder.add(
+            HighlightExpression.HIGHLIGHT_FIELD,
+            getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    }
     CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
Suggestion importance[1-10]: 6

__

Why: This is a valid defensive check. LogicalHighlight.create already has this guard, and adding it to pushDownHighlight ensures consistency and prevents potential duplicate column errors if the rule fires multiple times. However, the optimizer typically prevents re-application, so the practical impact may be low.

Low
Escape special characters in highlight query terms

The terms in highlightArgs are concatenated directly into a query string without
escaping special characters. If a term contains characters special to the Lucene
query string syntax (e.g., +, -, &&, ||, !, (, ), {, }, [, ], ^, ~, *, ?, :, </code>, /),
the resulting query will be malformed or produce unexpected results. The terms
should be escaped before being embedded in the query string.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [507-515]

-if (highlightArgs.size() == 1 && "*".equals(highlightArgs.get(0))) {
-  // Wildcard: highlight search query matches in all fields
-  hb.field("*");
 } else {
   // Highlight specific terms across all fields
   String queryStr =
       highlightArgs.stream()
-          .map(term -> "\"" + term + "\"")
+          .map(term -> "\"" + org.apache.lucene.queryparser.classic.QueryParser.escape(term) + "\"")
           .collect(java.util.stream.Collectors.joining(" OR "));
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid in principle—unescaped Lucene special characters in highlightArgs could cause malformed queries. However, the terms are already wrapped in quotes ("term"), which in Lucene query string syntax treats the content as a phrase and mitigates most special character issues. The risk is real but limited.

Low
Suggestions up to commit ab37b5b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Preserve schema column order in data rows

The previous iterator() implementation preserved the column ordering defined by the
schema, while the new implementation iterates over the tuple's entry set, whose
order may differ from the schema column order. This can cause values to be
misaligned with their column headers in the response. The row values should be
produced in schema column order, not tuple insertion order.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [89-97]

 return exprValues.stream()
     .map(ExprValueUtils::getTupleValue)
     .map(
-        tuple ->
-            tuple.entrySet().stream()
-                .filter(e -> !HIGHLIGHT_FIELD.equals(e.getKey()))
-                .map(e -> e.getValue().value())
-                .toArray(Object[]::new))
+        tuple -> {
+          List<String> columnNames =
+              schema.getColumns().stream()
+                  .map(col -> col.getAlias() != null ? col.getAlias() : col.getName())
+                  .filter(name -> !HIGHLIGHT_FIELD.equals(name))
+                  .collect(Collectors.toList());
+          return columnNames.stream()
+              .map(name -> tuple.getOrDefault(name, org.opensearch.sql.data.model.ExprMissingValue.of()).value())
+              .toArray(Object[]::new);
+        })
     .iterator();
Suggestion importance[1-10]: 7

__

Why: The new iterator() implementation iterates over the tuple's entry set order rather than schema column order, which could cause value-to-column misalignment in responses. This is a real correctness concern, though LinkedHashMap insertion order may coincidentally match in many cases.

Medium
Prevent duplicate highlight column in schema

If _highlight is already present in the row type (e.g., when pushDownHighlight is
called more than once), the schema builder will add a duplicate _highlight column,
which can cause downstream field-resolution errors. Add a guard to only append the
column when it is not already present, consistent with the same check in
LogicalHighlight.create.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [88-97]

 public RelNode pushDownHighlight(List<String> highlightArgs) {
     RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
     schemaBuilder.addAll(getRowType().getFieldList());
-    schemaBuilder.add(
-        HighlightExpression.HIGHLIGHT_FIELD,
-        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+        schemaBuilder.add(
+            HighlightExpression.HIGHLIGHT_FIELD,
+            getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    }
     CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
     newScan.getPushDownContext().setHighlightArgs(highlightArgs);
     return newScan;
 }
Suggestion importance[1-10]: 6

__

Why: The pushDownHighlight method unconditionally adds _highlight to the schema, unlike LogicalHighlight.create which guards against duplicates. While double-invocation may be unlikely in practice, the inconsistency is a real maintenance risk and could cause field-resolution errors.

Low
Security
Escape special characters in highlight query terms

The terms are interpolated directly into a query string without escaping special
characters (e.g., +, -, &&, ||, !, (, ), {, }, [, ], ^, ~, *, ?, :, </code>, /). A term
containing any of these characters will produce a malformed or unintended query,
potentially causing an OpenSearch exception. Use QueryParser.escape or equivalent to
sanitize each term before embedding it in the query string.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [511-522]

 } else {
   // Highlight specific terms across all fields
   String queryStr =
       highlightArgs.stream()
-          .map(term -> "\"" + term + "\"")
+          .map(term -> "\"" + org.apache.lucene.queryparser.classic.QueryParser.escape(term) + "\"")
           .collect(java.util.stream.Collectors.joining(" OR "));
   org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field field =
       new org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field("*")
           .highlightQuery(
               org.opensearch.index.query.QueryBuilders.queryStringQuery(queryStr)
                   .defaultField("*"));
   hb.field(field);
 }
Suggestion importance[1-10]: 6

__

Why: Terms are interpolated directly into a Lucene query string without escaping special characters, which could cause malformed queries or OpenSearch exceptions. However, since terms are already wrapped in quotes ("term"), many special characters are already handled by phrase quoting, reducing the severity somewhat.

Low
General
Use constant instead of hardcoded string literal

The highlight field is resolved by hardcoding the string "_highlight" instead of
using the constant HighlightExpression.HIGHLIGHT_FIELD. This creates a maintenance
risk where renaming the constant would silently break highlight resolution. Use the
constant for consistency and safety.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/OpenSearchIndexEnumerator.java [95-98]

 private Object resolveForCalcite(ExprValue value, String rawPath) {
-    if ("_highlight".equals(rawPath)) {
-      ExprValue hl = ExprValueUtils.getTupleValue(value).get("_highlight");
+    if (HighlightExpression.HIGHLIGHT_FIELD.equals(rawPath)) {
+      ExprValue hl = ExprValueUtils.getTupleValue(value).get(HighlightExpression.HIGHLIGHT_FIELD);
       return (hl != null && !hl.isMissing()) ? hl : null;
     }
     return ExprValueUtils.resolveRefPaths(value, List.of(rawPath.split("\\."))).valueForCalcite();
 }
Suggestion importance[1-10]: 5

__

Why: Using the hardcoded string "_highlight" instead of HighlightExpression.HIGHLIGHT_FIELD is a maintenance risk. This is a straightforward improvement for consistency and refactoring safety.

Low
Suggestions up to commit f0dacc0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent duplicate highlight column in schema

If _highlight is already present in the row type (e.g., when pushDownHighlight is
called more than once), the method unconditionally adds a duplicate _highlight
column, which will cause a schema conflict. Add a guard to only add the column when
it is not already present, mirroring the check done in LogicalHighlight.create.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [88-97]

 public RelNode pushDownHighlight(List<String> highlightArgs) {
     RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
     schemaBuilder.addAll(getRowType().getFieldList());
-    schemaBuilder.add(
-        HighlightExpression.HIGHLIGHT_FIELD,
-        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+        schemaBuilder.add(
+            HighlightExpression.HIGHLIGHT_FIELD,
+            getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    }
     CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
     newScan.getPushDownContext().setHighlightArgs(highlightArgs);
     return newScan;
 }
Suggestion importance[1-10]: 7

__

Why: The pushDownHighlight method unconditionally adds _highlight to the schema without checking if it already exists, which could cause a schema conflict if called multiple times. The fix mirrors the guard already present in LogicalHighlight.create.

Medium
Preserve schema column order in row iterator

The new iterator() implementation filters out _highlight but also changes the column
ordering guarantee: it now iterates over the map's entry set order rather than the
schema column order. This can cause column values to be misaligned with the schema
columns returned by columnNameTypes(). The original code used Map::values which
relied on insertion order; the new code should explicitly follow the schema column
order to be safe.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [89-97]

 return exprValues.stream()
     .map(ExprValueUtils::getTupleValue)
     .map(
         tuple ->
-            tuple.entrySet().stream()
-                .filter(e -> !HIGHLIGHT_FIELD.equals(e.getKey()))
-                .map(e -> e.getValue().value())
+            schema.getColumns().stream()
+                .filter(col -> !HIGHLIGHT_FIELD.equals(getColumnName(col)))
+                .map(col -> tuple.getOrDefault(getColumnName(col),
+                    org.opensearch.sql.data.model.ExprMissingValue.of()).value())
                 .toArray(Object[]::new))
     .iterator();
Suggestion importance[1-10]: 7

__

Why: The new iterator() implementation iterates over the map's entry set order rather than the schema column order, which could cause column values to be misaligned with the schema. Iterating over schema columns explicitly ensures correct alignment between datarows and schema.

Medium
Security
Escape special characters in highlight query terms

The terms are interpolated directly into a query string using """ + term + """
without any escaping. A term containing a double-quote character (e.g., he said
"hello") will break the query string syntax and may cause an OpenSearch request
failure or unexpected behavior. The terms should be escaped before being embedded in
the query string.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [511-522]

 } else {
   // Highlight specific terms across all fields
   String queryStr =
       highlightArgs.stream()
-          .map(term -> "\"" + term + "\"")
+          .map(term -> "\"" + term.replace("\\", "\\\\").replace("\"", "\\\"") + "\"")
           .collect(java.util.stream.Collectors.joining(" OR "));
   org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field field =
       new org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field("*")
           .highlightQuery(
               org.opensearch.index.query.QueryBuilders.queryStringQuery(queryStr)
                   .defaultField("*"));
   hb.field(field);
 }
Suggestion importance[1-10]: 6

__

Why: Terms containing double-quote characters are interpolated directly into the query string without escaping, which could break the query string syntax. However, since the grammar restricts highlightArg to string literals parsed by the PPL parser, the risk of unescaped quotes reaching this code is limited in practice.

Low
Suggestions up to commit 5891541
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent duplicate highlight column in schema

The _highlight column is unconditionally added to the schema even if it already
exists (e.g., if pushDownHighlight is called twice). This can cause duplicate column
errors. Add a guard to only add the column when it is not already present, similar
to how LogicalHighlight.create handles it.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java [90-99]

 public RelNode pushDownHighlight(List<String> highlightArgs) {
     RelDataTypeFactory.Builder schemaBuilder = getCluster().getTypeFactory().builder();
     schemaBuilder.addAll(getRowType().getFieldList());
-    schemaBuilder.add(
-        HighlightExpression.HIGHLIGHT_FIELD,
-        getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    if (!getRowType().getFieldNames().contains(HighlightExpression.HIGHLIGHT_FIELD)) {
+        schemaBuilder.add(
+            HighlightExpression.HIGHLIGHT_FIELD,
+            getCluster().getTypeFactory().createSqlType(SqlTypeName.ANY));
+    }
     CalciteLogicalIndexScan newScan = copyWithNewSchema(schemaBuilder.build());
     newScan.getPushDownContext().setHighlightArgs(highlightArgs);
     return newScan;
 }
Suggestion importance[1-10]: 6

__

Why: The pushDownHighlight method unconditionally adds _highlight to the schema without checking if it already exists, which could cause duplicate column errors if called multiple times. The fix mirrors the guard already present in LogicalHighlight.create.

Low
Security
Escape highlight terms to prevent query injection

The term values are interpolated directly into a query string using """ + term +
""" without any escaping. If a term contains special query string characters (e.g.,
", </code>, :, *), this could produce a malformed or injected query. Terms should be
escaped before being embedded in the query string.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java [511-522]

 } else {
     // Highlight specific terms across all fields
     String queryStr =
         highlightArgs.stream()
-            .map(term -> "\"" + term + "\"")
+            .map(term -> "\"" + org.apache.lucene.queryparser.classic.QueryParser.escape(term) + "\"")
             .collect(java.util.stream.Collectors.joining(" OR "));
     org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field field =
         new org.opensearch.search.fetch.subphase.highlight.HighlightBuilder.Field("*")
             .highlightQuery(
                 org.opensearch.index.query.QueryBuilders.queryStringQuery(queryStr)
                     .defaultField("*"));
     hb.field(field);
 }
Suggestion importance[1-10]: 6

__

Why: User-supplied terms are interpolated directly into a Lucene query string without escaping, which could produce malformed queries or allow injection of special characters. Using QueryParser.escape would mitigate this risk.

Low
General
Guard against null highlight field values

Calling entry.getValue().collectionValue() will throw a NullPointerException or
ExpressionEvaluationException if the highlight field value for a particular key is
not a collection type. Add a null/type check before calling collectionValue() to
handle unexpected value shapes gracefully.

protocol/src/main/java/org/opensearch/sql/protocol/response/QueryResult.java [110-124]

 map(
     tuple -> {
       ExprValue hl = tuple.get(HIGHLIGHT_FIELD);
       if (hl == null || hl.isMissing() || hl.isNull()) {
         return null;
       }
       Map<String, Object> hlMap = new LinkedHashMap<>();
       for (Map.Entry<String, ExprValue> entry : hl.tupleValue().entrySet()) {
+        ExprValue val = entry.getValue();
+        if (val == null || val.isMissing() || val.isNull()) continue;
         hlMap.put(
             entry.getKey(),
-            entry.getValue().collectionValue().stream()
+            val.collectionValue().stream()
                 .map(ExprValue::stringValue)
                 .collect(Collectors.toList()));
       }
       return (Map<String, Object>) hlMap;
     })
Suggestion importance[1-10]: 4

__

Why: Calling collectionValue() on a highlight entry without checking if the value is null or missing could throw an exception for unexpected value shapes. Adding a null/missing check improves robustness.

Low
Verify child plan attachment for highlight command

The visitHighlightCommand returns a new Highlight(highlightArgs) without attaching
it to the child plan. All other command visitors (e.g., visitTrendlineCommand)
return the node and rely on the caller to call .attach(child), but AstDSL.highlight
calls .attach(input). Since visitHighlightCommand is called by the AST builder
framework which chains nodes, the returned Highlight node will have a null child,
causing a NullPointerException when the plan is traversed. The child should be
attached here or the framework must handle it — verify the attachment mechanism is
consistent with other commands.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java [1290-1305]

 public UnresolvedPlan visitHighlightCommand(OpenSearchPPLParser.HighlightCommandContext ctx) {
     List<String> highlightArgs =
         ctx.highlightArg().stream()
             .map(
                 arg -> {
                   if (arg.STAR() != null) {
                     return "*";
                   } else if (arg.stringLiteral() != null) {
                     return StringUtils.unquoteText(arg.stringLiteral().getText());
                   } else {
                     return arg.fieldExpression().getText();
                   }
                 })
             .collect(Collectors.toList());
+    // Return without attach; the framework (visitPipeCommands) attaches the child.
+    // Ensure this is consistent with how other commands are handled.
     return new Highlight(highlightArgs);
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion asks to verify that the framework attaches the child, but the improved_code is essentially identical to the existing_code with only a comment added. Other command visitors in the same file follow the same pattern of returning without attaching, so this is consistent and not a real bug.

Low

@RyanL1997 RyanL1997 added PPL Piped processing language v3.6.0 Issues targeting release v3.6.0 feature labels Mar 14, 2026
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit f0dacc0.

PathLineSeverityDescription
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/AbstractCalciteIndexScan.java519lowUser-supplied highlight terms are wrapped in double quotes but not escaped before being passed to queryStringQuery(). A term containing a literal '"' character could partially break out of the phrase context and alter the Lucene query structure. Because PPL users are already authenticated and can issue arbitrary search queries, the practical exploitability is limited, but proper escaping of the term string (e.g., via QueryParser.escape) would be safer.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Contributor

Persistent review updated to latest commit f0dacc0

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit ab37b5b

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit a7c44d8

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 51356cd

Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit b2d6afe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PPL Piped processing language v3.6.0 Issues targeting release v3.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant