Skip to content

[Feature](agg) Add aggregate function entropy#60859

Closed
Copilot wants to merge 2 commits intomasterfrom
copilot/review-algorithm-implementation
Closed

[Feature](agg) Add aggregate function entropy#60859
Copilot wants to merge 2 commits intomasterfrom
copilot/review-algorithm-implementation

Conversation

Copy link
Contributor

Copilot AI commented Feb 26, 2026

Implements Shannon Entropy (base-2) as a new aggregate function, supporting single-column numeric/string types and multi-column generic inputs via serialized composite keys.

BE (aggregate_function_entropy.h/.cpp)

  • AggregateFunctionEntropyData<Value, Hash>: frequency map + total_count field tracked incrementally so get_result() is a single pass (original two-pass design iterated the map twice).
  • Three data specializations:
    • SingleNumericData<T>: direct typed value key
    • SingleStringData: XXH128 hash of raw bytes
    • GenericData: XXH128 hash of arena-serialized concatenation of all columns (multi-arg support)
  • AggregateFunctionEntropy<Data>: marked final, inherits VarargsExpression + NullableAggregateFunction; registered via register_function_both.

FE (Entropy.java)

NullableAggregateFunction with varArgs(AnyDataType)DoubleType; wired into BuiltinAggregateFunctions and AggregateFunctionVisitor.visitEntropy.

Test infrastructure fixes

  • agg_function_test.h: Guards deserialize_and_merge_from_column_range with block.rows() != 0 to prevent size_t underflow on empty input (was passing 0 - 1 = SIZE_MAX as end index).
  • column_helper.h: Extends two-column create_block to create_block<T1, T2 = T1> so mixed-type blocks (e.g. int + string) can be constructed in unit tests.

Tests

Unit tests (agg_entropy_test.cpp) cover numeric, string, multi-column generic, nullable (NULLs skipped), and empty-input cases via serialize/deserialize/merge round-trips. Regression suite (aggregate_function_entropy.groovy) exercises all SQL types including complex types, GROUP BY, and window functions.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@Thearas
Copy link
Contributor

Thearas commented Feb 26, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Co-authored-by: zclllyybb <61408379+zclllyybb@users.noreply.github.com>
Copilot AI changed the title [WIP] Review algorithm implementation for correctness and performance [Feature](agg) Add aggregate function entropy Feb 26, 2026
@zclllyybb zclllyybb closed this Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants