What is the problem the feature request solves?
CometPlainVector.getUTF8String(int) and CometPlainVector.getBinary(int) dereference the offset buffer address on every call:
long offsetBufferAddress = varWidthVector.getOffsetBuffer().memoryAddress();
int offset = Platform.getInt(null, offsetBufferAddress + rowId * 4L);
That is one Arrow method call, one buffer dereference, and one memoryAddress call per row, duplicating work valueBufferAddress already avoids (the value buffer address is cached as a final long field in the constructor).
For workloads that read VarChar or VarBinary columns heavily through CometPlainVector (native scan, shuffle, and the JVM UDF dispatch path introduced in #4267), the per-row overhead is pure bookkeeping.
Describe the potential solution
Hoist the offset buffer address into a final long offsetBufferAddress field set in the constructor alongside valueBufferAddress, and use it directly in getUTF8String / getBinary:
private final long offsetBufferAddress;
public CometPlainVector(ValueVector vector, boolean useDecimal128, boolean isUuid, boolean isReused) {
// ...
if (vector instanceof BaseVariableWidthVector) {
this.offsetBufferAddress = ((BaseVariableWidthVector) vector).getOffsetBuffer().memoryAddress();
} else {
this.offsetBufferAddress = -1;
}
}
public UTF8String getUTF8String(int rowId) {
if (isNullAt(rowId)) return null;
if (!isBaseFixedWidthVector) {
int offset = Platform.getInt(null, offsetBufferAddress + rowId * 4L);
int length = Platform.getInt(null, offsetBufferAddress + (rowId + 1L) * 4L) - offset;
return UTF8String.fromAddress(null, valueBufferAddress + offset, length);
}
// fixed-width branch unchanged
}
Same pattern for getBinary. JIT treats the address as a register-cached constant across the hot loop.
Additional context
valueBufferAddress already uses this pattern; this proposal extends it for symmetry and for the same reason it was used there.
- Arrow guarantees
getOffsetBuffer() returns the same buffer object for the lifetime of the vector, so caching the address at construction is safe.
- No API surface change. No behavior change. Strictly reduces per-row work for variable-width reads.
- Benefits every current caller of
CometPlainVector, not just one code path.
What is the problem the feature request solves?
CometPlainVector.getUTF8String(int)andCometPlainVector.getBinary(int)dereference the offset buffer address on every call:That is one Arrow method call, one buffer dereference, and one
memoryAddresscall per row, duplicating workvalueBufferAddressalready avoids (the value buffer address is cached as afinal longfield in the constructor).For workloads that read VarChar or VarBinary columns heavily through
CometPlainVector(native scan, shuffle, and the JVM UDF dispatch path introduced in #4267), the per-row overhead is pure bookkeeping.Describe the potential solution
Hoist the offset buffer address into a
final long offsetBufferAddressfield set in the constructor alongsidevalueBufferAddress, and use it directly ingetUTF8String/getBinary:Same pattern for
getBinary. JIT treats the address as a register-cached constant across the hot loop.Additional context
valueBufferAddressalready uses this pattern; this proposal extends it for symmetry and for the same reason it was used there.getOffsetBuffer()returns the same buffer object for the lifetime of the vector, so caching the address at construction is safe.CometPlainVector, not just one code path.