Skip to content

CometPlainVector: validity-bitmap byte cache for sequential reads #4279

@mbutrovich

Description

@mbutrovich

What is the problem the feature request solves?

CometPlainVector.getBoolean(int) already caches the last-read data-bitmap byte to avoid re-reading it eight times per byte of bit-packed booleans:

private byte booleanByteCache;
private int booleanByteCacheIndex = -1;

public boolean getBoolean(int rowId) {
  int byteIndex = rowId >> 3;
  if (byteIndex != booleanByteCacheIndex) {
    booleanByteCache = getByte(byteIndex);
    booleanByteCacheIndex = byteIndex;
  }
  return ((booleanByteCache >> (rowId & 7)) & 1) == 1;
}

The validity bitmap (checked by isNullAt) is also bit-packed one bit per row, but it has no equivalent cache. Sequential row reads on a nullable column hit the same validity-buffer byte eight times in a row. Every caller that reads a column row-by-row (including the native scan output path, shuffle read path, and the JVM UDF dispatch kernel in #4267) pays the duplicate reads.

Describe the potential solution

Add a sibling byte cache for the validity bitmap, using the same pattern already present for booleanByteCache:

private byte validityByteCache;
private int validityByteCacheIndex = -1;

@Override
public boolean isNullAt(int rowId) {
  if (this.valueBufferAddress == -1) return true;
  int byteIndex = rowId >> 3;
  if (byteIndex != validityByteCacheIndex) {
    validityByteCache = /* read validity-buffer byte at byteIndex */;
    validityByteCacheIndex = byteIndex;
  }
  return ((validityByteCache >> (rowId & 7)) & 1) == 0;
}

@Override
public void setNumNulls(int numNulls) {
  super.setNumNulls(numNulls);
  this.booleanByteCacheIndex = -1;
  this.validityByteCacheIndex = -1;  // invalidate on state transitions
}

Collapses the 8x duplicate validity-byte reads to one read per 8 rows on sequential access.

Additional context

  • Mirrors the established pattern in the same class (booleanByteCache).
  • The existing setNumNulls override resets booleanByteCacheIndex; the validity cache would extend that.
  • No API change. No behavior change. Strictly reduces per-row work for any nullable column read sequentially.
  • Random-access patterns see no win but no regression either (cache miss on every call, same cost as today after the miss).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:scanParquet scan / data readingpriority:lowMinor issues, test failures, tooling, cosmetic

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions