Skip to content

Magic-byte dispatch for read_one_image_array (perf) #20

@rfrenchseti

Description

@rfrenchseti

Background

src/picmaker/io.py::read_one_image_array runs the format cascade by trying each reader in turn (pickle → numpy → VICAR → FITS → PIL → PDS3). Each reader either opens the file itself or relies on its own magic-byte sniff. The FITS branch additionally re-reads the first 9 bytes (io.py:142-144) to sniff b'SIMPLE =' before calling astropy.io.fits.open.

For a file whose true format is at the end of the cascade (e.g. a PDS3 label), every preceding probe opens and partially reads the file before giving up. On a slow filesystem or for a directory tree of thousands of files this adds measurable wall-clock cost.

See CODEBASE_CRITIQUE.md §5 (Performance and resource use, "Finding (Low) — io.read_one_image_array re-opens the file up to five times").

Suggested approach

Read the first 32 bytes of the file once, then dispatch based on the magic bytes:

Format Magic bytes
pickle \x80 (protocol 2+) or \x28\x6c\x70 (protocol 0/1)
numpy .npy \x93NUMPY
VICAR LBLSIZE= (within the first ~16 bytes)
FITS SIMPLE =
PNG \x89PNG
TIFF II\x2a\x00 (LE) or MM\x00\x2a (BE)
JPEG \xff\xd8\xff

Sketch:

def read_one_image_array(filename, labelfile, obj=None, hst=False):
    filename_str = str(filename)
    try:
        with open(filename_str, 'rb') as f:
            header = f.read(32)
    except OSError as exc:
        cascade_errors = [exc]
        header = b''

    if header.startswith(b'\\x80') or header.startswith(b'(lp'):
        return _read_pickle(filename_str)
    if header.startswith(b'\\x93NUMPY'):
        return _read_npy(filename_str)
    if b'LBLSIZE=' in header[:16]:
        return _read_vicar(filename_str)
    if header.startswith(b'SIMPLE  ='):
        return _read_fits(filename_str, obj, hst)
    if header.startswith((b'\\x89PNG', b'II\\x2a\\x00', b'MM\\x00\\x2a', b'\\xff\\xd8\\xff')):
        return _read_pil(filename_str)
    if labelfile:
        result = read_pds_labeled_image_array(labelfile, obj)
        if result is not None:
            return result

    raise OSError(...) from ExceptionGroup(...)

Each per-format helper is the body of the corresponding branch in the current cascade. The factored helpers are easier to test in isolation (one of the things _process_one_image in #12 would consume directly).

Why now

  • The cascade-end OSError is already chained from ExceptionGroup (see codebase priority 4); the magic-byte rewrite preserves that semantics.
  • The reader cascade is the bottleneck for picmaker -r --pattern '*' <dir> on directory trees with many files.

Acceptance criteria

  • read_one_image_array reads each input file's first 32 bytes once for format dispatch.
  • Per-format readers are extracted as private helpers (_read_pickle, _read_npy, _read_vicar, _read_fits, _read_pil).
  • Cascade-end behaviour unchanged: an unrecognised file still raises OSError('Unrecognized image file format: ...') chained from ExceptionGroup.
  • Existing tests (test_io.py, test_io_cascade.py, test_pds3_reader.py, test_pds3_reader_branches.py, test_warning_elevation.py) pass unchanged.
  • Add one benchmark / microtest showing the dispatch picks the right reader for each format without opening the file twice.

Related

  • CODEBASE_CRITIQUE.md §5 — original finding.
  • #12 — function-split refactor; the per-format helpers from this issue feed the _process_one_image helper described there.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions