Design spec for duckdb general index for dascore 0.2 #648

d-chambers · 2026-04-02T19:19:56Z

d-chambers
Apr 2, 2026
Maintainer

DuckDB Spool Indexer

Schema

`meta_data`

Index-level metadata and compatibility information.

what_is_this
- Constant identity string for quick sanity checks, eg dascore_duckdb_index.
index_version
- Schema/index version used to validate compatibility.
dascore_version
- DASCore version used to create or last update the index.
last_indexed_ns
- Epoch nanoseconds for the last successful index mutation.

`sources`

One row per indexed source. A source is one FiberIO scan unit. It will usually be a file, but may also be a directory or another non-overlapping FiberIO-backed entity.

source_id
- Surrogate primary key for joins and cleanup.
base_uri
- Nullable common root/prefix for this source when one exists. Useful for remote/common-URI sources. Informational only for local spools.
source_path
- Path identifying the source. Relative when a stable base exists, otherwise absolute.
source_format
- FiberIO format name used to read/scan the source.
format_version
- Format version understood by the selected FiberIO.
mtime_ns
- Source-level modified time in epoch nanoseconds. V1 incremental updates depend on this being meaningful.
last_indexed_ns
- Epoch nanoseconds when this source was last indexed.

Notes:

Natural source identity is source_path, interpreted relative to base_uri when base_uri is present.
Incremental update() in v1 is only defined for source types that can provide a meaningful source-level mtime_ns.

`patches`

One row per patch summary emitted by a source.

patch_id
- Surrogate primary key for joins.
source_id
- Foreign key to sources.
source_patch_id
- Patch identity within a source. Empty string is allowed for single-patch sources.
n_dims
- Number of dimensions in the patch.
sample_count_total
- Total number of samples/elements represented by the patch.
dims
- Comma-separated dimension names, no spaces.
shape
- Serialized patch shape.
station
- Promoted patch attr for fast equality filtering.
network
- Promoted patch attr for fast equality filtering.
channel
- Promoted patch attr for fast equality filtering once standardized in PatchAttrs.
tag
- Promoted patch attr for fast equality/glob filtering.
data_type
- Promoted patch attr for fast equality filtering.
data_category
- Promoted patch attr for fast equality filtering.
time_min
- Promoted lower bound for the time coord.
time_max
- Promoted upper bound for the time coord.
time_step
- Promoted sampling interval for the time coord when available.
distance_min
- Promoted lower bound for the distance coord.
distance_max
- Promoted upper bound for the distance coord.
distance_step
- Promoted sampling interval for the distance coord when available.

Constraint:

Unique (source_id, source_patch_id).

Notes:

Natural patch identity is (source_id, source_patch_id).
Promoted attrs live only on patches; they are excluded from attr_index.

`attr_index`

Typed key/value index for non-promoted patch attrs.

patch_id
- Foreign key to patches.
attr_name
- Attr key name.
value_kind
- Dispatcher for which typed value column is active.
units
- Nullable normalized unit string for unit-bearing attrs and scalar quantities.
value_str
- String value storage.
value_int
- Integer value storage.
value_num
- Floating-point value storage.
value_bool
- Boolean value storage.
value_time_ns
- Absolute time attr stored as epoch nanoseconds.
value_duration_ns
- Duration attr stored as nanoseconds.

Constraint:

Primary key (patch_id, attr_name).

Notes:

attr_index stores only non-promoted, non-private, non-history attrs.
Complex attrs are skipped rather than serialized ambiguously.
Serialization contract:
- str -> value_kind = str, value_str
- bool -> value_kind = bool, value_bool
- int -> value_kind = int, value_int
- float -> value_kind = float, value_num
- datetime-like -> value_kind = time, value_time_ns
- timedelta-like -> value_kind = duration, value_duration_ns
- pure unit -> value_kind = unit, value_str and units
- scalar quantity -> value_kind = quantity, value_num and units

`coord_index`

Dispatch table for coord summaries.

coord_entry_id
- Surrogate primary key for one coord representation row.
patch_id
- Foreign key to patches.
coord_name
- Logical coord name, eg time, distance, lag_time.
coord_dtype
- Coord dtype summary string.
coord_dims
- Comma-separated coord dims.
coord_len
- Coord length from CoordSummary.
coord_hash
- Optional coord-defined semantic hash after normalization.
units
- Nullable normalized unit string for the coord representation.
payload_table
- Dispatch target table name: coord_time, coord_numeric, or coord_str.
payload_id
- Row id in the dispatched payload table.

Constraint:

Unique (patch_id, coord_name, payload_table).

Notes:

coord_hash is optional and not required for v1 query behavior.
payload_table is a physical dispatch field, not an extra semantic type system.

`coord_time`

Summary-only payload table for time-like coords.

id
- Primary key referenced by coord_index.payload_id.
min
- Lower bound of the coord.
max
- Upper bound of the coord.
step
- Nullable time step; null means do not assume even sampling.
count
- Number of coord elements represented by the summary.
is_monotonic
- Whether the coord is monotonic.
is_relative
- false for absolute epoch-based time, true for duration-like relative time.

`coord_numeric`

Summary-only payload table for numeric coords.

id
- Primary key referenced by coord_index.payload_id.
min
- Lower bound of the coord.
max
- Upper bound of the coord.
step
- Nullable numeric step; null means do not assume even sampling.
count
- Number of coord elements represented by the summary.
is_monotonic
- Whether the coord is monotonic.

`coord_str`

Summary-only payload table for string coords.

id
- Primary key referenced by coord_index.payload_id.
min
- Lower lexical bound when meaningful.
max
- Upper lexical bound when meaningful.
count
- Number of coord elements represented by the summary.

Relations

meta_data

sources
  1
  |
  | source_id
  v
patches
  1 ----< attr_index
  |
  | 1 ----< coord_index ----> coord_time
  |                    \----> coord_numeric
  |                    \----> coord_str

Notes

DuckDB is the replacement backend, not an experimental side path.
The index is summary-only. Exact coord membership can be resolved by reading the patch.
dc.scan(..., full_coords=True) is the cleaner future extension point for exact coord extraction.
Common spool predicates should use promoted patches columns where possible.
Attr predicates use attr_index.
Coord predicates use coord_index plus the resolved payload table.
SQL pushdown follows DASCore selector semantics on a best-effort basis.
If pushdown cannot faithfully express a selector, filtering can fall back to pandas/post-query filtering.
String matching semantics:
- str selector -> Unix glob semantics
- re.Pattern selector -> regex semantics
- collection of strings -> exact membership semantics
Summary-based coord filtering may produce false positives but should avoid false negatives for summary-representable predicates.
Final query results are projected back into one flat patch-row relation before converting to pandas for DirectorySpool.
The flat patch-row relation is reconstructed on demand from the normalized tables and is not stored as a separate materialized table.
update() should stay cheap and should not perform full archive reconciliation.
reconcile() is a distinct operation:
- compare sources on disk to the sources table
- add missing indexed sources
- remove stale indexed sources no longer present
- do not rescan existing sources just because mtime_ns changed
Cleanup is explicit, source-scoped, and driven from source_id.
Deleting a source removes dependent patch, attr, coord, and payload rows in one transaction.
Concurrent writes to the same index are not supported.
update(), reconcile(), and rebuild() assume exclusive access to the index.
Schema compatibility rule:
- if the stored schema is older than the indexer’s minimum readable version, rebuild automatically with warning
- if the stored schema is newer than the running code supports, warn that compatibility is not guaranteed
Path normalization rule:
- local directory spool: persist relative source_path, resolve against current spool root, do not rely on persisted local base_uri
- remote/common-URI spool: persist base_uri, resolve source_path relative to it
- no-common-root case: store absolute source_path, base_uri = null
Ingest behavior:
- successful source scans should still index patches when some metadata is missing
- missing scalar summary values become nulls where allowed
- unsupported attrs and unsupported coord representations are skipped
- promoted fields may be null when unavailable
- malformed patches within a multi-patch source may be skipped with warning while the rest of the source is indexed
- source-level scan failure fails the source
Multi-patch source updates must be source-scoped and transactional:
- rescan one source
- replace all patch rows for that source
- replace all attr/coord rows for those patches
- never update patch-by-patch

TODO

Add channel to the standard PatchAttrs defaults and treat it as a first-class attr, like station, network, and tag.
Ensure channel participates consistently in scan summaries, querying, and promoted patches columns once added to PatchAttrs.

andreas-wuestefeld · 2026-05-17T19:29:35Z

andreas-wuestefeld
May 17, 2026
Collaborator

What problem does this database solve? Is it just a spool on steroids?

0 replies

d-chambers · 2026-05-18T05:00:58Z

d-chambers
May 18, 2026
Maintainer Author

Exactly.

The main problem with the (directory) spool right now is that it is essentially just a single table with fixed column names corresponding to a few attributes and a time coordinate. That is good for basic operations on time-indexed files but not so good if you want to store (and index) patches with arbitrary coordinate and attribute names and dtypes. This design is intended to handle the latter case.

0 replies

andreas-wuestefeld · 2026-06-05T18:02:08Z

andreas-wuestefeld
Jun 5, 2026
Collaborator

In principle I like having all the meta-data available. But this is quickly becoming very big for acquisitions running for half a year or so.
I see the most values of a spool in having an overview of your data-inventory, i.e. from all past projects

For me a spool should answers the questions

What are my acquisition parameters, like GL, sample_rate, spacing, number-of-channels
When do I have data (and gaps)
Which files do I need for a certain period
Where are the individual data files (relative to a parent directory)
What format (and version) are the data in
What is the (total) size on disk

Everything else is nice to have, but probably only ever used by a very select group of developers.

Currently a lot of the information is redundantly saved for each file, notably file-format. This could be stored as an HDF5 attribute rather than a column. I would assume (hope??) that people have their data organised in a meaningful way such that only homogenous data can be "spooled".

HDF5 is a fine and flexible format. And can be opened outside DAScore easily, increasing the value of a "spool" beyond it's intended purpose. But that is just my personal opinion, and I don't have the overview of what great things could be done with a DataBase spool

0 replies

d-chambers · 2026-06-06T09:37:13Z

d-chambers
Jun 6, 2026
Maintainer Author

Yes, I share your sentiments, and agree the spool should be able to answer those questions. Let me clarify the motivation of this (work-in-progress) spec.

Motivation

I want the spool to become a generic patch manager, not just an archival tool for "normal" DAS data. In practice, this means the spool becomes a generic array query/access interface that is useful for many different applications, with a rock-solid directory-based implementation.

My assertion: Since generic patch indexing is useful, the complexity should live in the spool rather than being reimplemented in every downstream project, as is required now.

Example use cases

Here are a few examples:

Derived DAS products

Consider a process that creates correlograms from ambient noise recorded by a DAS system. These don't have typical "distance"/"time" coordinates, but rather (distance, lag_time), then the absolute time is stored in the attributes. It could also get more complicated if the patches are of higher dimensions. For example, if the correlograms have moving time/distance windows, now they have four coordinates (relative_distance, lag_time, absolute_distance, absolute_time). The current spool is not helpful here, but the expanded spool would enable efficient querying and management of these patches.

Of course there are many more such derived products: dispersion curves, PSD plots, state-of-health summaries, etc. that all represent data products above the archive layer.

Active source

These can be indexed by time from the shot and store an attribute like "shot number". If we don't have access to generic attributes, we don't have a clean way to know how many shots there are and which files belong to each shot without imposing some semantics on the directory structure or using side information that doesn't integrate with the spool. This leads to a sub-optimal developer experience.

Multi-resolution views

Visualizing large archives is a challenging problem, so creating decimated products (like FBE) at multiple resolutions is really the only feasible path forward. It would be nice if the spool could manage these multi-tiered datasets, helping a GUI select which level/data product to use with simple queries. For example, a viewer might want 1 Hz summary data for an entire day, 100 Hz data for a one-hour window, and full-resolution data for a ten-minute window around an event. That query should be expressible against one indexed archive rather than through a pile of product-specific lookup code.

To manage these scenarios today, downstream applications need a lot of custom code, much of which duplicates the functionality of the spool engine anyway. Why not take on the complexity in exactly one place, then other codes can benefit because they can just dump any kind of patch to a directory and get reasonable querying/management from DASCore? Facilitating "rapid application development" is DASCore's reason for existing anyway.

The tradeoff is real: a generic index is more complicated than an index that assumes every file has the same coordinates, attributes, and format. But that complexity has to exist somewhere. Putting it in the spool gives us one tested implementation, one query model, and one place to optimize. Pushing it into applications means every project that handles derived products, active-source data, multi-resolution data, or mixed archives has to invent its own mini-spool.

Who would use this?

Everything else is nice to have, but probably only ever used by a very select group of developers

I think this is only true if you consider the Spool solely as an archiver, and not a flexible data manager which can be used above the archive layer. I personally have needed these features in 3 different larger codes I have worked on.

Moreover, I think as as open-source DFOS processing evolves, the utility of a more capable spool will become increasingly apparent.

Mixed-format archives

Currently a lot of the information is redundantly saved for each file, notably file-format. This could be stored as an HDF5 attribute rather than a column. I would assume (hope??) that people have their data organised in a meaningful way such that only homogenous data can be "spooled".

I don't think we want to require homogeneous spools for the sake of saving a few bytes in the index. For example, if you had several different interrogator types outputting different file formats while monitoring different arms at NORFOX, should each arm be managed as a separate spool? Again, the problem here is that it pushes complexity away from a well-tested implementation and into every code that accesses that dataset, multiplying lines of code and potential for bugs.

For example, consider you have a directory for each arm of NORFOX (purely hypothetical):

norfox/
|-- arm_01/
|   |-- interrogator_a_2024-01-01T00-00-00.tdms
|   `-- interrogator_a_2024-01-01T00-01-00.tdms
|-- arm_02/
|   |-- interrogator_b_2024-01-01T00-00-00.h5
|   `-- interrogator_b_2024-01-01T00-01-00.h5
|-- arm_03/
|   `-- interrogator_c_2024-01-01T00-00-00.dat
|-- arm_04/
|   `-- interrogator_d_2024-01-01T00-00-00.hdf5
`-- arm_05/
    `-- interrogator_e_2024-01-01T00-00-00.sg2

If you want to extract a teleseismic event, the homogeneous-spool model forces you to iterate through each arm, create a spool, and then select:

from pathlib import Path

import dascore as dc

event_time = ("2024-01-01T00:03:30", "2024-01-01T00:09:00")
root = Path("/data/norfox")

event_spools = []
for arm_path in sorted(root.glob("arm_*")):
    arm_spool = dc.spool(arm_path).update()
    event_spools.append(arm_spool.select(time=event_time))

for spool in event_spools:
    for patch in spool:
        process_event_patch(patch)

But if we just have one spool:

import dascore as dc

event_time = ("2024-01-01T00:03:30", "2024-01-01T00:09:00")

spool = dc.spool("/data/norfox").update()
event_spool = spool.select(time=event_time)

for patch in event_spool:
    process_event_patch(patch)

Large archive APIs

This becomes even more important for large archives. Imagine a spool was behind an API for accessing DAS experiments, e.g. at EarthScope. In that case, the spool index is not just a convenience for a local analyst; it is the catalog that lets the service answer basic questions quickly and consistently: what data exist for this time range, which network/station/channel/arm produced them, what file format are they in, what sampling rate and units should a client expect, which derived products already exist, and where are the files physically stored?

The API should expose one query model. The fact that some results are TDMS, some are HDF5, and some are derived DASCore products should be an implementation detail handled at the final read step. If mixed formats or product types have to be split into separate spools, the API layer has to rediscover and reimplement cross-spool logic for every query.

Index size

The redundancy is real, but categorical types can help a lot here, which many databases (but not PyTables) support.

For example, consider an extreme case with 10,000 channels monitoring at 10 kHz for a year. At 4 bytes per sample uncompressed, this is about 13 PB. Now, putting aside the likely need for a more serious, non-file-based solution, assume each file is 1 GB. That comes out to about 13,000,000 files. If attributes are, on average, 8 bytes each for floats, ints, or small strings, each attribute only takes about 100 MB to store before dataframe/index overhead. Even if we stored 100 of these, the raw attribute values would be around 10 GB. Not bad at all. Also considering compression and categorical, the storage cost becomes miniscule for managing a colossal archive.

HDF5 versus an index

HDF5 is a fine and flexible format. And can be opened outside DASCore easily, increasing the value of a "spool" beyond it's intended purpose.

This is true for many database formats. Actually, right now, the directory-spool index is an HDF5 file written through pandas HDFStore/PyTables, with the main index stored as a PyTables table. That means it is not opaque: generic HDF5 tools can inspect the file and recover the underlying arrays. However, reconstructing the exact dataframe and query behavior requires understanding the pandas/PyTables conventions layered on top of HDF5, so it is much easier through pandas/PyTables than through a generic HDF5 reader. More importantly, openness/readability is not unique to HDF5, and it is not the only requirement here. If query/index behavior matters, a database-style index may be more appropriate than relying on file-level HDF5 metadata, while the underlying patch data can still live in ordinary readable files.

0 replies

andreas-wuestefeld · 2026-06-07T13:03:25Z

andreas-wuestefeld
Jun 7, 2026
Collaborator

Thank you for the detailed insights. This all makes sense, especially if the ambition is that datacenters may use this as their system. And I admittedly didn't think about derived products could part of a spool (and I like the option).

Having a system that takes care of it all is great, and I like the "build it and they will come" approach.

regarding file size, actually 10GB is large. Especially if things are on a network. My one recent spool has 35MB of hdf5 from 4month of recording, and that takes noticeable time to load from the network (and about 5sec when I have it locally on SSD). I suspect a database will allow more fine-grained requests to not load everything.

Complexity increases the barrier-of-entry. We have to make sure that documentation is top-notch, and helper-functions are available to the most commonly executed tasks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design spec for duckdb general index for dascore 0.2 #648

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Design spec for duckdb general index for dascore 0.2 #648

Uh oh!

d-chambers Apr 2, 2026 Maintainer

DuckDB Spool Indexer

Schema

meta_data

sources

patches

attr_index

coord_index

coord_time

coord_numeric

coord_str

Relations

Notes

TODO

Replies: 5 comments

Uh oh!

andreas-wuestefeld May 17, 2026 Collaborator

Uh oh!

d-chambers May 18, 2026 Maintainer Author

Uh oh!

andreas-wuestefeld Jun 5, 2026 Collaborator

Uh oh!

d-chambers Jun 6, 2026 Maintainer Author

Motivation

Example use cases

Who would use this?

Mixed-format archives

Large archive APIs

Index size

HDF5 versus an index

Uh oh!

andreas-wuestefeld Jun 7, 2026 Collaborator

d-chambers
Apr 2, 2026
Maintainer

`meta_data`

`sources`

`patches`

`attr_index`

`coord_index`

`coord_time`

`coord_numeric`

`coord_str`

andreas-wuestefeld
May 17, 2026
Collaborator

d-chambers
May 18, 2026
Maintainer Author

andreas-wuestefeld
Jun 5, 2026
Collaborator

d-chambers
Jun 6, 2026
Maintainer Author

andreas-wuestefeld
Jun 7, 2026
Collaborator