Design spec for duckdb general index for dascore 0.2 #648
Replies: 5 comments
-
|
What problem does this database solve? Is it just a spool on steroids? |
Beta Was this translation helpful? Give feedback.
-
|
Exactly. The main problem with the (directory) spool right now is that it is essentially just a single table with fixed column names corresponding to a few attributes and a time coordinate. That is good for basic operations on time-indexed files but not so good if you want to store (and index) patches with arbitrary coordinate and attribute names and dtypes. This design is intended to handle the latter case. |
Beta Was this translation helpful? Give feedback.
-
|
In principle I like having all the meta-data available. But this is quickly becoming very big for acquisitions running for half a year or so. For me a spool should answers the questions
Everything else is nice to have, but probably only ever used by a very select group of developers. Currently a lot of the information is redundantly saved for each file, notably file-format. This could be stored as an HDF5 attribute rather than a column. I would assume (hope??) that people have their data organised in a meaningful way such that only homogenous data can be "spooled". HDF5 is a fine and flexible format. And can be opened outside DAScore easily, increasing the value of a "spool" beyond it's intended purpose. But that is just my personal opinion, and I don't have the overview of what great things could be done with a DataBase spool |
Beta Was this translation helpful? Give feedback.
-
|
Yes, I share your sentiments, and agree the spool should be able to answer those questions. Let me clarify the motivation of this (work-in-progress) spec. MotivationI want the spool to become a generic patch manager, not just an archival tool for "normal" DAS data. In practice, this means the spool becomes a generic My assertion: Since generic patch indexing is useful, the complexity should live in the spool rather than being reimplemented in every downstream project, as is required now. Example use casesHere are a few examples:
Consider a process that creates correlograms from ambient noise recorded by a DAS system. These don't have typical "distance"/"time" coordinates, but rather (distance, lag_time), then the absolute time is stored in the attributes. It could also get more complicated if the patches are of higher dimensions. For example, if the correlograms have moving time/distance windows, now they have four coordinates (relative_distance, lag_time, absolute_distance, absolute_time). The current spool is not helpful here, but the expanded spool would enable efficient querying and management of these patches. Of course there are many more such derived products: dispersion curves, PSD plots, state-of-health summaries, etc. that all represent data products above the archive layer.
These can be indexed by time from the shot and store an attribute like "shot number". If we don't have access to generic attributes, we don't have a clean way to know how many shots there are and which files belong to each shot without imposing some semantics on the directory structure or using side information that doesn't integrate with the spool. This leads to a sub-optimal developer experience.
Visualizing large archives is a challenging problem, so creating decimated products (like FBE) at multiple resolutions is really the only feasible path forward. It would be nice if the spool could manage these multi-tiered datasets, helping a GUI select which level/data product to use with simple queries. For example, a viewer might want 1 Hz summary data for an entire day, 100 Hz data for a one-hour window, and full-resolution data for a ten-minute window around an event. That query should be expressible against one indexed archive rather than through a pile of product-specific lookup code. To manage these scenarios today, downstream applications need a lot of custom code, much of which duplicates the functionality of the spool engine anyway. Why not take on the complexity in exactly one place, then other codes can benefit because they can just dump any kind of patch to a directory and get reasonable querying/management from DASCore? Facilitating "rapid application development" is DASCore's reason for existing anyway. The tradeoff is real: a generic index is more complicated than an index that assumes every file has the same coordinates, attributes, and format. But that complexity has to exist somewhere. Putting it in the spool gives us one tested implementation, one query model, and one place to optimize. Pushing it into applications means every project that handles derived products, active-source data, multi-resolution data, or mixed archives has to invent its own mini-spool. Who would use this?
I think this is only true if you consider the Moreover, I think as as open-source DFOS processing evolves, the utility of a more capable spool will become increasingly apparent. Mixed-format archives
I don't think we want to require homogeneous spools for the sake of saving a few bytes in the index. For example, if you had several different interrogator types outputting different file formats while monitoring different arms at NORFOX, should each arm be managed as a separate spool? Again, the problem here is that it pushes complexity away from a well-tested implementation and into every code that accesses that dataset, multiplying lines of code and potential for bugs. For example, consider you have a directory for each arm of NORFOX (purely hypothetical): If you want to extract a teleseismic event, the homogeneous-spool model forces you to iterate through each arm, create a spool, and then select: from pathlib import Path
import dascore as dc
event_time = ("2024-01-01T00:03:30", "2024-01-01T00:09:00")
root = Path("/data/norfox")
event_spools = []
for arm_path in sorted(root.glob("arm_*")):
arm_spool = dc.spool(arm_path).update()
event_spools.append(arm_spool.select(time=event_time))
for spool in event_spools:
for patch in spool:
process_event_patch(patch)But if we just have one spool: import dascore as dc
event_time = ("2024-01-01T00:03:30", "2024-01-01T00:09:00")
spool = dc.spool("/data/norfox").update()
event_spool = spool.select(time=event_time)
for patch in event_spool:
process_event_patch(patch)Large archive APIsThis becomes even more important for large archives. Imagine a spool was behind an API for accessing DAS experiments, e.g. at EarthScope. In that case, the spool index is not just a convenience for a local analyst; it is the catalog that lets the service answer basic questions quickly and consistently: what data exist for this time range, which network/station/channel/arm produced them, what file format are they in, what sampling rate and units should a client expect, which derived products already exist, and where are the files physically stored? The API should expose one query model. The fact that some results are TDMS, some are HDF5, and some are derived DASCore products should be an implementation detail handled at the final read step. If mixed formats or product types have to be split into separate spools, the API layer has to rediscover and reimplement cross-spool logic for every query. Index sizeThe redundancy is real, but categorical types can help a lot here, which many databases (but not PyTables) support. For example, consider an extreme case with 10,000 channels monitoring at 10 kHz for a year. At 4 bytes per sample uncompressed, this is about 13 PB. Now, putting aside the likely need for a more serious, non-file-based solution, assume each file is 1 GB. That comes out to about 13,000,000 files. If attributes are, on average, 8 bytes each for floats, ints, or small strings, each attribute only takes about 100 MB to store before dataframe/index overhead. Even if we stored 100 of these, the raw attribute values would be around 10 GB. Not bad at all. Also considering compression and categorical, the storage cost becomes miniscule for managing a colossal archive. HDF5 versus an index
This is true for many database formats. Actually, right now, the directory-spool index is an HDF5 file written through pandas |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the detailed insights. This all makes sense, especially if the ambition is that datacenters may use this as their system. And I admittedly didn't think about derived products could part of a spool (and I like the option). Having a system that takes care of it all is great, and I like the "build it and they will come" approach. regarding file size, actually 10GB is large. Especially if things are on a network. My one recent spool has 35MB of hdf5 from 4month of recording, and that takes noticeable time to load from the network (and about 5sec when I have it locally on SSD). I suspect a database will allow more fine-grained requests to not load everything. Complexity increases the barrier-of-entry. We have to make sure that documentation is top-notch, and helper-functions are available to the most commonly executed tasks! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
DuckDB Spool Indexer
Schema
meta_dataIndex-level metadata and compatibility information.
what_is_thisdascore_duckdb_index.index_versiondascore_versionlast_indexed_nssourcesOne row per indexed source. A source is one FiberIO scan unit. It will usually be a file, but may also be a directory or another non-overlapping FiberIO-backed entity.
source_idbase_urisource_pathsource_formatformat_versionmtime_nslast_indexed_nsNotes:
source_path, interpreted relative tobase_uriwhenbase_uriis present.update()in v1 is only defined for source types that can provide a meaningful source-levelmtime_ns.patchesOne row per patch summary emitted by a source.
patch_idsource_idsources.source_patch_idn_dimssample_count_totaldimsshapestationnetworkchannelPatchAttrs.tagdata_typedata_categorytime_mintimecoord.time_maxtimecoord.time_steptimecoord when available.distance_mindistancecoord.distance_maxdistancecoord.distance_stepdistancecoord when available.Constraint:
(source_id, source_patch_id).Notes:
(source_id, source_patch_id).patches; they are excluded fromattr_index.attr_indexTyped key/value index for non-promoted patch attrs.
patch_idpatches.attr_namevalue_kindunitsvalue_strvalue_intvalue_numvalue_boolvalue_time_nsvalue_duration_nsConstraint:
(patch_id, attr_name).Notes:
attr_indexstores only non-promoted, non-private, non-historyattrs.str->value_kind = str,value_strbool->value_kind = bool,value_boolint->value_kind = int,value_intfloat->value_kind = float,value_numvalue_kind = time,value_time_nsvalue_kind = duration,value_duration_nsvalue_kind = unit,value_strandunitsvalue_kind = quantity,value_numandunitscoord_indexDispatch table for coord summaries.
coord_entry_idpatch_idpatches.coord_nametime,distance,lag_time.coord_dtypecoord_dimscoord_lenCoordSummary.coord_hashunitspayload_tablecoord_time,coord_numeric, orcoord_str.payload_idConstraint:
(patch_id, coord_name, payload_table).Notes:
coord_hashis optional and not required for v1 query behavior.payload_tableis a physical dispatch field, not an extra semantic type system.coord_timeSummary-only payload table for time-like coords.
idcoord_index.payload_id.minmaxstepcountis_monotonicis_relativefalsefor absolute epoch-based time,truefor duration-like relative time.coord_numericSummary-only payload table for numeric coords.
idcoord_index.payload_id.minmaxstepcountis_monotoniccoord_strSummary-only payload table for string coords.
idcoord_index.payload_id.minmaxcountRelations
Notes
dc.scan(..., full_coords=True)is the cleaner future extension point for exact coord extraction.patchescolumns where possible.attr_index.coord_indexplus the resolved payload table.strselector -> Unix glob semanticsre.Patternselector -> regex semanticsDirectorySpool.update()should stay cheap and should not perform full archive reconciliation.reconcile()is a distinct operation:sourcestablemtime_nschangedsource_id.update(),reconcile(), andrebuild()assume exclusive access to the index.source_path, resolve against current spool root, do not rely on persisted localbase_uribase_uri, resolvesource_pathrelative to itsource_path,base_uri = nullTODO
channelto the standardPatchAttrsdefaults and treat it as a first-class attr, likestation,network, andtag.channelparticipates consistently in scan summaries, querying, and promotedpatchescolumns once added toPatchAttrs.Beta Was this translation helpful? Give feedback.
All reactions