-
Notifications
You must be signed in to change notification settings - Fork 16
Description
The NWB schema defines default_value: 1.0 for conversion. When reading TimeSeries.data.conversion = 1.0 from an NWB file, two distinct scenarios are indistinguishable: (1) the user measured or verified that their data requires no conversion (already in SI units), or (2) the user didn't determine the conversion factor and the system applied the default. This is especially problematic for TimeSeries subtypes that fix the unit, such as ElectricalSeries which fixes units to volts, CurrentClampSeries, and VoltageClampSeries which fix their units to volts and amperes respectively. In these types, correct interpretation of the data requires the user to set conversion and offset so that readers can apply the formula real_value_in_units = data * conversion + offset to obtain values in the declared unit. The data is usually stored as raw ADC counts, and if users forget to set the conversion and offset, docval silently fills conversion=1.0, making the file claim that raw counts are volts. The same ambiguity applies to offset (default_value: 0.0).
Heuristics in the inspector could help: data.dtype being float32 rather than int16 suggests the data may already be calibrated, or if unit is set to something other than "unknown", the user likely considered the conversion. However, these remain heuristics, and the fact is that we are failing to provide proper provenance for this value.
I think the best solution is to only write information that the user explicitly provided. Omit the attribute entirely if the user did not specify it (conversion=None), leaving it unset in the HDF5 file. This way None means the value was not set and 1.0 means it was chosen explicitly. When units are set, None also carries unambiguous meaning: no conversion is required and the data is already in the units that unit declares. The cost is backwards compatibility, as existing readers that assume conversion is always present would need updating.
Two other alternatives: making conversion and offset required at construction time would resolve the ambiguity but forces users to supply values even when the data is already in natural units, and worse, forces them to make up values when the information is genuinely unavailable. Alternatively, a companion schema attribute (e.g., is_conversion_user_set: bool) set automatically by the API would preserve backwards compatibility, but it seems logically wrong for the schema to track metadata about other elements of the schema rather than describe scientific data. I honestly prefer this less.
On a wider scope, two additional considerations are worth noting regarding the use of default_value. First, not all uses of default_value in the schema carry the same risk. There are three categories:
- Provenance-destroying numeric defaults: the core problem described above. These are
conversion(1.0) andoffset(0.0) on TimeSeries, as well asImagingPlane.conversion(1.0). These are valid numeric values indistinguishable from user-provided ones.resolution(-1.0) is the exception:-1.0is not a natural value a user would set, so it effectively acts as a sentinel. - Domain-specific units that should be fixed values:
SpatialSeries.data.unit(meters),ImagingPlane.origin_coords.unit(meters),ImagingPlane.grid_spacing.unit(meters), andSubject.age_reference(birth). These define the coordinate system or domain convention for the type and are not meant to be set by the user. Could the schema use something likevalue:(fixed) rather thandefault_value:for these, they seem different in an important sense. - Sentinel strings: TimeSeries
descriptiondefaults to"no description",commentsdefaults to"no comments",AbstractFeatureSeries.data.unitdefaults to"see 'feature_units'", andImageSeries.formatdefaults to"raw". These are less problematic because the string values are clearly sentinel-like and carry no ambiguity about whether real information was provided.
Second, will the upcoming LinkML migration alleviate this problem? Currently, when a user creates a TimeSeries without specifying conversion, the docval decorator on TimeSeries fills in the default value 1.0 at construction time. This value is then indistinguishable from a user-provided value; there is no API to determine whether conversion was explicitly set or filled by docval. LinkML's ifabsent has clearer semantics, but the documentation does not specify whether there is an API to determine if a value was explicitly provided by the user or filled in by ifabsent.
Related issues: #639 ("Question: Functional difference between default_value for datasets and attributes?") and #641 ("Suggestion: Use default_value: null for unused fields in IndexSeries").