Is your feature request related to a problem?
Yes. Currently, when you load a datatree/set, all the coordinates are loaded eagerly, and indices built. I believe the loading is also synchronous, so you end up paying a performance penalty for things you might not even want.
Initially, I thought I could use create_default_indices=False to fix this, because indices are created automatically when needed. But setting that flag changes the semantics of index loading for data trees. Initially I wrote this up as a bug, but then it appears that this is deliberate behaviour.
Describe the solution you'd like
I would like an option that has the same semantics as create_default_indices=True, but defers creation until necessary. Perhaps create_default_indices="lazy".
A very simple example use case would be loading a datatree just so you can inspect the metadata (e.g. shapes). But or loading a large datatree, but only actually using a subtree of it.
Describe alternatives you've considered
- Just live with the perf hit and wasted bandwidth.
- Use create_default_indices=False, and be very careful with issues this can cause.
Additional context
I've prototyped a solution for this already.
My solution involves create LazyDefaultIndex, which inherits from Index. This object will load the actual value, a PandasIndex via create_default_index_implicit once the index is required for operations. On datatree creation, this object is filled in any indices, where PandasIndex would normally go. Because this is an actual object, it has similar resolution
Similarly, define LazyIndexingAdapter from PandasIndexingAdapter.
Both these lazy objects hold the actual index in a mutable box, so that the actual index is only loaded once, and re-used, regardless of copy operations. Indices are immutable, so this is safe.
Some methods don't always materialize the index:
- from_variables / create_variables / copy (returns another lazy object sharing same data)
- rename (can return a lazy object , but doesnt share)
- dim / repr (doesn't use index directly)
- equals (if it can determine they share the same mutable box)
- isel / roll (depending on arguments)
This works well enough in my code base, it's about 400 lines of code, but it depends on xarray internals:
- It has to duck type PandasIndexingAdapter quite closely.
- It has to inherit from PandasIndexingAdapator from it to pass a few isinstance checks.
- A few
Self typed methods have to be accept Self | PandasIndex, and return PandasIndex, subtly changing some contracts.
- When pretty printing, it does get the same
* marker that normal indices do, even after materialization.
If I were to incorpate into xarray, I'd probably make a base class AbstractPandasIndex, form which LazyPandasIndex and PandasIndex inherit, and interoperate.
I'm happy to share my code / incoporate into xarray if this seems like a good direction. But xarray's semantics are a bit tricky, I thought it best to discuss first.
Is your feature request related to a problem?
Yes. Currently, when you load a datatree/set, all the coordinates are loaded eagerly, and indices built. I believe the loading is also synchronous, so you end up paying a performance penalty for things you might not even want.
Initially, I thought I could use create_default_indices=False to fix this, because indices are created automatically when needed. But setting that flag changes the semantics of index loading for data trees. Initially I wrote this up as a bug, but then it appears that this is deliberate behaviour.
Describe the solution you'd like
I would like an option that has the same semantics as create_default_indices=True, but defers creation until necessary. Perhaps
create_default_indices="lazy".A very simple example use case would be loading a datatree just so you can inspect the metadata (e.g. shapes). But or loading a large datatree, but only actually using a subtree of it.
Describe alternatives you've considered
Additional context
I've prototyped a solution for this already.
My solution involves create LazyDefaultIndex, which inherits from Index. This object will load the actual value, a PandasIndex via
create_default_index_implicitonce the index is required for operations. On datatree creation, this object is filled in any indices, where PandasIndex would normally go. Because this is an actual object, it has similar resolutionSimilarly, define LazyIndexingAdapter from PandasIndexingAdapter.
Both these lazy objects hold the actual index in a mutable box, so that the actual index is only loaded once, and re-used, regardless of copy operations. Indices are immutable, so this is safe.
Some methods don't always materialize the index:
This works well enough in my code base, it's about 400 lines of code, but it depends on xarray internals:
Selftyped methods have to be acceptSelf | PandasIndex, and returnPandasIndex, subtly changing some contracts.*marker that normal indices do, even after materialization.If I were to incorpate into xarray, I'd probably make a base class AbstractPandasIndex, form which LazyPandasIndex and PandasIndex inherit, and interoperate.
I'm happy to share my code / incoporate into xarray if this seems like a good direction. But xarray's semantics are a bit tricky, I thought it best to discuss first.