fix: allow reading nullable uint arrays in read_{,elem_}lazy#2287
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2287 +/- ##
==========================================
- Coverage 86.74% 84.58% -2.17%
==========================================
Files 46 46
Lines 7204 7218 +14
==========================================
- Hits 6249 6105 -144
- Misses 955 1113 +158
|
|
Is this a public API? https://pandas.pydata.org/docs/search.html?q=BaseMaskedDtype |
|
Hm, no: https://pandas.pydata.org/docs/reference/index.html
I’ll fall back to the previous approach then. |
ilan-gold
left a comment
There was a problem hiding this comment.
Why are we changing the implementation of dtype? I am not sure that issue you made in pandas will ever be fixed. Like I mention in my comment, what we had wasn't random "string manipulation" but the actual on-disk file format names. It was super clear, if not the cleanest thing on earth (could use a match or something maybe), what in-memory array was created from what on-disk array. I do like the construct_array_type usage though!
That code is not what I referred to! I’m talking about this fragile construct: anndata/src/anndata/experimental/backed/_lazy_arrays.py Lines 194 to 197 in 088fa16 What you mean is that the previous code dispatched on I believe that what I do is perfectly safe and robust since the anndata version with this code only supports My changes just make sure that we can confidently say we support |
|
Nope, it doesn’t: import pandas as pd, numpy as np
nullable_dtype = lambda dtype: pd.array([], dtype=str(pd.api.types.pandas_dtype(dtype)).capitalize()).dtype
nullable_dtype(np.dtype(np.uint64))TypeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 nullable_dtype(np.dtype(np.uint64))
Cell In[2], line 1, in <lambda>(dtype)
----> 1 nullable_dtype = lambda dtype: pd.array([], dtype=str(pd.api.types.pandas_dtype(dtype)).capitalize()).dtype
File /usr/lib/python3.13/site-packages/pandas/core/construction.py:311, in array(data, dtype, copy)
309 # this returns None for not-found dtypes.
310 if dtype is not None:
--> 311 dtype = pandas_dtype(dtype)
313 if isinstance(data, ExtensionArray) and (dtype is None or data.dtype == dtype):
314 # e.g. TimedeltaArray[s], avoid casting to NumpyExtensionArray
315 if copy:
File /usr/lib/python3.13/site-packages/pandas/core/dtypes/common.py:1663, in pandas_dtype(dtype)
1656 with warnings.catch_warnings():
1657 # TODO: warnings.catch_warnings can be removed when numpy>2.3.0
1658 # is the minimum version
1659 # GH#51523 - Series.astype(np.integer) doesn't show
1660 # numpy deprecation warning of np.integer
1661 # Hence enabling DeprecationWarning
1662 warnings.simplefilter("always", DeprecationWarning)
-> 1663 npdtype = np.dtype(dtype)
1664 except SyntaxError as err:
1665 # np.dtype uses `eval` which can raise SyntaxError
1666 raise TypeError(f"data type '{dtype}' not understood") from err
TypeError: data type 'Uint64' not understood |
ilan-gold
left a comment
There was a problem hiding this comment.
Nice great change then!
|
OK, I’ll change this into a import anndata as ad, numpy as np, pandas as pd
ad.AnnData(np.zeros((4, 6)), dict(nint=pd.array([1, 0, None, 1], dtype=pd.UInt32Dtype()))).write_zarr("/tmp/test.ad.zarr")
ad.experimental.read_lazy("/tmp/test.ad.zarr")TypeError: data type 'Uint32' not understood
Error raised while reading key 'obs' of <class 'zarr.core.group.Group'> from / |
read_{,elem_}lazy
read_{,elem_}lazyread_{,elem_}lazy
|
We should add a test. Ideally it wouldn't even be a new test but some setting or something. I'm looking into how this ever even slipped through, because in theory it should be tested. |
|
Thanks! |
|
probably slipped through because the nullable-integer branch in our test utils was a bit inflexible, but I fixed that. |
Use
pandas.core.dtypes.dtypes.BaseMaskedDtype.from_numpy_dtype(with fallback to array constructors) instead of string manipulation to getMaskedArray’s dtype.Extracted and improved from #2279