Add support for hdf5 files#169
Conversation
erykoff
left a comment
There was a problem hiding this comment.
I've started to go through but not the full way. The map should be saved and not the valid pixels. This is a key part of the healsparse format.
| use_threads : `bool`, optional | ||
| Use multithreaded reading for parquet files. | ||
| group : `str`, optional | ||
| if file is hdf5, read from this group |
There was a problem hiding this comment.
-> If file is hdf5, read from this group. (Please format as a sentence).
The next line is not necessary; the default is in the signature.
There was a problem hiding this comment.
Actually, looking further, I think that this should be hdf_group instead of group everywhere.
| coverage information. | ||
| group: `str`, optional | ||
| If file format is ``hdf5``, save to this group | ||
| otheriwse unused |
There was a problem hiding this comment.
-> If file format is ``hdf5``, save to this group; otherwise unused.
(Can fit on one line; next line is not necessary)
There was a problem hiding this comment.
As above, make this hdf5_group.
| header=header, degrade_nside=degrade_nside, | ||
| weightfile=weightfile, reduction=reduction, use_threads=use_threads) | ||
| weightfile=weightfile, reduction=reduction, use_threads=use_threads, | ||
| hdf5_group=group) |
| """ | ||
| _write_map(self, filename, clobber=clobber, nocompress=nocompress, format=format, | ||
| nside_io=nside_io) | ||
| nside_io=nside_io, hdf5_group=group) |
| use_threads : `bool`, optional | ||
| Use multithreaded reading for parquet files. | ||
| hdf5_group: `str`, optional | ||
| hdf5 group to read from (only used if reading from an hdf5 file) |
There was a problem hiding this comment.
-> Read from this hdf5 group (only used reading from an hdf5 file).
| raise IOError("Filename %s could not be found." % (filename)) | ||
| else: | ||
| raise NotImplementedError("HealSparse only supports fits and parquet files (with pyarrow).") | ||
| raise NotImplementedError("HealSparse only supports fits, hdf5 and parquet files (with pyarrow).") |
There was a problem hiding this comment.
-> ... hdf5 (with h5py), and parquet ...
| Parameters | ||
| ---------- | ||
| hsp_map : HealSparseMap | ||
| Map to save |
| filepath : str | ||
| HDF5 file path | ||
| group : str, optional | ||
| Name of the HDF5 group to store the map (default 'map') |
There was a problem hiding this comment.
No need to put the default in the docstring. Here and below.
| Overwrite the file/group if it exists (default False) | ||
| """ | ||
| if os.path.isfile(filepath) and not clobber: | ||
| raise RuntimeError("Filename %s exists and clobber is False." % (filepath)) |
There was a problem hiding this comment.
I'm not sure how this can work as is. clobber is the same as overwrite which isn't happening here. I think we can demand that the HDF5 already exists (document this!) and therefore (a) clobber does nothing and (b) we can demand that the file exists.
Or ... I see now that you're using clobber to overwrite an existing group? In which case, that needs to be documented but the check here is not what you want.
There was a problem hiding this comment.
thanks. the clobber logic should now work. Yes, the intended behaviour is:
clobber=False only prevents overwriting the group. If the file already exists but the group is new, we can still save to a new group in that file (even if clobber=F)
|
|
||
| # Sparse map - only save valid pixels | ||
| valid_pixels = hsp_map.valid_pixels | ||
| grp.create_dataset("pixel", data=valid_pixels, compression="gzip") |
There was a problem hiding this comment.
This is not going to work, and in fact may explode well beyond your existing memory. The healsparse format is extremely efficient without recording pixel values (which can sometimes fail and is often slow). You want to save the map itself, and not valid pixels. Same thing with the coverage map.
There was a problem hiding this comment.
This should now save the map
erykoff
left a comment
There was a problem hiding this comment.
Thanks! This is looking a lot better but there are still a couple places that will not work in general (packed bool maps and recarrays). Also, please add to the documentation about this new format.
| the ``healpix`` EXPLICIT format does not maintain all metadata and | ||
| coverage information. | ||
| hdf5_group: `str`, optional | ||
| If file format is ``hdf5``, save to this group, otheriwse unused. |
| is_fits_file = True | ||
| fits.close() | ||
| except OSError: | ||
| except (OSError, UnicodeDecodeError): |
There was a problem hiding this comment.
Where is the UnicodeDecodeError coming from?
There was a problem hiding this comment.
It seems this is what happens when you try to load the hdf5 file as if its fits. there might be a cleaner way of checking
There was a problem hiding this comment.
Huh! Okay, could you add a comment here?
| ) | ||
| elif hsp_map.is_bit_packed_map: | ||
| # save as bool array rather than packed to keep the same pixel as other maps when reading | ||
| sparse_map_reshape = np.asarray(hsp_map._sparse_map).reshape(ncov_in_sparse, nfine_per_cov) |
There was a problem hiding this comment.
This is going to blow up the memory and crash in the cases that a packed map is being used. Please use the packed format. In this case you just need to reshape with a divide-by-8 on nfine_per_cov e.g. https://github.com/LSSTDESC/healsparse/blob/main/healsparse/io_map_fits.py#L614-L616
|
|
||
| sparse_map = np.zeros(sparse_size, dtype=dtype) | ||
| for name, _ in dtype: | ||
| sparse_map[name][:] = sentinel |
There was a problem hiding this comment.
This isn't going to work because the sentinel only applies to primary. You'll need to read the overflow coverage pixel in all cases (not just here)
There was a problem hiding this comment.
i read the overflow pixel from file anyway so this line was doing nothing. removed
| ) | ||
| elif is_bit_packed: | ||
| sparse_map = grp["sparse_map"][cov_index_in_sparse_ordered, :][inv].reshape(-1) | ||
| sparse_map = _PackedBoolArray.from_boolean_array(sparse_map) |
There was a problem hiding this comment.
You'll need to leave this stored as packed.
erykoff
left a comment
There was a problem hiding this comment.
A few small comments. The only substantive one is that it looks like the "pixels read" tests are turned off, but it seems like these should work. So these tests should be turned on (and working) or specifying pixels needs to raise a NotImplementedError.
| A :code:`HealSparseMap` may also be serialized to an HDF5 file. | ||
| Multiple :code:`HealSparseMap` objects can be stored in the same HDF5 file. | ||
| Each :code:`HealSparseMap` is stored in a different HDF5 group. | ||
| They are not required to have the same mask, :code:`nside_coverage` or :code:`nside_sparse`. |
| use_threads : `bool`, optional | ||
| Use multithreaded reading for parquet files. | ||
| hdf5_group : `str`, optional | ||
| If file format is ``hdf5``, read from this group, otheriwse unused. |
| is_fits_file = True | ||
| fits.close() | ||
| except OSError: | ||
| except (OSError, UnicodeDecodeError): |
There was a problem hiding this comment.
Huh! Okay, could you add a comment here?
| compression="gzip", | ||
| ) | ||
| elif hsp_map.is_bit_packed_map: | ||
| # save as bool array rather than packed to keep the same pixel as other maps when reading |
There was a problem hiding this comment.
This comment is no longer operative.
| import healsparse | ||
|
|
||
| # read specific coverage pixels not yet implemented | ||
| test_pixels_read = False |
There was a problem hiding this comment.
I removed this as an option and made the pixel tests always run
Added read/write to a group in an hdf5 file
Useful for saving maps in TXPipe/ceci where we want N maps with different masks saved to the same file (e.g. galaxy density maps in different tomo bins)
Currently it only allows the whole map to the read (cannot specify list of coverage pixels with
pixels) but this can be added