Skip to content

Fix getting filenames out of netCDF datasets#8

Merged
charles-turner-1 merged 8 commits into
mainfrom
fix-ds-handles
May 21, 2025
Merged

Fix getting filenames out of netCDF datasets#8
charles-turner-1 merged 8 commits into
mainfrom
fix-ds-handles

Conversation

@charles-turner-1

@charles-turner-1 charles-turner-1 commented May 13, 2025

Copy link
Copy Markdown
Collaborator

Mostly closes #7

As of right now I'm not sure that we can get all the filenames out of a multifile dataarray, only the first. With that said, probably unimportant in nearly all situations. I've added an xfailing test for it, but in practice I think it'll give the right answer if a multifile datarray is passed to this function.

@jemmajeffree think this might be up your alley - no rush with the review though.

@charles-turner-1

Copy link
Copy Markdown
Collaborator Author

if hasattr(bound_method, "__self__"):
file_handles.append(Path(bound_method.__self__._filename))

return file_handles

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall structure looks good and I can follow what the function does, though I'm not familiar with the specific functions/classes/attributes used (can infer what they do from use, but not confirm that they do what they seem to)

Comment thread versioneer.py
files.append(versioneer_file)
present = False
try:
with open(".gitattributes", "r") as fobj:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why the removal of "r" is beneficial. Wouldn't it be more futureproof to specify read only?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a versioneer artefact, not sure what's caused this to change tbh.

Comment thread tests/test_chunking.py Outdated
engine="netcdf4",
)
else:
ds = xr.open_dataset(fpath, decode_timedelta=False, engine="netcdf4")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_mfdataset seems to cope with a single path (as best I can tell, it uses it as a glob string that returns a single file); is there a reason to use open_dataset explicitly for only one filepath?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, fixed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you get both instances?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. Working on updating the related examples now anyway so will fix in there

Comment thread tests/test_chunking.py
fhandles = _get_file_handles(ds)

if isinstance(fpath, list):
assert fhandles == [Path(f) for f in fpath]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this work if fpath contains the same paths but in a different order to fhandles? Would you want this to pass or fail in that instance

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case it'd fail, you could replace it with a set intersection to make it pass. Whether you'd want it to pass/fail is more of a philosophical question I guess

Comment thread tests/test_chunking.py
else:
ds = xr.open_dataset(fpath, decode_timedelta=False, engine="netcdf4")
with warnings.catch_warnings():
chunk_dict = validate_chunkspec(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is the right place to mention this, but does validate_chunkspec care if the chunks are not an integer divisor of file size in that dimension? Ie, if files have 3 timesteps, in disc chunks of 1, is it okay with chunks of two? in which case you'd end up with every second chunk being half as big as the other. Are we okay with that?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I actually don't know whether this would cause performance issues. I suspect not but can't say for certain..

@charles-turner-1 charles-turner-1 merged commit f559460 into main May 21, 2025
14 checks passed
@charles-turner-1 charles-turner-1 deleted the fix-ds-handles branch May 21, 2025 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

validate_chunkspec on an xarray dataset/dataarray object

2 participants