Allow basename of dataset paths to match registered names#997
Allow basename of dataset paths to match registered names#997
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds local-filesystem dataset resolution: if a provided dataset identifier is a local path, derive a registered key from the path name, load the corresponding registered config and override its Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant DatasetUtils
participant Filesystem
participant Registry
participant DatasetLoader
Client->>DatasetUtils: get_dataset_samples(name_or_path, ...)
DatasetUtils->>Filesystem: os.path.exists(name_or_path)?
alt path exists
DatasetUtils->>DatasetUtils: dataset_name = last_segment(name_or_path)
DatasetUtils->>Registry: find registered key by substring(dataset_name)
alt registered key found
Registry-->>DatasetUtils: config
DatasetUtils->>DatasetUtils: config.path = name_or_path (override)
DatasetUtils->>DatasetLoader: load_dataset(config, split)
DatasetLoader-->>Client: samples
else no registered key
DatasetUtils->>DatasetLoader: load_from_local_path(name_or_path)
DatasetLoader-->>Client: samples
end
else not a path
DatasetUtils->>Registry: resolve registered key or auto-detect
alt resolved
Registry-->>DatasetUtils: config
DatasetUtils->>DatasetLoader: load_dataset(config, split)
DatasetLoader-->>Client: samples
else auto-detect
DatasetUtils->>DatasetLoader: auto_detect_and_load(name_or_path)
DatasetLoader-->>Client: samples
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 248-254: The code overwrites dataset_name with a manually parsed
basename which breaks unregistered local dataset paths and is not OS-portable;
instead keep dataset_path as the full local path, do not replace dataset_name,
and use os.path.basename() for any display/name extraction. Update the block
around dataset_name/dataset_path so dataset_path stays set to the full path when
os.path.exists(dataset_name) is true, and later when building config (used by
load_dataset) use dataset_path (not the truncated basename) for the "path" entry
unless the dataset is found in SUPPORTED_DATASET_CONFIG; replace any
rstrip("/").split("/")[-1] logic with os.path.basename(dataset_path) if you need
just the name for registration/display.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0e549af3-8813-487e-8da3-ccdbeb77a7b0
📥 Commits
Reviewing files that changed from the base of the PR and between 37d3f10 and dd8f7b632d222553bffe352bf3adbe71096d4de2.
📒 Files selected for processing (1)
modelopt/torch/utils/dataset_utils.py
10b0e68 to
595f789
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #997 +/- ##
==========================================
- Coverage 72.14% 72.09% -0.06%
==========================================
Files 209 209
Lines 23667 23725 +58
==========================================
+ Hits 17075 17104 +29
- Misses 6592 6621 +29 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/dataset_utils.py (1)
293-302:⚠️ Potential issue | 🟠 MajorUnregistered local datasets will fail: path contains only basename.
When a local path is provided but no registered key matches,
config["path"]is set todataset_name(line 298), which has been truncated to just the basename at line 258. This causesload_dataset()to fail because the full path is lost.For example, passing
/data/my_custom_datasetresults inconfig = {"path": "my_custom_dataset"}instead of the full path.Proposed fix
else: warn( f"Dataset '{dataset_name}' is not in SUPPORTED_DATASET_CONFIG. " "Auto-detecting format from column names." ) - config = {"path": dataset_name} + config = {"path": dataset_path if dataset_path is not None else dataset_name} splits = _normalize_splits(split) if split is not None else ["train"] def _preprocess(sample: dict) -> str: return _auto_preprocess_sample(sample, dataset_name, tokenizer)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 293 - 302, The code sets config["path"] to a truncated basename in the auto-detect branch, so local dataset paths (e.g., /data/my_custom_dataset) lose their full path and load_dataset() fails; fix by using the original full path variable (preserve the input path before any basename truncation) when creating config (i.e., set config["path"] = original_full_dataset_path or dataset_path instead of the truncated dataset_name), leaving _preprocess (which calls _auto_preprocess_sample) unchanged so auto-detection still works for unregistered datasets.
♻️ Duplicate comments (1)
modelopt/torch/utils/dataset_utils.py (1)
255-261:⚠️ Potential issue | 🟡 MinorUse
os.path.basename()for cross-platform compatibility.Line 258 uses
dataset_path.rstrip("/").split("/")[-1]which will fail on Windows where paths use backslash separators. Theosmodule is already imported.Proposed fix
if os.path.exists(dataset_name): # Local path dataset_path = dataset_name - dataset_name = dataset_path.rstrip("/").split("/")[-1] + dataset_name = os.path.basename(os.path.normpath(dataset_path)) else: dataset_path = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 255 - 261, The code that derives the dataset folder name uses manual string splitting (dataset_path.rstrip("/").split("/")[-1]) which is not cross-platform; replace that logic with os.path.basename(dataset_path.rstrip(os.sep)) (or simply os.path.basename(dataset_path) if you prefer) so dataset_name is computed portably; update the block that sets dataset_path/dataset_name in dataset_utils.py to call os.path.basename and remove the hardcoded "/" usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Line 262: The current selection logic uses a substring test when computing
registered_key (next((k for k in SUPPORTED_DATASET_CONFIG if k in dataset_name),
None)), which can produce incorrect matches (e.g., 'pile' matching
'minipile_100_samples'); change the match to a safe strategy such as exact
equality (k == dataset_name) or, if prefix/suffix matching is required, sort
keys by length descending and check for full-token or prefix matches to prefer
the most specific key; update the code that sets registered_key and any
downstream logic that assumes that key (referencing SUPPORTED_DATASET_CONFIG,
registered_key, and dataset_name) so the correct dataset configuration is always
chosen.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 86-108: The unit test test_get_dataset_samples_minipile_local_path
currently performs a network call via snapshot_download and belongs in
integration tests; change it to either move the test to tests/examples
(integration) or mock huggingface_hub.snapshot_download so the unit test is fast
and CPU-only, keeping the call to get_dataset_samples intact. Also add a new
unit test that creates a local directory whose basename does NOT match any key
in SUPPORTED_DATASET_CONFIG and calls get_dataset_samples with that local path
to exercise the auto-detection branch in dataset_utils.get_dataset_samples (the
code that handles unregistered local dataset paths), asserting it returns
expected text samples.
---
Outside diff comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 293-302: The code sets config["path"] to a truncated basename in
the auto-detect branch, so local dataset paths (e.g., /data/my_custom_dataset)
lose their full path and load_dataset() fails; fix by using the original full
path variable (preserve the input path before any basename truncation) when
creating config (i.e., set config["path"] = original_full_dataset_path or
dataset_path instead of the truncated dataset_name), leaving _preprocess (which
calls _auto_preprocess_sample) unchanged so auto-detection still works for
unregistered datasets.
---
Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 255-261: The code that derives the dataset folder name uses manual
string splitting (dataset_path.rstrip("/").split("/")[-1]) which is not
cross-platform; replace that logic with
os.path.basename(dataset_path.rstrip(os.sep)) (or simply
os.path.basename(dataset_path) if you prefer) so dataset_name is computed
portably; update the block that sets dataset_path/dataset_name in
dataset_utils.py to call os.path.basename and remove the hardcoded "/" usage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a7bc85fe-223b-4370-833c-d061a83ee250
📥 Commits
Reviewing files that changed from the base of the PR and between 595f789d3683c9cb4502a3329fcef82873015c19 and ed58dde972d7c812ce7c9566ad144c36b561b5cd.
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
ed58dde to
d8b2d4e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/unit/torch/utils/test_dataset_utils.py (1)
86-108:⚠️ Potential issue | 🟠 MajorThis unit test is still network-bound and won't catch the resolver bug.
snapshot_download()makestests/unit/...depend on the Hub, and the assertions only check for five non-empty strings. Because both thepileandminipile_100_samplesconfigs currently preprocesssample["text"], this can still pass when the resolver selects the wrong key. Please stubdatasets.load_dataset/Hub calls and assert the resolved kwargs instead (for example, that no inheritedname="v1.0"is passed), then add a second case whose directory basename is not inSUPPORTED_DATASET_CONFIGso the unregistered local-path branch is exercised too.Suggested direction
-def test_get_dataset_samples_minipile_local_path(tmp_path): +def test_get_dataset_samples_minipile_local_path(monkeypatch, tmp_path): - from huggingface_hub import snapshot_download - local_dir = tmp_path / "nanotron" / "minipile_100_samples" - snapshot_download( - repo_id="nanotron/minipile_100_samples", - repo_type="dataset", - local_dir=str(local_dir), - ) + local_dir.mkdir(parents=True) + + captured: dict[str, object] = {} + + def fake_load_dataset(*, split, streaming, **kwargs): + captured.update({"split": split, "streaming": streaming, **kwargs}) + return iter([{"text": f"sample-{i}"} for i in range(5)]) + + import datasets + monkeypatch.setattr(datasets, "load_dataset", fake_load_dataset) samples = get_dataset_samples(str(local_dir), num_samples=5) - assert isinstance(samples, list) - assert len(samples) == 5 - assert all(isinstance(s, str) and len(s) > 0 for s in samples) + assert samples == [f"sample-{i}" for i in range(5)] + assert captured["path"] == str(local_dir) + assert "name" not in capturedAs per coding guidelines,
tests/unit/**/*.py: Unit tests in tests/unit should be fast CPU-based tests taking no more than a few seconds to run.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 86 - 108, The test currently performs network I/O via snapshot_download and only asserts result strings; instead, mock/stub external Hub calls and datasets.load_dataset so the test runs offline and can assert how get_dataset_samples resolves kwargs: patch huggingface_hub.snapshot_download and datasets.load_dataset (or the internal call site used by get_dataset_samples) to return a fake local_dir and a fake dataset object, then call get_dataset_samples with that local path and assert the resolved kwargs passed into datasets.load_dataset (e.g., that no inherited name="v1.0" or incorrect config is forwarded and that the selected config equals the one matched from SUPPORTED_DATASET_CONFIG like "minipile_100_samples"); also add a second case where the directory basename is not present in SUPPORTED_DATASET_CONFIG to exercise the unregistered-local-path branch and assert datasets.load_dataset is called with the generic/local-path kwargs instead.modelopt/torch/utils/dataset_utils.py (1)
255-262:⚠️ Potential issue | 🟠 MajorKeep the full local path separate from the lookup key.
This block truncates a local path to
rstrip("/").split("/")[-1]and then resolves configs with substring matching. That breaks unregistered local datasets later on Lines 294-298 becauseload_dataset()receives only the basename, and it also makesminipile_100_samplesresolve to the earlierpileentry in the current config order. Preservedataset_pathas-is, derive a normalized basename for lookup, and match that key exactly.Suggested fix
- if os.path.exists(dataset_name): - # Local path - dataset_path = dataset_name - dataset_name = dataset_path.rstrip("/").split("/")[-1] - else: - dataset_path = None - - registered_key = next((k for k in SUPPORTED_DATASET_CONFIG if k in dataset_name), None) + dataset_path: str | None = None + dataset_key = dataset_name + if os.path.exists(dataset_name): + dataset_path = dataset_name + dataset_key = os.path.basename(os.path.normpath(dataset_path)) + + registered_key = next((k for k in SUPPORTED_DATASET_CONFIG if k == dataset_key), None) @@ - f"Dataset '{dataset_name}' is not in SUPPORTED_DATASET_CONFIG. " + f"Dataset '{dataset_key}' is not in SUPPORTED_DATASET_CONFIG. " "Auto-detecting format from column names." ) - config = {"path": dataset_name} + config = {"path": dataset_path or dataset_name} splits = _normalize_splits(split) if split is not None else ["train"] def _preprocess(sample: dict) -> str: - return _auto_preprocess_sample(sample, dataset_name, tokenizer) + return _auto_preprocess_sample(sample, dataset_key, tokenizer)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 255 - 262, The current code mutates dataset_name to the basename which loses the full local path and uses substring matching against SUPPORTED_DATASET_CONFIG; revert this by keeping dataset_path set to the original full path (do not overwrite dataset_name), derive a separate normalized basename (e.g., use os.path.basename(dataset_name.rstrip("/"))) for lookup, and compute registered_key by exact equality against entries in SUPPORTED_DATASET_CONFIG using that basename (not substring containment); ensure later calls such as load_dataset() are passed dataset_path when it exists (the full local path) and only use registered_key/basename for config lookup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 231-234: The docstring for the dataset_name parameter overstates
supported inputs (mentions .jsonl.gz and uses "mmlu" as an example) while the
implementation only special-cases .jsonl and SUPPORTED_DATASET_CONFIG does not
contain "mmlu"; update the docstring to accurately reflect current behavior by
removing .jsonl.gz from the supported list and replacing the "mmlu" example with
an actual registered key from SUPPORTED_DATASET_CONFIG (or a generic phrase like
"one of the keys in SUPPORTED_DATASET_CONFIG"); ensure you update the docstring
near the dataset_name parameter and reference SUPPORTED_DATASET_CONFIG so docs
and code match.
---
Duplicate comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 255-262: The current code mutates dataset_name to the basename
which loses the full local path and uses substring matching against
SUPPORTED_DATASET_CONFIG; revert this by keeping dataset_path set to the
original full path (do not overwrite dataset_name), derive a separate normalized
basename (e.g., use os.path.basename(dataset_name.rstrip("/"))) for lookup, and
compute registered_key by exact equality against entries in
SUPPORTED_DATASET_CONFIG using that basename (not substring containment); ensure
later calls such as load_dataset() are passed dataset_path when it exists (the
full local path) and only use registered_key/basename for config lookup.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 86-108: The test currently performs network I/O via
snapshot_download and only asserts result strings; instead, mock/stub external
Hub calls and datasets.load_dataset so the test runs offline and can assert how
get_dataset_samples resolves kwargs: patch huggingface_hub.snapshot_download and
datasets.load_dataset (or the internal call site used by get_dataset_samples) to
return a fake local_dir and a fake dataset object, then call get_dataset_samples
with that local path and assert the resolved kwargs passed into
datasets.load_dataset (e.g., that no inherited name="v1.0" or incorrect config
is forwarded and that the selected config equals the one matched from
SUPPORTED_DATASET_CONFIG like "minipile_100_samples"); also add a second case
where the directory basename is not present in SUPPORTED_DATASET_CONFIG to
exercise the unregistered-local-path branch and assert datasets.load_dataset is
called with the generic/local-path kwargs instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b3c8d7f2-c997-4ae2-ae2e-3bfadaf8d90d
📥 Commits
Reviewing files that changed from the base of the PR and between ed58dde972d7c812ce7c9566ad144c36b561b5cd and d8b2d4e72cdae9df7a8a5f457df3d760e6c21824.
📒 Files selected for processing (2)
modelopt/torch/utils/dataset_utils.pytests/unit/torch/utils/test_dataset_utils.py
01c2508 to
63f9c7d
Compare
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
c3f25fe to
d882938
Compare
kevalmorabia97
left a comment
There was a problem hiding this comment.
Updated get_dataset_samples to be able to parse local dataset path even when not supported explicitly in SUPPORTED_DATASET_CONFIG dict
8b2bb9c to
ab5ced3
Compare
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
New Features
Chores
Tests
Documentation