Skip to content

fix: extract arxiv archives outside download dir#2055

Open
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-2042
Open

fix: extract arxiv archives outside download dir#2055
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-2042

Conversation

@nightcityblade

Copy link
Copy Markdown
Contributor

Description

Closes #2042.

ArxivIterator now extracts downloaded source archives into the system temporary directory instead of creating its extraction directory under the archive's download directory. This avoids failures when the download directory is readable but not writable.

Usage

N/A

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Checks run:

  • PYTHONPATH=. uv run --no-sync pytest --confcutdir=tests/stages/text/download/arxiv tests/stages/text/download/arxiv/test_iterator.py -q (blocked: missing cosmos_xenna dependency in the local environment)
  • python3 -m py_compile nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py
  • uv run --no-sync --with ruff==0.14.10 ruff check nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade nightcityblade requested a review from a team as a code owner June 6, 2026 03:29
@nightcityblade nightcityblade requested review from sarahyurick and removed request for a team June 6, 2026 03:29
@copy-pr-bot

copy-pr-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug where ArxivIterator would fail when the download directory is read-only, by switching the extraction TemporaryDirectory from being rooted under the download directory to the system temp directory by default.

  • Core fix (iterator.py): Removes download_dir = os.path.split(file_path)[0] and replaces TemporaryDirectory(dir=download_dir) with a conditional that uses TemporaryDirectory() (system default) when no extract_tmp_dir is set, or TemporaryDirectory(dir=self._extract_tmp_dir) when one is configured.
  • New parameter: Adds extract_tmp_dir: str | None = None to __init__, allowing callers to pin the extraction location explicitly.
  • Tests (test_iterator.py): Patches tempfile.TemporaryDirectory via mock.patch(..., wraps=...) so real extraction still happens, and asserts the call signature matches expectations for both the default and custom-dir paths.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped, replaces a single well-understood call site, and both the default and custom-directory paths are verified by the updated tests.

The fix correctly decouples extraction from the download directory by using the system temp dir when no explicit path is given. The new extract_tmp_dir parameter is opt-in and backward-compatible. Tests cover both the default call signature and the custom-dir path using wrapping mocks so real extraction is exercised as well. No regressions are visible in the surrounding logic.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/stages/text/download/arxiv/iterator.py Core fix: removes dependency on download dir for extraction, adds optional extract_tmp_dir parameter — logic is correct and safe.
tests/stages/text/download/arxiv/test_iterator.py Adds mock-based assertions to verify TemporaryDirectory call signature for both default and custom-dir cases; addresses previous review feedback with assert_called_once_with().

Sequence Diagram

sequenceDiagram
    participant Caller
    participant ArxivIterator
    participant tempfile.TemporaryDirectory
    participant tarfile
    participant Filesystem

    Caller->>ArxivIterator: iterate(file_path)
    ArxivIterator->>ArxivIterator: "build tmpdir_kwargs (None → {}, path → {dir: path})"
    ArxivIterator->>tempfile.TemporaryDirectory: "TemporaryDirectory(**tmpdir_kwargs)"
    tempfile.TemporaryDirectory->>Filesystem: create temp dir (system tmp or extract_tmp_dir)
    ArxivIterator->>tarfile: open(file_path)
    ArxivIterator->>Filesystem: tar_safe_extract(tf, tmpdir)
    loop for each extracted file
        ArxivIterator->>ArxivIterator: _tex_proj_loader(item)
        ArxivIterator-->>Caller: "yield {id, source_id, content}"
    end
    ArxivIterator->>Filesystem: cleanup temp dir (context exit)
Loading

Reviews (4): Last reviewed commit: "fix: allow custom arxiv extraction temp ..." | Re-trigger Greptile

) as tmpdir_mock:
results = list(iterator.iterate(str(outer_tar_path)))

assert tmpdir_mock.call_args.kwargs == {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The assertion tmpdir_mock.call_args.kwargs == {} only checks that no keyword arguments were passed, but does not verify TemporaryDirectory was actually called. If iterate() were to exit early and never call TemporaryDirectory, call_args would be None and the test would raise an AttributeError instead of a clear assertion failure. assert_called_once_with() validates the call count, verifies no positional args, and no keyword args — all in one readable statement.

Suggested change
assert tmpdir_mock.call_args.kwargs == {}
tmpdir_mock.assert_called_once_with()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agreed — updated the test to use tmpdir_mock.assert_called_once_with() so it verifies TemporaryDirectory was called exactly once with no args/kwargs.

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

@praateekmahajan praateekmahajan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, requested a small change

bname = os.path.split(file_path)[-1]

with tempfile.TemporaryDirectory(dir=download_dir) as tmpdir, tarfile.open(file_path) as tf:
with tempfile.TemporaryDirectory() as tmpdir, tarfile.open(file_path) as tf:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, let's provide an argument so that user can decide to override where to untar. This is because many concurrent replicas of ArxivIterator might be untarring at the same time, and in some clusters /tmp has storage limits..

@nightcityblade

Copy link
Copy Markdown
Contributor Author

Thanks for the review — I updated ArxivIterator to accept an optional extract_tmp_dir argument, while keeping the default behavior as the system temp directory so read-only download dirs still work. When provided, it is passed through to tempfile.TemporaryDirectory(dir=...) so users can choose a cluster-appropriate untar location.

Validation run:

  • python3 -m ruff check nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py
  • python3 -m ruff format --check nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py
  • python3 -m py_compile nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py
  • Focused local validation script covering default TemporaryDirectory() and custom TemporaryDirectory(dir=...)

I also tried the focused pytest file, but this local environment is missing full project deps (ray/cosmos_xenna), so pytest collection could not start here.

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade

nightcityblade commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Thanks again for the review. The requested extract_tmp_dir override was added in the latest code update, and I just fixed the DCO failure by adding the missing Signed-off-by line to the latest commit.

Validation run:

  • python3 -m ruff check nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py
  • python3 -m ruff format --check nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py
  • python3 -m py_compile nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py

I also retried python3 -m pytest tests/stages/text/download/arxiv/test_iterator.py -q, but local collection still requires the missing ray dependency. The DCO check is now passing on commit 4065fd5.

@sarahyurick sarahyurick self-requested a review June 11, 2026 20:58
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Arxiv Iterator untars at the same path as download_dir which might fail due to lack of permissions

4 participants