fix: extract arxiv archives outside download dir#2055
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR fixes a bug where
Confidence Score: 5/5Safe to merge — the change is narrowly scoped, replaces a single well-understood call site, and both the default and custom-directory paths are verified by the updated tests. The fix correctly decouples extraction from the download directory by using the system temp dir when no explicit path is given. The new extract_tmp_dir parameter is opt-in and backward-compatible. Tests cover both the default call signature and the custom-dir path using wrapping mocks so real extraction is exercised as well. No regressions are visible in the surrounding logic. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant ArxivIterator
participant tempfile.TemporaryDirectory
participant tarfile
participant Filesystem
Caller->>ArxivIterator: iterate(file_path)
ArxivIterator->>ArxivIterator: "build tmpdir_kwargs (None → {}, path → {dir: path})"
ArxivIterator->>tempfile.TemporaryDirectory: "TemporaryDirectory(**tmpdir_kwargs)"
tempfile.TemporaryDirectory->>Filesystem: create temp dir (system tmp or extract_tmp_dir)
ArxivIterator->>tarfile: open(file_path)
ArxivIterator->>Filesystem: tar_safe_extract(tf, tmpdir)
loop for each extracted file
ArxivIterator->>ArxivIterator: _tex_proj_loader(item)
ArxivIterator-->>Caller: "yield {id, source_id, content}"
end
ArxivIterator->>Filesystem: cleanup temp dir (context exit)
Reviews (4): Last reviewed commit: "fix: allow custom arxiv extraction temp ..." | Re-trigger Greptile |
| ) as tmpdir_mock: | ||
| results = list(iterator.iterate(str(outer_tar_path))) | ||
|
|
||
| assert tmpdir_mock.call_args.kwargs == {} |
There was a problem hiding this comment.
The assertion
tmpdir_mock.call_args.kwargs == {} only checks that no keyword arguments were passed, but does not verify TemporaryDirectory was actually called. If iterate() were to exit early and never call TemporaryDirectory, call_args would be None and the test would raise an AttributeError instead of a clear assertion failure. assert_called_once_with() validates the call count, verifies no positional args, and no keyword args — all in one readable statement.
| assert tmpdir_mock.call_args.kwargs == {} | |
| tmpdir_mock.assert_called_once_with() |
There was a problem hiding this comment.
Thanks, agreed — updated the test to use tmpdir_mock.assert_called_once_with() so it verifies TemporaryDirectory was called exactly once with no args/kwargs.
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
f0ad547 to
0031d2e
Compare
praateekmahajan
left a comment
There was a problem hiding this comment.
Thanks for the PR, requested a small change
| bname = os.path.split(file_path)[-1] | ||
|
|
||
| with tempfile.TemporaryDirectory(dir=download_dir) as tmpdir, tarfile.open(file_path) as tf: | ||
| with tempfile.TemporaryDirectory() as tmpdir, tarfile.open(file_path) as tf: |
There was a problem hiding this comment.
Hey, let's provide an argument so that user can decide to override where to untar. This is because many concurrent replicas of ArxivIterator might be untarring at the same time, and in some clusters /tmp has storage limits..
|
Thanks for the review — I updated Validation run:
I also tried the focused pytest file, but this local environment is missing full project deps ( |
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
406a83e to
4065fd5
Compare
|
Thanks again for the review. The requested Validation run:
I also retried |
Description
Closes #2042.
ArxivIterator now extracts downloaded source archives into the system temporary directory instead of creating its extraction directory under the archive's download directory. This avoids failures when the download directory is readable but not writable.
Usage
N/A
Checklist
Checks run:
PYTHONPATH=. uv run --no-sync pytest --confcutdir=tests/stages/text/download/arxiv tests/stages/text/download/arxiv/test_iterator.py -q(blocked: missingcosmos_xennadependency in the local environment)python3 -m py_compile nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.pyuv run --no-sync --with ruff==0.14.10 ruff check nemo_curator/stages/text/download/arxiv/iterator.py tests/stages/text/download/arxiv/test_iterator.py