Skip to content

fix: replace urlretrieve in getRemoteFile#444

Merged
chrismattmann merged 2 commits intochrismattmann:masterfrom
afuetterer:442-tests
Mar 4, 2026
Merged

fix: replace urlretrieve in getRemoteFile#444
chrismattmann merged 2 commits intochrismattmann:masterfrom
afuetterer:442-tests

Conversation

@afuetterer
Copy link
Contributor

@afuetterer afuetterer commented Feb 24, 2026

Failing unit tests

  • see test: remove failing unit test #447
  • FAILED tika/tests/test_tika.py::CreateTest::test_remote_jpg - urllib.error.HTTPError: HTTP Error 403: Forbidden
  • FAILED tika/tests/test_tika.py::CreateTest::test_remote_pdf - urllib.error.HTTPError: HTTP Error 502: Bad Gateway

Tasks

  • fix broken url to pdf file
  • decide how to deal with urlretrieve
  • rebase
  • remove unit test 1 commit
  • remove deps commit

@afuetterer

This comment was marked as outdated.

@afuetterer afuetterer mentioned this pull request Feb 24, 2026
13 tasks
@chrismattmann
Copy link
Owner

As per your comment in #442:

  1. Let's remove the test_ssl_link.py test. Not sure what it's testing either and it has become outdated regardless.
  2. Header issue. Please update.
  3. Any old PDF file should do....
  1. The header issue is not in the test, but in the library:
    urlretrieve(urlOrPath, destPath)
  2. Same here I think. The broken pdf url is just step 1, I think if there is a valid url, it will still return 403

Got it, so we have to fix urlRetrieve then? Perhaps we should add a unit test for it too.

@afuetterer
Copy link
Contributor Author

As I said above, replacing the url to a remote pdf, does help a bit.
Now we got from
FAILED tika/tests/test_tika.py::CreateTest::test_remote_pdf - urllib.error.HTTPError: HTTP Error 502: Bad Gateway
to
FAILED tika/tests/test_tika.py::CreateTest::test_remote_pdf - urllib.error.HTTPError: HTTP Error 403: Forbidden

The error is in getRemoteFile.

It is a common issue: https://www.google.com/search?q=python+urlretrieve+403+forbidden

Maybe use custom headers here or switch to using requests.

@afuetterer afuetterer marked this pull request as draft February 24, 2026 19:42
@afuetterer

This comment was marked as outdated.

@afuetterer

This comment was marked as resolved.

@afuetterer afuetterer changed the title Fix failing unit tests so CI can pass fix: replace urlretrieve in getRemoteFile Feb 26, 2026
@afuetterer afuetterer changed the title fix: replace urlretrieve in getRemoteFile fix: replace urlretrieve in getRemoteFile Feb 26, 2026
@afuetterer
Copy link
Contributor Author

I suggest splitting the issues discovered during the work on this PR into smaller problems, that can be reviewed individually and merged faster.

The "main issue" discovered in this PR is the broken getRemoteFile. The problem is in tika-python's library, not the unit tests. Agreed?

How to proceed from here?

Write and use a new helper function? Something like _urlretrieve() It should use requests internally and send a user-agent of "tika-python"? Streaming response, due to potentially large files?

What do you think?

@chrismattmann
Copy link
Owner

I suggest splitting the issues discovered during the work on this PR into smaller problems, that can be reviewed individually and merged faster.

The "main issue" discovered in this PR is the broken getRemoteFile. The problem is in tika-python's library, not the unit tests. Agreed?

How to proceed from here?

Write and use a new helper function? Something like _urlretrieve() It should use requests internally and send a user-agent of "tika-python"? Streaming response, due to potentially large files?

What do you think?

Yes, perfect let's write a new _urlRetrieve that:

  1. uses requests internally
  2. uses tika-python as the User-Agent
  3. allows for streaming, etc.

@afuetterer
Copy link
Contributor Author

Please have a look at #446 and #447 in the meantime. These changes have been extracted from this PR.

@chrismattmann
Copy link
Owner

Merged #446

@chrismattmann
Copy link
Owner

Merged #447

@afuetterer afuetterer marked this pull request as ready for review March 2, 2026 09:12
@afuetterer afuetterer requested a review from chrismattmann March 2, 2026 09:12
@afuetterer
Copy link
Contributor Author

Please check out the proposed helper function. Tests are passing.

for chunk in response.iter_content(chunk_size=chunk_size):
if chunk: # Filter out keep-alive chunks
f.write(chunk)
bytes_downloaded += len(chunk)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized, that this can be removed. I had a progress bar in mind. But without that, no need for keeping track of the downloaded bytes.

@chrismattmann
Copy link
Owner

LGTM!!

@chrismattmann chrismattmann merged commit ce4c3ca into chrismattmann:master Mar 4, 2026
2 checks passed
@afuetterer afuetterer deleted the 442-tests branch March 4, 2026 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants