update nltk to 3.9.1 by RainbowRivey · Pull Request #211 · ArneBinder/pie-modules

RainbowRivey · 2025-08-25T18:15:08Z

Update nltk to 3.9.1
Use new method to load Punkt
Add deprecation warning for sentencizer_url parameter
Test using deprecated parameter

Reasons:

nltk had breaking change due to pickle vulnerabilities [BUG] punkt_tab breaking change nltk/nltk#3293
Our current limitation (nltk = ^3.8.1) allows version nltk==3.9.1, however NltkSentenceSplitter does not work with it.
Found while locally running slow tests on https://github.com/ArneBinder/pytorch-ie-hydra-template-1, error log below

First commit only bumps version, so that tests fails.

TODO:

Decide what to do with 'sentencizer_url' parameter, since it is not needed anymore, but there is now 'language' instead.

Error log

  ___________________________________________________________________ test_experiment[drugprot] ____________________________________________________________________

_target_ = <class 'pie_modules.document.processing.sentence_splitter.NltkSentenceSplitter'>, _partial_ = False, args = (), kwargs = {}
full_key = 'dataset.add_sentences.function'

   def _call_target(
       _target_: Callable[..., Any],
       _partial_: bool,
       args: Tuple[Any, ...],
       kwargs: Dict[str, Any],
       full_key: str,
   ) -> Any:
       """Call target (type) with args and kwargs."""
       try:
           args, kwargs = _extract_pos_args(args, kwargs)
           # detaching configs from parent.
           # At this time, everything is resolved and the parent link can cause
           # issues when serializing objects in some scenarios.
           for arg in args:
               if OmegaConf.is_config(arg):
                   arg._set_parent(None)
           for v in kwargs.values():
               if OmegaConf.is_config(v):
                   v._set_parent(None)
       except Exception as e:
           msg = (
               f"Error in collecting args and kwargs for '{_convert_target_to_string(_target_)}':"
               + f"\n{repr(e)}"
           )
           if full_key:
               msg += f"\nfull_key: {full_key}"
   
           raise InstantiationException(msg) from e
   
       if _partial_:
           try:
               return functools.partial(_target_, *args, **kwargs)
           except Exception as e:
               msg = (
                   f"Error in creating partial({_convert_target_to_string(_target_)}, ...) object:"
                   + f"\n{repr(e)}"
               )
               if full_key:
                   msg += f"\nfull_key: {full_key}"
               raise InstantiationException(msg) from e
       else:
           try:
>               return _target_(*args, **kwargs)

.venv/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py:92: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.9/site-packages/pie_modules/document/processing/sentence_splitter.py:44: in __init__
   self.sentencizer = nltk.data.load(sentencizer_url)
.venv/lib/python3.9/site-packages/nltk/data.py:823: in load
   return switch_punkt(fil)
.venv/lib/python3.9/site-packages/nltk/data.py:678: in switch_punkt
   return tok(lang)
.venv/lib/python3.9/site-packages/nltk/tokenize/punkt.py:1744: in __init__
   self.load_lang(lang)
.venv/lib/python3.9/site-packages/nltk/tokenize/punkt.py:1749: in load_lang
   lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

resource_name = 'tokenizers/punkt_tab/english/'
paths = ['/home/user/nltk_data', '/home/user/workspace/pie-template/.venv/nltk_data', '/home/user/workspace/pie-template/.v...', '/home/user/workspace/pie-template/.venv/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', ...]

   def find(resource_name, paths=None):
       """
       Find the given resource by searching through the directories and
       zip files in paths, where a None or empty string specifies an absolute path.
       Returns a corresponding path name.  If the given resource is not
       found, raise a ``LookupError``, whose message gives a pointer to
       the installation instructions for the NLTK downloader.
   
       Zip File Handling:
   
         - If ``resource_name`` contains a component with a ``.zip``
           extension, then it is assumed to be a zipfile; and the
           remaining path components are used to look inside the zipfile.
   
         - If any element of ``nltk.data.path`` has a ``.zip`` extension,
           then it is assumed to be a zipfile.
   
         - If a given resource name that does not contain any zipfile
           component is not found initially, then ``find()`` will make a
           second attempt to find that resource, by replacing each
           component *p* in the path with *p.zip/p*.  For example, this
           allows ``find()`` to map the resource name
           ``corpora/chat80/cities.pl`` to a zip file path pointer to
           ``corpora/chat80.zip/chat80/cities.pl``.
   
         - When using ``find()`` to locate a directory contained in a
           zipfile, the resource name must end with the forward slash
           character.  Otherwise, ``find()`` will not locate the
           directory.
   
       :type resource_name: str or unicode
       :param resource_name: The name of the resource to search for.
           Resource names are posix-style relative path names, such as
           ``corpora/brown``.  Directory names will be
           automatically converted to a platform-appropriate path separator.
       :rtype: str
       """
       resource_name = normalize_resource_name(resource_name, True)
   
       # Resolve default paths at runtime in-case the user overrides
       # nltk.data.path
       if paths is None:
           paths = path
   
       # Check if the resource name includes a zipfile name
       m = re.match(r"(.*\.zip)/?(.*)$|", resource_name)
       zipfile, zipentry = m.groups()
   
       # Check each item in our path
       for path_ in paths:
           # Is the path item a zipfile?
           if path_ and (os.path.isfile(path_) and path_.endswith(".zip")):
               try:
                   return ZipFilePathPointer(path_, resource_name)
               except OSError:
                   # resource not in zipfile
                   continue
   
           # Is the path item a directory or is resource_name an absolute path?
           elif not path_ or os.path.isdir(path_):
               if zipfile is None:
                   p = os.path.join(path_, url2pathname(resource_name))
                   if os.path.exists(p):
                       if p.endswith(".gz"):
                           return GzipFileSystemPathPointer(p)
                       else:
                           return FileSystemPathPointer(p)
               else:
                   p = os.path.join(path_, url2pathname(zipfile))
                   if os.path.exists(p):
                       try:
                           return ZipFilePathPointer(p, zipentry)
                       except OSError:
                           # resource not in zipfile
                           continue
   
       # Fallback: if the path doesn't include a zip file, then try
       # again, assuming that one of the path components is inside a
       # zipfile of the same name.
       if zipfile is None:
           pieces = resource_name.split("/")
           for i in range(len(pieces)):
               modified_name = "/".join(pieces[:i] + [pieces[i] + ".zip"] + pieces[i:])
               try:
                   return find(modified_name, paths)
               except LookupError:
                   pass
   
       # Identify the package (i.e. the .zip file) to download.
       resource_zipname = resource_name.split("/")[1]
       if resource_zipname.endswith(".zip"):
           resource_zipname = resource_zipname.rpartition(".")[0]
       # Display a friendly error message if the resource wasn't found:
       msg = str(
           "Resource \33[93m{resource}\033[0m not found.\n"
           "Please use the NLTK Downloader to obtain the resource:\n\n"
           "\33[31m"  # To display red text in terminal.
           ">>> import nltk\n"
           ">>> nltk.download('{resource}')\n"
           "\033[0m"
       ).format(resource=resource_zipname)
       msg = textwrap_indent(msg)
   
       msg += "\n  For more information see: https://www.nltk.org/data.html\n"
   
       msg += "\n  Attempted to load \33[93m{resource_name}\033[0m\n".format(
           resource_name=resource_name
       )
   
       msg += "\n  Searched in:" + "".join("\n    - %r" % d for d in paths)
       sep = "*" * 70
       resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
>       raise LookupError(resource_not_found)
E       LookupError: 
E       **********************************************************************
E         Resource punkt_tab not found.
E         Please use the NLTK Downloader to obtain the resource:
E       
E         >>> import nltk
E         >>> nltk.download('punkt_tab')
E         
E         For more information see: https://www.nltk.org/data.html
E       
E         Attempted to load tokenizers/punkt_tab/english/
E       
E         Searched in:
E           - '/home/user/nltk_data'
E           - '/home/user/workspace/pie-template/.venv/nltk_data'
E           - '/home/user/workspace/pie-template/.venv/share/nltk_data'
E           - '/home/user/workspace/pie-template/.venv/lib/nltk_data'
E           - '/usr/share/nltk_data'
E           - '/usr/local/share/nltk_data'
E           - '/usr/lib/nltk_data'
E           - '/usr/local/lib/nltk_data'
E       **********************************************************************

codecov-commenter · 2025-08-25T18:21:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.76%. Comparing base (d105e61) to head (05abcad).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #211   +/-   ##
=======================================
  Coverage   95.76%   95.76%           
=======================================
  Files          68       68           
  Lines        6156     6161    +5     
=======================================
+ Hits         5895     5900    +5     
  Misses        261      261

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ArneBinder

The issue below needs to be addressed so that this PR can stay non-breaking.

ArneBinder

lgtm

RainbowRivey added 2 commits August 25, 2025 19:58

update nltk to 3.9.1

ba1ecd9

use new mehtod to initialize sentence splitter

18f1756

RainbowRivey added bug Something isn't working dependencies Pull requests that update a dependency file labels Aug 25, 2025

RainbowRivey mentioned this pull request Aug 26, 2025

Use poetry (closes #210) ArneBinder/pytorch-ie-hydra-template-1#211

Merged

3 tasks

RainbowRivey requested a review from ArneBinder August 26, 2025 12:24

ArneBinder requested changes Aug 26, 2025

View reviewed changes

Comment thread src/pie_modules/document/processing/sentence_splitter.py Outdated

add deprecation warning and handle old paramter

05abcad

RainbowRivey requested a review from ArneBinder September 1, 2025 15:39

ArneBinder approved these changes Sep 3, 2025

View reviewed changes

ArneBinder merged commit c0d02ed into main Sep 3, 2025
2 checks passed

ArneBinder deleted the use_nltk_punkt_tab branch September 3, 2025 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update nltk to 3.9.1#211

update nltk to 3.9.1#211
ArneBinder merged 3 commits into
mainfrom
use_nltk_punkt_tab

RainbowRivey commented Aug 25, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Aug 25, 2025 •

edited

Loading

Uh oh!

ArneBinder left a comment

Uh oh!

Uh oh!

ArneBinder left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RainbowRivey commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ArneBinder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArneBinder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RainbowRivey commented Aug 25, 2025 •

edited

Loading

codecov-commenter commented Aug 25, 2025 •

edited

Loading