Skip to content

update nltk to 3.9.1#211

Merged
ArneBinder merged 3 commits into
mainfrom
use_nltk_punkt_tab
Sep 3, 2025
Merged

update nltk to 3.9.1#211
ArneBinder merged 3 commits into
mainfrom
use_nltk_punkt_tab

Conversation

@RainbowRivey
Copy link
Copy Markdown
Collaborator

@RainbowRivey RainbowRivey commented Aug 25, 2025

  • Update nltk to 3.9.1
  • Use new method to load Punkt
  • Add deprecation warning for sentencizer_url parameter
  • Test using deprecated parameter

Reasons:

First commit only bumps version, so that tests fails.

TODO:

  • Decide what to do with 'sentencizer_url' parameter, since it is not needed anymore, but there is now 'language' instead.
Error log
  ___________________________________________________________________ test_experiment[drugprot] ____________________________________________________________________

_target_ = <class 'pie_modules.document.processing.sentence_splitter.NltkSentenceSplitter'>, _partial_ = False, args = (), kwargs = {}
full_key = 'dataset.add_sentences.function'

   def _call_target(
       _target_: Callable[..., Any],
       _partial_: bool,
       args: Tuple[Any, ...],
       kwargs: Dict[str, Any],
       full_key: str,
   ) -> Any:
       """Call target (type) with args and kwargs."""
       try:
           args, kwargs = _extract_pos_args(args, kwargs)
           # detaching configs from parent.
           # At this time, everything is resolved and the parent link can cause
           # issues when serializing objects in some scenarios.
           for arg in args:
               if OmegaConf.is_config(arg):
                   arg._set_parent(None)
           for v in kwargs.values():
               if OmegaConf.is_config(v):
                   v._set_parent(None)
       except Exception as e:
           msg = (
               f"Error in collecting args and kwargs for '{_convert_target_to_string(_target_)}':"
               + f"\n{repr(e)}"
           )
           if full_key:
               msg += f"\nfull_key: {full_key}"
   
           raise InstantiationException(msg) from e
   
       if _partial_:
           try:
               return functools.partial(_target_, *args, **kwargs)
           except Exception as e:
               msg = (
                   f"Error in creating partial({_convert_target_to_string(_target_)}, ...) object:"
                   + f"\n{repr(e)}"
               )
               if full_key:
                   msg += f"\nfull_key: {full_key}"
               raise InstantiationException(msg) from e
       else:
           try:
>               return _target_(*args, **kwargs)

.venv/lib/python3.9/site-packages/hydra/_internal/instantiate/_instantiate2.py:92: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.9/site-packages/pie_modules/document/processing/sentence_splitter.py:44: in __init__
   self.sentencizer = nltk.data.load(sentencizer_url)
.venv/lib/python3.9/site-packages/nltk/data.py:823: in load
   return switch_punkt(fil)
.venv/lib/python3.9/site-packages/nltk/data.py:678: in switch_punkt
   return tok(lang)
.venv/lib/python3.9/site-packages/nltk/tokenize/punkt.py:1744: in __init__
   self.load_lang(lang)
.venv/lib/python3.9/site-packages/nltk/tokenize/punkt.py:1749: in load_lang
   lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

resource_name = 'tokenizers/punkt_tab/english/'
paths = ['/home/user/nltk_data', '/home/user/workspace/pie-template/.venv/nltk_data', '/home/user/workspace/pie-template/.v...', '/home/user/workspace/pie-template/.venv/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', ...]

   def find(resource_name, paths=None):
       """
       Find the given resource by searching through the directories and
       zip files in paths, where a None or empty string specifies an absolute path.
       Returns a corresponding path name.  If the given resource is not
       found, raise a ``LookupError``, whose message gives a pointer to
       the installation instructions for the NLTK downloader.
   
       Zip File Handling:
   
         - If ``resource_name`` contains a component with a ``.zip``
           extension, then it is assumed to be a zipfile; and the
           remaining path components are used to look inside the zipfile.
   
         - If any element of ``nltk.data.path`` has a ``.zip`` extension,
           then it is assumed to be a zipfile.
   
         - If a given resource name that does not contain any zipfile
           component is not found initially, then ``find()`` will make a
           second attempt to find that resource, by replacing each
           component *p* in the path with *p.zip/p*.  For example, this
           allows ``find()`` to map the resource name
           ``corpora/chat80/cities.pl`` to a zip file path pointer to
           ``corpora/chat80.zip/chat80/cities.pl``.
   
         - When using ``find()`` to locate a directory contained in a
           zipfile, the resource name must end with the forward slash
           character.  Otherwise, ``find()`` will not locate the
           directory.
   
       :type resource_name: str or unicode
       :param resource_name: The name of the resource to search for.
           Resource names are posix-style relative path names, such as
           ``corpora/brown``.  Directory names will be
           automatically converted to a platform-appropriate path separator.
       :rtype: str
       """
       resource_name = normalize_resource_name(resource_name, True)
   
       # Resolve default paths at runtime in-case the user overrides
       # nltk.data.path
       if paths is None:
           paths = path
   
       # Check if the resource name includes a zipfile name
       m = re.match(r"(.*\.zip)/?(.*)$|", resource_name)
       zipfile, zipentry = m.groups()
   
       # Check each item in our path
       for path_ in paths:
           # Is the path item a zipfile?
           if path_ and (os.path.isfile(path_) and path_.endswith(".zip")):
               try:
                   return ZipFilePathPointer(path_, resource_name)
               except OSError:
                   # resource not in zipfile
                   continue
   
           # Is the path item a directory or is resource_name an absolute path?
           elif not path_ or os.path.isdir(path_):
               if zipfile is None:
                   p = os.path.join(path_, url2pathname(resource_name))
                   if os.path.exists(p):
                       if p.endswith(".gz"):
                           return GzipFileSystemPathPointer(p)
                       else:
                           return FileSystemPathPointer(p)
               else:
                   p = os.path.join(path_, url2pathname(zipfile))
                   if os.path.exists(p):
                       try:
                           return ZipFilePathPointer(p, zipentry)
                       except OSError:
                           # resource not in zipfile
                           continue
   
       # Fallback: if the path doesn't include a zip file, then try
       # again, assuming that one of the path components is inside a
       # zipfile of the same name.
       if zipfile is None:
           pieces = resource_name.split("/")
           for i in range(len(pieces)):
               modified_name = "/".join(pieces[:i] + [pieces[i] + ".zip"] + pieces[i:])
               try:
                   return find(modified_name, paths)
               except LookupError:
                   pass
   
       # Identify the package (i.e. the .zip file) to download.
       resource_zipname = resource_name.split("/")[1]
       if resource_zipname.endswith(".zip"):
           resource_zipname = resource_zipname.rpartition(".")[0]
       # Display a friendly error message if the resource wasn't found:
       msg = str(
           "Resource \33[93m{resource}\033[0m not found.\n"
           "Please use the NLTK Downloader to obtain the resource:\n\n"
           "\33[31m"  # To display red text in terminal.
           ">>> import nltk\n"
           ">>> nltk.download('{resource}')\n"
           "\033[0m"
       ).format(resource=resource_zipname)
       msg = textwrap_indent(msg)
   
       msg += "\n  For more information see: https://www.nltk.org/data.html\n"
   
       msg += "\n  Attempted to load \33[93m{resource_name}\033[0m\n".format(
           resource_name=resource_name
       )
   
       msg += "\n  Searched in:" + "".join("\n    - %r" % d for d in paths)
       sep = "*" * 70
       resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
>       raise LookupError(resource_not_found)
E       LookupError: 
E       **********************************************************************
E         Resource punkt_tab not found.
E         Please use the NLTK Downloader to obtain the resource:
E       
E         >>> import nltk
E         >>> nltk.download('punkt_tab')
E         
E         For more information see: https://www.nltk.org/data.html
E       
E         Attempted to load tokenizers/punkt_tab/english/
E       
E         Searched in:
E           - '/home/user/nltk_data'
E           - '/home/user/workspace/pie-template/.venv/nltk_data'
E           - '/home/user/workspace/pie-template/.venv/share/nltk_data'
E           - '/home/user/workspace/pie-template/.venv/lib/nltk_data'
E           - '/usr/share/nltk_data'
E           - '/usr/local/share/nltk_data'
E           - '/usr/lib/nltk_data'
E           - '/usr/local/lib/nltk_data'
E       **********************************************************************

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Aug 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.76%. Comparing base (d105e61) to head (05abcad).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #211   +/-   ##
=======================================
  Coverage   95.76%   95.76%           
=======================================
  Files          68       68           
  Lines        6156     6161    +5     
=======================================
+ Hits         5895     5900    +5     
  Misses        261      261           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RainbowRivey RainbowRivey added bug Something isn't working dependencies Pull requests that update a dependency file labels Aug 25, 2025
Copy link
Copy Markdown
Owner

@ArneBinder ArneBinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue below needs to be addressed so that this PR can stay non-breaking.

Comment thread src/pie_modules/document/processing/sentence_splitter.py Outdated
Copy link
Copy Markdown
Owner

@ArneBinder ArneBinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ArneBinder ArneBinder merged commit c0d02ed into main Sep 3, 2025
2 checks passed
@ArneBinder ArneBinder deleted the use_nltk_punkt_tab branch September 3, 2025 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants