Skip to content

sentence span is wrong if there are sentences containing only space tokens #42

@jwijffels

Description

@jwijffels

The sentence span is wrong if there are sentences containing only space tokens

>>> import spacy
>>> import spacy_udpipe
>>> spacy_udpipe.download("nl")
Already downloaded a model for the 'nl' language
>>> nlp = spacy_udpipe.load("nl")
>>>
>>> def line_splitter(x):
...     text = str(x)
...     text = text.split(sep = "\n")
...     text = [sent + "\n" for sent in text]
...     return text
...
>>> text_raw = "We gingen naar Brussel \n\n \nen kochten op 13/12/2021 veel eten. Jullie ook?"
>>> text = line_splitter(text_raw)
>>> text
['We gingen naar Brussel \n', '\n', ' \n', 'en kochten op 13/12/2021 veel eten. Jullie ook?\n']
>>> doc = nlp(text)
>>> for sent_i, sent in enumerate(doc.sents):
...     print(sent.start_char, sent.end_char)
...
0 22
23 70
>>> text_raw[0:(22+1)]
'We gingen naar Brussel '
>>> text_raw[23:(70+1)]
'\n\n \nen kochten op 13/12/2021 veel eten. Jullie o'
>>>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions