-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Hello! Maybe there is something not working correctly with token.idx on portuguese.
I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a").
I saw #17, it seems to be the same problem, but it seem it wont work for portuguese.
This works (token.text and text slice are the same):
nlp = spacy_udpipe.load("en")
text = "The language of peace can be a culture."
doc = nlp(text)
for token in doc:
print(token.text,text[token.idx:token.idx+len(token.text)])
The The
language language
of of
peace peace
can can
be be
a a
culture culture
. .
This wont work (token.text and text slice are not the same after multiword): :
nlp = spacy_udpipe.load("pt")
text = "A linguagem da paz pode ser uma cultura."
doc = nlp(text)
for token in doc:
print(token.text,text[token.idx:token.idx+len(token.text)])
A A
linguagem linguagem
de da
a p
paz z p
pode de s
ser r u
uma a c
cultura ltura.
.
Any ideas of how to circumvent this?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working