Skip to content

Text indexation on portuguese #32

@albcunha

Description

@albcunha

Hello! Maybe there is something not working correctly with token.idx on portuguese.

I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a").

I saw #17, it seems to be the same problem, but it seem it wont work for portuguese.

This works (token.text and text slice are the same):

nlp = spacy_udpipe.load("en")
text = "The language of peace can be a culture."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])

The The
language language
of of
peace peace
can can
be be
a a
culture culture
. .

This wont work (token.text and text slice are not the same after multiword): :

nlp = spacy_udpipe.load("pt")
text = "A linguagem da paz pode ser uma cultura."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])

A A
linguagem linguagem
de da
a p
paz z p

pode de s
ser r u
uma a c
cultura ltura.
.

Any ideas of how to circumvent this?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions