Text indexation on portuguese

Hello! Maybe there is something not working correctly with token.idx on portuguese.

I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a"). 

I saw https://github.com/TakeLab/spacy-udpipe/pull/17, it seems to be the same problem, but it seem it wont work for portuguese.

This works (token.text and text slice are the same):

```
nlp = spacy_udpipe.load("en")
text = "The language of peace can be a culture."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])
```
The The
language language
of of
peace peace
can can
be be
a a
culture culture
. .


This wont work (token.text and text slice are not the same after multiword): :

```
nlp = spacy_udpipe.load("pt")
text = "A linguagem da paz pode ser uma cultura."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])
```
A A
linguagem linguagem
_**de da
a p
paz z p**_
pode de s
ser r u
uma a c
cultura ltura.
. 

Any ideas of how to circumvent this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text indexation on portuguese #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Text indexation on portuguese #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions