Skip to content

Custom model with pretokenized input including multiword#56

Open
ziqianPeng wants to merge 2 commits intonlp-uoregon:masterfrom
ziqianPeng:master
Open

Custom model with pretokenized input including multiword#56
ziqianPeng wants to merge 2 commits intonlp-uoregon:masterfrom
ziqianPeng:master

Conversation

@ziqianPeng
Copy link

Hello!
I'm trying to train custom parser using trankit with pretokenized input extracted from conllu files.

Maybe I didn't get the right way but in my way some bug occurred for French (multiword token) and Chinese ("KeyError UD-Japanese-Like" if I parse my test file just after finish training), so I modified the source code to fix them. I also modified the path of xlm_roberta model in file_utils.py such that it will be downloaded only one time when training multiple models of the same type, such as 'customized'.
The file train_pred_trainkit.py is an example to apply these modification, especially the function pred_trankit.

I hope this would be helpful for you and thanks a lot for developing trankit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant