An emtsv module to connect a preverb to its verb or verb-derivative token to which it belongs. To be used for Hungarian.
See also: https://github.com/ril-lexknowrep/hungarian-preverb-corpus
This module is a rule-based tool that essentially uses a hand-crafted decision tree to connect Hungarian preverbs to their verbs from which they are separated in certain syntactic contexts. It uses information from emtsv's tok, morph and pos modules to decide whether a separated preverb should be connected to a particular verb. It connects preverbs only based on morphological and part of speech tags and surface word order cues, and thus it does not need either a lexicon which lists legitimate preverb-verb combinations, nor the output of a syntactic parser to work. Separated preverbs are not only connected to finite verb forms, but also to infinitives, adjectival and adverbial participles, and nomina actionis that are derived from verbs that have preverbs. A Hungarian preverb may be separated from the verb root in all of these words in certain syntactic contexts.
This module uses the following tsv output fields. The philosophy that underlies these annotations is explained in our emPreverb paper.
-
The
prevfield: Verbs (by which we mean finite as well as non-finite verb forms and potentially separable verb derivatives) and preverbs are annotated as follows:pfxmarks a verb token that contains a prefixed, i.e. non-separated preverb.sepmarks a verb token from which a preverb was separated, i.e. a verb token to which a preverb token in the sentence belongs.connmarks a separated preverb token which has been identified as belonging to a verb token in the sentence.
The
sepandconnannotations thus mark connected verb-preverb pairs. Verbs for which no corresponding preverb was found and preverbs to which no verb could be assigned by emPreverb are not annotated in any way, i.e. this field remains empty for them. -
The
previdfield: This field contains an unambiguous numerical identifier that indicates whichsepverb a specificconnpreverb belongs to. The preverb has the sameprevidvalue as the corresponding verb.emPreverbis only designed to handle one-to-one correspondences between separated preverbs and verbs, and thus coordinative structures in which arguably more than one preverb should be connected to a single verb, or vice versa, are not annotated as such (e.g. meg kell és meg is lehet oldani; az öregje addig senkinek cipőt, csizmát nem szab a lábára, amíg meg nem nézette, szagoltatta, tapintatta velük a bőröket, hogy melyik lenne igazán a kedvükre való). -
The
prevposfield: This indicates the direction and distance of the separated preverb relative to its verb. This information only appears on verbs with separated preverbs, not on the preverbs, nor on verbs with a non-separated preverb. The value ofprevposconsists of a number, which specifies the distance in tokens, and a sign, which specifies the direction. Minus means 'to the left' and plus means 'to the right'. For example, a value of+1would mean that the separated preverb is located immediately to the right of its verb, and-2indicates that the preverb is two tokens to the left of the verb, i.e. with one other token in between. -
The
xpostagfield: Although this is one of the required source fields of emPreverb, its value is also modified by it. For verbs with either a separated (prev = "sep") or a non-separated preverb (prev = "pfx"), the label[/Prev]is prepended to the original value ofxpostag. For example: szétvetve becomes[/Prev][/V][_AdvPtcp/Adv]instead of the original[/V][_AdvPtcp/Adv]that is assigned to thexpostagfield by emtsv's PurePos tagger module. Thexpostagof the separated verb nyelje in a föld nyelje el becomes[/Prev][/V][Sbjv.Def.3Sg]. -
The
lemmafield: This is also a required source field of emPreverb which is modified by it. For verbs with a non-separated preverb the lemma is left unchanged. Forsepverbs, the separated preverb is prepended to the verb's lemma. The lemma ofconnpreverbs is set to an empty string. In the previous example, PurePos originally assigns the valuesnyelandelto thelemmafield of nyelje and el respectively. EmPreverb changes these toelnyeland the empty string (i.e. no lemma at all) respectively. -
The
compoundfield: This field is the target field of our emCompound module. It is not required by emPreverb, but if emPreverb's input does contain thecompoundfield, then emPreverb modifies it, adding the compound structure preverb + # + verb lemma as the value of this field forsepverbs. This means that the value of thecompoundfield of the token nyelje in the above example, which is originally empty (as this form is not itself a compound), becomesel#nyel. Verb tokens with non-separated preverbs, like szétvetve, are already analysed as compounds, i.e.szét#vetby emCompound, so these are not changed.
Important note on the ordering of modules: Since emPreverb modifies the values of the lemma and xpostag fields that are assigned by emtsv pos, it we do not recommend running emPreverb before any other emtsv modules that use these two fields as their source fields. Thus emPreverb should ideally toward the end of the pipeline. If it is being used, then emCompound should be run before emPreverb. EmFilter and emToReadable can be safely run after emPreverb. EmToReadable can in fact convert emPreverb's output annotation into a human-readable format.
| V prev | V previd | V prevpos | V xpostag | V lemma | V compound | P prev | P previd | P lemma | |
|---|---|---|---|---|---|---|---|---|---|
| átúsztam | pfx | [/Prev][/V][Pst.NDef.1Sg] |
átúszik | át#úszik | |||||
| végig kell vinni | sep | 1 | -2 | [/Prev][/V][Inf] |
végigvisz | végig#visz | conn | 1 | "" |
| nem gondolom -e meg | sep | 2 | +2 | [/Prev][/V][Prs.Def.1Sg] |
meggondol | meg#gondol | conn | 2 | "" |
| 90 napot meg nem haladó | sep | 3 | -2 | [/Prev][/Adj][Nom] |
meghaladó | meg#haladó | conn | 3 | "" |
Depending on the current configuration of your system, you might have to add the path to the emPreverb module on your machine (i.e. the path to your clone of the emPreverb repository) to the PYTHONPATH environmental variable like this before executing the commands below, otherwise you might get a 'module not found' error from the Python interpreter:
export PYTHONPATH="${PYTHONPATH}:/path/to/emPreverb/"
(Replace the part "/path/to/emPreverb/" by the actual absolute path to emPreverb on your machine.) In addition, if you are also using emCompound, you might have to do the same for the emCompound directory as well.
EmPreverb can be executed as an individual Python module. The file 'input.txt' in this example is a raw text file:
cat input.txt | docker run -i --rm mtaril/emtsv tok,morph,pos > pos_output.tsv
cat pos_output.tsv | python3 -m emPreverb > prev_output.tsv
Optionally, if emCompound is executed before emPreverb in the processing pipeline, then emPreverb adjusts the content of the compound field as described above:
cat input.txt | docker run -i --rm mtaril/emtsv tok,morph,pos > pos_output.tsv
cat pos_output.tsv | python3 -m emCompound | python3 emPreverb > prev_output.tsv
Alternatively, emPreverb can be run within emtsv as part of a processing pipeline:
cat input.txt | docker run -i --rm mtaril/emtsv tok,morph,pos,preverb > prev_output.tsv
Or together with emCompound:
cat input.txt | docker run -i --rm mtaril/emtsv tok,morph,pos,compound,preverb > prev_output.tsv
pip install -r requirements.txt
make connect_preverbs: if compound is present
make connect_preverbs_withcompound: if compound field is not present
Uses code in emPreverb directory directly.
Just type make to run all the following.
- A virtual environment is created in
venv. emPreverbPython package is created indist/emPreverb-*-py3-none-any.whl.- The package is installed in
venv. - The package is unit tested on
tests/inputs/*.inand outputs are compared withtests/outputs/*.out.
The above steps can be performed by make venv, make build, make install and make test respectively.
The Python package can be installed anywhere by direct path:
pip install ./dist/emPreverb-*-py3-none-any.whl- Check
emPreverb/version.py. make release-majorormake release-minorormake release-patch.
This will update the version number appropriately make agit commitwith a newgitTAG.maketo recreate the package with the new tag indist/emPreverb-TAG-py3-none-any.whl.- Go to
https://github.com/THISUSER/emPreverband "Create release from tag". - Add wheel file from
dist/emPreverb-TAG-py3-none-any.whlmanually to the release.
- Install
emtsv: 1st and 2nd point +cythononly. - Go to the
emtsvdirectory (cd emtsv). - Add
emPreverbby adding this line torequirements.txt:
https://github.com/THISUSER/emPreverb/releases/download/vTAG/emPreverb-TAG-py3-none-any.whl - Complete
config.pyby addingem_preverbandtoolsfromemPreverb/__main__.pyappropriately. - Complete
emtsvinstallation bymake venv. echo "A kutya ment volna el sétálni." | venv/bin/python3 ./main.py tok,morph,pos > oldecho "A kutya ment volna el sétálni." | venv/bin/python3 ./main.py tok,morph,pos,preverb > new- See results by
diff old new. - If everything is in order, create a PR for
emtsv.
That's it! :)
Based on postprocess-emtsv/scripts/connect_prev.py and emDummy.
TODO command line argument -v.