Skip to content

Conversation

@Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Jan 2, 2026

Added support for capturing renderings patterns, references, and term domains. Moved to using a KeyTerm data structure rather than tuples.

(This also includes porting of recent changes in Machine sillsdev/machine#362 and sillsdev/machine#368)


This change is Reviewable

@Enkidu93 Enkidu93 requested a review from ddaspit January 2, 2026 21:51
@Enkidu93
Copy link
Collaborator Author

Enkidu93 commented Jan 2, 2026

Also, update machine.py library version

@codecov-commenter
Copy link

codecov-commenter commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 82.05128% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.67%. Comparing base (acff116) to head (f25702e).

Files with missing lines Patch % Lines
.../corpora/test_usfm_versification_error_detector.py 4.76% 20 Missing ⚠️
...chine/corpora/usfm_versification_error_detector.py 30.00% 14 Missing ⚠️
machine/jobs/translation_file_service.py 33.33% 4 Missing ⚠️
...tion/huggingface/hugging_face_nmt_model_trainer.py 93.47% 3 Missing ⚠️
machine/corpora/memory_text.py 75.00% 2 Missing ⚠️
...hine/corpora/paratext_project_terms_parser_base.py 96.07% 2 Missing ⚠️
...hine/corpora/file_paratext_project_terms_parser.py 85.71% 1 Missing ⚠️
machine/corpora/n_parallel_text_row.py 85.71% 1 Missing ⚠️
...a/paratext_project_versification_error_detector.py 0.00% 1 Missing ⚠️
machine/corpora/text_file_alignment_corpus.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #257      +/-   ##
==========================================
- Coverage   90.74%   90.67%   -0.07%     
==========================================
  Files         352      355       +3     
  Lines       22337    22516     +179     
==========================================
+ Hits        20270    20417     +147     
- Misses       2067     2099      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit partially reviewed 11 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93).


machine/corpora/key_term_row.py line 0 at r1 (raw file):
This file should be named key_term.py.

@Enkidu93 Enkidu93 requested a review from ddaspit January 13, 2026 16:27
Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 3 comments.
Reviewable status: 8 of 21 files reviewed, 1 unresolved discussion (waiting on @ddaspit).


machine/corpora/key_term_row.py line at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This file should be named key_term.py.

Done.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 362 at r3 (raw file):

            ).tokens()
            src_term_partial_word_tokens.remove("▁")
            src_term_partial_word_tokens.remove("\ufffc")

This is mirroring code in silnlp more-or-less exactly. I made an issue for creating a shared utility function that could do some of this. I also experimented with finding a safe way to be able to do this with non-fast tokenizers. It's something we should look into as needed but I decided that it was taking too much time.


tests/translation/huggingface/test_hugging_face_nmt_model_trainer.py line 130 at r3 (raw file):

        corpus = source_corpus.align_rows(target_corpus)

        terms_corpus = DictionaryTextCorpus(MemoryText("terms", [TextRow("terms", 1, ["telephone"])])).align_rows(

I don't love that this test doesn't really cover whether the terms are affecting the result. I just stuck this in here for code coverage (no exceptions thrown, etc.), but I couldn't adapt our one true fine-tuning test because it uses a non-fast tokenizer. I looked for alternatives but couldn't find anything that works. I did confirm in the debugger that everything was being tokenized properly. Maybe we should consider outputting some kind of artifact in ClearML (?) with the tokenized data so we have something to compare apples-to-apples to the tokenized experiment txt files in silnlp.

@Enkidu93
Copy link
Collaborator Author

Fixes #256 #240

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit partially reviewed 9 files and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @Enkidu93).


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

        training_args: Seq2SeqTrainingArguments,
        corpus: Union[ParallelTextCorpus, Dataset],
        terms_corpus: Optional[Union[ParallelTextCorpus, Dataset]] = None,

I think it would be better to add some type of metadata to a row to indicate what kind of data it is, rather than pass around two separate corpora. It feels more in spirit with the design goals of the corpus API in Machine. We could have a generic metadata dictionary that allows you to set whatever you want. Or, we could add a new "data type" property with a predefined set of values, such as gloss, sentence, section, document, etc. What do you think?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 1 comment.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @ddaspit).


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think it would be better to add some type of metadata to a row to indicate what kind of data it is, rather than pass around two separate corpora. It feels more in spirit with the design goals of the corpus API in Machine. We could have a generic metadata dictionary that allows you to set whatever you want. Or, we could add a new "data type" property with a predefined set of values, such as gloss, sentence, section, document, etc. What do you think?

Yes, I did think of this. If we wanted to add a field to each row, I think an enum like you mentioned might be the better option. We could add it to each text and then expose it through the row as well (?). There were two reasons I didn't already tackle this:

  1. It will add a little complexity to the to_hf_dataset() function as well as to the preprocess_function when we call map() on the dataset.
  2. I wondered how much it was worth doing this when we only have two categories. Do we foresee additional text types? Or maybe we don't foresee other text types but do we see ourselves needing some kind of tagging for other attributes? This would affect how it should be implemented.

A competing (or perhaps supplemental) option would be for this function to take something like corpora: List[Tuple[CorpusOptions, Union[ParallelCorpus, Dataset]]] or corpora: List[CorpusConfig]. Then the processing options could be more dynamic - e.g., 'include partial words for corpus 1, tag corpus 2 with such-and-such tag, etc.' - regardless of a text type field and it could be specified by the calling code.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit made 1 comment.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @Enkidu93).


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Yes, I did think of this. If we wanted to add a field to each row, I think an enum like you mentioned might be the better option. We could add it to each text and then expose it through the row as well (?). There were two reasons I didn't already tackle this:

  1. It will add a little complexity to the to_hf_dataset() function as well as to the preprocess_function when we call map() on the dataset.
  2. I wondered how much it was worth doing this when we only have two categories. Do we foresee additional text types? Or maybe we don't foresee other text types but do we see ourselves needing some kind of tagging for other attributes? This would affect how it should be implemented.

A competing (or perhaps supplemental) option would be for this function to take something like corpus: List[Tuple[CorpusOptions, Union[ParallelCorpus, Dataset]]] or corpus: List[CorpusConfig]. Then the processing options could be more dynamic - e.g., 'include partial words for corpus 1, tag corpus 2 with such-and-such tag, etc.` - regardless of a text type field and it could be specified by the calling code.

I do think it would be generally useful to have a way of tagging each row with a "data type" field. I'm okay with added complexity in a couple of places. I also like the second option that you proposed as a shorter-term solution. Would the second option be a lot easier and quicker to implement than the tagging?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 1 comment.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @ddaspit).


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I do think it would be generally useful to have a way of tagging each row with a "data type" field. I'm okay with added complexity in a couple of places. I also like the second option that you proposed as a shorter-term solution. Would the second option be a lot easier and quicker to implement than the tagging?

OK! I think the second solution would require fewer edits since adding the data type attribute is going to affect quite a few classes I think all of which will need to be ported - but if it's what you think we should do ultimately, might as well just go ahead with it 👍.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit made 1 comment.
Reviewable status: 17 of 26 files reviewed, 1 unresolved discussion (waiting on @Enkidu93).


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK! I think the second solution would require fewer edits since adding the data type attribute is going to affect quite a few classes I think all of which will need to be ported - but if it's what you think we should do ultimately, might as well just go ahead with it 👍.

Ok, let's go ahead and implement the "data type" field, then we don't have to pass around multiple corpora.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 made 1 comment.
Reviewable status: 9 of 41 files reviewed, 1 unresolved discussion (waiting on @ddaspit).


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 94 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Ok, let's go ahead and implement the "data type" field, then we don't have to pass around multiple corpora.

Alright - I've gone ahead and added the data type field - hopefully satisfactorily - and updated the preprocessing code accordingly. Sorry this took a little while!

(Super minor but I don't like how we have competing names for source and target throughout the codebases source, src/target,trg,tgt - do you have a preference in regards to variable names that include source/target? If so, I can slowly normalize the naming as I edit code).

@Enkidu93
Copy link
Collaborator Author

machine/corpora/data_type.py line 9 at r5 (raw file):

    SENTENCE = auto()
    PASSAGE = auto()
    DOCUMENT = auto()

Word and segment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants