Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the document classification workflow to support “forced” classification for specific corpora (e.g., uved), adjusts multiclass SDG classification behavior accordingly, and refreshes dependencies/tests to match.
Changes:
- Add a forced-corpus allowlist (
FORCED_CORPUS_CLASSIFIED) and propagateis_forced_corpusthroughdocument_classifier.py→n_classify_slice. - Extend batch generation to allow filtering by corpus name (
PICK_CORPUS_NAME). - Add
get_model_classification_model_by_idhelper and bumpscikit-learn(plus lockfile refresh).
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| welearn_datastack/nodes_workflow/DocumentVectorizer/generate_to_vectorize_batch.py | Adds corpus filtering option when generating vectorization batches. |
| welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py | Detects forced corpora and bypasses bi-classifier gating when needed; passes is_forced_corpus downstream. |
| welearn_datastack/modules/sdgs_classifiers.py | Updates n_classify_slice to support forced-corpus behavior and revised threshold logic. |
| welearn_datastack/modules/retrieve_data_from_database.py | Adds model lookup helper by model id with custom exception. |
| welearn_datastack/constants.py | Introduces FORCED_CORPUS_CLASSIFIED constant (currently ["uved"]). |
| tests/document_classifier/test_sdgs_classifiers.py | Updates tests for forced-corpus behavior and revised multiclass logic. |
| pyproject.toml | Bumps scikit-learn constraint to ~=1.7.0. |
| poetry.lock | Lockfile update reflecting dependency resolution changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
72
to
82
| result = n_classify_slice( | ||
| slice, | ||
| "model_name", | ||
| n_classifier_id=uuid.uuid4(), | ||
| bi_classifier_id=uuid.uuid4(), | ||
| is_forced_corpus=True, | ||
| ) | ||
| self.assertEqual(result.sdg_number, 3) | ||
| self.assertEqual(result.sdg_number, 4) | ||
| self.assertIsNotNone(result.bi_classifier_model_id) | ||
| self.assertIsNone(result.n_classifier_model_id) | ||
|
|
|
|
||
| QDRANT_MULTI_LINGUAL_CODE = "mul" | ||
|
|
||
| FORCED_CORPUS_CLASSIFIED = ["uved"] |
Comment on lines
122
to
125
| externaly_classified_flag = ( | ||
| key_external_sdg in s.document.details | ||
| and s.document.details[key_external_sdg] | ||
| ) |
Comment on lines
+129
to
+135
| if ( | ||
| externaly_classified_flag | ||
| or is_forced_corpus | ||
| or bi_classify_slice(slice_=s, classifier_model_name=bi_model_name) | ||
| ): | ||
| logger.info( | ||
| f"Document {key_doc_id} is classified as SDG by bi-classifier" |
| return model | ||
|
|
||
| raise NoModelFoundError( | ||
| f"Model not found in the database according this id : {model_id}" |
Comment on lines
+68
to
+82
| # By default every SDGs are equally possible | ||
| is_forced_sdg_classif = bool(forced_sdg) | ||
| if not forced_sdg: | ||
| forced_sdg = [sdg_n + 1 for sdg_n in range(0, 17)] | ||
|
|
||
| # If there is only one forced sdg return it | ||
| if len(forced_sdg) == 1: | ||
| [sdg_number] = forced_sdg | ||
| return Sdg( | ||
| slice_id=_slice.id, | ||
| sdg_number=sdg_number, | ||
| id=uuid.uuid4(), | ||
| bi_classifier_model_id=bi_classifier_id, | ||
| n_classifier_model_id=n_classifier_id if not forced_sdg else None, | ||
| ) |
| sdg_number=best_sdg, | ||
| id=uuid.uuid4(), | ||
| bi_classifier_model_id=bi_classifier_id, | ||
| n_classifier_model_id=n_classifier_id if not forced_sdg else None, |
| self.assertEqual(result, None) | ||
| self.assertEqual(result.sdg_number, 3) | ||
| self.assertIsNotNone(result.bi_classifier_model_id) | ||
| self.assertIsNone(result.n_classifier_model_id) |
sandragjacinto
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces improvements to the document classification workflow, particularly around handling forced classification for specific corpora, and updates dependencies and test coverage. The most significant changes are the addition of logic for forced corpus classification, enhancements to the
n_classify_slicefunction, and improved test cases to reflect these updates.Forced Corpus Classification
FORCED_CORPUS_CLASSIFIEDconstant inwelearn_datastack/constants.pyto specify corpora that should always be classified, bypassing the bi-classifier.document_classifier.pyto check if a document's corpus is inFORCED_CORPUS_CLASSIFIED, and use this information to force classification in the main workflow. [1] [2] [3]n_classify_slicefunction to accept anis_forced_corpusparameter and adjust classification logic accordingly. [1] [2]Test Coverage Improvements
test_sdgs_classifiers.pyto reflect new forced classification logic, including renaming and modifying test cases for clarity and accuracy. [1] [2] [3]Dependency Updates
scikit-learndependency from version 1.6.1 to 1.7.0 inpyproject.toml.Utility and Exception Handling
get_model_classification_model_by_idutility function to retrieve models by ID and raise a custom exception if not found. [1] [2]Vectorization Workflow Enhancement
corpus_nameparameter to document vectorization batch generation for improved filtering and control. [1] [2]