Fix/uved by lpi-tn · Pull Request #117 · CyberCRI/welearn-datastack

lpi-tn · 2026-03-19T14:02:16Z

This pull request introduces improvements to the document classification workflow, particularly around handling forced classification for specific corpora, and updates dependencies and test coverage. The most significant changes are the addition of logic for forced corpus classification, enhancements to the n_classify_slice function, and improved test cases to reflect these updates.

Forced Corpus Classification

Added FORCED_CORPUS_CLASSIFIED constant in welearn_datastack/constants.py to specify corpora that should always be classified, bypassing the bi-classifier.
Updated document_classifier.py to check if a document's corpus is in FORCED_CORPUS_CLASSIFIED, and use this information to force classification in the main workflow. [1] [2] [3]
Modified the n_classify_slice function to accept an is_forced_corpus parameter and adjust classification logic accordingly. [1] [2]

Test Coverage Improvements

Updated and added tests in test_sdgs_classifiers.py to reflect new forced classification logic, including renaming and modifying test cases for clarity and accuracy. [1] [2] [3]

Dependency Updates

Upgraded scikit-learn dependency from version 1.6.1 to 1.7.0 in pyproject.toml.

Utility and Exception Handling

Added get_model_classification_model_by_id utility function to retrieve models by ID and raise a custom exception if not found. [1] [2]

Vectorization Workflow Enhancement

Added corpus_name parameter to document vectorization batch generation for improved filtering and control. [1] [2]

…odel retrieval

… and rename test functions

Copilot

Pull request overview

This PR updates the document classification workflow to support “forced” classification for specific corpora (e.g., uved), adjusts multiclass SDG classification behavior accordingly, and refreshes dependencies/tests to match.

Changes:

Add a forced-corpus allowlist (FORCED_CORPUS_CLASSIFIED) and propagate is_forced_corpus through document_classifier.py → n_classify_slice.
Extend batch generation to allow filtering by corpus name (PICK_CORPUS_NAME).
Add get_model_classification_model_by_id helper and bump scikit-learn (plus lockfile refresh).

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
welearn_datastack/nodes_workflow/DocumentVectorizer/generate_to_vectorize_batch.py	Adds corpus filtering option when generating vectorization batches.
welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py	Detects forced corpora and bypasses bi-classifier gating when needed; passes `is_forced_corpus` downstream.
welearn_datastack/modules/sdgs_classifiers.py	Updates `n_classify_slice` to support forced-corpus behavior and revised threshold logic.
welearn_datastack/modules/retrieve_data_from_database.py	Adds model lookup helper by model id with custom exception.
welearn_datastack/constants.py	Introduces `FORCED_CORPUS_CLASSIFIED` constant (currently `["uved"]`).
tests/document_classifier/test_sdgs_classifiers.py	Updates tests for forced-corpus behavior and revised multiclass logic.
pyproject.toml	Bumps `scikit-learn` constraint to `~=1.7.0`.
poetry.lock	Lockfile update reflecting dependency resolution changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/document_classifier/test_sdgs_classifiers.py

        result = n_classify_slice(
            slice,
            "model_name",
            n_classifier_id=uuid.uuid4(),
            bi_classifier_id=uuid.uuid4(),
+            is_forced_corpus=True,
        )
-        self.assertEqual(result.sdg_number, 3)
+        self.assertEqual(result.sdg_number, 4)
        self.assertIsNotNone(result.bi_classifier_model_id)
        self.assertIsNone(result.n_classifier_model_id)



welearn_datastack/constants.py


 QDRANT_MULTI_LINGUAL_CODE = "mul"
+
+FORCED_CORPUS_CLASSIFIED = ["uved"]


welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py

            externaly_classified_flag = (
                key_external_sdg in s.document.details
                and s.document.details[key_external_sdg]
            )


welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py

+            if (
+                externaly_classified_flag
+                or is_forced_corpus
+                or bi_classify_slice(slice_=s, classifier_model_name=bi_model_name)
+            ):
+                logger.info(
+                    f"Document {key_doc_id} is classified as SDG by bi-classifier"


welearn_datastack/modules/retrieve_data_from_database.py

+        return model
+
+    raise NoModelFoundError(
+        f"Model not found in the database according this id : {model_id}"


welearn_datastack/modules/sdgs_classifiers.py

+    # By default every SDGs are equally possible
+    is_forced_sdg_classif = bool(forced_sdg)
    if not forced_sdg:
        forced_sdg = [sdg_n + 1 for sdg_n in range(0, 17)]

+    # If there is only one forced sdg return it
+    if len(forced_sdg) == 1:
+        [sdg_number] = forced_sdg
+        return Sdg(
+            slice_id=_slice.id,
+            sdg_number=sdg_number,
+            id=uuid.uuid4(),
+            bi_classifier_model_id=bi_classifier_id,
+            n_classifier_model_id=n_classifier_id if not forced_sdg else None,
+        )


welearn_datastack/modules/sdgs_classifiers.py

+        sdg_number=best_sdg,
+        id=uuid.uuid4(),
+        bi_classifier_model_id=bi_classifier_id,
+        n_classifier_model_id=n_classifier_id if not forced_sdg else None,


tests/document_classifier/test_sdgs_classifiers.py

-        self.assertEqual(result, None)
+        self.assertEqual(result.sdg_number, 3)
+        self.assertIsNotNone(result.bi_classifier_model_id)
+        self.assertIsNone(result.n_classifier_model_id)


lpi-tn added 3 commits March 19, 2026 11:31

fix(classifiers): enhance SDG classification logic and logging

8f2591c

fix(classifiers): add forced corpus classification logic and update m…

8930a60

…odel retrieval

fix(test_sdgs_classifiers): update tests for SDG classification logic…

7a7fb20

… and rename test functions

lpi-tn requested review from Copilot and sandragjacinto March 19, 2026 14:02

Copilot AI reviewed Mar 19, 2026

View reviewed changes

sandragjacinto approved these changes Mar 19, 2026

View reviewed changes

lpi-tn merged commit 8da226f into main Mar 19, 2026
11 checks passed

lpi-tn deleted the Fix/uved branch March 19, 2026 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/uved#117

Fix/uved#117
lpi-tn merged 3 commits intomainfrom
Fix/uved

lpi-tn commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		QDRANT_MULTI_LINGUAL_CODE = "mul"

		FORCED_CORPUS_CLASSIFIED = ["uved"]

Conversation

lpi-tn commented Mar 19, 2026

Forced Corpus Classification

Test Coverage Improvements

Dependency Updates

Utility and Exception Handling

Vectorization Workflow Enhancement

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants