Skip to content

Fix/uved#117

Merged
lpi-tn merged 3 commits intomainfrom
Fix/uved
Mar 19, 2026
Merged

Fix/uved#117
lpi-tn merged 3 commits intomainfrom
Fix/uved

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Mar 19, 2026

This pull request introduces improvements to the document classification workflow, particularly around handling forced classification for specific corpora, and updates dependencies and test coverage. The most significant changes are the addition of logic for forced corpus classification, enhancements to the n_classify_slice function, and improved test cases to reflect these updates.

Forced Corpus Classification

  • Added FORCED_CORPUS_CLASSIFIED constant in welearn_datastack/constants.py to specify corpora that should always be classified, bypassing the bi-classifier.
  • Updated document_classifier.py to check if a document's corpus is in FORCED_CORPUS_CLASSIFIED, and use this information to force classification in the main workflow. [1] [2] [3]
  • Modified the n_classify_slice function to accept an is_forced_corpus parameter and adjust classification logic accordingly. [1] [2]

Test Coverage Improvements

  • Updated and added tests in test_sdgs_classifiers.py to reflect new forced classification logic, including renaming and modifying test cases for clarity and accuracy. [1] [2] [3]

Dependency Updates

  • Upgraded scikit-learn dependency from version 1.6.1 to 1.7.0 in pyproject.toml.

Utility and Exception Handling

  • Added get_model_classification_model_by_id utility function to retrieve models by ID and raise a custom exception if not found. [1] [2]

Vectorization Workflow Enhancement

  • Added corpus_name parameter to document vectorization batch generation for improved filtering and control. [1] [2]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the document classification workflow to support “forced” classification for specific corpora (e.g., uved), adjusts multiclass SDG classification behavior accordingly, and refreshes dependencies/tests to match.

Changes:

  • Add a forced-corpus allowlist (FORCED_CORPUS_CLASSIFIED) and propagate is_forced_corpus through document_classifier.pyn_classify_slice.
  • Extend batch generation to allow filtering by corpus name (PICK_CORPUS_NAME).
  • Add get_model_classification_model_by_id helper and bump scikit-learn (plus lockfile refresh).

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
welearn_datastack/nodes_workflow/DocumentVectorizer/generate_to_vectorize_batch.py Adds corpus filtering option when generating vectorization batches.
welearn_datastack/nodes_workflow/DocumentClassifier/document_classifier.py Detects forced corpora and bypasses bi-classifier gating when needed; passes is_forced_corpus downstream.
welearn_datastack/modules/sdgs_classifiers.py Updates n_classify_slice to support forced-corpus behavior and revised threshold logic.
welearn_datastack/modules/retrieve_data_from_database.py Adds model lookup helper by model id with custom exception.
welearn_datastack/constants.py Introduces FORCED_CORPUS_CLASSIFIED constant (currently ["uved"]).
tests/document_classifier/test_sdgs_classifiers.py Updates tests for forced-corpus behavior and revised multiclass logic.
pyproject.toml Bumps scikit-learn constraint to ~=1.7.0.
poetry.lock Lockfile update reflecting dependency resolution changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 72 to 82
result = n_classify_slice(
slice,
"model_name",
n_classifier_id=uuid.uuid4(),
bi_classifier_id=uuid.uuid4(),
is_forced_corpus=True,
)
self.assertEqual(result.sdg_number, 3)
self.assertEqual(result.sdg_number, 4)
self.assertIsNotNone(result.bi_classifier_model_id)
self.assertIsNone(result.n_classifier_model_id)


QDRANT_MULTI_LINGUAL_CODE = "mul"

FORCED_CORPUS_CLASSIFIED = ["uved"]
Comment on lines 122 to 125
externaly_classified_flag = (
key_external_sdg in s.document.details
and s.document.details[key_external_sdg]
)
Comment on lines +129 to +135
if (
externaly_classified_flag
or is_forced_corpus
or bi_classify_slice(slice_=s, classifier_model_name=bi_model_name)
):
logger.info(
f"Document {key_doc_id} is classified as SDG by bi-classifier"
return model

raise NoModelFoundError(
f"Model not found in the database according this id : {model_id}"
Comment on lines +68 to +82
# By default every SDGs are equally possible
is_forced_sdg_classif = bool(forced_sdg)
if not forced_sdg:
forced_sdg = [sdg_n + 1 for sdg_n in range(0, 17)]

# If there is only one forced sdg return it
if len(forced_sdg) == 1:
[sdg_number] = forced_sdg
return Sdg(
slice_id=_slice.id,
sdg_number=sdg_number,
id=uuid.uuid4(),
bi_classifier_model_id=bi_classifier_id,
n_classifier_model_id=n_classifier_id if not forced_sdg else None,
)
sdg_number=best_sdg,
id=uuid.uuid4(),
bi_classifier_model_id=bi_classifier_id,
n_classifier_model_id=n_classifier_id if not forced_sdg else None,
self.assertEqual(result, None)
self.assertEqual(result.sdg_number, 3)
self.assertIsNotNone(result.bi_classifier_model_id)
self.assertIsNone(result.n_classifier_model_id)
@lpi-tn lpi-tn merged commit 8da226f into main Mar 19, 2026
11 checks passed
@lpi-tn lpi-tn deleted the Fix/uved branch March 19, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants