Pensar - auto fix for ML Pipeline Data Poisoning via Unvalidated Tweet Ingestion by pensarappdev[bot] · Pull Request #13 · Yuvanesh-ux/Nexus

pensarappdev · 2025-05-07T16:39:26Z

Security issue fixed:
The ingestion of untrusted tweet data in create_social_profile_sns (and related data flows) allowed unvalidated input, which could compromise downstream ML operations via data poisoning attacks as described.

How it was addressed:

Provenance Tracking:
Every tweet ingested (from disk or scraped) receives a _provenance field indicating its source: "disk" or "scraped".
Strict Input Validation:
A helper function _is_valid_tweet was added to verify (before use):
- Minimum and maximum length of tweet text
- Valid UTF-8 encoding and only printable/control chars allowed
- Rejects text with excessive URLs, binary/control chars, or highly anomalous/repetitive content
Deduplication:
A _deduplicate_tweets function deduplicates tweets (by text and timestamp) before they are processed downstream or written to disk.
Safe Disk Loading:
Tweets loaded from disk get revalidated and recleaned before being used. Invalid/anomalous tweets are skipped and a warning is logged.
Corpus Filtering:
Only validated/cleaned, deduplicated tweets are added to the analysis corpus (all_tweets) and written back to disk.

Side notes:

The interface and file format remain essentially unchanged except for the addition of the _provenance and user fields.
The downstream ML code uses these cleaned tweets as before, but now with stronger guarantees about the reliability and provenance of the data.
The patch adds no new external dependencies.

More Details

Type	Identifier	Message	Severity	Link
Application	CWE-20, ML02	The pipeline ingests raw, publicly-sourced tweets (user-generated content) into the `all_tweets` training corpus without any provenance tracking, validation, or anomaly detection. An attacker can deliberately craft malicious tweets—either by compromising one of the listed accounts or by creating bulk spam that the target account retweets—to poison the downstream embedding and clustering stages. This can: • Skew topic extraction or cluster assignment (integrity attack) • Implant back-door trigger strings that later manipulate analytical outcomes • Degrade model quality or cause misinterpretation of the generated Atlas map This aligns with OWASP ML Top-10 ‘ML02 – Data Poisoning’ and maps to Improper Input Validation (CWE-20) because untrusted data is consumed directly in the training workflow.	medium	Link

…Ingestion (CWE-20, ML02)

Fix security issue: ML Pipeline Data Poisoning via Unvalidated Tweet …

8d18574

…Ingestion (CWE-20, ML02)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pensar - auto fix for ML Pipeline Data Poisoning via Unvalidated Tweet Ingestion#13

Pensar - auto fix for ML Pipeline Data Poisoning via Unvalidated Tweet Ingestion#13
pensarappdev[bot] wants to merge 1 commit into
mainfrom
pensar-auto-fix-xbT5

pensarappdev Bot commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

pensarappdev Bot commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants