Skip to content

CYC07/Afanon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Advanced Text Anonymizer (afanon) 🕵️‍♂️📄

A powerful and configurable Python script (afanon.py) for identifying and anonymizing sensitive information within text documents. This tool employs a multi-layered approach, combining Named Entity Recognition (NER), regular expressions, custom wordlists, and fallback heuristics to provide comprehensive PII redaction.

Key Features

  • Multi-Layered Detection:
    • Named Entity Recognition (NER): Leverages NLTK to identify entities like PERSON, ORGANIZATION, LOCATION, DATE, MONEY, etc. (configurable).
    • Regular Expression (Regex) Matching: Pre-defined and customizable regex patterns to detect common PII like emails, phone numbers, SSNs, credit card numbers, IP addresses, physical addresses, and postal codes.
    • Custom Wordlist Anonymization: Allows users to supply their own lists of sensitive words/phrases (e.g., passwords, specific object names, dog breeds, project codenames) for targeted redaction.
    • Fallback Capitalized Word Detection: A heuristic to catch potential names that might be missed by other methods.
  • Flexible Anonymization Strategies:
    • placeholder: Replaces sensitive data with generic placeholders (e.g., [PERSON], [EMAIL]).
    • counter: Replaces with unique, numbered placeholders (e.g., [PERSON_1], [PERSON_2]).
    • fake: Substitutes PII with realistic-looking but synthetic data generated by the Faker library (e.g., "John Doe" for a name, a fake address for a real one).
  • Intelligent Overlap Resolution: When multiple detection methods identify overlapping text, the script prioritizes matches based on specificity (regex/custom lists > NER > fallback) and length to ensure accurate anonymization.
  • Highly Configurable:
    • Specify which NER entities to target.
    • Provide custom files for passwords, objects, dog breeds, or a generic wordlist.
    • Choose the anonymization strategy.
    • (Advanced) Modify built-in regex patterns directly in the code.
  • Command-Line Interface (CLI): Easy to integrate into workflows for processing text files.
  • Detailed Logging: Offers insights into the anonymization process, including found entities and applied replacements (with verbose/debug mode).
  • Automatic NLTK Data Check: Ensures necessary NLTK resources are available and attempts to download them if missing.

Getting Started

Prerequisites

  • Python 3.7+
  • Required Python packages (see requirements.txt):
    • nltk
    • Faker

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/advanced-text-anonymizer.git
    cd advanced-text-anonymizer
  2. Install dependencies:

    pip install -r requirements.txt
  3. NLTK Data: The script (afanon.py) will automatically check and attempt to download required NLTK data packages (punkt, averaged_perceptron_tagger, maxent_ne_chunker, words) on its first run. Ensure you have an internet connection.

Usage

The script is run from the command line:

python afanon.py <input_file> <output_file> [options]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages