Advanced Text Anonymizer (afanon) 🕵️‍♂️📄

A powerful and configurable Python script (afanon.py) for identifying and anonymizing sensitive information within text documents. This tool employs a multi-layered approach, combining Named Entity Recognition (NER), regular expressions, custom wordlists, and fallback heuristics to provide comprehensive PII redaction.

Key Features

Multi-Layered Detection:
- Named Entity Recognition (NER): Leverages NLTK to identify entities like PERSON, ORGANIZATION, LOCATION, DATE, MONEY, etc. (configurable).
- Regular Expression (Regex) Matching: Pre-defined and customizable regex patterns to detect common PII like emails, phone numbers, SSNs, credit card numbers, IP addresses, physical addresses, and postal codes.
- Custom Wordlist Anonymization: Allows users to supply their own lists of sensitive words/phrases (e.g., passwords, specific object names, dog breeds, project codenames) for targeted redaction.
- Fallback Capitalized Word Detection: A heuristic to catch potential names that might be missed by other methods.
Flexible Anonymization Strategies:
- placeholder: Replaces sensitive data with generic placeholders (e.g., [PERSON], [EMAIL]).
- counter: Replaces with unique, numbered placeholders (e.g., [PERSON_1], [PERSON_2]).
- fake: Substitutes PII with realistic-looking but synthetic data generated by the Faker library (e.g., "John Doe" for a name, a fake address for a real one).
Intelligent Overlap Resolution: When multiple detection methods identify overlapping text, the script prioritizes matches based on specificity (regex/custom lists > NER > fallback) and length to ensure accurate anonymization.
Highly Configurable:
- Specify which NER entities to target.
- Provide custom files for passwords, objects, dog breeds, or a generic wordlist.
- Choose the anonymization strategy.
- (Advanced) Modify built-in regex patterns directly in the code.
Command-Line Interface (CLI): Easy to integrate into workflows for processing text files.
Detailed Logging: Offers insights into the anonymization process, including found entities and applied replacements (with verbose/debug mode).
Automatic NLTK Data Check: Ensures necessary NLTK resources are available and attempts to download them if missing.

Getting Started

Prerequisites

Python 3.7+
Required Python packages (see requirements.txt):
- nltk
- Faker

Installation

Clone the repository:

git clone https://github.com/your-username/advanced-text-anonymizer.git
cd advanced-text-anonymizer

Install dependencies:
```
pip install -r requirements.txt
```
NLTK Data: The script (afanon.py) will automatically check and attempt to download required NLTK data packages (punkt, averaged_perceptron_tagger, maxent_ne_chunker, words) on its first run. Ensure you have an internet connection.

Usage

The script is run from the command line:

python afanon.py <input_file> <output_file> [options]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
afanon.py		afanon.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced Text Anonymizer (afanon) 🕵️‍♂️📄

Key Features

Getting Started

Prerequisites

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Advanced Text Anonymizer (afanon) 🕵️‍♂️📄

Key Features

Getting Started

Prerequisites

Installation

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages