Config-driven PDF line redaction helper for repetitive document cleanup workflows.
pdf-redaction-helper is a practical batch utility for removing whole text lines from PDF files based on reusable keyword rules.
It is designed for situations where the same kinds of lines need to be removed from many documents repeatedly. Instead of opening files one by one and editing them manually, this tool applies rule-based cleanup driven by config.ini and writes sanitized copies to a separate output folder.
The tool scans extracted text lines in PDF pages and removes a whole line only when:
- the line matches at least one include rule
- and the line does not match any exclude rule
This makes it suitable for stable, repeatable document patterns where literal keywords or regular expressions can describe the cleanup target.
Use this tool when you need to:
- process many PDF files with the same cleanup rules
- keep the original files untouched
- reuse the same matching logic later
- turn a repetitive manual task into a configurable batch workflow
It should be understood as a workflow utility, not a general-purpose PDF editor.
- reads
config.inifrom the same directory as the script or packaged exe - scans each text line in each PDF page
- removes the whole line when include rules match and exclude rules do not match
- writes sanitized PDFs to
output_dir - supports both literal keyword rules and regex keyword rules
- supports
minimalorverboseterminal logging - continues processing even if individual files fail
A line is removed only when all of the following are true:
- it matches at least one include rule
- it does not match any exclude rule
Supported rule groups:
literal_keywordsregex_keywordsexclude_literal_keywordsexclude_regex_keywords
Notes:
- regex matching is case-insensitive by default
\s+means one or more spaces\s*means optional spaces
- Put source PDF files into
input_dir. - Edit
config.inito define input/output paths and matching rules. - Run the script or packaged exe.
- Review the generated sanitized PDFs in
output_dir. - Refine the rules if needed and run again.
This makes the tool useful for repeated internal cleanup tasks where rules may gradually improve over time.
pip install -r requirements.txt
python pdf_redaction_helper.pyEdit config.ini:
input_dir: input PDF folderoutput_dir: output PDF folder; must be different from inputprefix: output filename prefix; can be emptylog_mode:minimalorverbosepause_on_exit:never,error, oralwayserror_log: error log pathliteral_keywords: plain-text include rulesregex_keywords: regex include rulesexclude_literal_keywords: plain-text exclude rulesexclude_regex_keywords: regex exclude rules
[settings]
input_dir=origin
output_dir=sanitized_output
prefix=sanitized_
log_mode=minimal
pause_on_exit=error
error_log=error.log
literal_keywords=
Brand A
Brand B
regex_keywords=
Company\s+Name
Document\s+Code
exclude_literal_keywords=
Keep This Line
exclude_regex_keywords=
Reference\s+OnlyNotes:
- relative paths are resolved relative to
config.ini - output is written to a separate folder
- filename prefixes can help distinguish sanitized copies from originals
[START] config=config.ini
[START] input=origin | output=sanitized_origin | prefix=sanitized_
[START] files=564 | include(literal=4, regex=3) | exclude(literal=0, regex=0) | log_mode=minimal
[PROGRESS] 100.0% 564/564 | ok 564 skip 0 err 0 | 29.3 files/s
[DONE] files=564 | ok=564 skip=0 err=0 | removed_lines=5694 | elapsed=19.27s
python -m PyInstaller --noconfirm --clean --onedir --name pdf_redaction pdf_redaction_helper.py
Copy-Item -Force config.ini dist\pdf_redaction\config.iniRun:
.\dist\pdf_redaction\pdf_redaction.exeinput_dir == output_diris blocked- per-file failures do not stop the full batch
- in
minimalmode, the terminal shows startup summary, one-line progress, and final summary
- this tool works on extracted PDF text lines; it is not a full visual redaction editor
- results depend on whether the target PDF text is extractable and consistently structured
- poorly chosen keywords or regex patterns may remove too much or too little
For that reason, it is best used on known document patterns with reviewable output.
This repository is intentionally lightweight. Its value is not a complex algorithm, but the ability to turn repetitive PDF cleanup into a reusable and configurable batch process.
- do not commit business PDFs such as
origin/or generated output folders - do not commit build artifacts such as
build/,dist/, or__pycache__/ - keep the repository focused on source code, config templates, and docs