Skip to content

joe-salamy/llm-batch-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Batch Processor

A Python tool that processes multiple files through Google Gemini API in parallel, applying a custom system prompt to each file and saving the outputs as markdown files.

Features

  • Batch Processing: Process multiple files at once from an input folder
  • Parallel Processing: Send multiple API requests concurrently to save time
  • Multiple File Types: Supports text files (txt, md), PDFs, DOCX, and other document formats
  • Custom System Prompts: Define your own system prompt to customize LLM behavior
  • Configurable: All settings in one configuration file
  • Cross-Platform: Works on Windows and MacOS

Prerequisites

Installation & Setup

1. Clone or Download the Repository

Download this repository to your local machine.

2. Create a Virtual Environment

Windows:

cd path\to\llm-batch-processor
python -m venv venv
venv\Scripts\activate

MacOS/Linux:

cd path/to/llm-batch-processor
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Set Up Environment Variables

  1. Copy .env.example to .env:

    # Windows
    copy .env.example .env
    
    # MacOS/Linux
    cp .env.example .env
  2. Edit .env and add your Google Gemini API key:

    GOOGLE_API_KEY=your_actual_api_key_here
    

5. Configure Settings

Edit config.py to customize your settings:

  1. Set Input/Output Folders: Update the paths to your input and output folders

    # Windows example
    INPUT_FOLDER = Path(r"C:\Users\YourName\Documents\llm_inputs")
    OUTPUT_FOLDER = Path(r"C:\Users\YourName\Documents\llm_outputs")
    
    # MacOS/Linux example
    INPUT_FOLDER = Path("/Users/YourName/Documents/llm_inputs")
    OUTPUT_FOLDER = Path("/Users/YourName/Documents/llm_outputs")
  2. Choose Model: Set your preferred Gemini model

    MODEL_TYPE = "gemini-1.5-pro"  # or "gemini-1.5-flash" for faster processing
  3. Set Max Workers: Adjust parallel processing threads

    MAX_WORKERS = 5  # Increase for faster processing, decrease if hitting rate limits
  4. Set System Prompt: Specify which prompt file to use

    SYSTEM_PROMPT_FILE = "default_prompt.md"  # or create your own in prompts/

6. Create Your Input and Output Folders

Create the folders you specified in config.py:

Windows:

mkdir C:\Users\YourName\Documents\llm_inputs
mkdir C:\Users\YourName\Documents\llm_outputs

MacOS/Linux:

mkdir -p /Users/YourName/Documents/llm_inputs
mkdir -p /Users/YourName/Documents/llm_outputs

7. Customize Your System Prompt (Optional)

Edit prompts/default_prompt.md or create a new prompt file in the prompts/ folder. Then update SYSTEM_PROMPT_FILE in config.py to use your custom prompt.

Usage

1. Add Files to Process

Place the files you want to process in your input folder (the one you configured in config.py).

Supported file types:

  • Text files: .txt, .md
  • Documents: .pdf, .docx, .doc
  • Other file types supported by Gemini

2. Run the Processor

Make sure your virtual environment is activated, then run:

python batch_processor.py

3. Check the Results

Processed files will be saved as markdown (.md) files in your output folder, with the same name as the input file.

For example:

  • Input: document.pdf → Output: document.md
  • Input: notes.txt → Output: notes.md

Configuration Options

All configuration is done in config.py:

Setting Description Default
MODEL_TYPE Gemini model to use gemini-1.5-pro
SYSTEM_PROMPT_FILE System prompt filename (in prompts/) default_prompt.md
INPUT_FOLDER Folder containing files to process User-configured
OUTPUT_FOLDER Folder to save processed outputs User-configured
MAX_WORKERS Number of parallel API requests 5
GENERATION_CONFIG Model generation parameters See config.py

How It Works

  1. Loads Configuration: Reads settings from config.py and API key from .env
  2. Loads System Prompt: Reads the system prompt from the specified file in prompts/
  3. Scans Input Folder: Finds all files in the input folder
  4. Parallel Processing:
    • For text files (.txt, .md): Reads content and sends directly to Gemini
    • For binary files (.pdf, .docx, etc.): Uploads to Gemini and references in prompt
  5. Saves Outputs: Saves each LLM response as a .md file in the output folder

Troubleshooting

"GOOGLE_API_KEY not found"

  • Make sure you created a .env file (not .env.example)
  • Verify your API key is set correctly in .env

"Input folder not found"

  • Check that the folder path in config.py is correct
  • Make sure the folder exists on your system
  • On Windows, use raw strings: Path(r"C:\path\to\folder")

"System prompt file not found"

  • Verify the SYSTEM_PROMPT_FILE in config.py matches a file in the prompts/ folder
  • Make sure the file has a .md extension

Rate Limit Errors

  • Decrease MAX_WORKERS in config.py to send fewer parallel requests
  • Check your Gemini API quota and limits

File Processing Errors

  • Check that your files are not corrupted
  • For PDFs and DOCX files, ensure they are valid documents
  • Check the console output for specific error messages

Advanced Usage

Multiple System Prompts

You can create multiple system prompt files for different use cases:

  1. Create new prompt files in prompts/:

    • prompts/summarize.md - For summarization tasks
    • prompts/analyze.md - For analysis tasks
    • prompts/translate.md - For translation tasks
  2. Switch between them by changing SYSTEM_PROMPT_FILE in config.py

Custom Generation Settings

Adjust the GENERATION_CONFIG in config.py to control output:

GENERATION_CONFIG = {
    "temperature": 0.7,        # Lower = more focused, Higher = more creative
    "top_p": 0.95,             # Nucleus sampling threshold
    "top_k": 40,               # Top-k sampling value
    "max_output_tokens": 8192, # Maximum length of response
}

License

This project is provided as-is for educational and personal use.

Support

For issues or questions:

  1. Check the Troubleshooting section above
  2. Review the Google Gemini API documentation
  3. Verify your API key and quota at Google AI Studio

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages