FileUtils

A Python utility package for consistent file operations across local and Azure storage, with enhanced support for various data formats and flexible configuration options.

Features

Unified Storage Interface
- Seamless operations across local and Azure Blob Storage
- Consistent API regardless of storage backend
- Automatic directory structure management
Comprehensive File Format Support
- Tabular Data: CSV (with delimiter auto-detection), Excel (.xlsx, .xls) with multi-sheet support, Parquet (with compression options)
- Document Formats: Microsoft PowerPoint (.pptx), Microsoft Word (.docx) with template support, Markdown (.md) with YAML frontmatter, PDF (read-only text extraction)
- Multi-Purpose Formats: JSON and YAML support both DataFrame storage and structured document handling with automatic pandas type conversion
- Excel ↔ CSV Round-Trip: Convert Excel workbooks to CSV files with structure preservation, and reconstruct Excel workbooks from modified CSV files
Advanced Data Handling
- Single and multi-DataFrame operations
- Automatic format detection
- Flexible data orientation options
- Customizable save/load options per format
- Dynamic Subdirectory Creation: Specify nested subdirectories for saving files on the fly using the sub_path parameter.
- File Format Handling: Loads data based on file extension; includes CSV delimiter auto-detection during load.
Robust Infrastructure
- YAML-based configuration system
- Comprehensive error handling
- Detailed logging with configurable levels
- Type hints throughout the codebase

New in v0.8

Python 3.11+ support (required)
Convenience: FileUtils.save_bytes(...) for saving raw bytes (e.g., images)
Typed enums for directories: InputType, OutputArea
Optional structured results via structured_result=True returning SaveResult
Helper: FileUtils.open_run(prefix, customer) -> (sub_path, run_id)
Deprecations: save_dataframes(file_format=...), utils.common.get_logger, FileUtils._get_default_config()

New in v0.8.3

File System Operations:
- file_exists(): Check if files exist without raising exceptions
- list_directory(): List files and directories with pattern filtering
- Enhanced create_directory(): More flexible directory creation with new signature (backward compatible)

Installation

You can install directly from GitHub and choose the installation option that best suits your needs:

# Basic installation
pip install git+https://github.com/topij/FileUtils.git

# With specific features
pip install 'git+https://github.com/topij/FileUtils.git#egg=FileUtils[azure]'
pip install 'git+https://github.com/topij/FileUtils.git#egg=FileUtils[all]'

Quick Start

from FileUtils import FileUtils, OutputFileType

# Initialize with local storage
file_utils = FileUtils()

# Load data (format is auto-detected)
df = file_utils.load_single_file("data.csv", input_type="raw")

# Save as JSON with custom options
file_utils.save_data_to_storage(
    data=df,
    file_name="output",
    output_type="processed",
    output_filetype=OutputFileType.JSON,
    orient="records",  # Save as list of records
)

# Save multiple DataFrames to Excel
data_dict = {
    "Sheet1": df1,
    "Sheet2": df2
}
file_utils.save_data_to_storage(
    data=data_dict,
    file_name="multi_sheet",
    output_type="processed",
    output_filetype=OutputFileType.XLSX
)

# Save to a dynamic subdirectory
file_utils.save_data_to_storage(
    data=df,
    file_name="report_data",
    output_type="processed",
    output_filetype=OutputFileType.CSV,
    sub_path="analysis_run_1/summaries" # New subdirectory
)
# File saved to: data/processed/analysis_run_1/summaries/report_data_<timestamp>.csv

# Load data from the dynamic subdirectory
loaded_report = file_utils.load_single_file(
    file_path="report_data.csv", # Just the filename
    input_type="processed",
    sub_path="analysis_run_1/summaries" # Specify the sub_path
)

# Save raw bytes (e.g., PNG) and get the path/URL if cloud-backed
from FileUtils.core.enums import OutputArea
chart_path = file_utils.save_bytes(
    content=png_bytes,
    file_stem="chart_q1",
    sub_path="runs/acme/images",
    output_type=OutputArea.PROCESSED,  # or "processed"
    file_ext="png",
)

# Structured results (SaveResult) for easier downstream handling
from FileUtils import SaveResult
res_map, _ = file_utils.save_data_to_storage(
    data={"Sheet1": df1, "Sheet2": df2},
    output_filetype=OutputFileType.XLSX,
    file_name="multi_sheet",
    structured_result=True,
)
first: SaveResult = next(iter(res_map.values()))
print(first.path, first.url)

# Typed enums for directories
from FileUtils.core.enums import InputType
doc = file_utils.load_document_from_storage("readme.md", input_type=InputType.RAW)

# Standardize run folders
sub_path, run_id = file_utils.open_run(sub_path_prefix="presentations", customer="ACME")
print(sub_path, run_id)

# File system operations (v0.8.3+)
# Check if file exists (never raises exceptions)
if file_utils.file_exists("config.yml", input_type="config", sub_path="ACME"):
    config = file_utils.load_yaml("config.yml", input_type="config", sub_path="ACME")

# List files in directory with pattern filtering
config_files = file_utils.list_directory(
    input_type="config", 
    sub_path="ACME", 
    pattern="*.yml"
)

# Enhanced directory creation
dir_path = file_utils.create_directory(
    "charts", 
    input_type="processed", 
    sub_path="presentations/ACME/run123"
)

Logging Control

FileUtils logs operations at INFO level by default. For automation scripts and structured output scenarios (JSON, XML, etc.), you can control logging verbosity:

Quiet Mode

# Suppress all logging except CRITICAL errors
fu = FileUtils(quiet=True)

# Perfect for JSON output scenarios
import json
result = fu.save_data_to_storage(data, output_filetype=OutputFileType.JSON, ...)
print(json.dumps({"status": "success", "files": result}))

Custom Log Level

import logging

# Show only warnings and errors
fu = FileUtils(log_level=logging.WARNING)

# Debug mode for development
fu = FileUtils(log_level=logging.DEBUG)

# Also accepts string levels (backward compatible)
fu = FileUtils(log_level="WARNING")

Document Handling

FileUtils now supports rich document formats perfect for AI/agentic workflows:

from FileUtils import FileUtils, OutputFileType

# Initialize FileUtils
file_utils = FileUtils()

# Save Markdown with YAML frontmatter
document_content = {
    "frontmatter": {
        "title": "AI Analysis Report",
        "author": "AI Agent",
        "confidence": 0.95
    },
    "body": """# Analysis Results

## Key Findings
- Pattern detected with 94.2% confidence
- 3 anomalies identified
- Recommended actions: Update model, retrain

## Next Steps
1. Review findings
2. Implement recommendations
3. Schedule follow-up
"""
}

saved_path, _ = file_utils.save_document_to_storage(
    content=document_content,
    output_filetype=OutputFileType.MARKDOWN,
    output_type="processed",
    file_name="ai_analysis",
    sub_path="reports/2024"
)

# Enhanced DOCX with Template Support
markdown_content = """# Project Report

## Executive Summary
This is a comprehensive analysis of our project progress.

## Key Findings
- **Important**: We've achieved 95% completion
- [ ] Complete final testing
- [x] Update documentation

| Metric | Value | Status |
|--------|-------|--------|
| Progress | 95% | ✅ On Track |
| Budget | $45,000 | ✅ Under Budget |
"""

# Convert markdown to DOCX with template
saved_path, _ = file_utils.save_document_to_storage(
    content=markdown_content,
    output_filetype=OutputFileType.DOCX,
    output_type="processed",
    file_name="project_report",
    template="review",  # Use specific template
    add_provenance=True,
    add_reviewer_instructions=True
)

# Save structured DOCX document
docx_content = {
    "title": "Project Report",
    "sections": [
        {
            "heading": "Executive Summary",
            "level": 1,
            "text": "Project completed successfully."
        },
        {
            "heading": "Results",
            "level": 2,
            "table": [
                ["Metric", "Value"],
                ["Accuracy", "95.2%"],
                ["Speed", "2.3s"]
            ]
        }
    ]
}

saved_path, _ = file_utils.save_document_to_storage(
    content=docx_content,
    output_filetype=OutputFileType.DOCX,
    output_type="processed",
    file_name="project_report"
)

# Load document content
loaded_content = file_utils.load_document_from_storage(
    file_path="ai_analysis.md",
    input_type="processed",
    sub_path="reports/2024"
)

# Save structured JSON configuration
config_data = {
    "database": {
        "host": "localhost",
        "port": 5432,
        "name": "analytics"
    },
    "api": {
        "timeout": 30,
        "retries": 3,
        "base_url": "https://api.example.com"
    },
    "features": {
        "enable_caching": True,
        "cache_ttl": 3600
    }
}

saved_path, _ = file_utils.save_document_to_storage(
    content=config_data,
    output_filetype=OutputFileType.JSON,
    output_type="processed",
    file_name="app_config"
)

# Load configuration
loaded_config = file_utils.load_json(
    file_path="app_config.json",
    input_type="processed"
)

# Automatic type conversion for pandas data
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5),
    'value': np.random.randn(5),
    'category': ['A', 'B', 'C', 'D', 'E']
})

# This works without manual conversion!
json_data = {
    'metadata': {
        'created': pd.Timestamp.now(),
        'total_records': len(df)
    },
    'data': df.to_dict('records')  # Pandas Timestamps automatically converted
}

saved_path, _ = file_utils.save_document_to_storage(
    content=json_data,
    output_filetype=OutputFileType.JSON,
    output_type="processed",
    file_name="data_with_types"
)

Enhanced DOCX Template System

FileUtils includes a comprehensive DOCX template system with:

Template Support: Use existing DOCX files with custom styles (not .dotx template files)
Markdown Conversion: Convert markdown content to professionally formatted DOCX
Style Mapping: Customize how elements are styled in the output
Reviewer Workflow: Built-in support for document review processes
Provenance Tracking: Automatic metadata and source tracking

Important: FileUtils uses regular .docx files as templates, not Microsoft Word .dotx template files. The system loads the DOCX file, clears its content, and preserves the styles for use in the generated documents.

# Initialize with template configuration
file_utils = FileUtils(
    config_override={
        "docx_templates": {
            "template_dir": "templates",
            "templates": {
                "default": "style-template-doc.docx",  # Generic template
                "personal": "IP-template-doc.docx"      # Personal template
            }
        },
        "style_mapping": {
            "table": "IP-table_light",
            "heading_1": "Heading 1"
        }
    }
)

# Convert markdown to DOCX with template
file_utils.save_document_to_storage(
    content=markdown_content,
    output_filetype=OutputFileType.DOCX,
    template="review",
    add_provenance=True,
    add_reviewer_instructions=True
)

For detailed DOCX template documentation, see Enhanced DOCX Guide.

Excel ↔ CSV Round-Trip Conversion

Complete round-trip workflow for Excel workbooks: Excel → CSV → Excel. Perfect for data processing pipelines where you need to work with individual CSV files but distribute results as consolidated Excel workbooks.

Key Features:

Excel → CSV: Convert each Excel sheet to separate CSV files with structure preservation
CSV → Excel: Reconstruct Excel workbooks from modified CSV files using structure JSON
Structure Preservation: JSON metadata maintains workbook relationships and sheet information
Data Modification: Work with individual CSV files, then reconstruct for distribution
Error Handling: Graceful handling of missing or modified files

# Step 1: Convert Excel to CSV with structure preservation
csv_files, structure_file = file_utils.convert_excel_to_csv_with_structure(
    excel_file_path="workbook.xlsx",
    file_name="converted_workbook",
    preserve_structure=True
)

# Step 2: Work with individual CSV files
sales_df = file_utils.load_single_file("converted_workbook_Sales.csv", input_type="processed")
# ... modify data ...
file_utils.save_data_to_storage(sales_df, output_filetype=OutputFileType.CSV, 
                                file_name="converted_workbook_Sales", include_timestamp=False)

# Step 3: Reconstruct Excel workbook from modified CSV files
excel_path = file_utils.convert_csv_to_excel_workbook(
    structure_json_path=structure_file,
    file_name="reconstructed_workbook"
)

Structure JSON includes:

Workbook metadata (source file, conversion timestamp, sheet count)
Sheet details (dimensions, columns, data types, null counts)
Data quality metrics (memory usage, index information)

Configurable Directory Names

Customize directory names to match your project domain. Perfect for document processing, content creation, or research projects.

Configuration:

# config.yaml
directories:
  data_directory: "documents"  # Instead of "data"
  subdirectories:
    raw: "product_docs"        # Instead of "raw"
    processed: "cs_documents"  # Instead of "processed"
    templates: "templates"     # DOCX templates

Usage:

# Initialize with custom configuration
file_utils = FileUtils(config_file="config.yaml")

# Files will be saved to:
# documents/product_docs/     (instead of data/raw/)
# documents/cs_documents/     (instead of data/processed/)
# documents/templates/       (DOCX templates)

# All FileUtils operations work seamlessly
file_utils.save_data_to_storage(data, output_filetype=OutputFileType.CSV, 
                                output_type="raw")  # → documents/product_docs/

Perfect for:

Document Projects: documents/ instead of data/
Content Creation: assets/ instead of data/
Research: experiments/ instead of data/
Customer Success: cs_workflow/ instead of data/

Root-Level Directories

FileUtils can work with directories outside the data folder at the project root level. This is useful for configuration files, logs, documentation, or any other project-level directories.

Usage:

from FileUtils import FileUtils, OutputFileType

file_utils = FileUtils()

# Save configuration to config directory at project root
config_data = {
    "database": {"host": "localhost", "port": 5432},
    "api": {"timeout": 30}
}

file_utils.save_document_to_storage(
    content=config_data,
    output_filetype=OutputFileType.JSON,
    output_type="config",
    file_name="app_config",
    root_level=True  # Creates config/ at project root, not data/config/
)

# Load configuration from root-level config directory
loaded_config = file_utils.load_json(
    file_path="app_config.json",
    input_type="config",
    root_level=True
)

# Save logs to logs directory at project root
file_utils.save_data_to_storage(
    data=log_df,
    output_filetype=OutputFileType.CSV,
    output_type="logs",
    file_name="application_logs",
    root_level=True  # Creates logs/ at project root
)

Key Points:

root_level=True: Directory is created at project root (e.g., project_root/config/)
root_level=False (default): Directory is under data directory (e.g., project_root/data/processed/)
Works with all file operations: save, load, documents, Excel conversion, etc.
Can use sub_path with root-level directories for nested structures

Examples

FileUtils includes comprehensive example scripts demonstrating various use cases:

Quick Examples

# Run example scripts
python examples/data_pipeline.py      # Complete data pipeline
python examples/ai_workflow.py         # AI/agentic workflows  
python examples/multi_format_reports.py # Multi-format reporting
python examples/enhanced_docx.py       # Enhanced DOCX template system
python examples/excel_to_csv_conversion.py # Excel workbook to CSV conversion
python examples/excel_csv_roundtrip.py      # Complete Excel ↔ CSV round-trip workflow
python examples/error_handling.py      # Robust error handling
python examples/performance_optimization.py # Large dataset optimization

Example Scripts Overview

basic_usage.py - Basic operations (CSV, Excel, metadata)
data_pipeline.py - Complete data science pipeline (7,000+ records)
ai_workflow.py - AI integration (sentiment analysis, recommendations)
enhanced_docx.py - Enhanced DOCX template system (markdown conversion, templates)
excel_to_csv_conversion.py - Excel workbook to CSV conversion with structure preservation
excel_csv_roundtrip.py - Complete Excel ↔ CSV round-trip workflow (Excel → CSV → Excel)
multi_format_reports.py - Same data → Excel, PDF, Markdown, JSON
error_handling.py - Production-ready error handling and recovery
performance_optimization.py - Large dataset optimization (50MB+)
document_types.py - Document functionality (DOCX, Markdown, PDF)
configuration.py - Configuration options and customization
azure_storage.py - Azure Blob Storage integration
FileUtils_tutorial.ipynb - Comprehensive interactive tutorial

📚 Complete Examples Documentation - Detailed guide to all example scripts

Testing

FileUtils includes comprehensive test coverage with 50+ tests ensuring reliability and data integrity:

Unit Tests: Individual method validation with controlled inputs
Integration Tests: Complete workflow testing with realistic business data
Excel ↔ CSV Tests: Dedicated tests for round-trip conversion functionality
Error Handling: Graceful failure scenarios and edge cases
Data Integrity: Validation of data preservation through all operations

# Run all tests
pytest

# Run Excel ↔ CSV conversion tests
pytest -k "excel_csv"

# Run with coverage
pytest --cov=FileUtils

📚 Testing Documentation - Complete test coverage guide and methodology

Key Benefits

Consistency: Same interface for local and cloud storage operations
Flexibility: Extensive options for each file format
Reliability: Robust error handling and logging
Simplicity: Intuitive API with sensible defaults
Configurable Directories: Customize directory names for domain-specific projects
Smart Type Handling: Automatic conversion of pandas types for JSON/YAML documents
Intelligent File Discovery: Automatic handling of timestamped files when loading

Background

This package was developed to streamline data operations across various projects, particularly in data science and analysis workflows. It eliminates the need to rewrite common file handling code and provides a consistent interface regardless of the storage backend.

For a practical example, check out my semantic text analyzer project, which uses FileUtils for seamless data handling across local and Azure environments.

Documentation

Getting Started Guide - Quick introduction to key use cases
Installation Guide - Detailed installation instructions
Usage Guide - Comprehensive examples and patterns
Document Types Guide - Rich document formats (DOCX, Markdown, PDF)
Enhanced DOCX Guide - DOCX template system and markdown conversion
Examples Documentation - Complete guide to all example scripts
API Reference - Complete API documentation
Testing Documentation - Test coverage and methodology
Azure Setup Guide - Azure Blob Storage configuration
Development Guide - Setup, building, and contributing to the project
Future Features - Roadmap and planned enhancements

Requirements

Core Dependencies (automatically installed)

Python 3.11+
pandas
pyyaml
python-dotenv
jsonschema

Optional Dependencies

Choose the dependencies you need based on your use case:

Azure Storage ([azure]):
- azure-storage-blob
- azure-identity
Parquet Support ([parquet]):
- pyarrow
Excel Support ([excel]):
- openpyxl
Document Formats ([documents]):
- python-docx (Microsoft Word documents)
- markdown (Markdown processing)
- PyMuPDF (PDF read/write, supports multiple formats)

Install optional dependencies using the corresponding extras tag (e.g., pip install 'FileUtils[documents]').

Notes from the Author

This package prioritizes functionality and ease of use over performance optimization. While I use it in all of my projects, it's maintained as a side project.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github		.github
data/templates		data/templates
docs		docs
examples		examples
scripts		scripts
src/FileUtils		src/FileUtils
tests		tests
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CI_DEBUG_SUMMARY.md		CI_DEBUG_SUMMARY.md
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FileUtils

Features

New in v0.8

New in v0.8.3

Installation

Quick Start

Logging Control

Quiet Mode

Custom Log Level

Document Handling

Enhanced DOCX Template System

Excel ↔ CSV Round-Trip Conversion

Configurable Directory Names

Root-Level Directories

Examples

Quick Examples

Example Scripts Overview

Testing

Key Benefits

Background

Documentation

Requirements

Core Dependencies (automatically installed)

Optional Dependencies

Notes from the Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

topij/FileUtils

Folders and files

Latest commit

History

Repository files navigation

FileUtils

Features

New in v0.8

New in v0.8.3

Installation

Quick Start

Logging Control

Quiet Mode

Custom Log Level

Document Handling

Enhanced DOCX Template System

Excel ↔ CSV Round-Trip Conversion

Configurable Directory Names

Root-Level Directories

Examples

Quick Examples

Example Scripts Overview

Testing

Key Benefits

Background

Documentation

Requirements

Core Dependencies (automatically installed)

Optional Dependencies

Notes from the Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages