Skip to content

topij/FileUtils

Repository files navigation

FileUtils

A Python utility package for consistent file operations across local and Azure storage, with enhanced support for various data formats and flexible configuration options.

Features

  • Unified Storage Interface

    • Seamless operations across local and Azure Blob Storage
    • Consistent API regardless of storage backend
    • Automatic directory structure management
  • Comprehensive File Format Support

    • Tabular Data: CSV (with delimiter auto-detection), Excel (.xlsx, .xls) with multi-sheet support, Parquet (with compression options)
    • Document Formats: Microsoft PowerPoint (.pptx), Microsoft Word (.docx) with template support, Markdown (.md) with YAML frontmatter, PDF (read-only text extraction)
    • Multi-Purpose Formats: JSON and YAML support both DataFrame storage and structured document handling with automatic pandas type conversion
    • Excel ↔ CSV Round-Trip: Convert Excel workbooks to CSV files with structure preservation, and reconstruct Excel workbooks from modified CSV files
  • Advanced Data Handling

    • Single and multi-DataFrame operations
    • Automatic format detection
    • Flexible data orientation options
    • Customizable save/load options per format
    • Dynamic Subdirectory Creation: Specify nested subdirectories for saving files on the fly using the sub_path parameter.
    • File Format Handling: Loads data based on file extension; includes CSV delimiter auto-detection during load.
  • Robust Infrastructure

    • YAML-based configuration system
    • Comprehensive error handling
    • Detailed logging with configurable levels
    • Type hints throughout the codebase

New in v0.8

  • Python 3.11+ support (required)
  • Convenience: FileUtils.save_bytes(...) for saving raw bytes (e.g., images)
  • Typed enums for directories: InputType, OutputArea
  • Optional structured results via structured_result=True returning SaveResult
  • Helper: FileUtils.open_run(prefix, customer) -> (sub_path, run_id)
  • Deprecations: save_dataframes(file_format=...), utils.common.get_logger, FileUtils._get_default_config()

New in v0.8.3

  • File System Operations:
    • file_exists(): Check if files exist without raising exceptions
    • list_directory(): List files and directories with pattern filtering
    • Enhanced create_directory(): More flexible directory creation with new signature (backward compatible)

Installation

You can install directly from GitHub and choose the installation option that best suits your needs:

# Basic installation
pip install git+https://github.com/topij/FileUtils.git

# With specific features
pip install 'git+https://github.com/topij/FileUtils.git#egg=FileUtils[azure]'
pip install 'git+https://github.com/topij/FileUtils.git#egg=FileUtils[all]'

Quick Start

from FileUtils import FileUtils, OutputFileType

# Initialize with local storage
file_utils = FileUtils()

# Load data (format is auto-detected)
df = file_utils.load_single_file("data.csv", input_type="raw")

# Save as JSON with custom options
file_utils.save_data_to_storage(
    data=df,
    file_name="output",
    output_type="processed",
    output_filetype=OutputFileType.JSON,
    orient="records",  # Save as list of records
)

# Save multiple DataFrames to Excel
data_dict = {
    "Sheet1": df1,
    "Sheet2": df2
}
file_utils.save_data_to_storage(
    data=data_dict,
    file_name="multi_sheet",
    output_type="processed",
    output_filetype=OutputFileType.XLSX
)

# Save to a dynamic subdirectory
file_utils.save_data_to_storage(
    data=df,
    file_name="report_data",
    output_type="processed",
    output_filetype=OutputFileType.CSV,
    sub_path="analysis_run_1/summaries" # New subdirectory
)
# File saved to: data/processed/analysis_run_1/summaries/report_data_<timestamp>.csv

# Load data from the dynamic subdirectory
loaded_report = file_utils.load_single_file(
    file_path="report_data.csv", # Just the filename
    input_type="processed",
    sub_path="analysis_run_1/summaries" # Specify the sub_path
)

# Save raw bytes (e.g., PNG) and get the path/URL if cloud-backed
from FileUtils.core.enums import OutputArea
chart_path = file_utils.save_bytes(
    content=png_bytes,
    file_stem="chart_q1",
    sub_path="runs/acme/images",
    output_type=OutputArea.PROCESSED,  # or "processed"
    file_ext="png",
)

# Structured results (SaveResult) for easier downstream handling
from FileUtils import SaveResult
res_map, _ = file_utils.save_data_to_storage(
    data={"Sheet1": df1, "Sheet2": df2},
    output_filetype=OutputFileType.XLSX,
    file_name="multi_sheet",
    structured_result=True,
)
first: SaveResult = next(iter(res_map.values()))
print(first.path, first.url)

# Typed enums for directories
from FileUtils.core.enums import InputType
doc = file_utils.load_document_from_storage("readme.md", input_type=InputType.RAW)

# Standardize run folders
sub_path, run_id = file_utils.open_run(sub_path_prefix="presentations", customer="ACME")
print(sub_path, run_id)

# File system operations (v0.8.3+)
# Check if file exists (never raises exceptions)
if file_utils.file_exists("config.yml", input_type="config", sub_path="ACME"):
    config = file_utils.load_yaml("config.yml", input_type="config", sub_path="ACME")

# List files in directory with pattern filtering
config_files = file_utils.list_directory(
    input_type="config", 
    sub_path="ACME", 
    pattern="*.yml"
)

# Enhanced directory creation
dir_path = file_utils.create_directory(
    "charts", 
    input_type="processed", 
    sub_path="presentations/ACME/run123"
)

Logging Control

FileUtils logs operations at INFO level by default. For automation scripts and structured output scenarios (JSON, XML, etc.), you can control logging verbosity:

Quiet Mode

# Suppress all logging except CRITICAL errors
fu = FileUtils(quiet=True)

# Perfect for JSON output scenarios
import json
result = fu.save_data_to_storage(data, output_filetype=OutputFileType.JSON, ...)
print(json.dumps({"status": "success", "files": result}))

Custom Log Level

import logging

# Show only warnings and errors
fu = FileUtils(log_level=logging.WARNING)

# Debug mode for development
fu = FileUtils(log_level=logging.DEBUG)

# Also accepts string levels (backward compatible)
fu = FileUtils(log_level="WARNING")

Document Handling

FileUtils now supports rich document formats perfect for AI/agentic workflows:

from FileUtils import FileUtils, OutputFileType

# Initialize FileUtils
file_utils = FileUtils()

# Save Markdown with YAML frontmatter
document_content = {
    "frontmatter": {
        "title": "AI Analysis Report",
        "author": "AI Agent",
        "confidence": 0.95
    },
    "body": """# Analysis Results

## Key Findings
- Pattern detected with 94.2% confidence
- 3 anomalies identified
- Recommended actions: Update model, retrain

## Next Steps
1. Review findings
2. Implement recommendations
3. Schedule follow-up
"""
}

saved_path, _ = file_utils.save_document_to_storage(
    content=document_content,
    output_filetype=OutputFileType.MARKDOWN,
    output_type="processed",
    file_name="ai_analysis",
    sub_path="reports/2024"
)

# Enhanced DOCX with Template Support
markdown_content = """# Project Report

## Executive Summary
This is a comprehensive analysis of our project progress.

## Key Findings
- **Important**: We've achieved 95% completion
- [ ] Complete final testing
- [x] Update documentation

| Metric | Value | Status |
|--------|-------|--------|
| Progress | 95% | ✅ On Track |
| Budget | $45,000 | ✅ Under Budget |
"""

# Convert markdown to DOCX with template
saved_path, _ = file_utils.save_document_to_storage(
    content=markdown_content,
    output_filetype=OutputFileType.DOCX,
    output_type="processed",
    file_name="project_report",
    template="review",  # Use specific template
    add_provenance=True,
    add_reviewer_instructions=True
)

# Save structured DOCX document
docx_content = {
    "title": "Project Report",
    "sections": [
        {
            "heading": "Executive Summary",
            "level": 1,
            "text": "Project completed successfully."
        },
        {
            "heading": "Results",
            "level": 2,
            "table": [
                ["Metric", "Value"],
                ["Accuracy", "95.2%"],
                ["Speed", "2.3s"]
            ]
        }
    ]
}

saved_path, _ = file_utils.save_document_to_storage(
    content=docx_content,
    output_filetype=OutputFileType.DOCX,
    output_type="processed",
    file_name="project_report"
)

# Load document content
loaded_content = file_utils.load_document_from_storage(
    file_path="ai_analysis.md",
    input_type="processed",
    sub_path="reports/2024"
)

# Save structured JSON configuration
config_data = {
    "database": {
        "host": "localhost",
        "port": 5432,
        "name": "analytics"
    },
    "api": {
        "timeout": 30,
        "retries": 3,
        "base_url": "https://api.example.com"
    },
    "features": {
        "enable_caching": True,
        "cache_ttl": 3600
    }
}

saved_path, _ = file_utils.save_document_to_storage(
    content=config_data,
    output_filetype=OutputFileType.JSON,
    output_type="processed",
    file_name="app_config"
)

# Load configuration
loaded_config = file_utils.load_json(
    file_path="app_config.json",
    input_type="processed"
)

# Automatic type conversion for pandas data
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5),
    'value': np.random.randn(5),
    'category': ['A', 'B', 'C', 'D', 'E']
})

# This works without manual conversion!
json_data = {
    'metadata': {
        'created': pd.Timestamp.now(),
        'total_records': len(df)
    },
    'data': df.to_dict('records')  # Pandas Timestamps automatically converted
}

saved_path, _ = file_utils.save_document_to_storage(
    content=json_data,
    output_filetype=OutputFileType.JSON,
    output_type="processed",
    file_name="data_with_types"
)

Enhanced DOCX Template System

FileUtils includes a comprehensive DOCX template system with:

  • Template Support: Use existing DOCX files with custom styles (not .dotx template files)
  • Markdown Conversion: Convert markdown content to professionally formatted DOCX
  • Style Mapping: Customize how elements are styled in the output
  • Reviewer Workflow: Built-in support for document review processes
  • Provenance Tracking: Automatic metadata and source tracking

Important: FileUtils uses regular .docx files as templates, not Microsoft Word .dotx template files. The system loads the DOCX file, clears its content, and preserves the styles for use in the generated documents.

# Initialize with template configuration
file_utils = FileUtils(
    config_override={
        "docx_templates": {
            "template_dir": "templates",
            "templates": {
                "default": "style-template-doc.docx",  # Generic template
                "personal": "IP-template-doc.docx"      # Personal template
            }
        },
        "style_mapping": {
            "table": "IP-table_light",
            "heading_1": "Heading 1"
        }
    }
)

# Convert markdown to DOCX with template
file_utils.save_document_to_storage(
    content=markdown_content,
    output_filetype=OutputFileType.DOCX,
    template="review",
    add_provenance=True,
    add_reviewer_instructions=True
)

For detailed DOCX template documentation, see Enhanced DOCX Guide.

Excel ↔ CSV Round-Trip Conversion

Complete round-trip workflow for Excel workbooks: Excel → CSV → Excel. Perfect for data processing pipelines where you need to work with individual CSV files but distribute results as consolidated Excel workbooks.

Key Features:

  • Excel → CSV: Convert each Excel sheet to separate CSV files with structure preservation
  • CSV → Excel: Reconstruct Excel workbooks from modified CSV files using structure JSON
  • Structure Preservation: JSON metadata maintains workbook relationships and sheet information
  • Data Modification: Work with individual CSV files, then reconstruct for distribution
  • Error Handling: Graceful handling of missing or modified files
# Step 1: Convert Excel to CSV with structure preservation
csv_files, structure_file = file_utils.convert_excel_to_csv_with_structure(
    excel_file_path="workbook.xlsx",
    file_name="converted_workbook",
    preserve_structure=True
)

# Step 2: Work with individual CSV files
sales_df = file_utils.load_single_file("converted_workbook_Sales.csv", input_type="processed")
# ... modify data ...
file_utils.save_data_to_storage(sales_df, output_filetype=OutputFileType.CSV, 
                                file_name="converted_workbook_Sales", include_timestamp=False)

# Step 3: Reconstruct Excel workbook from modified CSV files
excel_path = file_utils.convert_csv_to_excel_workbook(
    structure_json_path=structure_file,
    file_name="reconstructed_workbook"
)

Structure JSON includes:

  • Workbook metadata (source file, conversion timestamp, sheet count)
  • Sheet details (dimensions, columns, data types, null counts)
  • Data quality metrics (memory usage, index information)

Configurable Directory Names

Customize directory names to match your project domain. Perfect for document processing, content creation, or research projects.

Configuration:

# config.yaml
directories:
  data_directory: "documents"  # Instead of "data"
  subdirectories:
    raw: "product_docs"        # Instead of "raw"
    processed: "cs_documents"  # Instead of "processed"
    templates: "templates"     # DOCX templates

Usage:

# Initialize with custom configuration
file_utils = FileUtils(config_file="config.yaml")

# Files will be saved to:
# documents/product_docs/     (instead of data/raw/)
# documents/cs_documents/     (instead of data/processed/)
# documents/templates/       (DOCX templates)

# All FileUtils operations work seamlessly
file_utils.save_data_to_storage(data, output_filetype=OutputFileType.CSV, 
                                output_type="raw")  # → documents/product_docs/

Perfect for:

  • Document Projects: documents/ instead of data/
  • Content Creation: assets/ instead of data/
  • Research: experiments/ instead of data/
  • Customer Success: cs_workflow/ instead of data/

Root-Level Directories

FileUtils can work with directories outside the data folder at the project root level. This is useful for configuration files, logs, documentation, or any other project-level directories.

Usage:

from FileUtils import FileUtils, OutputFileType

file_utils = FileUtils()

# Save configuration to config directory at project root
config_data = {
    "database": {"host": "localhost", "port": 5432},
    "api": {"timeout": 30}
}

file_utils.save_document_to_storage(
    content=config_data,
    output_filetype=OutputFileType.JSON,
    output_type="config",
    file_name="app_config",
    root_level=True  # Creates config/ at project root, not data/config/
)

# Load configuration from root-level config directory
loaded_config = file_utils.load_json(
    file_path="app_config.json",
    input_type="config",
    root_level=True
)

# Save logs to logs directory at project root
file_utils.save_data_to_storage(
    data=log_df,
    output_filetype=OutputFileType.CSV,
    output_type="logs",
    file_name="application_logs",
    root_level=True  # Creates logs/ at project root
)

Key Points:

  • root_level=True: Directory is created at project root (e.g., project_root/config/)
  • root_level=False (default): Directory is under data directory (e.g., project_root/data/processed/)
  • Works with all file operations: save, load, documents, Excel conversion, etc.
  • Can use sub_path with root-level directories for nested structures

Examples

FileUtils includes comprehensive example scripts demonstrating various use cases:

Quick Examples

# Run example scripts
python examples/data_pipeline.py      # Complete data pipeline
python examples/ai_workflow.py         # AI/agentic workflows  
python examples/multi_format_reports.py # Multi-format reporting
python examples/enhanced_docx.py       # Enhanced DOCX template system
python examples/excel_to_csv_conversion.py # Excel workbook to CSV conversion
python examples/excel_csv_roundtrip.py      # Complete Excel ↔ CSV round-trip workflow
python examples/error_handling.py      # Robust error handling
python examples/performance_optimization.py # Large dataset optimization

Example Scripts Overview

  • basic_usage.py - Basic operations (CSV, Excel, metadata)
  • data_pipeline.py - Complete data science pipeline (7,000+ records)
  • ai_workflow.py - AI integration (sentiment analysis, recommendations)
  • enhanced_docx.py - Enhanced DOCX template system (markdown conversion, templates)
  • excel_to_csv_conversion.py - Excel workbook to CSV conversion with structure preservation
  • excel_csv_roundtrip.py - Complete Excel ↔ CSV round-trip workflow (Excel → CSV → Excel)
  • multi_format_reports.py - Same data → Excel, PDF, Markdown, JSON
  • error_handling.py - Production-ready error handling and recovery
  • performance_optimization.py - Large dataset optimization (50MB+)
  • document_types.py - Document functionality (DOCX, Markdown, PDF)
  • configuration.py - Configuration options and customization
  • azure_storage.py - Azure Blob Storage integration
  • FileUtils_tutorial.ipynb - Comprehensive interactive tutorial

📚 Complete Examples Documentation - Detailed guide to all example scripts

Testing

FileUtils includes comprehensive test coverage with 50+ tests ensuring reliability and data integrity:

  • Unit Tests: Individual method validation with controlled inputs
  • Integration Tests: Complete workflow testing with realistic business data
  • Excel ↔ CSV Tests: Dedicated tests for round-trip conversion functionality
  • Error Handling: Graceful failure scenarios and edge cases
  • Data Integrity: Validation of data preservation through all operations
# Run all tests
pytest

# Run Excel ↔ CSV conversion tests
pytest -k "excel_csv"

# Run with coverage
pytest --cov=FileUtils

📚 Testing Documentation - Complete test coverage guide and methodology

Key Benefits

  • Consistency: Same interface for local and cloud storage operations
  • Flexibility: Extensive options for each file format
  • Reliability: Robust error handling and logging
  • Simplicity: Intuitive API with sensible defaults
  • Configurable Directories: Customize directory names for domain-specific projects
  • Smart Type Handling: Automatic conversion of pandas types for JSON/YAML documents
  • Intelligent File Discovery: Automatic handling of timestamped files when loading

Background

This package was developed to streamline data operations across various projects, particularly in data science and analysis workflows. It eliminates the need to rewrite common file handling code and provides a consistent interface regardless of the storage backend.

For a practical example, check out my semantic text analyzer project, which uses FileUtils for seamless data handling across local and Azure environments.

Documentation

Requirements

Core Dependencies (automatically installed)

  • Python 3.11+
  • pandas
  • pyyaml
  • python-dotenv
  • jsonschema

Optional Dependencies

Choose the dependencies you need based on your use case:

  • Azure Storage ([azure]):
    • azure-storage-blob
    • azure-identity
  • Parquet Support ([parquet]):
    • pyarrow
  • Excel Support ([excel]):
    • openpyxl
  • Document Formats ([documents]):
    • python-docx (Microsoft Word documents)
    • markdown (Markdown processing)
    • PyMuPDF (PDF read/write, supports multiple formats)

Install optional dependencies using the corresponding extras tag (e.g., pip install 'FileUtils[documents]').

Notes from the Author

This package prioritizes functionality and ease of use over performance optimization. While I use it in all of my projects, it's maintained as a side project.

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages