A Python utility package for consistent file operations across local and Azure storage, with enhanced support for various data formats and flexible configuration options.
-
Unified Storage Interface
- Seamless operations across local and Azure Blob Storage
- Consistent API regardless of storage backend
- Automatic directory structure management
-
Comprehensive File Format Support
- Tabular Data: CSV (with delimiter auto-detection), Excel (.xlsx, .xls) with multi-sheet support, Parquet (with compression options)
- Document Formats: Microsoft PowerPoint (.pptx), Microsoft Word (.docx) with template support, Markdown (.md) with YAML frontmatter, PDF (read-only text extraction)
- Multi-Purpose Formats: JSON and YAML support both DataFrame storage and structured document handling with automatic pandas type conversion
- Excel ↔ CSV Round-Trip: Convert Excel workbooks to CSV files with structure preservation, and reconstruct Excel workbooks from modified CSV files
-
Advanced Data Handling
- Single and multi-DataFrame operations
- Automatic format detection
- Flexible data orientation options
- Customizable save/load options per format
Dynamic Subdirectory Creation: Specify nested subdirectories for saving files on the fly using thesub_pathparameter.File Format Handling: Loads data based on file extension; includes CSV delimiter auto-detection during load.
-
Robust Infrastructure
- YAML-based configuration system
- Comprehensive error handling
- Detailed logging with configurable levels
- Type hints throughout the codebase
- Python 3.11+ support (required)
- Convenience:
FileUtils.save_bytes(...)for saving raw bytes (e.g., images) - Typed enums for directories:
InputType,OutputArea - Optional structured results via
structured_result=TruereturningSaveResult - Helper:
FileUtils.open_run(prefix, customer) -> (sub_path, run_id) - Deprecations:
save_dataframes(file_format=...),utils.common.get_logger,FileUtils._get_default_config()
- File System Operations:
file_exists(): Check if files exist without raising exceptionslist_directory(): List files and directories with pattern filtering- Enhanced
create_directory(): More flexible directory creation with new signature (backward compatible)
You can install directly from GitHub and choose the installation option that best suits your needs:
# Basic installation
pip install git+https://github.com/topij/FileUtils.git
# With specific features
pip install 'git+https://github.com/topij/FileUtils.git#egg=FileUtils[azure]'
pip install 'git+https://github.com/topij/FileUtils.git#egg=FileUtils[all]'from FileUtils import FileUtils, OutputFileType
# Initialize with local storage
file_utils = FileUtils()
# Load data (format is auto-detected)
df = file_utils.load_single_file("data.csv", input_type="raw")
# Save as JSON with custom options
file_utils.save_data_to_storage(
data=df,
file_name="output",
output_type="processed",
output_filetype=OutputFileType.JSON,
orient="records", # Save as list of records
)
# Save multiple DataFrames to Excel
data_dict = {
"Sheet1": df1,
"Sheet2": df2
}
file_utils.save_data_to_storage(
data=data_dict,
file_name="multi_sheet",
output_type="processed",
output_filetype=OutputFileType.XLSX
)
# Save to a dynamic subdirectory
file_utils.save_data_to_storage(
data=df,
file_name="report_data",
output_type="processed",
output_filetype=OutputFileType.CSV,
sub_path="analysis_run_1/summaries" # New subdirectory
)
# File saved to: data/processed/analysis_run_1/summaries/report_data_<timestamp>.csv
# Load data from the dynamic subdirectory
loaded_report = file_utils.load_single_file(
file_path="report_data.csv", # Just the filename
input_type="processed",
sub_path="analysis_run_1/summaries" # Specify the sub_path
)
# Save raw bytes (e.g., PNG) and get the path/URL if cloud-backed
from FileUtils.core.enums import OutputArea
chart_path = file_utils.save_bytes(
content=png_bytes,
file_stem="chart_q1",
sub_path="runs/acme/images",
output_type=OutputArea.PROCESSED, # or "processed"
file_ext="png",
)
# Structured results (SaveResult) for easier downstream handling
from FileUtils import SaveResult
res_map, _ = file_utils.save_data_to_storage(
data={"Sheet1": df1, "Sheet2": df2},
output_filetype=OutputFileType.XLSX,
file_name="multi_sheet",
structured_result=True,
)
first: SaveResult = next(iter(res_map.values()))
print(first.path, first.url)
# Typed enums for directories
from FileUtils.core.enums import InputType
doc = file_utils.load_document_from_storage("readme.md", input_type=InputType.RAW)
# Standardize run folders
sub_path, run_id = file_utils.open_run(sub_path_prefix="presentations", customer="ACME")
print(sub_path, run_id)
# File system operations (v0.8.3+)
# Check if file exists (never raises exceptions)
if file_utils.file_exists("config.yml", input_type="config", sub_path="ACME"):
config = file_utils.load_yaml("config.yml", input_type="config", sub_path="ACME")
# List files in directory with pattern filtering
config_files = file_utils.list_directory(
input_type="config",
sub_path="ACME",
pattern="*.yml"
)
# Enhanced directory creation
dir_path = file_utils.create_directory(
"charts",
input_type="processed",
sub_path="presentations/ACME/run123"
)FileUtils logs operations at INFO level by default. For automation scripts and structured output scenarios (JSON, XML, etc.), you can control logging verbosity:
# Suppress all logging except CRITICAL errors
fu = FileUtils(quiet=True)
# Perfect for JSON output scenarios
import json
result = fu.save_data_to_storage(data, output_filetype=OutputFileType.JSON, ...)
print(json.dumps({"status": "success", "files": result}))import logging
# Show only warnings and errors
fu = FileUtils(log_level=logging.WARNING)
# Debug mode for development
fu = FileUtils(log_level=logging.DEBUG)
# Also accepts string levels (backward compatible)
fu = FileUtils(log_level="WARNING")FileUtils now supports rich document formats perfect for AI/agentic workflows:
from FileUtils import FileUtils, OutputFileType
# Initialize FileUtils
file_utils = FileUtils()
# Save Markdown with YAML frontmatter
document_content = {
"frontmatter": {
"title": "AI Analysis Report",
"author": "AI Agent",
"confidence": 0.95
},
"body": """# Analysis Results
## Key Findings
- Pattern detected with 94.2% confidence
- 3 anomalies identified
- Recommended actions: Update model, retrain
## Next Steps
1. Review findings
2. Implement recommendations
3. Schedule follow-up
"""
}
saved_path, _ = file_utils.save_document_to_storage(
content=document_content,
output_filetype=OutputFileType.MARKDOWN,
output_type="processed",
file_name="ai_analysis",
sub_path="reports/2024"
)
# Enhanced DOCX with Template Support
markdown_content = """# Project Report
## Executive Summary
This is a comprehensive analysis of our project progress.
## Key Findings
- **Important**: We've achieved 95% completion
- [ ] Complete final testing
- [x] Update documentation
| Metric | Value | Status |
|--------|-------|--------|
| Progress | 95% | ✅ On Track |
| Budget | $45,000 | ✅ Under Budget |
"""
# Convert markdown to DOCX with template
saved_path, _ = file_utils.save_document_to_storage(
content=markdown_content,
output_filetype=OutputFileType.DOCX,
output_type="processed",
file_name="project_report",
template="review", # Use specific template
add_provenance=True,
add_reviewer_instructions=True
)
# Save structured DOCX document
docx_content = {
"title": "Project Report",
"sections": [
{
"heading": "Executive Summary",
"level": 1,
"text": "Project completed successfully."
},
{
"heading": "Results",
"level": 2,
"table": [
["Metric", "Value"],
["Accuracy", "95.2%"],
["Speed", "2.3s"]
]
}
]
}
saved_path, _ = file_utils.save_document_to_storage(
content=docx_content,
output_filetype=OutputFileType.DOCX,
output_type="processed",
file_name="project_report"
)
# Load document content
loaded_content = file_utils.load_document_from_storage(
file_path="ai_analysis.md",
input_type="processed",
sub_path="reports/2024"
)
# Save structured JSON configuration
config_data = {
"database": {
"host": "localhost",
"port": 5432,
"name": "analytics"
},
"api": {
"timeout": 30,
"retries": 3,
"base_url": "https://api.example.com"
},
"features": {
"enable_caching": True,
"cache_ttl": 3600
}
}
saved_path, _ = file_utils.save_document_to_storage(
content=config_data,
output_filetype=OutputFileType.JSON,
output_type="processed",
file_name="app_config"
)
# Load configuration
loaded_config = file_utils.load_json(
file_path="app_config.json",
input_type="processed"
)
# Automatic type conversion for pandas data
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=5),
'value': np.random.randn(5),
'category': ['A', 'B', 'C', 'D', 'E']
})
# This works without manual conversion!
json_data = {
'metadata': {
'created': pd.Timestamp.now(),
'total_records': len(df)
},
'data': df.to_dict('records') # Pandas Timestamps automatically converted
}
saved_path, _ = file_utils.save_document_to_storage(
content=json_data,
output_filetype=OutputFileType.JSON,
output_type="processed",
file_name="data_with_types"
)FileUtils includes a comprehensive DOCX template system with:
- Template Support: Use existing DOCX files with custom styles (not .dotx template files)
- Markdown Conversion: Convert markdown content to professionally formatted DOCX
- Style Mapping: Customize how elements are styled in the output
- Reviewer Workflow: Built-in support for document review processes
- Provenance Tracking: Automatic metadata and source tracking
Important: FileUtils uses regular .docx files as templates, not Microsoft Word .dotx template files. The system loads the DOCX file, clears its content, and preserves the styles for use in the generated documents.
# Initialize with template configuration
file_utils = FileUtils(
config_override={
"docx_templates": {
"template_dir": "templates",
"templates": {
"default": "style-template-doc.docx", # Generic template
"personal": "IP-template-doc.docx" # Personal template
}
},
"style_mapping": {
"table": "IP-table_light",
"heading_1": "Heading 1"
}
}
)
# Convert markdown to DOCX with template
file_utils.save_document_to_storage(
content=markdown_content,
output_filetype=OutputFileType.DOCX,
template="review",
add_provenance=True,
add_reviewer_instructions=True
)For detailed DOCX template documentation, see Enhanced DOCX Guide.
Complete round-trip workflow for Excel workbooks: Excel → CSV → Excel. Perfect for data processing pipelines where you need to work with individual CSV files but distribute results as consolidated Excel workbooks.
Key Features:
- Excel → CSV: Convert each Excel sheet to separate CSV files with structure preservation
- CSV → Excel: Reconstruct Excel workbooks from modified CSV files using structure JSON
- Structure Preservation: JSON metadata maintains workbook relationships and sheet information
- Data Modification: Work with individual CSV files, then reconstruct for distribution
- Error Handling: Graceful handling of missing or modified files
# Step 1: Convert Excel to CSV with structure preservation
csv_files, structure_file = file_utils.convert_excel_to_csv_with_structure(
excel_file_path="workbook.xlsx",
file_name="converted_workbook",
preserve_structure=True
)
# Step 2: Work with individual CSV files
sales_df = file_utils.load_single_file("converted_workbook_Sales.csv", input_type="processed")
# ... modify data ...
file_utils.save_data_to_storage(sales_df, output_filetype=OutputFileType.CSV,
file_name="converted_workbook_Sales", include_timestamp=False)
# Step 3: Reconstruct Excel workbook from modified CSV files
excel_path = file_utils.convert_csv_to_excel_workbook(
structure_json_path=structure_file,
file_name="reconstructed_workbook"
)Structure JSON includes:
- Workbook metadata (source file, conversion timestamp, sheet count)
- Sheet details (dimensions, columns, data types, null counts)
- Data quality metrics (memory usage, index information)
Customize directory names to match your project domain. Perfect for document processing, content creation, or research projects.
Configuration:
# config.yaml
directories:
data_directory: "documents" # Instead of "data"
subdirectories:
raw: "product_docs" # Instead of "raw"
processed: "cs_documents" # Instead of "processed"
templates: "templates" # DOCX templatesUsage:
# Initialize with custom configuration
file_utils = FileUtils(config_file="config.yaml")
# Files will be saved to:
# documents/product_docs/ (instead of data/raw/)
# documents/cs_documents/ (instead of data/processed/)
# documents/templates/ (DOCX templates)
# All FileUtils operations work seamlessly
file_utils.save_data_to_storage(data, output_filetype=OutputFileType.CSV,
output_type="raw") # → documents/product_docs/Perfect for:
- Document Projects:
documents/instead ofdata/ - Content Creation:
assets/instead ofdata/ - Research:
experiments/instead ofdata/ - Customer Success:
cs_workflow/instead ofdata/
FileUtils can work with directories outside the data folder at the project root level. This is useful for configuration files, logs, documentation, or any other project-level directories.
Usage:
from FileUtils import FileUtils, OutputFileType
file_utils = FileUtils()
# Save configuration to config directory at project root
config_data = {
"database": {"host": "localhost", "port": 5432},
"api": {"timeout": 30}
}
file_utils.save_document_to_storage(
content=config_data,
output_filetype=OutputFileType.JSON,
output_type="config",
file_name="app_config",
root_level=True # Creates config/ at project root, not data/config/
)
# Load configuration from root-level config directory
loaded_config = file_utils.load_json(
file_path="app_config.json",
input_type="config",
root_level=True
)
# Save logs to logs directory at project root
file_utils.save_data_to_storage(
data=log_df,
output_filetype=OutputFileType.CSV,
output_type="logs",
file_name="application_logs",
root_level=True # Creates logs/ at project root
)Key Points:
root_level=True: Directory is created at project root (e.g.,project_root/config/)root_level=False(default): Directory is under data directory (e.g.,project_root/data/processed/)- Works with all file operations: save, load, documents, Excel conversion, etc.
- Can use
sub_pathwith root-level directories for nested structures
FileUtils includes comprehensive example scripts demonstrating various use cases:
# Run example scripts
python examples/data_pipeline.py # Complete data pipeline
python examples/ai_workflow.py # AI/agentic workflows
python examples/multi_format_reports.py # Multi-format reporting
python examples/enhanced_docx.py # Enhanced DOCX template system
python examples/excel_to_csv_conversion.py # Excel workbook to CSV conversion
python examples/excel_csv_roundtrip.py # Complete Excel ↔ CSV round-trip workflow
python examples/error_handling.py # Robust error handling
python examples/performance_optimization.py # Large dataset optimizationbasic_usage.py- Basic operations (CSV, Excel, metadata)data_pipeline.py- Complete data science pipeline (7,000+ records)ai_workflow.py- AI integration (sentiment analysis, recommendations)enhanced_docx.py- Enhanced DOCX template system (markdown conversion, templates)excel_to_csv_conversion.py- Excel workbook to CSV conversion with structure preservationexcel_csv_roundtrip.py- Complete Excel ↔ CSV round-trip workflow (Excel → CSV → Excel)multi_format_reports.py- Same data → Excel, PDF, Markdown, JSONerror_handling.py- Production-ready error handling and recoveryperformance_optimization.py- Large dataset optimization (50MB+)document_types.py- Document functionality (DOCX, Markdown, PDF)configuration.py- Configuration options and customizationazure_storage.py- Azure Blob Storage integrationFileUtils_tutorial.ipynb- Comprehensive interactive tutorial
📚 Complete Examples Documentation - Detailed guide to all example scripts
FileUtils includes comprehensive test coverage with 50+ tests ensuring reliability and data integrity:
- Unit Tests: Individual method validation with controlled inputs
- Integration Tests: Complete workflow testing with realistic business data
- Excel ↔ CSV Tests: Dedicated tests for round-trip conversion functionality
- Error Handling: Graceful failure scenarios and edge cases
- Data Integrity: Validation of data preservation through all operations
# Run all tests
pytest
# Run Excel ↔ CSV conversion tests
pytest -k "excel_csv"
# Run with coverage
pytest --cov=FileUtils📚 Testing Documentation - Complete test coverage guide and methodology
- Consistency: Same interface for local and cloud storage operations
- Flexibility: Extensive options for each file format
- Reliability: Robust error handling and logging
- Simplicity: Intuitive API with sensible defaults
- Configurable Directories: Customize directory names for domain-specific projects
- Smart Type Handling: Automatic conversion of pandas types for JSON/YAML documents
- Intelligent File Discovery: Automatic handling of timestamped files when loading
This package was developed to streamline data operations across various projects, particularly in data science and analysis workflows. It eliminates the need to rewrite common file handling code and provides a consistent interface regardless of the storage backend.
For a practical example, check out my semantic text analyzer project, which uses FileUtils for seamless data handling across local and Azure environments.
- Getting Started Guide - Quick introduction to key use cases
- Installation Guide - Detailed installation instructions
- Usage Guide - Comprehensive examples and patterns
- Document Types Guide - Rich document formats (DOCX, Markdown, PDF)
- Enhanced DOCX Guide - DOCX template system and markdown conversion
- Examples Documentation - Complete guide to all example scripts
- API Reference - Complete API documentation
- Testing Documentation - Test coverage and methodology
- Azure Setup Guide - Azure Blob Storage configuration
- Development Guide - Setup, building, and contributing to the project
- Future Features - Roadmap and planned enhancements
- Python 3.11+
- pandas
- pyyaml
- python-dotenv
- jsonschema
Choose the dependencies you need based on your use case:
- Azure Storage (
[azure]):- azure-storage-blob
- azure-identity
- Parquet Support (
[parquet]):- pyarrow
- Excel Support (
[excel]):- openpyxl
- Document Formats (
[documents]):- python-docx (Microsoft Word documents)
- markdown (Markdown processing)
- PyMuPDF (PDF read/write, supports multiple formats)
Install optional dependencies using the corresponding extras tag (e.g., pip install 'FileUtils[documents]').
This package prioritizes functionality and ease of use over performance optimization. While I use it in all of my projects, it's maintained as a side project.
MIT License