A template and guide for building custom applications to run on the Pennsieve platform. This repository serves as both documentation and a working example that you can use as a starting point for your own applications.
This guide walks you through developing, testing, and deploying custom applications on Pennsieve. Applications allow you to create automated workflows that process data within the Pennsieve platform, supporting any programming language that can be containerized with Docker (Python, R, Go, JavaScript, etc.).
Building an application for Pennsieve follows a four-step process:
- Develop your application in your desired programming language
- Build and test your application locally using Docker
- Push to a public GitHub repository
- Deploy and run your application(s) on Pennsieve
Pennsieve applications follow a simple input-output model:
- Your application reads input files from a designated directory
- Processes them according to your custom logic
- Writes output files to an output directory
The platform handles file management, while your application focuses solely on the processing logic.
Before you begin, ensure you have:
- Docker installed on your local machine (Install Docker)
- Git for version control
- A public GitHub repository for hosting your application
- Basic knowledge of your chosen programming language
Your application must use three environment variables to access directories:
INPUT_DIR: Directory containing input files to processOUTPUT_DIR: Directory where your application writes resultsRESOURCES_DIR: Directory for static resources (optional)
Every Pennsieve application requires an application.json file that describes inputs, outputs, and parameters.
Example from this repository:
{
"name": "echo-application",
"description": "echo-application",
"version": "1.0.0",
"inputs": [
{
"name": "Packages",
"description": "Pipeline packages",
"path": "/input",
"files": [
{
"id": "package_file",
"description": "A package file",
"ext": "*",
"required": true
}
]
}
],
"outputs": [
{
"name": "Packages",
"description": "Pipeline packages",
"path": "/output",
"files": [
{
"id": "package_file",
"description": "A package file",
"ext": "*",
"required": true
}
]
}
],
"params": []
}Key fields:
- name: Unique identifier for your application
- description: What your application does
- version: Semantic version number
- inputs/outputs: Array of file specifications
- params: Configurable parameters (optional)
- depends_on: File types your application can process (
["*"]for all)
Your application should read from INPUT_DIR, process data, and write to OUTPUT_DIR.
Example from this repository (main.py):
#!/usr/bin/env python3.9
import sys
import shutil
import os
def main():
print("start of processing")
src = os.environ['INPUT_DIR']
dest = os.environ['OUTPUT_DIR']
resources = os.environ['RESOURCES_DIR']
print("Command line arguments ...")
print(sys.argv)
print("ENV variables ...")
print(os.environ)
list_files(resources)
# Example: Read static resources
if os.path.exists(f'{resources}/static-file.txt'):
with open(f'{resources}/static-file.txt', "r") as file:
content = file.read()
print(content)
# Process: Copy files from input to output
shutil.copytree(src, dest, dirs_exist_ok=True)
print("end of processing")
def list_files(startpath):
for root, _, files in os.walk(startpath):
level = root.replace(startpath, '').count(os.sep)
indent = ' ' * 4 * (level)
print('{}{}/'.format(indent, os.path.basename(root)))
subindent = ' ' * 4 * (level + 1)
for f in files:
print('{}{}'.format(subindent, f))
if __name__ == '__main__':
main()List your dependencies in requirements.txt (Python), install.packages() (R), or equivalent for your language.
Example (requirements.txt):
pandas
openpyxl
requests
This Python template can be adapted for any language:
R Example:
#!/usr/bin/env Rscript
input_dir <- Sys.getenv("INPUT_DIR")
output_dir <- Sys.getenv("OUTPUT_DIR")
resources_dir <- Sys.getenv("RESOURCES_DIR")
# Your processing logic hereGo Example:
package main
import "os"
func main() {
inputDir := os.Getenv("INPUT_DIR")
outputDir := os.Getenv("OUTPUT_DIR")
resourcesDir := os.Getenv("RESOURCES_DIR")
// Your processing logic here
}The Dockerfile defines how to build your application container.
Example from this repository:
FROM python:3.13.1
WORKDIR /service
RUN apt clean && apt-get update
COPY . .
RUN ls /service
RUN mkdir -p data
# Add additional dependencies below ...
RUN pip install -r /service/requirements.txt
ENTRYPOINT [ "python3.13", "/service/main.py" ]For other languages, modify the base image:
- R:
FROM r-base:4.1.0 - Go:
FROM golang:1.19 - Node.js:
FROM node:18
Example (docker-compose.yml):
version: '3.9'
services:
hackathon-python:
env_file:
- dev.env
image: hackathon/python
volumes:
- ./data:/service/data
container_name: hackathon-python
build:
context: .
dockerfile: ./DockerfileExample (dev.env):
INPUT_DIR=/service/data/input
OUTPUT_DIR=/service/data/output
RESOURCES_DIR=/service/data/resourcesmkdir -p data/input data/output data/resourcesYour structure should look like:
python-application-template/
├── data/
│ ├── input/ # Place test input files here
│ ├── output/ # Application writes results here
│ └── resources/ # Optional: static resources
├── main.py
├── application.json
├── Dockerfile
├── docker-compose.yml
├── dev.env
├── requirements.txt
└── README.md
Create test files in data/input/:
echo "test data" > data/input/test.txtYou can also add static resources:
echo "static content" > data/resources/static-file.txtdocker-compose up --buildThis will:
- Build your Docker image
- Start a container with your application
- Mount local directories for input/output
- Run your application with the test data
Check the data/output/ directory for results:
ls -la data/output/For this example application, test files should be copied from input to output.
Check container logs for errors or debug information:
docker-compose logsAs you develop:
- Make code changes
- Run
docker-compose up --build - Verify output files in
data/output/ - Review logs
- Clear output:
rm -rf data/output/* - Repeat
Container exits immediately:
- Check logs:
docker-compose logs - Verify environment variables are set correctly
- Check for syntax errors in your code
Files not appearing in output:
- Ensure code writes to
OUTPUT_DIRenvironment variable - Verify output directory has write permissions
Dependencies not installed:
- Check
requirements.txtis correct - Rebuild:
docker-compose up --build
Once your application works locally, push it to a public GitHub repository.
Ensure your repository includes:
- Application code (
main.py,main.R, etc.) -
application.json -
Dockerfile - Dependency files (
requirements.txt, etc.) -
README.mdwith clear documentation -
.gitignoreto exclude test data
# Test data directories
data/
# Python
*.pyc
__pycache__/
venv/
env/
# Environment files
*.env
# IDE files
.vscode/
.idea/
# OS files
.DS_Store
Important: Do not commit the data/ directory or sensitive environment variables.
# Initialize git (if not already done)
git init
# Add files
git add .
# Commit
git commit -m "Initial commit: Pennsieve application"
# Add remote (replace with your repository URL)
git remote add origin https://github.com/YOUR_USERNAME/YOUR_REPO.git
# Push
git branch -M main
git push -u origin mainPennsieve currently requires a public GitHub repository to build your application. Ensure visibility is set to public in your GitHub repository settings.
Note: Private repository support is currently in development and will be available in a future release.
Tag your releases using semantic versioning:
git tag -a v1.0.0 -m "Initial release"
git push origin v1.0.0With your application in a public GitHub repository, you can now register it with Pennsieve.
Follow the official Pennsieve documentation to register your application:
Registering Analytic Workflows
The registration process:
- Navigate to the applications section in your Pennsieve workspace
- Click to register a new application
- Provide your GitHub repository URL
- Test your application on sample data
Once registered, you can:
- Select files or packages in your dataset
- Choose your registered application (legacy) or a named workflow (new!)
- Configure parameters (if defined) - currently supported only for legacy.
- Start the workflow
Pennsieve handles:
- Container orchestration
- File mounting
- Execution monitoring
- Output capture and storage
# Clone this repository
git clone https://github.com/Penn-I3H/python-application-template.git my-app
# Navigate to directory
cd my-app
# Remove git history to start fresh
rm -rf .git
git init
# Modify files for your use case
# - Update main.py with your processing logic
# - Update application.json with your inputs/outputs
# - Update requirements.txt with your dependencies
# - Update README.md with your documentation
# Test locally
docker-compose up --build
# Push to your own repository
git remote add origin https://github.com/YOUR_USERNAME/YOUR_REPO.git
git add .
git commit -m "Initial commit"
git push -u origin mainReference this repository structure and adapt it to your preferred language while maintaining the same directory structure and environment variable usage.
- Write clear, well-commented code
- Implement robust error handling
- Log progress and errors to stdout
- Validate inputs before processing
- Minimize image size
- Use specific base image versions (not
latest) - Cache dependencies appropriately
- Use multi-stage builds when possible
- Test with various input types
- Test edge cases and error conditions
- Verify output format and content
- Test resource file access if used
- Keep README updated with changes
- Document expected input/output formats
- Provide usage examples
- Include troubleshooting tips
- This Template Repository: https://github.com/Penn-I3H/python-application-template
- Pennsieve Documentation: https://docs.pennsieve.io
- Registering Workflows: https://docs.pennsieve.io/docs/registering-analytic-workflows
- Docker Documentation: https://docs.docker.com
You've learned how to:
- ✅ Structure a Pennsieve application with required environment variables
- ✅ Create
application.jsonto define inputs and outputs - ✅ Build and test locally using Docker and Docker Compose
- ✅ Push to a public GitHub repository
- ✅ Register and deploy on Pennsieve
This template provides everything you need to start building custom data processing workflows on Pennsieve in any language that supports Docker containerization.

