Version: 1.8.2 Last Updated: 2026-03-20
If you run Informatica PowerCenter and are planning to migrate to open code — Python scripts, dbt models, PySpark jobs — this tool is the first step. And you already know the gaps that are driving that decision. The platform does what it was built to do, but it has not kept pace with what engineering teams are now being asked to deliver. The most acute gap is GenAI: business teams are demanding AI-enabled pipelines — LLMs (Large Language Models) for classification, enrichment, and anomaly detection embedded in the data flow. Informatica PowerCenter has no native GenAI capability. It cannot call an LLM, cannot integrate with vector databases, and has no roadmap for AI-augmented pipelines. Staying on PowerCenter means the data infrastructure cannot participate in what the business is trying to do next.
Open code closes these gaps. Python, PySpark, and dbt integrate with any model, any API, any tool in the ecosystem — and produce a codebase the team owns, version-controls, and deploys without platform lock-in. That changes the conversion problem: you need to produce well-structured, maintainable source code, not just functionally equivalent scripts inside another tool.
Informatica PowerCenter stores its data transformation logic in units called mappings. Each mapping defines a data flow: source tables → transformations (lookups, expressions, filters, aggregations) → target tables. A typical project has dozens to hundreds of mappings. All of this can be exported as XML files — and those XML exports are what this tool reads.
Why not just convert each mapping one at a time? You can, but you'll end up with a mess. If 14 mappings all follow the same pattern differing only by table name, converting them individually produces 14 separate scripts instead of one parameterized template plus a config file. Shared lookup tables get redefined in every script. There's no dependency graph, no project structure, and massive duplication. You left a proprietary tool and landed in a code mess.
InformaticaProjectAnalysis reads all the mapping XMLs from your Informatica project (the complete collection of exported mappings, workflows, and parameter files) and produces a conversion strategy before any conversion begins. The strategy answers three questions:
-
Which mappings share the same structural pattern? Mappings with the same transformation flow (e.g., source → lookup → expression → target) differing only by table names and column names are grouped together as candidates for a single parameterized template. The tool shows the evidence and assigns a confidence level.
-
What depends on what? If mapping A loads
DIM_CUSTOMERand mapping B does a lookup againstDIM_CUSTOMER, then B depends on A. The tool builds a dependency graph across all mappings and computes a safe execution order. -
What needs human attention? Not every mapping can be confidently classified. Custom SQL overrides, missing definitions, and unusual patterns reduce certainty. The tool flags these for tech lead review.
The output is a strategy document (PDF + Excel + JSON) that tech leads and leadership review and approve before conversion work starts. The tool does not prescribe which target language to use or how to orchestrate — those decisions belong to the humans reviewing the strategy.
- Python 3.11+
- An Anthropic API key (for AI-assisted analysis in Phases 1 and 2)
git clone https://github.com/ad25343/InformaticaProjectAnalysis.git
cd InformaticaProjectAnalysis
python -m venv .venv
source .venv/bin/activate # macOS / Linux
# .venv\Scripts\activate # Windows
pip install -r requirements.txtCopy the example environment file and fill in your values:
cp .env.example .envRequired settings in .env:
| Variable | Purpose |
|---|---|
ANTHROPIC_API_KEY |
Your Anthropic API key for AI-assisted analysis |
SECRET_KEY |
Random string for session cookie signing (change from default) |
Optional settings are documented in .env.example.
Create a *.project.yaml file that points at your Informatica project exports:
project:
name: "My Migration"
version: "1.0"
owner: "Data Engineering"
source:
type: folder
location: "/path/to/informatica/exports/"
scope:
mappings:
include: ["mappings/**/*.xml"]
exclude: ["mappings/archive/**"]
workflows:
include: ["workflows/**/*.xml"]
parameters:
include: ["parameter_files/*.xml"]
default_parameter_env: "dev"
analysis:
fingerprint_strictness: "moderate"
min_group_size: 2
confidence_threshold: 0.7
detect_shared_assets: true
build_dependency_dag: true
classify_expressions: true
review:
tech_lead:
name: "Your Name"
email: "you@company.com"python -m app.mainThe server starts on http://localhost:8090 (configurable via the PORT env var).
Open the URL in your browser to access the UI.
The *.project.yaml file is the single input that drives the entire analysis.
| Field | Required | Description |
|---|---|---|
name |
Yes | Human-readable project name (appears in strategy documents) |
version |
No | Your version label for this migration scope |
owner |
No | Team or individual responsible |
| Field | Required | Description |
|---|---|---|
type |
Yes | How to locate the files: folder, repo, zip, or s3 |
location |
Yes | Path, URL, or bucket depending on source type |
Source types:
- folder — local directory path; the tool scans recursively using scope globs
- repo — Git repository URL; shallow-cloned via GitPython. Optional
branchandpath(subfolder within repo) fields. Temp dir cleaned up after analysis. - zip — uploaded ZIP archive via
POST /api/projects/zip(multipart form). Path traversal protection; size capped atMAX_UPLOAD_SIZE_MB(default 500 MB). - s3 — S3 bucket path; objects pulled and scanned (bring your own credentials via AWS env vars)
Controls which files within the source are included in the analysis.
| Field | Required | Description |
|---|---|---|
mappings.include |
Yes | Glob patterns for mapping XML files |
mappings.exclude |
No | Glob patterns to skip |
workflows.include |
No | Glob patterns for workflow XML files |
parameters.include |
No | Glob patterns for parameter files |
default_parameter_env |
No | Which parameter environment to use (default: dev) |
| Field | Default | Description |
|---|---|---|
fingerprint_strictness |
moderate |
How strict spine matching is (spine = the ordered sequence of transformation types in a mapping; see "How the Analysis Works" below): strict, moderate, relaxed |
min_group_size |
2 |
Minimum mappings needed to form a pattern group |
confidence_threshold |
0.7 |
Below this, mappings are flagged for human review |
detect_shared_assets |
true |
Identify tables referenced by multiple mappings |
build_dependency_dag |
true |
Build cross-mapping dependency graph (DAG = Directed Acyclic Graph) |
classify_expressions |
true |
Use AI to classify Expression transformation port complexity |
interpret_sql |
true |
Use AI to extract structured intent from Source Qualifier SQL overrides |
ai_refine_confidence |
false |
Use AI to re-score MEDIUM/LOW group member confidence levels (slower; requires API key) |
Names and emails of the reviewers who will approve the strategy. These appear in the strategy document and audit trail.
This section is a config convenience; the actual webhook is configured via environment
variables (WEBHOOK_URL, WEBHOOK_SECRET). When set, the server POSTs signed events
to the webhook URL automatically — no config section required.
| Field | Description |
|---|---|
webhook_url |
Slack/Teams/custom webhook for status notifications |
events |
Which events trigger notifications (analysis complete, strategy ready, review approved) |
The analysis runs in five phases. You can watch progress in real time via the progress
bar in the UI (powered by the SSE endpoint GET /api/projects/{id}/progress) or by
consuming that endpoint directly.
The tool resolves your source location, scans for files matching your scope globs, and parses every mapping XML it finds. Each mapping is reduced to its structural components: transformation types, connector topology, expression bodies, lookup references, parameter variables, and SQL overrides.
From the individual parse results, the tool builds a project-level graph:
- Dependency edges — if mapping A writes to TABLE_X and mapping B has a Lookup against TABLE_X, then B depends on A.
- Shared assets — tables that appear as lookup sources in multiple mappings.
- Repeated expressions — expression logic that appears verbatim across many mappings.
Parse cache: results are keyed by SHA-256 of file content. On re-analysis, only
new or changed files are re-parsed. The cache lives in the SQLite database at DB_PATH.
AI-assisted interpretation (Phase 1.1): If interpret_sql is enabled, Source
Qualifier SQL overrides are sent to Claude to extract structured intent: source_tables,
join_conditions, filter_logic, business_intent. Results are stored on each
MappingParseResult and included in the strategy JSON.
AI expression classification (Phase 2.1): If classify_expressions is enabled,
non-trivial Expression transformation ports are classified into:
SIMPLE_CALC, LOOKUP_CHAIN, CUSTOM_FUNCTION, SQL_OVERRIDE, REGEX_TRANSFORM, UNKNOWN.
AI confidence refinement (Phase 2.2): If ai_refine_confidence is enabled, Claude
re-evaluates each MEDIUM or LOW group member: "does this mapping structurally fit this
group?" Updated confidence and reason are stored on the group member.
Each mapping's transformation topology is reduced to a canonical "spine" — the ordered sequence of transformation types from source to target.
Examples:
SQ → EXP → TARGET(simple load)SQ → LKP → EXP → RTR → UPD → TARGET(SCD2 / slowly changing dimension)SQ(×3) → JNR → LKP(×2) → EXP → RTR → TARGET(×2)(complex multi-source)
Mappings with matching spines are candidates for the same pattern group. Within each group, variation is classified:
-
Tier 1 — Parameter variation. Structurally identical. Only table names, column names, and filter values differ. One template + one config file.
-
Tier 2 — Minor structural variation. Core flow is the same, but minor differences exist (an extra Expression, a Filter present in some but not all). Template accommodates variation via config flags.
-
Tier 3 — Fundamental variation. Different transformation types or flow shapes. Does not group. Convert individually.
Each mapping-to-group assignment carries a confidence level: HIGH, MEDIUM, LOW, or UNCLASSIFIED.
Three outputs are produced from the same analysis data:
PDF report — two layers in one document. Page 1 is the leadership summary (mapping count, pattern groups, scope reduction, risk flags). Remaining pages are tech lead detail (per-group evidence, dependency DAG, shared assets, per-mapping assignments with confidence).
Excel workbook — five sheets: Pattern Groups, Dependency Graph, Shared Assets, Per-Mapping Assignments, Risk Flags.
Strategy JSON — machine-readable format containing pattern groups with members, unique mappings with reasons, shared assets, dependency DAG, and execution order.
The strategy is presented in the UI for review. This is the decision gate.
Available actions:
| Action | What it does |
|---|---|
| Confirm mapping assignment | Locks the mapping's group assignment |
| Move mapping to different group | Simple override — no re-analysis needed |
| Convert mapping individually | Removes mapping from its group |
| Split group | Triggers lightweight re-validation |
| Merge groups | Triggers lightweight re-validation |
| Add notes | Stored per mapping and per group |
| APPROVE | Finalizes the strategy; generates approved strategy JSON |
| REJECT | Returns to analysis with reviewer notes as constraints |
Every review action is recorded in the audit trail with reviewer name, role, timestamp, and decision.
The approved strategy JSON is the final output. It contains everything needed to drive a conversion: which mappings belong to which pattern groups, what parameters are externalized, which mappings convert individually and why, the dependency DAG, and the execution order.
The strategy can be delivered as a file or via API POST to a downstream conversion tool.
The dashboard shows the project at a glance: total mappings, pattern groups found, template candidates, unique mappings, and the scope reduction metric (e.g., "50 mappings → 8 templates + 12 unique files"). It includes complexity distribution, confidence breakdown, dependency depth, and risk flags.
The APPROVE / REJECT gate is on this page — leadership makes the call after tech leads have reviewed the detail.
The detail view for validating groupings. The left panel lists all pattern groups with member count and confidence indicator. Selecting a group shows:
- The spine (transformation flow)
- Member mappings with variation tier, confidence, and flags
- Parameter differences across members (what varies)
- Structural evidence (why these were grouped)
- Per-mapping override controls (confirm, move, convert individually)
- Notes field per mapping and per group
Interactive dependency graph (vis-network) visualization. Nodes are mappings, colored by execution stage. Shared assets appear as diamond-shaped nodes. Click a node to see its upstream dependencies, downstream dependents, and shared assets in the detail panel on the right.
Execution stages are highlighted so you can see which mappings run in parallel and where serialization is required.
The 💬 AI Chat tab provides a multi-turn conversation with an AI assistant that has complete knowledge of the current analysis job.
What it knows: All pattern groups (with members, confidence, assignees, risk flags), unique mappings, anomaly report, reviewer brief, and strategy narrative.
What to ask:
- "Which mappings are HIGH complexity and unassigned?"
- "What are the biggest risks in the ETL Staging group?"
- "Summarise the duplicate mappings found."
- "How many dev-days would the unassigned work take?"
Setup: Enter your name in the name field (persisted in browser localStorage). Messages are stored per-job; history is included in chat context (last 20 turns).
To clear history: Click "Clear history" — this resets the conversation context so the AI starts fresh on your next message.
Requirements: ANTHROPIC_API_KEY must be set in the server environment and
CHAT_ENABLED=true (the default). If chat is disabled the tab returns a 503.
The ? Help tab contains this tool's user guide as an in-app reference, including:
- A User Workflow flowchart showing the end-to-end process
- An Analysis Pipeline flowchart showing all 4 phases and 7 AI agents
- Getting Started steps, YAML config reference, tab guide, override controls, AI feature descriptions, export formats, and environment variables
The left sidebar now supports:
- Search box — type to filter groups by name in real time
- Confidence filter pills — ALL / HIGH / MEDIUM / LOW; shows match count
The group detail panel now includes a "Why this confidence?" section explaining the confidence verdict with data-driven bullet reasons:
- Member count (small groups → less certain)
- Mixed variation tiers (Tier 1 + Tier 3 in same group → inconsistent)
- Risk flag count on members
- LOW-confidence members within a nominally higher-confidence group
If a Tech Lead approves and subsequently spots an error, they can click "Re-open for amendment" on the Review Gate. This:
- Reverts the job status to AWAITING_REVIEW
- Re-activates all override controls (Confirm / Individualize / Move)
- Appends a
review:reopenedevent to the audit trail (who, when, why) - Preserves the original approval record — the audit trail shows both events
After making adjustments, the Tech Lead submits a new approval. Both approval events appear in the audit log.
Note: Re-open is only available on APPROVED jobs. REJECTED jobs require a fresh analysis run.
When previous analyses exist, the landing page shows five programme-level KPIs:
| Metric | Description |
|---|---|
| Projects Analyzed | Total jobs in the database (all statuses) |
| Total Mappings | Sum of mappings across all approved/awaiting-review jobs |
| Avg Scope Reduction | Mean scope reduction % — the headline ROI number |
| Awaiting Review | Jobs pending Tech Lead approval (pulses when non-zero) |
| Approved Strategies | Jobs with approved status — completed deliverables |
Seven AI agents run automatically in Phase 3 when ANTHROPIC_API_KEY is set.
All are non-fatal — the pipeline continues if any agent fails.
Risk flags appear as red mono tags in the Pattern Groups member table. Each
flag is a category prefix (e.g., HARDCODED_DATE, COMPLEX_SQL); hover for
the full description.
Duplicate detection adds a purple dup badge on the mapping name. Hover to
see which mapping it likely duplicates. Flagged mappings are candidates for
removal before conversion begins.
AI group names replace machine-generated IDs (GRP-004 → Daily Sales Extract). The original ID is shown as a subtitle in mono text. The AI
description appears in the group detail header.
T-shirt sizing (S / M / L / XL) replaces raw effort hour numbers in the Dashboard groups table. Hover the badge to see the dev-day range.
| Size | Effort | Dev-days |
|---|---|---|
| S | < 16h | < 2 days |
| M | 16–40h | 2–5 days |
| L | 40–80h | 1–2 weeks |
| XL | > 80h | needs scoping |
These are indicative, not commitments. They are useful for capacity planning and stakeholder conversations, not sprint estimation.
Reviewer Brief and Anomaly Report are accessible in the Dashboard as collapsible panels (click to load lazily). The reviewer brief is a numbered checklist of what the Tech Lead should scrutinise before approving. The anomaly report lists statistical outliers with severity badges and recommendations.
Upload or point to a project config through the UI. Watch the analysis run in real time. Review the strategy in the browser. Approve or override.
The tool polls a directory for *.project.yaml files. When one appears or
changes, analysis triggers automatically. Configure via:
WATCHER_ENABLED=true
WATCHER_DIR=./watch
WATCHER_POLL_INTERVAL_SECS=30
A pipeline step posts the project config via POST /api/projects/trigger with an inline
JSON body. The endpoint requires X-API-Key when API_KEY is set in the environment.
The strategy document and other artifacts can be retrieved via
GET /api/projects/{id}/artifacts.
curl -X POST http://localhost:8090/api/projects/trigger \
-H "Content-Type: application/json" \
-H "X-API-Key: $API_KEY" \
-d '{
"project_name": "MyProject",
"source_type": "folder",
"source_location": "/path/to/exports",
"previous_job_id": "abc123"
}'When you re-run analysis on a project that has changed:
- Only new or modified mapping XMLs are re-parsed (unchanged files hit the SHA-256 cache)
- Pattern grouping runs on the full project (fast — parsing is cached)
- Previous human decisions (overrides, confirmations, notes) are preserved unless the underlying XML changed
- The strategy document includes a diff section showing what changed
When WEBHOOK_URL is set, the server automatically POSTs signed JSON events to that URL
at key lifecycle points.
Events:
| Event | When |
|---|---|
analysis_started |
Job begins running |
analysis_complete |
Job reaches PENDING_REVIEW |
analysis_failed |
Job fails with an error |
review_approved |
Strategy approved |
review_rejected |
Strategy rejected |
Payload format:
{ "event": "analysis_complete", "job_id": "abc123", "project_name": "MyProject" }Signature: every request includes an X-Signature-SHA256 header containing the
HMAC-SHA256 hex digest of the JSON body, signed with WEBHOOK_SECRET. Verify on
your end:
import hashlib, hmac
digest = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
assert digest == request.headers["X-Signature-SHA256"]Security is infrastructure, not a feature layer. Key protections:
- All XML parsing is hardened against XXE (XML External Entity) injection — DTD and entity resolution disabled
- Path traversal prevented — all paths resolved relative to configured root; symlinks rejected
- ZIP extraction validates every entry path, caps total bytes and entry count
- Project config parsed with
yaml.safe_load()and schema validated - Session-cookie authentication on all non-static routes
- All HTTP security headers applied (CSP, HSTS, X-Frame-Options, etc.)
- No secrets in logs — structured logging only
- Dependencies pinned and audited with
pip-audit
See SECURITY.md for the full threat/defence matrix.
| Method | Path | Description |
|---|---|---|
POST |
/api/projects |
Upload project config YAML and start analysis |
POST |
/api/projects/folder |
Start analysis from a local folder path |
POST |
/api/projects/zip |
Upload ZIP archive (multipart/form-data) and start analysis |
POST |
/api/projects/trigger |
CI/CD trigger — inline JSON config. Requires X-API-Key. |
GET |
/api/projects |
List all jobs (?limit=N&offset=M for pagination) |
GET |
/api/projects/{id} |
Get analysis job state |
GET |
/api/projects/{id}/progress |
SSE progress stream (text/event-stream) |
GET |
/api/projects/{id}/strategy.json |
Download strategy JSON |
GET |
/api/projects/{id}/strategy.pdf |
Download strategy PDF |
GET |
/api/projects/{id}/strategy.xlsx |
Download strategy Excel workbook |
GET |
/api/projects/{id}/diff |
Mapping diff vs previous job for same project |
GET |
/api/projects/{id}/artifacts |
List available download formats (JSON/XLSX/PDF) |
POST |
/api/projects/{id}/review |
Submit review decision (APPROVE / REJECT) |
POST |
/api/projects/{id}/override |
Submit mapping override (CONFIRM / MOVE / INDIVIDUALIZE) |
GET |
/api/projects/{id}/graph |
Get dependency graph data |
GET |
/api/projects/{id}/groups |
Get pattern groups with members |
GET |
/api/projects/{id}/groups/{gid} |
Get single group detail |
POST |
/api/projects/{id}/groups/{gid}/notes |
Add reviewer note to a group |
GET |
/api/audit |
Audit trail of all review decisions |
GET |
/api/health |
Liveness + readiness probe |
All configuration is via environment variables (loaded from .env).
See .env.example for the full list with descriptions.
| Variable | Required | Default | Description |
|---|---|---|---|
ANTHROPIC_API_KEY |
Yes* | — | API key for AI-assisted analysis (*not required if all AI features disabled) |
PORT |
No | 8090 |
Server port |
HOST |
No | 127.0.0.1 |
Server bind address |
DB_PATH |
No | app/data/analysis.db |
SQLite database path |
CLAUDE_MODEL |
No | claude-sonnet-4-5 |
Model for AI analysis |
LOG_LEVEL |
No | INFO |
Logging level |
CORS_ORIGINS |
No | * |
Space-separated allowed origins for CORS |
HTTPS_MODE |
No | false |
Enable HSTS header (true in production) |
API_KEY |
No | — | When set, all write endpoints require X-API-Key header |
WATCHER_ENABLED |
No | false |
Enable folder watcher for auto-triggered analysis |
WATCHER_DIR |
No | ./watch |
Directory polled by the watcher |
WATCHER_POLL_INTERVAL_SECS |
No | 30 |
How often the watcher scans (seconds) |
WEBHOOK_URL |
No | — | URL to POST signed event notifications to |
WEBHOOK_SECRET |
No | — | HMAC-SHA256 signing secret for webhook payloads |
WEBHOOK_TIMEOUT_SECS |
No | 10 |
HTTP timeout for webhook POST requests |
MAX_UPLOAD_SIZE_MB |
No | 500 |
Maximum ZIP upload size |
OUTPUT_DIR |
No | app/data/output |
Directory for extracted uploads and generated files |
Analysis fails at "Resolving Source"
Check that the source.location in your project config points to a valid,
accessible directory. The tool does not follow symlinks.
Most mappings are UNCLASSIFIED
Try lowering analysis.fingerprint_strictness from strict to moderate or
relaxed. This widens the spine matching tolerance.
Confidence scores are low across the board Check if your mappings use heavy custom SQL overrides or missing mapplet definitions. These reduce confidence because the tool cannot fully determine structural equivalence. The flagged mappings will need manual review.
Parse cache not working
The cache keys on SHA-256 of file content. If file paths change but content
is identical, the cache still hits. If the cache seems stale, check that the
DB_PATH points to the correct database.
Port conflict
Change the PORT environment variable in .env. Default is 8090.
| Term | Definition |
|---|---|
| Mapping | A single unit of data transformation logic in Informatica PowerCenter. Defines how data flows from sources through transformations to targets. Exported as an XML file. |
| Project | The complete collection of Informatica PowerCenter mappings, workflows, and parameter files being analyzed for migration. Defined by a single *.project.yaml config file. |
| Spine | The canonical ordered sequence of transformation types in a mapping (e.g., SQ → EXP → LKP → TARGET). Used to identify structural similarity between mappings. |
| Pattern group | A set of mappings that share the same structural spine and can be converted using one parameterized template instead of N separate files. |
| Variation tier | How much a mapping differs from its group's canonical pattern: Tier 1 (parameter only — table/column names differ), Tier 2 (minor structural — an extra filter or expression), Tier 3 (fundamental — does not group). |
| Confidence | How certain the tool is about a mapping-to-group assignment: HIGH, MEDIUM, LOW, UNCLASSIFIED. |
| Dependency edge | A relationship where one mapping's output table is used as another mapping's lookup source — meaning the second must run after the first. |
| Shared asset | A table referenced as a lookup source by multiple mappings across the project. |
| Strategy document | The PDF + Excel + JSON output describing pattern groups, dependencies, and conversion recommendations. Reviewed and approved before conversion begins. |
| Human gate | The structured review step where tech leads and leadership approve or reject the strategy before any conversion work starts. |
| Project config | The *.project.yaml file that defines source location, scope, analysis settings, and reviewers. The single input that drives the entire analysis. |