InformaticaProjectAnalysis — User Guide

Version: 1.8.2 Last Updated: 2026-03-20

What Is This Tool For?

If you run Informatica PowerCenter and are planning to migrate to open code — Python scripts, dbt models, PySpark jobs — this tool is the first step. And you already know the gaps that are driving that decision. The platform does what it was built to do, but it has not kept pace with what engineering teams are now being asked to deliver. The most acute gap is GenAI: business teams are demanding AI-enabled pipelines — LLMs (Large Language Models) for classification, enrichment, and anomaly detection embedded in the data flow. Informatica PowerCenter has no native GenAI capability. It cannot call an LLM, cannot integrate with vector databases, and has no roadmap for AI-augmented pipelines. Staying on PowerCenter means the data infrastructure cannot participate in what the business is trying to do next.

Open code closes these gaps. Python, PySpark, and dbt integrate with any model, any API, any tool in the ecosystem — and produce a codebase the team owns, version-controls, and deploys without platform lock-in. That changes the conversion problem: you need to produce well-structured, maintainable source code, not just functionally equivalent scripts inside another tool.

Informatica PowerCenter stores its data transformation logic in units called mappings. Each mapping defines a data flow: source tables → transformations (lookups, expressions, filters, aggregations) → target tables. A typical project has dozens to hundreds of mappings. All of this can be exported as XML files — and those XML exports are what this tool reads.

Why not just convert each mapping one at a time? You can, but you'll end up with a mess. If 14 mappings all follow the same pattern differing only by table name, converting them individually produces 14 separate scripts instead of one parameterized template plus a config file. Shared lookup tables get redefined in every script. There's no dependency graph, no project structure, and massive duplication. You left a proprietary tool and landed in a code mess.

InformaticaProjectAnalysis reads all the mapping XMLs from your Informatica project (the complete collection of exported mappings, workflows, and parameter files) and produces a conversion strategy before any conversion begins. The strategy answers three questions:

Which mappings share the same structural pattern? Mappings with the same transformation flow (e.g., source → lookup → expression → target) differing only by table names and column names are grouped together as candidates for a single parameterized template. The tool shows the evidence and assigns a confidence level.
What depends on what? If mapping A loads DIM_CUSTOMER and mapping B does a lookup against DIM_CUSTOMER, then B depends on A. The tool builds a dependency graph across all mappings and computes a safe execution order.
What needs human attention? Not every mapping can be confidently classified. Custom SQL overrides, missing definitions, and unusual patterns reduce certainty. The tool flags these for tech lead review.

The output is a strategy document (PDF + Excel + JSON) that tech leads and leadership review and approve before conversion work starts. The tool does not prescribe which target language to use or how to orchestrate — those decisions belong to the humans reviewing the strategy.

Quick Start

1. Prerequisites

Python 3.11+
An Anthropic API key (for AI-assisted analysis in Phases 1 and 2)

2. Installation

git clone https://github.com/ad25343/InformaticaProjectAnalysis.git
cd InformaticaProjectAnalysis

python -m venv .venv
source .venv/bin/activate    # macOS / Linux
# .venv\Scripts\activate     # Windows

pip install -r requirements.txt

3. Configuration

Copy the example environment file and fill in your values:

cp .env.example .env

Required settings in .env:

Variable	Purpose
`ANTHROPIC_API_KEY`	Your Anthropic API key for AI-assisted analysis
`SECRET_KEY`	Random string for session cookie signing (change from default)

Optional settings are documented in .env.example.

4. Create a Project Config

Create a *.project.yaml file that points at your Informatica project exports:

project:
  name: "My Migration"
  version: "1.0"
  owner: "Data Engineering"

source:
  type: folder
  location: "/path/to/informatica/exports/"

scope:
  mappings:
    include: ["mappings/**/*.xml"]
    exclude: ["mappings/archive/**"]
  workflows:
    include: ["workflows/**/*.xml"]
  parameters:
    include: ["parameter_files/*.xml"]
  default_parameter_env: "dev"

analysis:
  fingerprint_strictness: "moderate"
  min_group_size: 2
  confidence_threshold: 0.7
  detect_shared_assets: true
  build_dependency_dag: true
  classify_expressions: true

review:
  tech_lead:
    name: "Your Name"
    email: "you@company.com"

5. Run the Tool

python -m app.main

The server starts on http://localhost:8090 (configurable via the PORT env var). Open the URL in your browser to access the UI.

Project Config Reference

The *.project.yaml file is the single input that drives the entire analysis.

`project` Section

Field	Required	Description
`name`	Yes	Human-readable project name (appears in strategy documents)
`version`	No	Your version label for this migration scope
`owner`	No	Team or individual responsible

`source` Section

Field	Required	Description
`type`	Yes	How to locate the files: `folder`, `repo`, `zip`, or `s3`
`location`	Yes	Path, URL, or bucket depending on source type

Source types:

folder — local directory path; the tool scans recursively using scope globs
repo — Git repository URL; shallow-cloned via GitPython. Optional branch and path (subfolder within repo) fields. Temp dir cleaned up after analysis.
zip — uploaded ZIP archive via POST /api/projects/zip (multipart form). Path traversal protection; size capped at MAX_UPLOAD_SIZE_MB (default 500 MB).
s3 — S3 bucket path; objects pulled and scanned (bring your own credentials via AWS env vars)

`scope` Section

Controls which files within the source are included in the analysis.

Field	Required	Description
`mappings.include`	Yes	Glob patterns for mapping XML files
`mappings.exclude`	No	Glob patterns to skip
`workflows.include`	No	Glob patterns for workflow XML files
`parameters.include`	No	Glob patterns for parameter files
`default_parameter_env`	No	Which parameter environment to use (default: `dev`)

`analysis` Section

Field	Default	Description
`fingerprint_strictness`	`moderate`	How strict spine matching is (spine = the ordered sequence of transformation types in a mapping; see "How the Analysis Works" below): `strict`, `moderate`, `relaxed`
`min_group_size`	`2`	Minimum mappings needed to form a pattern group
`confidence_threshold`	`0.7`	Below this, mappings are flagged for human review
`detect_shared_assets`	`true`	Identify tables referenced by multiple mappings
`build_dependency_dag`	`true`	Build cross-mapping dependency graph (DAG = Directed Acyclic Graph)
`classify_expressions`	`true`	Use AI to classify Expression transformation port complexity
`interpret_sql`	`true`	Use AI to extract structured intent from Source Qualifier SQL overrides
`ai_refine_confidence`	`false`	Use AI to re-score MEDIUM/LOW group member confidence levels (slower; requires API key)

`review` Section

Names and emails of the reviewers who will approve the strategy. These appear in the strategy document and audit trail.

`notifications` Section (optional)

This section is a config convenience; the actual webhook is configured via environment variables (WEBHOOK_URL, WEBHOOK_SECRET). When set, the server POSTs signed events to the webhook URL automatically — no config section required.

Field	Description
`webhook_url`	Slack/Teams/custom webhook for status notifications
`events`	Which events trigger notifications (analysis complete, strategy ready, review approved)

How the Analysis Works

The analysis runs in five phases. You can watch progress in real time via the progress bar in the UI (powered by the SSE endpoint GET /api/projects/{id}/progress) or by consuming that endpoint directly.

Phase 1 — Discovery

The tool resolves your source location, scans for files matching your scope globs, and parses every mapping XML it finds. Each mapping is reduced to its structural components: transformation types, connector topology, expression bodies, lookup references, parameter variables, and SQL overrides.

From the individual parse results, the tool builds a project-level graph:

Dependency edges — if mapping A writes to TABLE_X and mapping B has a Lookup against TABLE_X, then B depends on A.
Shared assets — tables that appear as lookup sources in multiple mappings.
Repeated expressions — expression logic that appears verbatim across many mappings.

Parse cache: results are keyed by SHA-256 of file content. On re-analysis, only new or changed files are re-parsed. The cache lives in the SQLite database at DB_PATH.

AI-assisted interpretation (Phase 1.1): If interpret_sql is enabled, Source Qualifier SQL overrides are sent to Claude to extract structured intent: source_tables, join_conditions, filter_logic, business_intent. Results are stored on each MappingParseResult and included in the strategy JSON.

AI expression classification (Phase 2.1): If classify_expressions is enabled, non-trivial Expression transformation ports are classified into: SIMPLE_CALC, LOOKUP_CHAIN, CUSTOM_FUNCTION, SQL_OVERRIDE, REGEX_TRANSFORM, UNKNOWN.

AI confidence refinement (Phase 2.2): If ai_refine_confidence is enabled, Claude re-evaluates each MEDIUM or LOW group member: "does this mapping structurally fit this group?" Updated confidence and reason are stored on the group member.

Phase 2 — Pattern Grouping

Each mapping's transformation topology is reduced to a canonical "spine" — the ordered sequence of transformation types from source to target.

Examples:

SQ → EXP → TARGET (simple load)
SQ → LKP → EXP → RTR → UPD → TARGET (SCD2 / slowly changing dimension)
SQ(×3) → JNR → LKP(×2) → EXP → RTR → TARGET(×2) (complex multi-source)

Mappings with matching spines are candidates for the same pattern group. Within each group, variation is classified:

Tier 1 — Parameter variation. Structurally identical. Only table names, column names, and filter values differ. One template + one config file.
Tier 2 — Minor structural variation. Core flow is the same, but minor differences exist (an extra Expression, a Filter present in some but not all). Template accommodates variation via config flags.
Tier 3 — Fundamental variation. Different transformation types or flow shapes. Does not group. Convert individually.

Each mapping-to-group assignment carries a confidence level: HIGH, MEDIUM, LOW, or UNCLASSIFIED.

Phase 3 — Strategy Document Generation

Three outputs are produced from the same analysis data:

PDF report — two layers in one document. Page 1 is the leadership summary (mapping count, pattern groups, scope reduction, risk flags). Remaining pages are tech lead detail (per-group evidence, dependency DAG, shared assets, per-mapping assignments with confidence).

Excel workbook — five sheets: Pattern Groups, Dependency Graph, Shared Assets, Per-Mapping Assignments, Risk Flags.

Strategy JSON — machine-readable format containing pattern groups with members, unique mappings with reasons, shared assets, dependency DAG, and execution order.

Phase 4 — Human Gate (Strategy Review)

The strategy is presented in the UI for review. This is the decision gate.

Available actions:

Action	What it does
Confirm mapping assignment	Locks the mapping's group assignment
Move mapping to different group	Simple override — no re-analysis needed
Convert mapping individually	Removes mapping from its group
Split group	Triggers lightweight re-validation
Merge groups	Triggers lightweight re-validation
Add notes	Stored per mapping and per group
APPROVE	Finalizes the strategy; generates approved strategy JSON
REJECT	Returns to analysis with reviewer notes as constraints

Every review action is recorded in the audit trail with reviewer name, role, timestamp, and decision.

Phase 5 — Strategy Delivery

The approved strategy JSON is the final output. It contains everything needed to drive a conversion: which mappings belong to which pattern groups, what parameters are externalized, which mappings convert individually and why, the dependency DAG, and the execution order.

The strategy can be delivered as a file or via API POST to a downstream conversion tool.

The Three UI Views

Dashboard (Leadership)

The dashboard shows the project at a glance: total mappings, pattern groups found, template candidates, unique mappings, and the scope reduction metric (e.g., "50 mappings → 8 templates + 12 unique files"). It includes complexity distribution, confidence breakdown, dependency depth, and risk flags.

The APPROVE / REJECT gate is on this page — leadership makes the call after tech leads have reviewed the detail.

Pattern Groups (Tech Leads)

The detail view for validating groupings. The left panel lists all pattern groups with member count and confidence indicator. Selecting a group shows:

The spine (transformation flow)
Member mappings with variation tier, confidence, and flags
Parameter differences across members (what varies)
Structural evidence (why these were grouped)
Per-mapping override controls (confirm, move, convert individually)
Notes field per mapping and per group

Dependency Graph (Both Audiences)

Interactive dependency graph (vis-network) visualization. Nodes are mappings, colored by execution stage. Shared assets appear as diamond-shaped nodes. Click a node to see its upstream dependencies, downstream dependents, and shared assets in the detail panel on the right.

Execution stages are highlighted so you can see which mappings run in parallel and where serialization is required.

AI Chat Tab

The 💬 AI Chat tab provides a multi-turn conversation with an AI assistant that has complete knowledge of the current analysis job.

What it knows: All pattern groups (with members, confidence, assignees, risk flags), unique mappings, anomaly report, reviewer brief, and strategy narrative.

What to ask:

"Which mappings are HIGH complexity and unassigned?"
"What are the biggest risks in the ETL Staging group?"
"Summarise the duplicate mappings found."
"How many dev-days would the unassigned work take?"

Setup: Enter your name in the name field (persisted in browser localStorage). Messages are stored per-job; history is included in chat context (last 20 turns).

To clear history: Click "Clear history" — this resets the conversation context so the AI starts fresh on your next message.

Requirements: ANTHROPIC_API_KEY must be set in the server environment and CHAT_ENABLED=true (the default). If chat is disabled the tab returns a 503.

Help Tab

The ? Help tab contains this tool's user guide as an in-app reference, including:

A User Workflow flowchart showing the end-to-end process
An Analysis Pipeline flowchart showing all 4 phases and 7 AI agents
Getting Started steps, YAML config reference, tab guide, override controls, AI feature descriptions, export formats, and environment variables

Pattern Groups — Search and Filter (v1.8.2)

The left sidebar now supports:

Search box — type to filter groups by name in real time
Confidence filter pills — ALL / HIGH / MEDIUM / LOW; shows match count

The group detail panel now includes a "Why this confidence?" section explaining the confidence verdict with data-driven bullet reasons:

Member count (small groups → less certain)
Mixed variation tiers (Tier 1 + Tier 3 in same group → inconsistent)
Risk flag count on members
LOW-confidence members within a nominally higher-confidence group

Re-opening an Approved Strategy (v1.8.2)

If a Tech Lead approves and subsequently spots an error, they can click "Re-open for amendment" on the Review Gate. This:

Reverts the job status to AWAITING_REVIEW
Re-activates all override controls (Confirm / Individualize / Move)
Appends a review:reopened event to the audit trail (who, when, why)
Preserves the original approval record — the audit trail shows both events

After making adjustments, the Tech Lead submits a new approval. Both approval events appear in the audit log.

Note: Re-open is only available on APPROVED jobs. REJECTED jobs require a fresh analysis run.

Landing Page — Metrics Bar (v1.8.2)

When previous analyses exist, the landing page shows five programme-level KPIs:

Metric	Description
Projects Analyzed	Total jobs in the database (all statuses)
Total Mappings	Sum of mappings across all approved/awaiting-review jobs
Avg Scope Reduction	Mean scope reduction % — the headline ROI number
Awaiting Review	Jobs pending Tech Lead approval (pulses when non-zero)
Approved Strategies	Jobs with approved status — completed deliverables

AI Enhancement Features (v1.8.0)

Seven AI agents run automatically in Phase 3 when ANTHROPIC_API_KEY is set. All are non-fatal — the pipeline continues if any agent fails.

Risk flags appear as red mono tags in the Pattern Groups member table. Each flag is a category prefix (e.g., HARDCODED_DATE, COMPLEX_SQL); hover for the full description.

Duplicate detection adds a purple dup badge on the mapping name. Hover to see which mapping it likely duplicates. Flagged mappings are candidates for removal before conversion begins.

AI group names replace machine-generated IDs (GRP-004 → Daily Sales Extract). The original ID is shown as a subtitle in mono text. The AI description appears in the group detail header.

T-shirt sizing (S / M / L / XL) replaces raw effort hour numbers in the Dashboard groups table. Hover the badge to see the dev-day range.

Size	Effort	Dev-days
S	< 16h	< 2 days
M	16–40h	2–5 days
L	40–80h	1–2 weeks
XL	> 80h	needs scoping

These are indicative, not commitments. They are useful for capacity planning and stakeholder conversations, not sprint estimation.

Reviewer Brief and Anomaly Report are accessible in the Dashboard as collapsible panels (click to load lazily). The reviewer brief is a numbered checklist of what the Tech Lead should scrutinise before approving. The anomaly report lists statistical outliers with severity badges and recommendations.

Three Operating Modes

Interactive

Upload or point to a project config through the UI. Watch the analysis run in real time. Review the strategy in the browser. Approve or override.

Watcher

The tool polls a directory for *.project.yaml files. When one appears or changes, analysis triggers automatically. Configure via:

WATCHER_ENABLED=true
WATCHER_DIR=./watch
WATCHER_POLL_INTERVAL_SECS=30

CI/CD

A pipeline step posts the project config via POST /api/projects/trigger with an inline JSON body. The endpoint requires X-API-Key when API_KEY is set in the environment. The strategy document and other artifacts can be retrieved via GET /api/projects/{id}/artifacts.

curl -X POST http://localhost:8090/api/projects/trigger \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{
    "project_name": "MyProject",
    "source_type": "folder",
    "source_location": "/path/to/exports",
    "previous_job_id": "abc123"
  }'

Incremental Analysis

When you re-run analysis on a project that has changed:

Only new or modified mapping XMLs are re-parsed (unchanged files hit the SHA-256 cache)
Pattern grouping runs on the full project (fast — parsing is cached)
Previous human decisions (overrides, confirmations, notes) are preserved unless the underlying XML changed
The strategy document includes a diff section showing what changed

Webhooks

When WEBHOOK_URL is set, the server automatically POSTs signed JSON events to that URL at key lifecycle points.

Events:

Event	When
`analysis_started`	Job begins running
`analysis_complete`	Job reaches PENDING_REVIEW
`analysis_failed`	Job fails with an error
`review_approved`	Strategy approved
`review_rejected`	Strategy rejected

Payload format:

{ "event": "analysis_complete", "job_id": "abc123", "project_name": "MyProject" }

Signature: every request includes an X-Signature-SHA256 header containing the HMAC-SHA256 hex digest of the JSON body, signed with WEBHOOK_SECRET. Verify on your end:

import hashlib, hmac
digest = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
assert digest == request.headers["X-Signature-SHA256"]

Security

Security is infrastructure, not a feature layer. Key protections:

All XML parsing is hardened against XXE (XML External Entity) injection — DTD and entity resolution disabled
Path traversal prevented — all paths resolved relative to configured root; symlinks rejected
ZIP extraction validates every entry path, caps total bytes and entry count
Project config parsed with yaml.safe_load() and schema validated
Session-cookie authentication on all non-static routes
All HTTP security headers applied (CSP, HSTS, X-Frame-Options, etc.)
No secrets in logs — structured logging only
Dependencies pinned and audited with pip-audit

See SECURITY.md for the full threat/defence matrix.

API Reference

Method	Path	Description
`POST`	`/api/projects`	Upload project config YAML and start analysis
`POST`	`/api/projects/folder`	Start analysis from a local folder path
`POST`	`/api/projects/zip`	Upload ZIP archive (multipart/form-data) and start analysis
`POST`	`/api/projects/trigger`	CI/CD trigger — inline JSON config. Requires `X-API-Key`.
`GET`	`/api/projects`	List all jobs (`?limit=N&offset=M` for pagination)
`GET`	`/api/projects/{id}`	Get analysis job state
`GET`	`/api/projects/{id}/progress`	SSE progress stream (`text/event-stream`)
`GET`	`/api/projects/{id}/strategy.json`	Download strategy JSON
`GET`	`/api/projects/{id}/strategy.pdf`	Download strategy PDF
`GET`	`/api/projects/{id}/strategy.xlsx`	Download strategy Excel workbook
`GET`	`/api/projects/{id}/diff`	Mapping diff vs previous job for same project
`GET`	`/api/projects/{id}/artifacts`	List available download formats (JSON/XLSX/PDF)
`POST`	`/api/projects/{id}/review`	Submit review decision (APPROVE / REJECT)
`POST`	`/api/projects/{id}/override`	Submit mapping override (CONFIRM / MOVE / INDIVIDUALIZE)
`GET`	`/api/projects/{id}/graph`	Get dependency graph data
`GET`	`/api/projects/{id}/groups`	Get pattern groups with members
`GET`	`/api/projects/{id}/groups/{gid}`	Get single group detail
`POST`	`/api/projects/{id}/groups/{gid}/notes`	Add reviewer note to a group
`GET`	`/api/audit`	Audit trail of all review decisions
`GET`	`/api/health`	Liveness + readiness probe

Environment Variables

All configuration is via environment variables (loaded from .env). See .env.example for the full list with descriptions.

Variable	Required	Default	Description
`ANTHROPIC_API_KEY`	Yes*	—	API key for AI-assisted analysis (*not required if all AI features disabled)
`PORT`	No	`8090`	Server port
`HOST`	No	`127.0.0.1`	Server bind address
`DB_PATH`	No	`app/data/analysis.db`	SQLite database path
`CLAUDE_MODEL`	No	`claude-sonnet-4-5`	Model for AI analysis
`LOG_LEVEL`	No	`INFO`	Logging level
`CORS_ORIGINS`	No	`*`	Space-separated allowed origins for CORS
`HTTPS_MODE`	No	`false`	Enable HSTS header (`true` in production)
`API_KEY`	No	—	When set, all write endpoints require `X-API-Key` header
`WATCHER_ENABLED`	No	`false`	Enable folder watcher for auto-triggered analysis
`WATCHER_DIR`	No	`./watch`	Directory polled by the watcher
`WATCHER_POLL_INTERVAL_SECS`	No	`30`	How often the watcher scans (seconds)
`WEBHOOK_URL`	No	—	URL to POST signed event notifications to
`WEBHOOK_SECRET`	No	—	HMAC-SHA256 signing secret for webhook payloads
`WEBHOOK_TIMEOUT_SECS`	No	`10`	HTTP timeout for webhook POST requests
`MAX_UPLOAD_SIZE_MB`	No	`500`	Maximum ZIP upload size
`OUTPUT_DIR`	No	`app/data/output`	Directory for extracted uploads and generated files

Troubleshooting

Analysis fails at "Resolving Source" Check that the source.location in your project config points to a valid, accessible directory. The tool does not follow symlinks.

Most mappings are UNCLASSIFIED Try lowering analysis.fingerprint_strictness from strict to moderate or relaxed. This widens the spine matching tolerance.

Confidence scores are low across the board Check if your mappings use heavy custom SQL overrides or missing mapplet definitions. These reduce confidence because the tool cannot fully determine structural equivalence. The flagged mappings will need manual review.

Parse cache not working The cache keys on SHA-256 of file content. If file paths change but content is identical, the cache still hits. If the cache seems stale, check that the DB_PATH points to the correct database.

Port conflict Change the PORT environment variable in .env. Default is 8090.

Glossary

Term	Definition
Mapping	A single unit of data transformation logic in Informatica PowerCenter. Defines how data flows from sources through transformations to targets. Exported as an XML file.
Project	The complete collection of Informatica PowerCenter mappings, workflows, and parameter files being analyzed for migration. Defined by a single `*.project.yaml` config file.
Spine	The canonical ordered sequence of transformation types in a mapping (e.g., `SQ → EXP → LKP → TARGET`). Used to identify structural similarity between mappings.
Pattern group	A set of mappings that share the same structural spine and can be converted using one parameterized template instead of N separate files.
Variation tier	How much a mapping differs from its group's canonical pattern: Tier 1 (parameter only — table/column names differ), Tier 2 (minor structural — an extra filter or expression), Tier 3 (fundamental — does not group).
Confidence	How certain the tool is about a mapping-to-group assignment: HIGH, MEDIUM, LOW, UNCLASSIFIED.
Dependency edge	A relationship where one mapping's output table is used as another mapping's lookup source — meaning the second must run after the first.
Shared asset	A table referenced as a lookup source by multiple mappings across the project.
Strategy document	The PDF + Excel + JSON output describing pattern groups, dependencies, and conversion recommendations. Reviewed and approved before conversion begins.
Human gate	The structured review step where tech leads and leadership approve or reject the strategy before any conversion work starts.
Project config	The `*.project.yaml` file that defines source location, scope, analysis settings, and reviewers. The single input that drives the entire analysis.

FilesExpand file tree

USER_GUIDE.md

Latest commit

History