Skip to content

Latest commit

 

History

History
685 lines (507 loc) · 27.8 KB

File metadata and controls

685 lines (507 loc) · 27.8 KB

InformaticaProjectAnalysis — User Guide

Version: 1.8.2 Last Updated: 2026-03-20


What Is This Tool For?

If you run Informatica PowerCenter and are planning to migrate to open code — Python scripts, dbt models, PySpark jobs — this tool is the first step. And you already know the gaps that are driving that decision. The platform does what it was built to do, but it has not kept pace with what engineering teams are now being asked to deliver. The most acute gap is GenAI: business teams are demanding AI-enabled pipelines — LLMs (Large Language Models) for classification, enrichment, and anomaly detection embedded in the data flow. Informatica PowerCenter has no native GenAI capability. It cannot call an LLM, cannot integrate with vector databases, and has no roadmap for AI-augmented pipelines. Staying on PowerCenter means the data infrastructure cannot participate in what the business is trying to do next.

Open code closes these gaps. Python, PySpark, and dbt integrate with any model, any API, any tool in the ecosystem — and produce a codebase the team owns, version-controls, and deploys without platform lock-in. That changes the conversion problem: you need to produce well-structured, maintainable source code, not just functionally equivalent scripts inside another tool.

Informatica PowerCenter stores its data transformation logic in units called mappings. Each mapping defines a data flow: source tables → transformations (lookups, expressions, filters, aggregations) → target tables. A typical project has dozens to hundreds of mappings. All of this can be exported as XML files — and those XML exports are what this tool reads.

Why not just convert each mapping one at a time? You can, but you'll end up with a mess. If 14 mappings all follow the same pattern differing only by table name, converting them individually produces 14 separate scripts instead of one parameterized template plus a config file. Shared lookup tables get redefined in every script. There's no dependency graph, no project structure, and massive duplication. You left a proprietary tool and landed in a code mess.

InformaticaProjectAnalysis reads all the mapping XMLs from your Informatica project (the complete collection of exported mappings, workflows, and parameter files) and produces a conversion strategy before any conversion begins. The strategy answers three questions:

  1. Which mappings share the same structural pattern? Mappings with the same transformation flow (e.g., source → lookup → expression → target) differing only by table names and column names are grouped together as candidates for a single parameterized template. The tool shows the evidence and assigns a confidence level.

  2. What depends on what? If mapping A loads DIM_CUSTOMER and mapping B does a lookup against DIM_CUSTOMER, then B depends on A. The tool builds a dependency graph across all mappings and computes a safe execution order.

  3. What needs human attention? Not every mapping can be confidently classified. Custom SQL overrides, missing definitions, and unusual patterns reduce certainty. The tool flags these for tech lead review.

The output is a strategy document (PDF + Excel + JSON) that tech leads and leadership review and approve before conversion work starts. The tool does not prescribe which target language to use or how to orchestrate — those decisions belong to the humans reviewing the strategy.


Quick Start

1. Prerequisites

  • Python 3.11+
  • An Anthropic API key (for AI-assisted analysis in Phases 1 and 2)

2. Installation

git clone https://github.com/ad25343/InformaticaProjectAnalysis.git
cd InformaticaProjectAnalysis

python -m venv .venv
source .venv/bin/activate    # macOS / Linux
# .venv\Scripts\activate     # Windows

pip install -r requirements.txt

3. Configuration

Copy the example environment file and fill in your values:

cp .env.example .env

Required settings in .env:

Variable Purpose
ANTHROPIC_API_KEY Your Anthropic API key for AI-assisted analysis
SECRET_KEY Random string for session cookie signing (change from default)

Optional settings are documented in .env.example.

4. Create a Project Config

Create a *.project.yaml file that points at your Informatica project exports:

project:
  name: "My Migration"
  version: "1.0"
  owner: "Data Engineering"

source:
  type: folder
  location: "/path/to/informatica/exports/"

scope:
  mappings:
    include: ["mappings/**/*.xml"]
    exclude: ["mappings/archive/**"]
  workflows:
    include: ["workflows/**/*.xml"]
  parameters:
    include: ["parameter_files/*.xml"]
  default_parameter_env: "dev"

analysis:
  fingerprint_strictness: "moderate"
  min_group_size: 2
  confidence_threshold: 0.7
  detect_shared_assets: true
  build_dependency_dag: true
  classify_expressions: true

review:
  tech_lead:
    name: "Your Name"
    email: "you@company.com"

5. Run the Tool

python -m app.main

The server starts on http://localhost:8090 (configurable via the PORT env var). Open the URL in your browser to access the UI.


Project Config Reference

The *.project.yaml file is the single input that drives the entire analysis.

project Section

Field Required Description
name Yes Human-readable project name (appears in strategy documents)
version No Your version label for this migration scope
owner No Team or individual responsible

source Section

Field Required Description
type Yes How to locate the files: folder, repo, zip, or s3
location Yes Path, URL, or bucket depending on source type

Source types:

  • folder — local directory path; the tool scans recursively using scope globs
  • repo — Git repository URL; shallow-cloned via GitPython. Optional branch and path (subfolder within repo) fields. Temp dir cleaned up after analysis.
  • zip — uploaded ZIP archive via POST /api/projects/zip (multipart form). Path traversal protection; size capped at MAX_UPLOAD_SIZE_MB (default 500 MB).
  • s3 — S3 bucket path; objects pulled and scanned (bring your own credentials via AWS env vars)

scope Section

Controls which files within the source are included in the analysis.

Field Required Description
mappings.include Yes Glob patterns for mapping XML files
mappings.exclude No Glob patterns to skip
workflows.include No Glob patterns for workflow XML files
parameters.include No Glob patterns for parameter files
default_parameter_env No Which parameter environment to use (default: dev)

analysis Section

Field Default Description
fingerprint_strictness moderate How strict spine matching is (spine = the ordered sequence of transformation types in a mapping; see "How the Analysis Works" below): strict, moderate, relaxed
min_group_size 2 Minimum mappings needed to form a pattern group
confidence_threshold 0.7 Below this, mappings are flagged for human review
detect_shared_assets true Identify tables referenced by multiple mappings
build_dependency_dag true Build cross-mapping dependency graph (DAG = Directed Acyclic Graph)
classify_expressions true Use AI to classify Expression transformation port complexity
interpret_sql true Use AI to extract structured intent from Source Qualifier SQL overrides
ai_refine_confidence false Use AI to re-score MEDIUM/LOW group member confidence levels (slower; requires API key)

review Section

Names and emails of the reviewers who will approve the strategy. These appear in the strategy document and audit trail.

notifications Section (optional)

This section is a config convenience; the actual webhook is configured via environment variables (WEBHOOK_URL, WEBHOOK_SECRET). When set, the server POSTs signed events to the webhook URL automatically — no config section required.

Field Description
webhook_url Slack/Teams/custom webhook for status notifications
events Which events trigger notifications (analysis complete, strategy ready, review approved)

How the Analysis Works

The analysis runs in five phases. You can watch progress in real time via the progress bar in the UI (powered by the SSE endpoint GET /api/projects/{id}/progress) or by consuming that endpoint directly.

Phase 1 — Discovery

The tool resolves your source location, scans for files matching your scope globs, and parses every mapping XML it finds. Each mapping is reduced to its structural components: transformation types, connector topology, expression bodies, lookup references, parameter variables, and SQL overrides.

From the individual parse results, the tool builds a project-level graph:

  • Dependency edges — if mapping A writes to TABLE_X and mapping B has a Lookup against TABLE_X, then B depends on A.
  • Shared assets — tables that appear as lookup sources in multiple mappings.
  • Repeated expressions — expression logic that appears verbatim across many mappings.

Parse cache: results are keyed by SHA-256 of file content. On re-analysis, only new or changed files are re-parsed. The cache lives in the SQLite database at DB_PATH.

AI-assisted interpretation (Phase 1.1): If interpret_sql is enabled, Source Qualifier SQL overrides are sent to Claude to extract structured intent: source_tables, join_conditions, filter_logic, business_intent. Results are stored on each MappingParseResult and included in the strategy JSON.

AI expression classification (Phase 2.1): If classify_expressions is enabled, non-trivial Expression transformation ports are classified into: SIMPLE_CALC, LOOKUP_CHAIN, CUSTOM_FUNCTION, SQL_OVERRIDE, REGEX_TRANSFORM, UNKNOWN.

AI confidence refinement (Phase 2.2): If ai_refine_confidence is enabled, Claude re-evaluates each MEDIUM or LOW group member: "does this mapping structurally fit this group?" Updated confidence and reason are stored on the group member.

Phase 2 — Pattern Grouping

Each mapping's transformation topology is reduced to a canonical "spine" — the ordered sequence of transformation types from source to target.

Examples:

  • SQ → EXP → TARGET (simple load)
  • SQ → LKP → EXP → RTR → UPD → TARGET (SCD2 / slowly changing dimension)
  • SQ(×3) → JNR → LKP(×2) → EXP → RTR → TARGET(×2) (complex multi-source)

Mappings with matching spines are candidates for the same pattern group. Within each group, variation is classified:

  • Tier 1 — Parameter variation. Structurally identical. Only table names, column names, and filter values differ. One template + one config file.

  • Tier 2 — Minor structural variation. Core flow is the same, but minor differences exist (an extra Expression, a Filter present in some but not all). Template accommodates variation via config flags.

  • Tier 3 — Fundamental variation. Different transformation types or flow shapes. Does not group. Convert individually.

Each mapping-to-group assignment carries a confidence level: HIGH, MEDIUM, LOW, or UNCLASSIFIED.

Phase 3 — Strategy Document Generation

Three outputs are produced from the same analysis data:

PDF report — two layers in one document. Page 1 is the leadership summary (mapping count, pattern groups, scope reduction, risk flags). Remaining pages are tech lead detail (per-group evidence, dependency DAG, shared assets, per-mapping assignments with confidence).

Excel workbook — five sheets: Pattern Groups, Dependency Graph, Shared Assets, Per-Mapping Assignments, Risk Flags.

Strategy JSON — machine-readable format containing pattern groups with members, unique mappings with reasons, shared assets, dependency DAG, and execution order.

Phase 4 — Human Gate (Strategy Review)

The strategy is presented in the UI for review. This is the decision gate.

Available actions:

Action What it does
Confirm mapping assignment Locks the mapping's group assignment
Move mapping to different group Simple override — no re-analysis needed
Convert mapping individually Removes mapping from its group
Split group Triggers lightweight re-validation
Merge groups Triggers lightweight re-validation
Add notes Stored per mapping and per group
APPROVE Finalizes the strategy; generates approved strategy JSON
REJECT Returns to analysis with reviewer notes as constraints

Every review action is recorded in the audit trail with reviewer name, role, timestamp, and decision.

Phase 5 — Strategy Delivery

The approved strategy JSON is the final output. It contains everything needed to drive a conversion: which mappings belong to which pattern groups, what parameters are externalized, which mappings convert individually and why, the dependency DAG, and the execution order.

The strategy can be delivered as a file or via API POST to a downstream conversion tool.


The Three UI Views

Dashboard (Leadership)

The dashboard shows the project at a glance: total mappings, pattern groups found, template candidates, unique mappings, and the scope reduction metric (e.g., "50 mappings → 8 templates + 12 unique files"). It includes complexity distribution, confidence breakdown, dependency depth, and risk flags.

The APPROVE / REJECT gate is on this page — leadership makes the call after tech leads have reviewed the detail.

Pattern Groups (Tech Leads)

The detail view for validating groupings. The left panel lists all pattern groups with member count and confidence indicator. Selecting a group shows:

  • The spine (transformation flow)
  • Member mappings with variation tier, confidence, and flags
  • Parameter differences across members (what varies)
  • Structural evidence (why these were grouped)
  • Per-mapping override controls (confirm, move, convert individually)
  • Notes field per mapping and per group

Dependency Graph (Both Audiences)

Interactive dependency graph (vis-network) visualization. Nodes are mappings, colored by execution stage. Shared assets appear as diamond-shaped nodes. Click a node to see its upstream dependencies, downstream dependents, and shared assets in the detail panel on the right.

Execution stages are highlighted so you can see which mappings run in parallel and where serialization is required.


AI Chat Tab

The 💬 AI Chat tab provides a multi-turn conversation with an AI assistant that has complete knowledge of the current analysis job.

What it knows: All pattern groups (with members, confidence, assignees, risk flags), unique mappings, anomaly report, reviewer brief, and strategy narrative.

What to ask:

  • "Which mappings are HIGH complexity and unassigned?"
  • "What are the biggest risks in the ETL Staging group?"
  • "Summarise the duplicate mappings found."
  • "How many dev-days would the unassigned work take?"

Setup: Enter your name in the name field (persisted in browser localStorage). Messages are stored per-job; history is included in chat context (last 20 turns).

To clear history: Click "Clear history" — this resets the conversation context so the AI starts fresh on your next message.

Requirements: ANTHROPIC_API_KEY must be set in the server environment and CHAT_ENABLED=true (the default). If chat is disabled the tab returns a 503.


Help Tab

The ? Help tab contains this tool's user guide as an in-app reference, including:

  • A User Workflow flowchart showing the end-to-end process
  • An Analysis Pipeline flowchart showing all 4 phases and 7 AI agents
  • Getting Started steps, YAML config reference, tab guide, override controls, AI feature descriptions, export formats, and environment variables

Pattern Groups — Search and Filter (v1.8.2)

The left sidebar now supports:

  • Search box — type to filter groups by name in real time
  • Confidence filter pills — ALL / HIGH / MEDIUM / LOW; shows match count

The group detail panel now includes a "Why this confidence?" section explaining the confidence verdict with data-driven bullet reasons:

  • Member count (small groups → less certain)
  • Mixed variation tiers (Tier 1 + Tier 3 in same group → inconsistent)
  • Risk flag count on members
  • LOW-confidence members within a nominally higher-confidence group

Re-opening an Approved Strategy (v1.8.2)

If a Tech Lead approves and subsequently spots an error, they can click "Re-open for amendment" on the Review Gate. This:

  1. Reverts the job status to AWAITING_REVIEW
  2. Re-activates all override controls (Confirm / Individualize / Move)
  3. Appends a review:reopened event to the audit trail (who, when, why)
  4. Preserves the original approval record — the audit trail shows both events

After making adjustments, the Tech Lead submits a new approval. Both approval events appear in the audit log.

Note: Re-open is only available on APPROVED jobs. REJECTED jobs require a fresh analysis run.


Landing Page — Metrics Bar (v1.8.2)

When previous analyses exist, the landing page shows five programme-level KPIs:

Metric Description
Projects Analyzed Total jobs in the database (all statuses)
Total Mappings Sum of mappings across all approved/awaiting-review jobs
Avg Scope Reduction Mean scope reduction % — the headline ROI number
Awaiting Review Jobs pending Tech Lead approval (pulses when non-zero)
Approved Strategies Jobs with approved status — completed deliverables

AI Enhancement Features (v1.8.0)

Seven AI agents run automatically in Phase 3 when ANTHROPIC_API_KEY is set. All are non-fatal — the pipeline continues if any agent fails.

Risk flags appear as red mono tags in the Pattern Groups member table. Each flag is a category prefix (e.g., HARDCODED_DATE, COMPLEX_SQL); hover for the full description.

Duplicate detection adds a purple dup badge on the mapping name. Hover to see which mapping it likely duplicates. Flagged mappings are candidates for removal before conversion begins.

AI group names replace machine-generated IDs (GRP-004Daily Sales Extract). The original ID is shown as a subtitle in mono text. The AI description appears in the group detail header.

T-shirt sizing (S / M / L / XL) replaces raw effort hour numbers in the Dashboard groups table. Hover the badge to see the dev-day range.

Size Effort Dev-days
S < 16h < 2 days
M 16–40h 2–5 days
L 40–80h 1–2 weeks
XL > 80h needs scoping

These are indicative, not commitments. They are useful for capacity planning and stakeholder conversations, not sprint estimation.

Reviewer Brief and Anomaly Report are accessible in the Dashboard as collapsible panels (click to load lazily). The reviewer brief is a numbered checklist of what the Tech Lead should scrutinise before approving. The anomaly report lists statistical outliers with severity badges and recommendations.


Three Operating Modes

Interactive

Upload or point to a project config through the UI. Watch the analysis run in real time. Review the strategy in the browser. Approve or override.

Watcher

The tool polls a directory for *.project.yaml files. When one appears or changes, analysis triggers automatically. Configure via:

WATCHER_ENABLED=true
WATCHER_DIR=./watch
WATCHER_POLL_INTERVAL_SECS=30

CI/CD

A pipeline step posts the project config via POST /api/projects/trigger with an inline JSON body. The endpoint requires X-API-Key when API_KEY is set in the environment. The strategy document and other artifacts can be retrieved via GET /api/projects/{id}/artifacts.

curl -X POST http://localhost:8090/api/projects/trigger \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $API_KEY" \
  -d '{
    "project_name": "MyProject",
    "source_type": "folder",
    "source_location": "/path/to/exports",
    "previous_job_id": "abc123"
  }'

Incremental Analysis

When you re-run analysis on a project that has changed:

  • Only new or modified mapping XMLs are re-parsed (unchanged files hit the SHA-256 cache)
  • Pattern grouping runs on the full project (fast — parsing is cached)
  • Previous human decisions (overrides, confirmations, notes) are preserved unless the underlying XML changed
  • The strategy document includes a diff section showing what changed

Webhooks

When WEBHOOK_URL is set, the server automatically POSTs signed JSON events to that URL at key lifecycle points.

Events:

Event When
analysis_started Job begins running
analysis_complete Job reaches PENDING_REVIEW
analysis_failed Job fails with an error
review_approved Strategy approved
review_rejected Strategy rejected

Payload format:

{ "event": "analysis_complete", "job_id": "abc123", "project_name": "MyProject" }

Signature: every request includes an X-Signature-SHA256 header containing the HMAC-SHA256 hex digest of the JSON body, signed with WEBHOOK_SECRET. Verify on your end:

import hashlib, hmac
digest = hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
assert digest == request.headers["X-Signature-SHA256"]

Security

Security is infrastructure, not a feature layer. Key protections:

  • All XML parsing is hardened against XXE (XML External Entity) injection — DTD and entity resolution disabled
  • Path traversal prevented — all paths resolved relative to configured root; symlinks rejected
  • ZIP extraction validates every entry path, caps total bytes and entry count
  • Project config parsed with yaml.safe_load() and schema validated
  • Session-cookie authentication on all non-static routes
  • All HTTP security headers applied (CSP, HSTS, X-Frame-Options, etc.)
  • No secrets in logs — structured logging only
  • Dependencies pinned and audited with pip-audit

See SECURITY.md for the full threat/defence matrix.


API Reference

Method Path Description
POST /api/projects Upload project config YAML and start analysis
POST /api/projects/folder Start analysis from a local folder path
POST /api/projects/zip Upload ZIP archive (multipart/form-data) and start analysis
POST /api/projects/trigger CI/CD trigger — inline JSON config. Requires X-API-Key.
GET /api/projects List all jobs (?limit=N&offset=M for pagination)
GET /api/projects/{id} Get analysis job state
GET /api/projects/{id}/progress SSE progress stream (text/event-stream)
GET /api/projects/{id}/strategy.json Download strategy JSON
GET /api/projects/{id}/strategy.pdf Download strategy PDF
GET /api/projects/{id}/strategy.xlsx Download strategy Excel workbook
GET /api/projects/{id}/diff Mapping diff vs previous job for same project
GET /api/projects/{id}/artifacts List available download formats (JSON/XLSX/PDF)
POST /api/projects/{id}/review Submit review decision (APPROVE / REJECT)
POST /api/projects/{id}/override Submit mapping override (CONFIRM / MOVE / INDIVIDUALIZE)
GET /api/projects/{id}/graph Get dependency graph data
GET /api/projects/{id}/groups Get pattern groups with members
GET /api/projects/{id}/groups/{gid} Get single group detail
POST /api/projects/{id}/groups/{gid}/notes Add reviewer note to a group
GET /api/audit Audit trail of all review decisions
GET /api/health Liveness + readiness probe

Environment Variables

All configuration is via environment variables (loaded from .env). See .env.example for the full list with descriptions.

Variable Required Default Description
ANTHROPIC_API_KEY Yes* API key for AI-assisted analysis (*not required if all AI features disabled)
PORT No 8090 Server port
HOST No 127.0.0.1 Server bind address
DB_PATH No app/data/analysis.db SQLite database path
CLAUDE_MODEL No claude-sonnet-4-5 Model for AI analysis
LOG_LEVEL No INFO Logging level
CORS_ORIGINS No * Space-separated allowed origins for CORS
HTTPS_MODE No false Enable HSTS header (true in production)
API_KEY No When set, all write endpoints require X-API-Key header
WATCHER_ENABLED No false Enable folder watcher for auto-triggered analysis
WATCHER_DIR No ./watch Directory polled by the watcher
WATCHER_POLL_INTERVAL_SECS No 30 How often the watcher scans (seconds)
WEBHOOK_URL No URL to POST signed event notifications to
WEBHOOK_SECRET No HMAC-SHA256 signing secret for webhook payloads
WEBHOOK_TIMEOUT_SECS No 10 HTTP timeout for webhook POST requests
MAX_UPLOAD_SIZE_MB No 500 Maximum ZIP upload size
OUTPUT_DIR No app/data/output Directory for extracted uploads and generated files

Troubleshooting

Analysis fails at "Resolving Source" Check that the source.location in your project config points to a valid, accessible directory. The tool does not follow symlinks.

Most mappings are UNCLASSIFIED Try lowering analysis.fingerprint_strictness from strict to moderate or relaxed. This widens the spine matching tolerance.

Confidence scores are low across the board Check if your mappings use heavy custom SQL overrides or missing mapplet definitions. These reduce confidence because the tool cannot fully determine structural equivalence. The flagged mappings will need manual review.

Parse cache not working The cache keys on SHA-256 of file content. If file paths change but content is identical, the cache still hits. If the cache seems stale, check that the DB_PATH points to the correct database.

Port conflict Change the PORT environment variable in .env. Default is 8090.


Glossary

Term Definition
Mapping A single unit of data transformation logic in Informatica PowerCenter. Defines how data flows from sources through transformations to targets. Exported as an XML file.
Project The complete collection of Informatica PowerCenter mappings, workflows, and parameter files being analyzed for migration. Defined by a single *.project.yaml config file.
Spine The canonical ordered sequence of transformation types in a mapping (e.g., SQ → EXP → LKP → TARGET). Used to identify structural similarity between mappings.
Pattern group A set of mappings that share the same structural spine and can be converted using one parameterized template instead of N separate files.
Variation tier How much a mapping differs from its group's canonical pattern: Tier 1 (parameter only — table/column names differ), Tier 2 (minor structural — an extra filter or expression), Tier 3 (fundamental — does not group).
Confidence How certain the tool is about a mapping-to-group assignment: HIGH, MEDIUM, LOW, UNCLASSIFIED.
Dependency edge A relationship where one mapping's output table is used as another mapping's lookup source — meaning the second must run after the first.
Shared asset A table referenced as a lookup source by multiple mappings across the project.
Strategy document The PDF + Excel + JSON output describing pattern groups, dependencies, and conversion recommendations. Reviewed and approved before conversion begins.
Human gate The structured review step where tech leads and leadership approve or reject the strategy before any conversion work starts.
Project config The *.project.yaml file that defines source location, scope, analysis settings, and reviewers. The single input that drives the entire analysis.