Skip to content

Latest commit

Β 

History

History
358 lines (273 loc) Β· 15.2 KB

File metadata and controls

358 lines (273 loc) Β· 15.2 KB

Architecture

Overview

The Logic Network Generator transforms Reactome pathway data into directed logic networks suitable for perturbation analysis and pathway flow studies. The system decomposes complex biochemical structures (complexes and entity sets) into individual components and creates a network where edges represent biochemical transformations.

Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          Reactome Neo4j Database                     β”‚
β”‚                       (Biological Pathway Data)                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β”‚ Neo4j Queries
                                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    reaction_connections_{pathway_id}.csv             β”‚
β”‚    (Connections between reactions: preceding β†’ following)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β”‚ Decomposition
                                    β”‚ (Break complexes/sets into components)
                                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 decomposed_uid_mapping_{pathway_id}.csv              β”‚
β”‚  (Maps hashes to individual physical entities - proteins, etc.)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β”‚ Hungarian Algorithm
                                    β”‚ (Optimal input/output pairing)
                                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    best_matches_{pathway_id}.csv                     β”‚
β”‚        (Pairs of input/output combinations within reactions)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β”‚ Logic Network Generation
                                    β”‚ (Create transformation edges)
                                    β”‚ (Position-aware UUID assignment)
                                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    pathway_logic_network.csv                         β”‚
β”‚  (source_id β†’ target_id edges with AND/OR logic annotations)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                                    β”‚ UUID Mapping Export
                                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    uuid_to_reactome_{pathway_id}.csv                 β”‚
β”‚        (Maps UUIDs back to Reactome database IDs)                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Concepts

1. Physical Entities

In Reactome, a :PhysicalEntity represents any biological molecule or complex:

  • Simple molecules (ATP, water)
  • Proteins (individual gene products)
  • Complexes (protein complexes like Complex(A,B,C))
  • Entity sets (alternative molecules like EntitySet(IsoformA, IsoformB))

2. Decomposition

Complex structures are broken down into individual components:

Input: Complex(ProteinA, ProteinB, EntitySet(ATP, GTP))
                    ↓ decomposition
Output:
  - Combination 1: ProteinA, ProteinB, ATP
  - Combination 2: ProteinA, ProteinB, GTP

This creates all possible molecular combinations through cartesian product, preserving biological alternatives.

3. Virtual Reactions

A single biological reaction in Reactome may represent multiple transformations after decomposition:

Biological Reaction (Reactome ID: 12345):
  Inputs: Complex(A,B), ATP
  Outputs: Complex(A,B,P), ADP

After decomposition and best matching:
  Virtual Reaction 1 (UID: uuid-1, Reactome ID: 12345):
    input_hash: "hash-of-[A,B,ATP]"
    output_hash: "hash-of-[A,B,P,ADP]"

  Virtual Reaction 2 (UID: uuid-2, Reactome ID: 12345):
    input_hash: "hash-of-[A,B,ATP]"
    output_hash: "hash-of-[A,P,B,ADP]"
  ...

Each virtual reaction gets a unique UID (UUID v4) while preserving the link to the original Reactome reaction ID.

4. Edge Semantics

CRITICAL: Edges represent transformations WITHIN reactions, not connections BETWEEN reactions.

Reaction: ATP + Water β†’ ADP + Phosphate

Creates 4 edges (cartesian product):
  ATP       β†’ ADP
  ATP       β†’ Phosphate
  Water     β†’ ADP
  Water     β†’ Phosphate

Reactions connect implicitly through shared physical entities:

Reaction 1: A β†’ B (creates edge where B is target)
Reaction 2: B β†’ C (creates edge where B is source)

Result: Pathway flow A β†’ B β†’ C (B connects the reactions)

Self-loops are minimized using position-aware UUIDs. When the same entity connects reactions, the union-find algorithm ensures entities in the same connected component share UUIDs, creating intentional self-loops that represent pathway flow, while entities at disconnected positions get different UUIDs.

5. Position-Aware UUIDs

The system uses position-aware UUIDs to uniquely identify entities at different pathway positions:

Example:
  Reaction1 β†’ gene1 β†’ Reaction2
  Reaction3 β†’ gene1 β†’ Reaction2

Result: gene1 gets UUID_A (connected component)

But elsewhere:
  Reaction100 β†’ gene1 β†’ Reaction101

Result: gene1 gets UUID_B (different position)

Key Properties:

  • Entities in same connected component share UUIDs (union-find algorithm)
  • Entities at disconnected positions get different UUIDs
  • Registry tracks: (entity_dbId, reaction_uuid, role) β†’ entity_uuid
  • Results in 0% self-loops in real pathways while maintaining connectivity

See UUID_DESIGN.md for detailed design.

6. AND/OR Logic

The logic network assigns AND/OR relationships based on how many reactions produce the same physical entity:

OR Relationship (Multiple sources):

R1: Glycolysis β†’ ATP
R2: Oxidative Phosphorylation β†’ ATP
R3: ATP β†’ Energy

For R3: ATP can come from R1 OR R2
Edges: R1β†’ATP (OR), R2β†’ATP (OR)
Then:  ATP→R3 (AND - ATP is required)

AND Relationship (Single source):

R1: Glucose β†’ Glucose-6-Phosphate
R2: Glucose-6-Phosphate β†’ ...

Only one source produces Glucose-6-Phosphate
Edge: R1β†’G6P (AND - required)

Rule:

  • Multiple preceding reactions β†’ OR (alternatives)
  • Single preceding reaction β†’ AND (required)
  • All inputs to reactions are AND (required)

Component Architecture

Core Components

1. src/neo4j_connector.py

Purpose: Query Reactome Neo4j database

Key Functions:

  • get_reaction_connections(): Get preceding/following reaction pairs
  • get_catalysts_for_reaction(): Get catalyst relationships
  • get_positive/negative_regulators_for_reaction(): Get regulatory relationships

Output: Raw Reactome data as DataFrames

2. src/reaction_generator.py

Purpose: Decompose complexes and sets into components

Key Functions:

  • get_decomposed_uid_mapping(): Main decomposition orchestrator
  • Handles complexes (using itertools.product for combinations)
  • Handles entity sets (using itertools.product for alternatives)
  • Recursively decomposes nested structures

Output: decomposed_uid_mapping with all molecular combinations

3. src/best_reaction_match.py

Purpose: Pair input/output combinations optimally

Algorithm: Hungarian algorithm (optimal assignment)

Input: Input combinations and output combinations from same reaction

Output: best_matches DataFrame with optimal pairings

4. src/logic_network_generator.py

Purpose: Generate the final logic network with position-aware UUIDs

Key Functions:

  • create_pathway_logic_network(): Main orchestrator
  • _get_or_create_entity_uuid(): Union-find UUID assignment
  • _assign_uuids(): Position-aware UUID generation
  • create_reaction_id_map(): Create virtual reactions from best_matches
  • extract_inputs_and_outputs(): Create transformation edges
  • _determine_edge_properties(): Assign AND/OR logic
  • _add_pathway_connections(): Add edges with cartesian product
  • append_regulators(): Add catalyst/regulator edges
  • export_uuid_to_reactome_mapping(): Export UUIDβ†’dbId mapping

Output:

  • Logic network DataFrame with edges and logic annotations
  • UUID to Reactome ID mapping for entity tracking

Bin Scripts

bin/create-pathways.py

Purpose: Command-line interface for generating pathways

Usage:

# Single pathway
poetry run python bin/create-pathways.py --pathway-id 69620

# Multiple pathways
poetry run python bin/create-pathways.py --pathway-list pathways.tsv

bin/create-db-id-name-mapping-file.py

Purpose: Create human-readable mapping of database IDs to names

Network Properties

Node Types

  • Root Inputs: Physical entities that only appear as sources (pathway starting points)
  • Intermediate Entities: Appear as both sources and targets (connect reactions)
  • Terminal Outputs: Physical entities that only appear as targets (pathway endpoints)

Edge Types

  • Main edges: Transformation edges within reactions

    • edge_type: "input" (single source, AND) or "output" (multiple sources, OR)
    • pos_neg: "pos" (positive transformation)
    • and_or: "and" (required) or "or" (alternative)
  • Regulatory edges: Catalysts and regulators

    • edge_type: "catalyst" or "regulator"
    • pos_neg: "pos" (positive regulation) or "neg" (negative regulation)
    • and_or: Empty (not applicable to regulation)

Network Structure

  • Directed: Edges have direction (source β†’ target)
  • Acyclic: No cycles in main transformation edges (within individual reactions)
  • Bipartite-like: Entities and reactions connect through transformations
  • Minimal self-loops: Position-aware UUIDs minimize self-loops while preserving pathway connectivity

Testing Strategy

Test Categories

  1. Unit Tests (tests/test_logic_network_generator.py)

    • Individual helper functions
    • Position-aware UUID assignment with union-find
    • Edge property determination
  2. Integration Tests (tests/test_edge_direction_integration.py)

    • Multi-reaction pathways
    • End-to-end data flow
  3. Semantic Tests (tests/test_transformation_semantics.py)

    • Cartesian product correctness
    • Edge direction validation
    • Transformation logic
  4. Invariant Tests (tests/test_network_invariants.py)

    • No self-loops
    • Root inputs only as sources
    • Terminal outputs only as targets
    • AND/OR logic consistency
  5. Logic Tests (tests/test_and_or_logic.py)

    • Multiple sources β†’ OR
    • Single source β†’ AND
    • User requirement validation
  6. Validation Tests (tests/test_input_validation.py)

    • Empty DataFrame handling
    • Missing column detection
    • Error message clarity

Test Coverage

  • 73+ tests total (100% passing for core unit tests)
  • Covers position-aware UUIDs, core functionality, edge semantics, network properties, and comprehensive validation
  • Run tests with: poetry run pytest tests/ -v

Design Decisions

Why Virtual Reactions?

  • Problem: A biological reaction may have multiple input/output combinations after decomposition
  • Solution: Create multiple "virtual reactions" representing each combination
  • Benefit: Clean mapping from combinations to transformations

Why Cartesian Product for Edges?

  • Problem: How to represent transformation within a reaction with multiple inputs/outputs?
  • Solution: Every input connects to every output (cartesian product)
  • Rationale: Biochemically accurate - all reactants contribute to all products

Why Implicit Reaction Connections?

  • Problem: How do reactions connect in the network?
  • Solution: Through shared physical entities (molecule appears as target in R1, source in R2)
  • Benefit: Natural representation - pathways flow through molecules, not abstract connections

Why AND/OR Based on Preceding Count?

  • User Requirement: Multiple sources should be OR, inputs to reactions should be AND
  • Implementation: Count preceding reactions - if >1 then OR, otherwise AND
  • Rationale: Matches biological intuition (alternatives vs requirements)

Performance Considerations

Caching

  • Files are cached: reaction_connections_{id}.csv, decomposed_uid_mapping_{id}.csv, best_matches_{id}.csv
  • Subsequent runs reuse cached data
  • Position-aware UUIDs tracked in entity_uuid_registry (regenerated each run for consistency)
  • UUIDβ†’dbId mappings exported to uuid_to_reactome_{id}.csv

Scalability

  • Decomposition uses itertools.product (efficient for combinatorics)
  • Hungarian algorithm is O(nΒ³) but pathways are typically small (<1000 reactions)
  • Pandas operations are vectorized where possible

Typical Performance

  • Small pathway (10-20 reactions): <1 second
  • Medium pathway (100-200 reactions): 1-5 seconds
  • Large pathway (500+ reactions): 5-30 seconds

Additional Documentation

  • Main README: ../README.md - Quick start guide and features
  • Position-Aware UUIDs: UUID_DESIGN.md - Why and how UUIDs are assigned per pathway position
  • Design Decisions: DESIGN_DECISIONS.md - Intentional behaviors that look surprising
  • Examples: ../examples/README.md - Usage patterns and troubleshooting
  • Reactome Database: https://reactome.org/