DRCFNet Model Architecture Overview

This document provides a detailed breakdown of the Disentangled Representation and Controlled Fusion Network (DRCFNet) architecture as implemented in src/models/drcfnet.py.

Overall Flow

DRCFNet processes trimodal input (Vision, Audio, Text) to generate a final prediction. The architecture follows a multi-step process designed to extract modality-specific features, learn shared cross-modal representations, and fuse them effectively.

1. Multi-Scale Temporal Convolutional Projection (MS-TCN)

Input: Raw multimodal sequences (Vision, Audio, Text).
Function: Projects the sequences into a common embedding dimension d using 1D convolutions of varying kernel sizes (1, 3, 5) and a skip connection.
Necessity: Captures multi-scale local temporal patterns before global sequence modeling.

2. Temporal Transformer

Input: Projected sequences + Positional Encoding.
Function: Applies self-attention via standard Transformer Encoder layers independently for each modality.
Necessity: Captures long-range temporal dependencies within each individual modality.

3. Gated MSR/SSR Disentanglement Split

Input: Modality representations from the Temporal Transformer.
Function: Splits each representation into two parts using a learnable gating mechanism (sigmoid):
- MSR (Modality-Specific Representation): Captures unique information private to that specific modality.
- SSR (Shared-Specific Representation): Captures information that is common or correlated across modalities.
Necessity: Disentangling representations reduces redundancy and allows downstream fusion mechanisms to focus on either specific interactions or complementary information.

4. Cross-modal Representation Encoder (CRE)

Input: SSR representations from Text, Audio, and Vision.
Function: Employs Multi-Head Cross-Attention to model interactions between modalities. Specifically, it computes:
- Zta: Text-Audio shared representation (Query: Text SSR, Key/Value: Audio SSR).
- Ztv: Text-Vision shared representation (Query: Text SSR, Key/Value: Vision SSR).
Necessity: Explicitly models the pairwise synergies between modalities to extract rich, cross-modal semantic information.

5. Multi-Head Graph Neural Fusion (GNN)

Input: A list of 5 node representations: [MSR_t, MSR_a, MSR_v, Zta, Ztv].
Function: Treats the 5 representations as nodes in a graph and applies Multi-Head Graph Attention over them. This allows information to flow and interact across all specific and shared nodes globally.
Necessity: Enables complex, high-level structural interactions between the disentangled specific and shared representations.

6. Gated Controlled Fusion (GCF)

Input: The 5 refined node representations from the Graph Fusion step.
Function: Applies an advanced gating mechanism to each node independently. It computes a confidence gate (sigmoid) and feature-wise attention weights (softmax), fusing the gated input with the attention-weighted input.
Necessity: Filters out noisy or irrelevant features from each node before the final aggregation, ensuring only the most informative signals contribute to the prediction.

7. Final Prediction

Input: Concatenated outputs from the 5 GCF modules.
Function: Applies an attention-based pooling mechanism across the sequence length, followed by a Multi-Layer Perceptron (FC layers).
Output: Final scalar prediction (e.g., sentiment score).

Analysis Against Provided Flowcharts

Based on an analysis of the provided architecture flowcharts against the implementation in src/models/drcfnet.py, there are significant discrepancies to note:

CRE Implementation (Flowchart 1): The flowchart WhatsApp Image 2026-05-04 at 1.53.17 PM.jpeg correctly depicts the CRE cross-attention mechanism used for generating Zta. This perfectly matches the implementation in DRCFNet.
GCF Mechanism (Flowchart 2): The flowchart WhatsApp Image 2026-05-04 at 1.53.18 PM (1).jpeg depicts a complex, bimodal fusion mechanism involving a Retain Gate (gr), Compound Gate (gc), and dual confidences operating on two inputs (Z and m). However, the implemented GCFModule in the code is a unary operation applied independently to single nodes. It uses a single confidence gate and feature-wise attention. The implemented GCF does not match this flowchart.
Overall Architecture (Flowchart 3): The flowchart WhatsApp Image 2026-05-04 at 1.53.18 PM.jpeg completely omits the Multi-Head Graph Fusion (GNN) step. In the code, the 5 representations (MSR_t, MSR_a, MSR_v, Zta, Ztv) are fed into a Graph Neural Network (self.gnn) before passing to the GCF modules. The flowchart shows them skipping the GNN and going straight to GCF/Concatenation. The flowchart is missing a critical GNN component present in the code.

Model Weights Status

A thorough search of the repository reveals that there are no saved model weight files (.pt or .pth) present in the current branch. The best model weights have not been committed to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Paper		Paper
docs/Report		docs/Report
outputs		outputs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
final_walkthrough.md		final_walkthrough.md
generate_pair_weights.py		generate_pair_weights.py
requirements.txt		requirements.txt
test_drcfnet.py		test_drcfnet.py
test_loss.py		test_loss.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DRCFNet Model Architecture Overview

Overall Flow

1. Multi-Scale Temporal Convolutional Projection (MS-TCN)

2. Temporal Transformer

3. Gated MSR/SSR Disentanglement Split

4. Cross-modal Representation Encoder (CRE)

5. Multi-Head Graph Neural Fusion (GNN)

6. Gated Controlled Fusion (GCF)

7. Final Prediction

Analysis Against Provided Flowcharts

Model Weights Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DRCFNet Model Architecture Overview

Overall Flow

1. Multi-Scale Temporal Convolutional Projection (MS-TCN)

2. Temporal Transformer

3. Gated MSR/SSR Disentanglement Split

4. Cross-modal Representation Encoder (CRE)

5. Multi-Head Graph Neural Fusion (GNN)

6. Gated Controlled Fusion (GCF)

7. Final Prediction

Analysis Against Provided Flowcharts

Model Weights Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages