about Multimodal Feature Injection in 3D Attention

Thank you for your great work. I'm curious if you've explored an alternative method for injecting multimodal features into 3D attention, besides the direct injection into 3D Attention followed by an FFN, as shown in the provided image.

<img width="383" height="224" alt="Image" src="https://github.com/user-attachments/assets/c5349b7b-250f-4074-b912-d834011a2b5a" />

Specifically, have you attempted a method similar to "[Diffusion as a Shader](https://arxiv.org/abs/2501.03847)" where multimodal features are added to a DIT block via a zero linear layer? I'm interested to know which of these two approaches yields better results.

<img width="155" height="198" alt="Image" src="https://github.com/user-attachments/assets/a8686590-5c5a-475b-9d9c-fdcc006895b2" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about Multimodal Feature Injection in 3D Attention #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

about Multimodal Feature Injection in 3D Attention #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions