Thank you for your great work. I'm curious if you've explored an alternative method for injecting multimodal features into 3D attention, besides the direct injection into 3D Attention followed by an FFN, as shown in the provided image.
Specifically, have you attempted a method similar to "Diffusion as a Shader" where multimodal features are added to a DIT block via a zero linear layer? I'm interested to know which of these two approaches yields better results.

Thank you for your great work. I'm curious if you've explored an alternative method for injecting multimodal features into 3D attention, besides the direct injection into 3D Attention followed by an FFN, as shown in the provided image.
Specifically, have you attempted a method similar to "Diffusion as a Shader" where multimodal features are added to a DIT block via a zero linear layer? I'm interested to know which of these two approaches yields better results.