Controllable text-guided gesture generation

Hi, thanks for you great work. I have read your paper and have some problems related to this part. For text-conditioned generation, is the condition c1 only d not (d,a)? Because for text-conditioned data, we only have text condition and no audio condition.
![Screenshot 2024-07-15 at 15 22 42](https://github.com/user-attachments/assets/0fb74964-b03f-4d51-aba5-1f6bb89a9f0e)