Hi, thanks for you great work. I have read your paper and have some problems related to this part. For text-conditioned generation, is the condition c1 only d not (d,a)? Because for text-conditioned data, we only have text condition and no audio condition.

Hi, thanks for you great work. I have read your paper and have some problems related to this part. For text-conditioned generation, is the condition c1 only d not (d,a)? Because for text-conditioned data, we only have text condition and no audio condition.
