Hi,
I have two questions regarding the official FlexiCodec checkpoint and the merging transformer implementation.
1. Gap between paper results and official checkpoint evaluation
I first evaluated the official DualCodec checkpoint, and the results were either almost identical to the paper or slightly worse, which seems reasonable.
However, when I evaluated the official FlexiCodec checkpoint using the same evaluation code, the gap from the paper results seemed noticeably larger. In particular, at threshold = 0.867, I obtained WER(RVQ1) = 4.43.
For evaluation, I used the LibriSpeech test-clean subset filtered to 4–10 seconds (1,237 utterances).
My measured results are as follows:
| Model |
Threshold |
WER (RVQ1) ↓ |
WER (RVQ1:8) ↓ |
PESQ ↑ |
MCD ↓ |
SIM ↑ |
UTMOS ↑ |
Avg Frame Rate (Hz) |
| FlexiCodec (Paper) |
1.000 |
2.76 |
2.23 |
3.35 |
2.76 |
0.85 |
4.22 |
12.5 |
|
0.910 |
2.98 |
2.28 |
3.03 |
3.10 |
0.78 |
4.21 |
8.3 |
|
0.867 |
4.15 |
2.53 |
2.76 |
3.42 |
0.71 |
4.18 |
6.25 |
| FlexiCodec (Official ckpt) |
1.000 |
2.89 |
2.27 |
3.42 |
2.97 |
0.86 |
4.15 |
- |
|
0.910 |
3.04 |
2.34 |
3.13 |
3.27 |
0.80 |
4.16 |
8.39 |
|
0.867 |
4.43 |
2.63 |
2.88 |
3.52 |
0.73 |
4.13 |
6.36 |
Could you help clarify why this gap occurs?
Is there any additional inference configuration or evaluation setup that should be used for the official FlexiCodec checkpoint?
2. Question about local attention in the merging transformer
In the FlexiCodec paper, the merging module is described as using a Transformer with local attention, and it says:
“Each token in the local attention transformer can attend to ℓ_k = 8 tokens left and right.”
I initially thought this was the reason why the inference config uses:
transformer_context_frames: 16
However, when I debugged the code, it seemed that in the Transformer, self.causal = False, so local attention did not appear to behave the way I expected.
Also, in the training config, I saw:
transformer_context_frames: 38 # 3-second context
So I wanted to ask whether I may have misunderstood how the local attention is implemented.
Could you clarify:
- how the local attention is actually applied in the merging transformer
- why the training config uses
transformer_context_frames: 38
Thank you very much.
Hi,
I have two questions regarding the official FlexiCodec checkpoint and the merging transformer implementation.
1. Gap between paper results and official checkpoint evaluation
I first evaluated the official DualCodec checkpoint, and the results were either almost identical to the paper or slightly worse, which seems reasonable.
However, when I evaluated the official FlexiCodec checkpoint using the same evaluation code, the gap from the paper results seemed noticeably larger. In particular, at threshold = 0.867, I obtained WER(RVQ1) = 4.43.
For evaluation, I used the LibriSpeech test-clean subset filtered to 4–10 seconds (1,237 utterances).
My measured results are as follows:
Could you help clarify why this gap occurs?
Is there any additional inference configuration or evaluation setup that should be used for the official FlexiCodec checkpoint?
2. Question about local attention in the merging transformer
In the FlexiCodec paper, the merging module is described as using a Transformer with local attention, and it says:
I initially thought this was the reason why the inference config uses:
transformer_context_frames: 16However, when I debugged the code, it seemed that in the Transformer,
self.causal = False, so local attention did not appear to behave the way I expected.Also, in the training config, I saw:
transformer_context_frames: 38 # 3-second contextSo I wanted to ask whether I may have misunderstood how the local attention is implemented.
Could you clarify:
transformer_context_frames: 38Thank you very much.