Skip to content

Questions about FlexiCodec eval & local attention #4

@jee019

Description

@jee019

Hi,

I have two questions regarding the official FlexiCodec checkpoint and the merging transformer implementation.

1. Gap between paper results and official checkpoint evaluation

I first evaluated the official DualCodec checkpoint, and the results were either almost identical to the paper or slightly worse, which seems reasonable.

However, when I evaluated the official FlexiCodec checkpoint using the same evaluation code, the gap from the paper results seemed noticeably larger. In particular, at threshold = 0.867, I obtained WER(RVQ1) = 4.43.

For evaluation, I used the LibriSpeech test-clean subset filtered to 4–10 seconds (1,237 utterances).

My measured results are as follows:

Model Threshold WER (RVQ1) ↓ WER (RVQ1:8) ↓ PESQ ↑ MCD ↓ SIM ↑ UTMOS ↑ Avg Frame Rate (Hz)
FlexiCodec (Paper) 1.000 2.76 2.23 3.35 2.76 0.85 4.22 12.5
0.910 2.98 2.28 3.03 3.10 0.78 4.21 8.3
0.867 4.15 2.53 2.76 3.42 0.71 4.18 6.25
FlexiCodec (Official ckpt) 1.000 2.89 2.27 3.42 2.97 0.86 4.15 -
0.910 3.04 2.34 3.13 3.27 0.80 4.16 8.39
0.867 4.43 2.63 2.88 3.52 0.73 4.13 6.36

Could you help clarify why this gap occurs?
Is there any additional inference configuration or evaluation setup that should be used for the official FlexiCodec checkpoint?

2. Question about local attention in the merging transformer

In the FlexiCodec paper, the merging module is described as using a Transformer with local attention, and it says:

“Each token in the local attention transformer can attend to ℓ_k = 8 tokens left and right.”

I initially thought this was the reason why the inference config uses:

transformer_context_frames: 16

However, when I debugged the code, it seemed that in the Transformer, self.causal = False, so local attention did not appear to behave the way I expected.

Also, in the training config, I saw:

transformer_context_frames: 38 # 3-second context

So I wanted to ask whether I may have misunderstood how the local attention is implemented.

Could you clarify:

  • how the local attention is actually applied in the merging transformer
  • why the training config uses transformer_context_frames: 38

Thank you very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions