Questions about FlexiCodec eval & local attention

Hi, 

I have two questions regarding the official FlexiCodec checkpoint and the merging transformer implementation.

## 1. Gap between paper results and official checkpoint evaluation

I first evaluated the **official DualCodec checkpoint**, and the results were either almost identical to the paper or slightly worse, which seems reasonable.

However, when I evaluated the **official FlexiCodec checkpoint** using the same evaluation code, the gap from the paper results seemed noticeably larger. In particular, at **threshold = 0.867**, I obtained **WER(RVQ1) = 4.43**.

For evaluation, I used the **LibriSpeech test-clean subset filtered to 4–10 seconds (1,237 utterances)**.

My measured results are as follows:

| Model | Threshold | WER (RVQ1) ↓ | WER (RVQ1:8) ↓ | PESQ ↑ | MCD ↓ | SIM ↑ | UTMOS ↑ | Avg Frame Rate (Hz) |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| FlexiCodec (Paper) | 1.000 | 2.76 | 2.23 | 3.35 | 2.76 | 0.85 | 4.22 | 12.5 |
|  | 0.910 | 2.98 | 2.28 | 3.03 | 3.10 | 0.78 | 4.21 | 8.3 |
|  | 0.867 | 4.15 | 2.53 | 2.76 | 3.42 | 0.71 | 4.18 | 6.25 |
| FlexiCodec (Official ckpt) | 1.000 | 2.89 | 2.27 | 3.42 | 2.97 | 0.86 | 4.15 | - |
|  | 0.910 | 3.04 | 2.34 | 3.13 | 3.27 | 0.80 | 4.16 | 8.39 |
|  | 0.867 | 4.43 | 2.63 | 2.88 | 3.52 | 0.73 | 4.13 | 6.36 |

Could you help clarify why this gap occurs?  
Is there any additional inference configuration or evaluation setup that should be used for the official FlexiCodec checkpoint?

## 2. Question about local attention in the merging transformer

In the FlexiCodec paper, the merging module is described as using a **Transformer with local attention**, and it says:

> “Each token in the local attention transformer can attend to ℓ_k = 8 tokens left and right.”

I initially thought this was the reason why the [inference config](https://github.com/AmphionTeam/FlexiCodec/blob/main/flexicodec/conf/model_config.yaml#L24) uses:

`transformer_context_frames: 16`

However, when I debugged the code, it seemed that in the Transformer, `self.causal = False`, so **local attention did not appear** to behave the way I expected.

Also, in the [training config](https://github.com/jiaqili3/flexicodec_training_share/blob/main/conf/model/bdcodec/flexicodec_reproduce.yaml#L32), I saw:

`transformer_context_frames: 38  # 3-second context`

So I wanted to ask whether I may have misunderstood how the local attention is implemented.

Could you clarify:
- how the local attention is actually applied in the merging transformer
- why the training config uses `transformer_context_frames: 38`

Thank you very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about FlexiCodec eval & local attention #4

1. Gap between paper results and official checkpoint evaluation

2. Question about local attention in the merging transformer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Threshold	WER (RVQ1) ↓	WER (RVQ1:8) ↓	PESQ ↑	MCD ↓	SIM ↑	UTMOS ↑	Avg Frame Rate (Hz)
FlexiCodec (Paper)	1.000	2.76	2.23	3.35	2.76	0.85	4.22	12.5
	0.910	2.98	2.28	3.03	3.10	0.78	4.21	8.3
	0.867	4.15	2.53	2.76	3.42	0.71	4.18	6.25
FlexiCodec (Official ckpt)	1.000	2.89	2.27	3.42	2.97	0.86	4.15	-
	0.910	3.04	2.34	3.13	3.27	0.80	4.16	8.39
	0.867	4.43	2.63	2.88	3.52	0.73	4.13	6.36

Questions about FlexiCodec eval & local attention #4

Description

1. Gap between paper results and official checkpoint evaluation

2. Question about local attention in the merging transformer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions