From 1611e6d01519708f8b603f25dddd8974f1a5ff5c Mon Sep 17 00:00:00 2001 From: HaD0Yun Date: Sun, 22 Mar 2026 19:25:46 +0900 Subject: [PATCH] Capture the paper clarifications from issue #13 Adds a small paper-clarification note based on the maintainer answers already given in issue #13. Constraint: Keep this branch scoped to issue #13 only Rejected: Folding this into a multi-issue combined PR | the requested delivery shape is one PR per issue Confidence: high Scope-risk: narrow Directive: Keep this note fact-limited to what the public repository and issue thread currently support Tested: git diff --check --cached Not-tested: Runtime feature execution for the requested capability --- docs/issues/issue-13-paper-questions.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 docs/issues/issue-13-paper-questions.md diff --git a/docs/issues/issue-13-paper-questions.md b/docs/issues/issue-13-paper-questions.md new file mode 100644 index 0000000..b070922 --- /dev/null +++ b/docs/issues/issue-13-paper-questions.md @@ -0,0 +1,23 @@ +# Issue #13 — Paper clarification note + +## Summary +This note turns the maintainer answers from issue #13 into a short reference for readers of the paper. + +## Clarifications already provided in the issue thread + +### 1) Lookahead shift `K` +The maintainer response states that acoustic features are shifted by **5 positions**. In practice, that means TADA uses **5 text-token lookahead** during TTS generation. + +### 2) Evaluation setup for Tables 5 and 6 +The maintainer response states that the evaluation is the **voice cloning setup from Table 2**, using **PPL as in Table 4**. + +### 3) Cross-entropy / KD losses for the TTS setting +The maintainer response states that removing the CE and KD losses did **not significantly improve TTS performance**. The ablations in Table 6 were run at a smaller scale before the final main model, so the team does not currently report a separate "base" number for the fully removed-loss setting. + +## Follow-up questions that remain open in the thread +The issue still contains follow-up questions that are not yet answered in the repository docs: +- how the text/audio pair construction handles the final lookahead positions during training +- whether text prediction remains active under the hood during inference in the TTS path + +## Why this note exists +The GitHub issue already contains useful maintainer answers, but they are easy to miss if a reader only consults the repository files.