Skip to content

Commit d7d9505

Browse files
committed
Note updates
1 parent 9230bae commit d7d9505

6 files changed

Lines changed: 76 additions & 253 deletions

File tree

File renamed without changes.

notes/courses/LING-UA-1/index.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[
22
{
33
"slug": "01-intro",
4-
"title": "01 - What is Language",
4+
"title": "01 - What is Language",
55
"date": "2025-05-19"
66
}
77
,
Lines changed: 49 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Loss-free Balancing in MoEs
2+
title: Auxiliary-loss Load Balancing in MoEs (1)
33
date: 2025-07-06
44
tags: [cs, finance, ai]
55
author: R
@@ -12,18 +12,33 @@ I'm currently reading [this conference paper for ICLR 2025](https://arxiv.org/pd
1212
After opening the paper I encountered the concept of MoEs, to get myself more familiar, I have read [this blog on HuggingFace](https://huggingface.co/blog/moe) (Sanseviero et al., 2023) which was really helpful, highly recommended. MoE accounts for **M**ixture **o**f **E**xperts, a famous example of it's type is Deepseek. It have many advantages, as the authors have wrote, easy to scale to a large amount of parameters, managable costs, etc.
1313

1414
*Let $u_t$ denote the input of the $t$-th token to an $N$-expert MoE layer, the output $h_t$ is computed as follows:*
15+
Let
16+
- $N$ be the number of experts,
17+
- $K$ the number of experts selected per token,
18+
- $T$ the total number of tokens in the batch,
19+
- $\mathbf{u}_t\in\R^d$ the input for token $t$,
20+
- $\mathrm{FFN}_i: \R^d\to\R^d$ the $i$-th expert network,
21+
- $e_i\in\R^d$ the centroid (parameter) of expert $i$, and
22+
- $G\colon\R\to\R_{>0}$ a positive gating function (e.g. $\exp$, $\mathrm{sigmoid}$, or $\mathrm{softmax}$).
23+
24+
Compute for each token $t$ and expert $i$:
1525
$$
1626
\begin{align*}
17-
\textbf{h}_t &= \textbf{u}_t + \sum^N_{i=1} g_{i,t} \text{FFN}_i (\textbf{u}_t) \\
1827
g_{i,t} &=
1928
\begin{cases}
20-
s_{i,t}, & s_{i,t} \in \text{Topk}(\{s_{j,t} | 1 \leq j \leq N\} , K) \\
29+
s_{i,t}, & s_{i,t} \in \text{Topk}(\{s_{j,t} | 1 \leq j \leq N\} , K)\\
30+
& (\text{if $s_{i,t}$ is among the top-$K$ scores})
31+
\\
2132
0, & \text{otherwise}
2233
\end{cases}\\
2334
s_{i,t} &= G(\textbf{u}_t^\top e_i)
2435
\end{align*}
2536
$$
26-
*where $G$ is a nonlinear gating function and $e_i$ is the centroid of the $i$-th expert.*
37+
38+
and form the layer output
39+
$$
40+
\textbf{h}_t = \textbf{u}_t + \sum^N_{i=1} g_{i,t} \text{FFN}_i (\textbf{u}_t) \\
41+
$$
2742

2843
So, here $G$ could be something that is $\R \to \R_{>0}$, some conventional ones could be a $\exp$, softmax or sigmoid (TBH I have to search these up to see what they actually are). In this paper they have use the latter 2.
2944

@@ -37,33 +52,42 @@ I was wondering how it could cause a computational bottleneck, but then I realiz
3752
Plus, the training loop should undergo a substantial redesign for it to use the idle computational power to catch up. Even if I create replicas for the "hot" experts on more hosting devices, they need to be in sync, therefore creating a lot of cost by itself. Merging gradients across replicas requires collective operations every step, at that point it will just recreate the original problem trying to overcome if 1 of these slowed down...
3853

3954
#### Solution: Auxiliary-loss
40-
To address this issue, there is auxiliary-loss encourage balanced load thus avoids inbalanced routing in training MoEs. To do this, it penalized the use of only a few number of agents. Defined as such:
55+
To address this issue, there is auxiliary-loss encourage balanced load thus avoids inbalanced routing in training MoEs. To do this, it penalized the use of only a few number of agents. Its mostly within the process of the gating function. Defined as such:
56+
4157

42-
$$
43-
\begin{align*}
44-
\mathcal{L}_\text{Balance} &= \alpha \sum^N_{i=1}f_i P_i, \\
45-
f_i &= \frac{N}{KT} \sum^T_{t=1} \mathbb{1} \text{ (Token t selects Expert i)}, \\
46-
P_i &= \frac{1}{T} \sum^T_{t=1} s_{i,t}
47-
\end{align*}
48-
$$
4958

50-
Basically, divide by the variance.
59+
- **Normalized load**
60+
$$
61+
f_i = \frac{N}{KT} \sum_{t=1}^T \mathbb{1} (i \in \mathrm{Topk} \mid \mathbf{u}_t )
62+
\quad (\text{fraction of tokens routed to expert }i ).
63+
$$
64+
- **Average gating weight**
65+
$$
66+
P_i = \frac{1}{T}\sum_{t=1}^T s_{i,t}
67+
\quad (\text{mean score assigned by the gate to expert }i ).
68+
$$
69+
`
70+
##### Balance loss
5171

52-
#### Solution: EC (Expert Choice)
53-
The authors of the paper have indicated that this approach *break the causal constraint* (causing a leakage in future information which *destroys the generalization of a model and prevents reliable evaluation*). They have even proved their hypothesis (*that the loss drop originates from the model’s accessing and exploiting future token information*) through experiment:
54-
Theoretically
72+
Combine these into a single penalty term:
73+
$$
74+
\mathcal{L}_{\mathrm{balance}}
75+
=
76+
\alpha \sum_{i=1}^N f_i P_i.
77+
$$
5578

56-
#### Solution: Loss-Free Balancing
57-
This approach is what the authors of this paper have come up with. Essentially introducing a bias factor without adding some noisy gradients like the Auxiliary-loss one does.
79+
##### Why this encourages balanced routing
5880

59-
## MoE in Trading
60-
And there's [the 2023 paper](https://personal.ntu.edu.sg/boan/papers/KDD23_Stock.pdf) (Sun et al., 2023) which builds a MoE in order to account for multiple metrics, which differs from mainstream deep learning models being used in the industry.
81+
- $f_i$ captures how heavily expert $i$ is used, while $P_i$ captures its average gate score.
82+
- If an expert is over-selected ($f_i$ large), the product $f_iP_i$ grows, increasing the penalty.
83+
- Gradients then adjust the gating parameters to **decrease** routing to over-used experts and **increase** routing to under-used ones, driving the distribution toward uniformity.
6184

62-
## For me
63-
I guess my main task is that to apply Loss-Free Balancing for an MoE model.
85+
##### Why this balances load
6486

65-
Because just looking at the time of the publication and practices around I'd say the majority of the industry is still using deep learning models that are dependent on a single indicator.
87+
- Minimizing $\mathrm{CV}^2$ drives the variance of $\{\mathrm{Imp}_i\}$ or $\{\mathrm{Load}_i\}$ toward zero *relative* to their mean.
88+
- Any expert $i$ with above-average usage raises its own $\mathrm{Imp}_i$ or $\mathrm{Load}_i$, increasing the penalty.
89+
- Backpropagation through the gating parameters encourages **reduced** routing to overloaded experts and **increased** routing to underutilized ones, leading to a more uniform expert selection distribution.
6690

67-
While attempts could be made (and I'm sure there have been such attempts all around) to take in more indicators, it seems like that building a MoE model which overcomes drawbacks in auxiliary-loss implementations could be advantageous.
91+
Basically, divide by the variance.
6892

69-
On second thought, I should've just annotated the paper instead of writing this. Reading sth. as this bright on an airplane with the lights off felt like some type of torture. I couldn't fall asleep in the tiny space, and when I woke up the screen is still on, that made me felt that I have to finish it before doing anything else.
93+
(TBC)

posts/entries/012-ConvEWMA.md

Lines changed: 0 additions & 183 deletions
This file was deleted.

posts/entries/012-MoE-2.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
title: Loss-free Balancing in MoEs (2)
3+
date: 2025-07-06
4+
tags: [cs, finance, ai]
5+
author: R
6+
location: Above around Ust-Ilimsk, Russia while on a plane from New York to Hong Kong
7+
---
8+
9+
(On second thought) I should've just annotated the paper instead of writing this. Reading sth. as this bright on an airplane with the lights off felt like some type of torture. I couldn't fall asleep in the tiny space, and when I woke up the screen is still on, that made me felt that I have to finish it before doing anything else.
10+
11+
#### Solution: EC (Expert Choice)
12+
The authors of the paper have indicated that this approach *break the causal constraint* (causing a leakage in future information which *destroys the generalization of a model and prevents reliable evaluation*). They have even proved their hypothesis (*that the loss drop originates from the model’s accessing and exploiting future token information*) through experiment:
13+
Theoretically
14+
15+
#### Solution: Loss-Free Balancing
16+
This approach is what the authors of this paper have come up with. Essentially introducing a bias factor without adding some noisy gradients like the Auxiliary-loss one does.
17+
18+
## MoE in Trading
19+
And there's [the 2023 paper](https://personal.ntu.edu.sg/boan/papers/KDD23_Stock.pdf) (Sun et al., 2023) which builds a MoE in order to account for multiple metrics, which differs from mainstream deep learning models being used in the industry.
20+
21+
## For me
22+
I guess my main task is that to apply Loss-Free Balancing for an MoE model.
23+
24+
Because just looking at the time of the publication and practices around I'd say the majority of the industry is still using deep learning models that are dependent on a single indicator.
25+
26+
While attempts could be made (and I'm sure there have been such attempts all around) to take in more indicators, it seems like that building a MoE model which overcomes drawbacks in auxiliary-loss implementations could be advantageous.

0 commit comments

Comments
 (0)