You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -12,18 +12,33 @@ I'm currently reading [this conference paper for ICLR 2025](https://arxiv.org/pd
12
12
After opening the paper I encountered the concept of MoEs, to get myself more familiar, I have read [this blog on HuggingFace](https://huggingface.co/blog/moe) (Sanseviero et al., 2023) which was really helpful, highly recommended. MoE accounts for **M**ixture **o**f **E**xperts, a famous example of it's type is Deepseek. It have many advantages, as the authors have wrote, easy to scale to a large amount of parameters, managable costs, etc.
13
13
14
14
*Let $u_t$ denote the input of the $t$-th token to an $N$-expert MoE layer, the output $h_t$ is computed as follows:*
15
+
Let
16
+
- $N$ be the number of experts,
17
+
- $K$ the number of experts selected per token,
18
+
- $T$ the total number of tokens in the batch,
19
+
- $\mathbf{u}_t\in\R^d$ the input for token $t$,
20
+
- $\mathrm{FFN}_i: \R^d\to\R^d$ the $i$-th expert network,
21
+
- $e_i\in\R^d$ the centroid (parameter) of expert $i$, and
22
+
- $G\colon\R\to\R_{>0}$ a positive gating function (e.g. $\exp$, $\mathrm{sigmoid}$, or $\mathrm{softmax}$).
So, here $G$ could be something that is $\R \to \R_{>0}$, some conventional ones could be a $\exp$, softmax or sigmoid (TBH I have to search these up to see what they actually are). In this paper they have use the latter 2.
29
44
@@ -37,33 +52,42 @@ I was wondering how it could cause a computational bottleneck, but then I realiz
37
52
Plus, the training loop should undergo a substantial redesign for it to use the idle computational power to catch up. Even if I create replicas for the "hot" experts on more hosting devices, they need to be in sync, therefore creating a lot of cost by itself. Merging gradients across replicas requires collective operations every step, at that point it will just recreate the original problem trying to overcome if 1 of these slowed down...
38
53
39
54
#### Solution: Auxiliary-loss
40
-
To address this issue, there is auxiliary-loss encourage balanced load thus avoids inbalanced routing in training MoEs. To do this, it penalized the use of only a few number of agents. Defined as such:
55
+
To address this issue, there is auxiliary-loss encourage balanced load thus avoids inbalanced routing in training MoEs. To do this, it penalized the use of only a few number of agents. Its mostly within the process of the gating function. Defined as such:
\quad (\text{fraction of tokens routed to expert }i ).
63
+
$$
64
+
-**Average gating weight**
65
+
$$
66
+
P_i = \frac{1}{T}\sum_{t=1}^T s_{i,t}
67
+
\quad (\text{mean score assigned by the gate to expert }i ).
68
+
$$
69
+
`
70
+
##### Balance loss
51
71
52
-
#### Solution: EC (Expert Choice)
53
-
The authors of the paper have indicated that this approach *break the causal constraint* (causing a leakage in future information which *destroys the generalization of a model and prevents reliable evaluation*). They have even proved their hypothesis (*that the loss drop originates from the model’s accessing and exploiting future token information*) through experiment:
54
-
Theoretically
72
+
Combine these into a single penalty term:
73
+
$$
74
+
\mathcal{L}_{\mathrm{balance}}
75
+
=
76
+
\alpha \sum_{i=1}^N f_i P_i.
77
+
$$
55
78
56
-
#### Solution: Loss-Free Balancing
57
-
This approach is what the authors of this paper have come up with. Essentially introducing a bias factor without adding some noisy gradients like the Auxiliary-loss one does.
79
+
##### Why this encourages balanced routing
58
80
59
-
## MoE in Trading
60
-
And there's [the 2023 paper](https://personal.ntu.edu.sg/boan/papers/KDD23_Stock.pdf) (Sun et al., 2023) which builds a MoE in order to account for multiple metrics, which differs from mainstream deep learning models being used in the industry.
81
+
- $f_i$ captures how heavily expert $i$ is used, while $P_i$ captures its average gate score.
82
+
- If an expert is over-selected ($f_i$ large), the product $f_iP_i$ grows, increasing the penalty.
83
+
- Gradients then adjust the gating parameters to **decrease** routing to over-used experts and **increase** routing to under-used ones, driving the distribution toward uniformity.
61
84
62
-
## For me
63
-
I guess my main task is that to apply Loss-Free Balancing for an MoE model.
85
+
##### Why this balances load
64
86
65
-
Because just looking at the time of the publication and practices around I'd say the majority of the industry is still using deep learning models that are dependent on a single indicator.
87
+
- Minimizing $\mathrm{CV}^2$ drives the variance of $\{\mathrm{Imp}_i\}$ or $\{\mathrm{Load}_i\}$ toward zero *relative* to their mean.
88
+
- Any expert $i$ with above-average usage raises its own $\mathrm{Imp}_i$ or $\mathrm{Load}_i$, increasing the penalty.
89
+
- Backpropagation through the gating parameters encourages **reduced** routing to overloaded experts and **increased** routing to underutilized ones, leading to a more uniform expert selection distribution.
66
90
67
-
While attempts could be made (and I'm sure there have been such attempts all around) to take in more indicators, it seems like that building a MoE model which overcomes drawbacks in auxiliary-loss implementations could be advantageous.
91
+
Basically, divide by the variance.
68
92
69
-
On second thought, I should've just annotated the paper instead of writing this. Reading sth. as this bright on an airplane with the lights off felt like some type of torture. I couldn't fall asleep in the tiny space, and when I woke up the screen is still on, that made me felt that I have to finish it before doing anything else.
location: Above around Ust-Ilimsk, Russia while on a plane from New York to Hong Kong
7
+
---
8
+
9
+
(On second thought) I should've just annotated the paper instead of writing this. Reading sth. as this bright on an airplane with the lights off felt like some type of torture. I couldn't fall asleep in the tiny space, and when I woke up the screen is still on, that made me felt that I have to finish it before doing anything else.
10
+
11
+
#### Solution: EC (Expert Choice)
12
+
The authors of the paper have indicated that this approach *break the causal constraint* (causing a leakage in future information which *destroys the generalization of a model and prevents reliable evaluation*). They have even proved their hypothesis (*that the loss drop originates from the model’s accessing and exploiting future token information*) through experiment:
13
+
Theoretically
14
+
15
+
#### Solution: Loss-Free Balancing
16
+
This approach is what the authors of this paper have come up with. Essentially introducing a bias factor without adding some noisy gradients like the Auxiliary-loss one does.
17
+
18
+
## MoE in Trading
19
+
And there's [the 2023 paper](https://personal.ntu.edu.sg/boan/papers/KDD23_Stock.pdf) (Sun et al., 2023) which builds a MoE in order to account for multiple metrics, which differs from mainstream deep learning models being used in the industry.
20
+
21
+
## For me
22
+
I guess my main task is that to apply Loss-Free Balancing for an MoE model.
23
+
24
+
Because just looking at the time of the publication and practices around I'd say the majority of the industry is still using deep learning models that are dependent on a single indicator.
25
+
26
+
While attempts could be made (and I'm sure there have been such attempts all around) to take in more indicators, it seems like that building a MoE model which overcomes drawbacks in auxiliary-loss implementations could be advantageous.
0 commit comments