localhost433
diff --git a/‎notes/courses/Japanese/ipa.md‎ ‎notes/courses/J-N5/ipa.md‎notes/courses/Japanese/ipa.md renamed to notes/courses/J-N5/ipa.md b/‎notes/courses/Japanese/ipa.md‎ ‎notes/courses/J-N5/ipa.md‎notes/courses/Japanese/ipa.md renamed to notes/courses/J-N5/ipa.md
diff --git a/‎notes/courses/LING-UA-1/index.json‎
Lines changed: 1 addition & 1 deletion b/‎notes/courses/LING-UA-1/index.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎posts/entries/011-MoE.md‎ ‎posts/entries/011-MoE-1.md‎posts/entries/011-MoE.md renamed to posts/entries/011-MoE-1.md
Lines changed: 49 additions & 25 deletions b/‎posts/entries/011-MoE.md‎ ‎posts/entries/011-MoE-1.md‎posts/entries/011-MoE.md renamed to posts/entries/011-MoE-1.md
Lines changed: 49 additions & 25 deletions
diff --git a/‎posts/entries/012-ConvEWMA.md‎
Lines changed: 0 additions & 183 deletions b/‎posts/entries/012-ConvEWMA.md‎
Lines changed: 0 additions & 183 deletions
diff --git a/‎posts/entries/012-MoE-2.md‎
Lines changed: 26 additions & 0 deletions b/‎posts/entries/012-MoE-2.md‎
Lines changed: 26 additions & 0 deletions
@@ -1,7 +1,7 @@
 [
     {
         "slug": "01-intro",
-        "title": "01    - What is Language",
+        "title": "01 - What is Language",
         "date": "2025-05-19"
     }
     ,
 
@@ -1,5 +1,5 @@
 ---
-title: Loss-free Balancing in MoEs
+title: Auxiliary-loss Load Balancing in MoEs (1)
 date: 2025-07-06
 tags: [cs, finance, ai]
 author: R
@@ -12,18 +12,33 @@ I'm currently reading [this conference paper for ICLR 2025](https://arxiv.org/pd
 After opening the paper I encountered the concept of MoEs, to get myself more familiar, I have read [this blog on HuggingFace](https://huggingface.co/blog/moe) (Sanseviero et al., 2023) which was really helpful, highly recommended. MoE accounts for **M**ixture **o**f **E**xperts, a famous example of it's type is Deepseek. It have many advantages, as the authors have wrote, easy to scale to a large amount of parameters, managable costs, etc.
 
 *Let $u_t$ denote the input of the $t$-th token to an $N$-expert MoE layer, the output $h_t$ is computed as follows:*
+Let  
+- $N$ be the number of experts,  
+- $K$ the number of experts selected per token,  
+- $T$ the total number of tokens in the batch,  
+- $\mathbf{u}_t\in\R^d$ the input for token $t$,  
+- $\mathrm{FFN}_i: \R^d\to\R^d$ the $i$-th expert network,  
+- $e_i\in\R^d$ the centroid (parameter) of expert $i$, and  
+- $G\colon\R\to\R_{>0}$ a positive gating function (e.g. $\exp$, $\mathrm{sigmoid}$, or $\mathrm{softmax}$).
+
+Compute for each token $t$ and expert $i$:
 $$
 \begin{align*}
-        \textbf{h}_t &= \textbf{u}_t + \sum^N_{i=1} g_{i,t} \text{FFN}_i (\textbf{u}_t) \\
     g_{i,t} &= 
         \begin{cases}
-        s_{i,t}, & s_{i,t} \in \text{Topk}(\{s_{j,t} | 1 \leq j \leq N\} , K) \\
+        s_{i,t}, & s_{i,t} \in \text{Topk}(\{s_{j,t} | 1 \leq j \leq N\} , K)\\
+                 &          (\text{if $s_{i,t}$ is among the top-$K$ scores})
+        \\
         0, & \text{otherwise}
         \end{cases}\\
     s_{i,t} &= G(\textbf{u}_t^\top e_i)
 \end{align*}
 $$
-*where $G$ is a nonlinear gating function and $e_i$ is the centroid of the $i$-th expert.*
+
+and form the layer output
+$$
+\textbf{h}_t = \textbf{u}_t + \sum^N_{i=1} g_{i,t} \text{FFN}_i (\textbf{u}_t) \\
+$$
 
 So, here $G$ could be something that is $\R \to \R_{>0}$, some conventional ones could be a $\exp$, softmax or sigmoid (TBH I have to search these up to see what they actually are). In this paper they have use the latter 2.
 
@@ -37,33 +52,42 @@ I was wondering how it could cause a computational bottleneck, but then I realiz
 Plus, the training loop should undergo a substantial redesign for it to use the idle computational power to catch up. Even if I create replicas for the "hot" experts on more hosting devices, they need to be in sync, therefore creating a lot of cost by itself. Merging gradients across replicas requires collective operations every step, at that point it will just recreate the original problem trying to overcome if 1 of these slowed down...
 
 #### Solution: Auxiliary-loss
-To address this issue, there is auxiliary-loss encourage balanced load thus avoids inbalanced routing in training MoEs. To do this, it penalized the use of only a few number of agents. Defined as such:
+To address this issue, there is auxiliary-loss encourage balanced load thus avoids inbalanced routing in training MoEs. To do this, it penalized the use of only a few number of agents. Its mostly within the process of the gating function. Defined as such:
+
 
-$$
-\begin{align*}
-    \mathcal{L}_\text{Balance} &= \alpha \sum^N_{i=1}f_i P_i, \\
-    f_i &= \frac{N}{KT} \sum^T_{t=1} \mathbb{1} \text{  (Token t selects Expert i)}, \\
-    P_i &= \frac{1}{T} \sum^T_{t=1} s_{i,t}
-\end{align*}
-$$
 
-Basically, divide by the variance. 
+- **Normalized load**  
+  $$
+    f_i = \frac{N}{KT} \sum_{t=1}^T \mathbb{1} (i \in \mathrm{Topk} \mid \mathbf{u}_t )
+    \quad (\text{fraction of tokens routed to expert }i ).
+  $$
+- **Average gating weight**
+  $$
+    P_i = \frac{1}{T}\sum_{t=1}^T s_{i,t}
+    \quad (\text{mean score assigned by the gate to expert }i ).
+  $$
+`
+##### Balance loss
 
-#### Solution: EC (Expert Choice)
-The authors of the paper have indicated that this approach *break the causal constraint* (causing a leakage in future information which *destroys the generalization of a model and prevents reliable evaluation*). They have even proved their hypothesis (*that the loss drop originates from the model’s accessing and exploiting future token information*) through experiment:
-Theoretically
+Combine these into a single penalty term:
+$$
+  \mathcal{L}_{\mathrm{balance}}
+   = 
+  \alpha \sum_{i=1}^N f_i P_i.
+$$
 
-#### Solution: Loss-Free Balancing
-This approach is what the authors of this paper have come up with. Essentially introducing a bias factor without adding some noisy gradients like the Auxiliary-loss one does.
+##### Why this encourages balanced routing
 
-## MoE in Trading
-And there's [the 2023 paper](https://personal.ntu.edu.sg/boan/papers/KDD23_Stock.pdf) (Sun et al., 2023) which builds a MoE in order to account for multiple metrics, which differs from mainstream deep learning models being used in the industry.
+- $f_i$ captures how heavily expert $i$ is used, while $P_i$ captures its average gate score.  
+- If an expert is over-selected ($f_i$ large), the product $f_iP_i$ grows, increasing the penalty.  
+- Gradients then adjust the gating parameters to **decrease** routing to over-used experts and **increase** routing to under-used ones, driving the distribution toward uniformity.
 
-## For me
-I guess my main task is that to apply Loss-Free Balancing for an MoE model.
+##### Why this balances load
 
-Because just looking at the time of the publication and practices around I'd say the majority of the industry is still using deep learning models that are dependent on a single indicator.
+- Minimizing $\mathrm{CV}^2$ drives the variance of $\{\mathrm{Imp}_i\}$ or $\{\mathrm{Load}_i\}$ toward zero *relative* to their mean.  
+- Any expert $i$ with above-average usage raises its own $\mathrm{Imp}_i$ or $\mathrm{Load}_i$, increasing the penalty.  
+- Backpropagation through the gating parameters encourages **reduced** routing to overloaded experts and **increased** routing to underutilized ones, leading to a more uniform expert selection distribution.  
 
-While attempts could be made (and I'm sure there have been such attempts all around) to take in more indicators, it seems like that building a MoE model which overcomes drawbacks in auxiliary-loss implementations could be advantageous.
+Basically, divide by the variance.
 
-On second thought, I should've just annotated the paper instead of writing this. Reading sth. as this bright on an airplane with the lights off felt like some type of torture. I couldn't fall asleep in the tiny space, and when I woke up the screen is still on, that made me felt that I have to finish it before doing anything else.
+(TBC)
@@ -0,0 +1,26 @@
+---
+title: Loss-free Balancing in MoEs (2)
+date: 2025-07-06
+tags: [cs, finance, ai]
+author: R
+location: Above around Ust-Ilimsk, Russia while on a plane from New York to Hong Kong
+---
+
+(On second thought) I should've just annotated the paper instead of writing this. Reading sth. as this bright on an airplane with the lights off felt like some type of torture. I couldn't fall asleep in the tiny space, and when I woke up the screen is still on, that made me felt that I have to finish it before doing anything else.
+
+#### Solution: EC (Expert Choice)
+The authors of the paper have indicated that this approach *break the causal constraint* (causing a leakage in future information which *destroys the generalization of a model and prevents reliable evaluation*). They have even proved their hypothesis (*that the loss drop originates from the model’s accessing and exploiting future token information*) through experiment:
+Theoretically
+
+#### Solution: Loss-Free Balancing
+This approach is what the authors of this paper have come up with. Essentially introducing a bias factor without adding some noisy gradients like the Auxiliary-loss one does.
+
+## MoE in Trading
+And there's [the 2023 paper](https://personal.ntu.edu.sg/boan/papers/KDD23_Stock.pdf) (Sun et al., 2023) which builds a MoE in order to account for multiple metrics, which differs from mainstream deep learning models being used in the industry.
+
+## For me
+I guess my main task is that to apply Loss-Free Balancing for an MoE model.
+
+Because just looking at the time of the publication and practices around I'd say the majority of the industry is still using deep learning models that are dependent on a single indicator.
+
+While attempts could be made (and I'm sure there have been such attempts all around) to take in more indicators, it seems like that building a MoE model which overcomes drawbacks in auxiliary-loss implementations could be advantageous.
Original file line number	Diff line number	Diff line change
`@@ -1,7 +1,7 @@`
`1`	`1`	`[`
`2`	`2`	`{`
`3`	`3`	`"slug": "01-intro",`
`4`		`- "title": "01 - What is Language",`
	`4`	`+ "title": "01 - What is Language",`
`5`	`5`	`"date": "2025-05-19"`
`6`	`6`	`}`
`7`	`7`	`,`