Quartz sync: Mar 14, 2026, 5:06 PM

slyofzero · slyofzero · commit 5eb0404daea8 · 2026-03-14T17:06:38.000+05:30
diff --git a/content/Machine Learning Techniques/Linear Regression.md b/content/Machine Learning Techniques/Linear Regression.md
@@ -0,0 +1 @@
+# Non-linear Regression
diff --git a/content/Mathematical Foundations of Generative AI/VDM and GANs.md b/content/Mathematical Foundations of Generative AI/VDM and GANs.md
@@ -100,6 +100,11 @@ $$
 K.L Divergence is asymmetric, meaning $\underbrace{D(P_X\,||\,P_\theta)}_\text{Forward K.L.} \ne \underbrace{D(P_\theta\,||\,P_X)}_\text{Reverse K.L.}$.
 
 2. $f(u) = \frac{1}{2}\left(u\,log\,u-(u+1)\,log\left(\frac{u+1}{2}\right)\right)$ leads to the **JS (Jensen-Shannon) Divergence**.
+
+$$
+  JS(P_X || P_\theta) = \frac{1}{2} KL(P_X || M) + \frac{1}{2} KL(P_\theta || M) \qquad \text{where,}\,\,M = \frac{P_X + P_\theta}{2}
+$$
+
 3. $f(u)=\frac{1}{2}|u-1|$ leads to the **Total Variation Distance** or TV Distance.
 
 ## Algorithm for f-divergence minimization
@@ -332,7 +337,7 @@ $$
 
 &\approx \arg\min_{\theta} \Bigg[\cancel{\frac{1}{B_1}\sum_{i=1}^{B_1}log\,D_w(x_i)}^{\text{Independent of }\theta} + \frac{1}{B_2}\sum_{i=1}^{B_2}log\,(1-D_w(\hat{x}_i)) \Bigg] \\[8pt]
 
-&= \arg\min_{\theta} \Bigg[\frac{1}{B_2}\sum_{i=1}^{B_2}log\,(1-D_w(\hat{x}_i) \Bigg] \qquad\because \text{Second term stays as } \hat{x}_i=g_\theta(z_j) \in P_\theta \\[8pt]
+&= \arg\min_{\theta} \Bigg[\frac{1}{B_2}\sum_{i=1}^{B_2}log\,(1-D_w(\hat{x}_i)) \Bigg] \qquad\because \text{Second term stays as } \hat{x}_i=g_\theta(z_j) \in P_\theta \\[8pt]
 \end{aligned}
 $$
 
@@ -438,7 +443,7 @@ Given two distributions $P_X$ and $P_{\hat{X}}$,
 
 $$
 \begin{aligned}
-W(P_X || P_{\hat{X}}) &= \min_{\lambda \in \Pi(X,\hat{X})} \Big[\underset{\lambda(x,\hat{x})}{\mathbb{E}}||x-\hat{x}||_2\Big] \\[8pt]
+W(P_X || P_{\hat{X}}) &= \min_{\lambda \in \Pi(X,\hat{X})} \Big[\underset{\lambda(x,\hat{x})}{\mathbb{E}}||X-\hat{X}||_2\Big] \\[8pt]
 \lambda &: \text{Joint distribution b/w }P_X,P_{\hat{X}} \\[8pt]
 \Pi(X,\hat{X}) &: \text{All Joint distributions such that -} \\[8pt]
 &\int_X \Pi(X,\hat{X})\,dx = P_{\hat{X}} \\[8pt]
@@ -500,21 +505,21 @@ W(P_x || P_\theta) &= \max_{||T_w(x)||_L  \lt 1} \Big[\underset{P_X}{\mathbb{E}}
 \end{aligned}
 $$
 
-Any function $f$ being 1-Lipschitz means that the function cannot change faster than the distance -
+Any function $f$ being 1-Lipschitz means that the function cannot change faster than the distance (the derivative is always less than or equal to 1) -
 
 $$
-\frac{||f(x_1)-f(x_2)||}{||x_1-x_2||} \lt 1
+\frac{||f(x_1)-f(x_2)||}{||x_1-x_2||} \le 1
 $$
 
-The $T_w$ in this case is a neural network and can be made 1-Lipschitz by normalizing the weights of $T_w$ such that $||w||_2=1$ after each gradient step.
+The $T_w$ in this case is a neural network and can be made 1-Lipschitz by normalizing the weights of $T_w$ such that $||w||_2=1$ after each gradient step. 
 
 $\theta^*$ has to be such that the Wasserstein's distance is to be minimized. The Kantrovic-Rubenstein's Duality enables us to express the Wasserstein's distance in terms of expectations of $P_X$ and $P_\theta$.
 
 $$
 \theta^*, w^*= \arg\min_\theta\max_{||T_w(x)||_L  \lt 1} \Big[\underset{P_X}{\mathbb{E}}\, [T_w(x)] - \underset{P_\theta}{\mathbb{E}} \, [T_w(\hat{x})]\Big]
 $$
 
-The above objective is very similar to GANs. That's why this method of minimizing the Wasserstein's metric is called the **WGAN**. Training a WGAN is more stable than training a Naive-GAN.
+The above objective is very similar to GANs. That's why this method of minimizing the Wasserstein's metric is called the **WGAN**. Training a WGAN is more stable than training a Naive-GAN as the gradients will not saturate if the supports of the probability distributions misalign.
 ## Bi-Directional GAN (Bi-GAN)
 ### Inversion of GANs
 We train a GAN specifically to allow us to sample $x$ from the dataset distribution $P_X$ by picking a random sample $z$ from an arbitrary distribution $Z$ and passing it through the generator $g_\theta$. But how can we get back $z$ if we know $x$?
@@ -531,7 +536,7 @@ The objective function is -
 
 $$
 \begin{aligned}
-L_{BiGAN}(\theta,w,\phi) &= \underset{x\sim P_X}{\mathbb{E}}\Big[\underset{\hat{z}\sim P_\phi}{\mathbb{E}}[log\,D_w(x,E_\phi(x))]\Big] + \underset{z\sim Z}{\mathbb{E}}\Big[\underset{\hat{x}\sim P_\theta}{\mathbb{E}}[log\,\{1 - D_w(x,E_\phi(x))\}]\Big] \\[8pt]
+L_{BiGAN}(\theta,w,\phi) &= \underset{x\sim P_X}{\mathbb{E}}\Big[\underset{\hat{z}\sim P_\phi}{\mathbb{E}}[log\,D_w(x,E_\phi(x))]\Big] + \underset{z\sim Z}{\mathbb{E}}\Big[\underset{\hat{x}\sim P_\theta}{\mathbb{E}}[log\,\{1 - D_w(g_\theta(z),z)\}]\Big] \\[8pt]
 \theta^*,w^*,\phi^* &= \arg\min_{\theta,\phi}\max_{w} L_{BiGAN}(\theta,w,\phi) \\[8pt]
 \text{where, } &z\sim Z, \,\,\,\,\hat{x} \sim P_\theta, \,\,\,\,\hat{z} \sim P_\phi
 \end{aligned}
@@ -546,6 +551,14 @@ P_{\hat{Z}X} &= \int_X P_X(x) \int_{\hat{Z}} P_\phi(\hat{z}|x)\,d\hat{z}\,dx \\[
 P_{Z\hat{X}} &= \int_Z P_Z(x) \int_{\hat{X}} P_\phi(\hat{x}|z)\,d\hat{x}\,dz \\[8pt]
 \end{aligned}
 $$
+
+### Latent Regression
+$$
+\begin{aligned}
+L(\theta, w, \phi) = \underset{x\sim P_X}{\mathbb{E}}\operatorname{log}D_w(x) + \underset{\hat{x}\sim P_\theta}{\mathbb{E}}\operatorname{log}(1-D_w(x)) + \lambda\underset{\hat{x}\sim P_\theta}{\mathbb{E}}||z-E_\phi(\hat x)||^2_2
+\end{aligned}
+$$
+where $\lambda$ is a hyperparameter. Here the discriminator remains the same as the naive regressor. It has been found that 
 ## Domain Adversarial Networks
 Suppose we have a source dataset and target dataset such that both belong to a different distribution.
 
@@ -556,23 +569,47 @@ D_t &= \{(\hat{x}_j)\}_{j=1}^m &\sim P_t\\[8pt]
 \end{alignedat}
 $$
 
-Any classifier/regressor trained solely on $D_s$ would fail to predict for the target items in $D_t$. We can use **Domain Adversarial Networks** here to train a classifier that is **domain agnostic** (able to classify independent on which domain element belong to).
+When the probability distribution of your training and testing dataset differ, we call this as **domain shift**.
+
+Any classifier/regressor trained solely on $D_s$ would fail to predict for the target items in $D_t$. 
+
+---
+<h4 class="special">Example</h4>
+Imagine you're training a model to identify images of dogs. The training dataset for your model has sketches, paintings, and cartoon representations of dogs while you test dataset has actual photos of dogs. In such a case the distributions for the training and testing dataset differ. The method to solving domain shift is called as **Unsupervised Domain Adaptation**.
+
+In the above example -
+- The broader class of "animal representations" is a **semantic class.**
+- The sketches, paintings, cartoons, and photos are called as **domains**.
+
+The network can be trained either on all four domains or one of the domains can be unknown. In the case where a domain is left out, our hope is that all domains share the same underlying semantic structure just with different marginal distributions. This is called as the **shared support assumption**. Under this assumption, an optimal encoder trained on the rest three domains should be able to extract meaningful features from the unseen domain. This setting is called **domain generalization**.
+
+---
+
+So our objective with such a setup would be that our model is able to learn the features/classifier in such a manner that it's able to perform well on both $P_s$ and $P_t$.
+
+We can use **Domain Adversarial Networks** here to train a classifier that is **domain agnostic** (able to classify independent on which domain element belong to).
 
 In Domain Adversarial Networks we have -
 1. An Encoder $\phi:X \rightarrow F$ to extract features from inputs regardless of which domain the inputs belong to (both $D_s$ and $D_t$).
 2. A Discriminator $T_w:F \rightarrow [0,1]$ to distinguish between elements of $P_s$ and elements of $P_t$ (Features of both source and target data).
-3. A Classifier/Regressor $h_\psi: F_s \rightarrow y_s \sim P_s(y|x)$ which uses the features of the source inputs to make a prediction regarding them.
+3. A Classifier/Regressor $h_\psi: F_s \rightarrow y_s \sim P_s(y|x)$ which uses the features of the source inputs to make a prediction regarding their target. This works as a metric for the usefulness of the features.
 
-Here the Discriminator makes the Encoder better at constructing features from the inputs (both source and target) in such a way that the features appear domain agnostic. But just having domain agnostic features isn't all, they need to be useful for predicting the target class. For this we include a Classifier/Regressor as well in the network so that the features learnt are both domain agnostic and useful.
+A DANN cannot generate samples, it's only job is to align the features of the different distributions.
+
+The Discriminator makes the Encoder better at constructing features from the inputs (both source and target) in such a way that the features appear domain agnostic. But just having domain agnostic features isn't all, they need to be useful for predicting the target class. For this we include a Classifier/Regressor as well in the network so that the features learnt are both domain agnostic and useful.
 
 $$
 \begin{aligned}
 \phi^*, w^* &= \arg\min_\phi\max_w \Bigg[\underset{P_{F_s}}{\mathbb{E}}\, [log\,D_w(F_s)] + \underset{P_{F_t}}{\mathbb{E}} \, [log\,(1-D_w(F_t))]\Bigg]& \\[8pt]
 \psi^* &= \arg\min_\psi \operatorname{BCE}(y,h_\psi(F_s)) \qquad(\text{BCE=Binary Cross-Entropy})
 \end{aligned}
 $$
+
+The encoder network has gradients flowing from both the discriminator as well as the classifier.
+- $\phi$ would ensure that $P_{F_s}$ = $P_{F_t}$.
+- $h_\phi$ would ensure that the features are meaningful.
 ## Evaluation of a GAN
-Suppose we have some true and generated samples and we wish to evaluate whether the GAN is successful in generating samples from $P_X$. There are various methods for it, but we'd be look at an adversarial method of evaluation called **Frechet Inception Distance**. FID uses [[VDM and GANs#Wasserstein's Metric (Optimal Transport)|Wasserstein's Metric]]  along with Inception Network trained on Imagenet to do this evaluation.
+Suppose we have some true and generated samples and we wish to evaluate whether the GAN is successful in generating samples from $P_X$. There are various methods for it, but we'd be look at an adversarial method of evaluation called **Fréchet Inception Distance**. FID uses [[VDM and GANs#Wasserstein's Metric (Optimal Transport)|Wasserstein's Metric]]  along with **Inception Network trained on Imagenet** to do this evaluation.
 
 Let -
 
diff --git a/content/Software Testing.md b/content/Software Testing.md
@@ -35,6 +35,13 @@ Software testing is the process of examining the artifacts and behavior of a sof
 - **Error -** An incorrect internal state during execution. This happens inside the memory.
 
 A test case involves an input to the software and an output. If the actual output matches the expected output, we say that the test case passed.
+
+Testing goals based on process maturity -
+1. Level 0: There is no difference between testing and de-bugging.
+2. Level 1: The purpose of testing is to show correctness.
+3. Level 2: The purpose of testing is to show that software doesn’t work.
+4. Level 3: The purpose of testing is not to prove anything specific, but to reduce the risk of using the software.
+5. Level 4: Testing is a mental discipline that helps all IT professionals develop higher quality software.
 ## Types of testing
 1. **Unit Testing -** Testing of a singular component.
 2. **Integration Testing** - Various components are put together and tested.
@@ -64,13 +71,11 @@ There are two broader methods of testing -
 1. **Simple Path -** A path from one node to another is a simple path if no node appears more than once except the first and last node. No internal loops.
 2. **Prime Path -** A simple path such that it's not a sub-path of another simple path. They are thus the maximal simple paths. ^633cde
 ## Types of Tours
-1. Tours with side-trips - A test path $p$ tours a sub-path $q$ with side-trips iff every edge in $q$ is also in $p$ in the same order. 
-   If a tour comes back to the same node it diverted from, we say the tour includes a side-trip.
+1. **Tours with side-trips** - A test path $p$ tours a sub-path $q$ with side-trips iff every edge in $q$ is also in $p$ in the same order.  If a tour comes back to the same node it diverted from, we say the tour includes a side-trip.
 
 ![[Pasted image 20260225093645.png|450]]
 
-2. Tours with detours - A test path $p$ tours a sub-path $q$ with detours iff every node in $q$ is also in $p$ in the same order. 
-   If a tour detours from some node $n$ and returns back to the prime path at a successor of $n$, we say the tour has a detour.
+2. **Tours with detours** - A test path $p$ tours a sub-path $q$ with detours iff every node in $q$ is also in $p$ in the same order.  If a tour detours from some node $n$ and returns back to the prime path at a successor of $n$, we say the tour has a detour.
 
 ![[Pasted image 20260225093700.png|450]]
 # Data flow Coverage
@@ -84,6 +89,18 @@ There are two broader methods of testing -
 
 ![[Pasted image 20260222112558.png]]
 # Test Integration
+## Scaffolding
+When testing incomplete portions of software, we need extra software components, sometimes called scaffolding.
+Two common types of scaffolding:
+1. **Test stub** is a skeletal or special purpose implementation of a software module, used to develop or test a component that calls the stub or otherwise depends on it.
+2. **Test driver** is a software component or test tool that replaces a component that takes care of the control and/or the calling of a software component.
+## Five approaches to integration testing 
+1. **Incremental -**
+	1. Top-down - Create top level modules while using stubs.
+	2. Bottom-up - Create bottom level modules while using test drivers to call them.
+2. **Sandwich** - Mix of top-down and bottom-up
+3. **Big Bang** - all individually tested modules are put together to construct the entire system which is tested as a whole. 
+## Coupling data flow
 Coupling variables are variables that are defined in one unit and used in the other.
 There are different kinds of couplings based on the interfaces:
 - **Parameter coupling:** Parameters are passed in calls.
@@ -96,7 +113,7 @@ Data flow coverage criteria can now be extended to coupling variables:
 - **All-coupling-def coverage:** A path is to be executed from every last-def to at least one first-use.
 - **All-coupling-use coverage:** A path is to be executed from every last-def to every first-use.
 - **All-coupling-du-paths coverage:** Every simple path from every last-def to every first-use needs to be executed.
-
+## Classical Coverage Criteria
 Traditional terminologies - 
 - **A linearly independent path** of execution in the CFG of a program is a path that does not contain other paths within it. (very similar to prime paths)
 - **Basic Block -** A series of nodes with no branching can be collapsed into one node called the basic block.
diff --git a/content/Theory of Computation/Closure of Languages (Cheatsheet).md b/content/Theory of Computation/Closure of Languages (Cheatsheet).md
@@ -1,12 +1,23 @@
 
 |   Operation   | RLs | CFLs | CSL | REL |
 | :-----------: | :-: | :--: | :-: | :-: |
-|     Union     |     |  ✅   |     |     |
-| Intersection  |     |  ❌   |     |     |
-|  Complement   |     |  ❌   |     |     |
-| Concatenation |     |  ✅   |     |     |
-|  Kleene Star  |     |  ✅   |     |     |
-| Positive Star |     |  ✅   |     |     |
-|  Difference   |     |  ❌   |     |     |
-|   Reversal    |     |  ✅   |     |     |
+|     Union     |  ✅  |  ✅   |     |     |
+| Intersection  |  ✅  |  ❌   |     |     |
+|  Complement   |  ✅  |  ❌   |     |     |
+| Concatenation |  ✅  |  ✅   |     |     |
+|  Kleene Star  |  ✅  |  ✅   |     |     |
+| Positive Star |  ✅  |  ✅   |     |     |
+|  Difference   |  ✅  |  ❌   |     |     |
+|   Reversal    |  ✅  |  ✅   |     |     |
+## Operator Precedence
+The precedence of regular operators. are
+$$
+() \gg *,+,R, - \gg \circ \gg \cap, \backslash \gg \cup 
+$$
+- Parentheses or grouping have highest precedence.
+- This is followed by the unary operators of **Kleene star, positive star, reversal**, and **complement** that have the same precedence.
+- This is followed by the binary operator of concatenation.
+- Then we have binary operators of intersection and difference with the same precedence.
+- Finally, the binary operator of the union has the lowest precedence.
+## Extras
 - Intersection of CFLs and RLs is closed.
diff --git a/content/Theory of Computation/Finite Automata and Regular Languages.md b/content/Theory of Computation/Finite Automata and Regular Languages.md
@@ -34,4 +34,30 @@ It is a finite automata where each state has **exactly one transition** for **ev
 2. For a language there can be multiple possible NFAs, but only one possible DFA.
 3. Grammars are inherently non-deterministic as a production rule can lead to multiple outcomes.
 4. A recognizer is a computational device or algorithm that decides whether a given input string belongs to a language.
+# Regular Expression
+The precedence of regex operators are: 
+$$
+() \gg *,+ \gg \cdot \gg |
+$$
 
+- Parentheses or grouping have highest precedence.
+- This is followed by the unary operators of Kleene star and positive star that have the same precedence.
+- This is followed by the binary operator of concatenation.
+- Finally, the binary operator of the union has the lowest precedence.
+## Arden's Theorem
+If $P$ and $Q$ are two Regular Expressions over $\Sigma$ and if $P$ does not contain $\epsilon$, then the equation for $R$
+
+$$
+\begin{aligned}
+R &= Q + RP \\[8pt]
+&= Q + (Q + RP)P \\[8pt]
+&= Q + QP + RP^2 \\[8pt]
+&= Q + QP + (Q + RP)P^2 \\[8pt]
+&= Q + QP + QP^2 + RP^2 \\[8pt]
+&= Q + QP + QP^2 + \dots \\[8pt]
+&= Q(\epsilon + P + P^2 + \dots) \\[8pt]
+&= QP^* \\[8pt]
+\end{aligned}
+$$
+
+Similarly if $R = Q + PR$ then $R = P^*Q$ is the solution.