diff --git a/content/07.implicit.md b/content/07.implicit.md index c2a706c..4adfe5e 100644 --- a/content/07.implicit.md +++ b/content/07.implicit.md @@ -12,10 +12,19 @@ In this section, we offer a systematic review of the mechanisms underlying impli The capacity for implicit adaptation does not originate from a single mechanism, but reflects a range of capabilities grounded in fundamental principles of neural network design. Unlike approaches that adjust parameters by directly mapping context to coefficients, implicit adaptation emerges from the way information is processed within a model, even when the global parameters remain fixed. To provide a basis for understanding more advanced forms of adaptation, such as in-context learning, this section reviews the architectural components that enable context-aware computation. We begin with simple context-as-input models and then discuss the more dynamic forms of conditioning enabled by attention mechanisms. -#### Architectural Conditioning via Context Inputs +#### Theoretical Bridge: Architectural Conditioning via Context Inputs In contrast to explicit parameter mapping, the simplest route to implicit adaptation is to feed context directly as part of the input. The simplest form of implicit adaptation appears in neural network models that directly incorporate context as part of their input. In models written as $y_i = g([x_i, c_i]; \Phi)$, context features $c_i$ are concatenated with the primary features $x_i$, and the mapping $g$ is determined by a single set of fixed global weights $\Phi$. Even though these parameters do not change during inference, the network’s nonlinear structure allows it to capture complex interactions. As a result, the relationship between $x_i$ and $y_i$ can vary depending on the specific value of $c_i$. + +The connection is explicit for differentiable models $g$. Consider the model $P(Y | X, C)$ as a varying-coefficient regression model. An explicit estimator for regression parameters will solve for the regression parameter map $\beta_i = f(c_i)$ through +$$\hat{f} = \text{argmin}_f \sum_i (y_i - x_i \cdot f(c_i))^2,$$ +while a differentiable model (e.g. a neural network) will solve +$$\hat{\Phi} = \text{argmin}_\Phi \sum_i (y_i - g([x_i, c_i]; \Phi).$$ +Under mild assumptions, these result in an identical solution for the intermediate regression parameters $\beta$. While the varying-coefficient model solves this explicitly, these can be obtained post-hoc from the differentiable model by differentiating with respect to $c_i$ +$$\beta_i = \frac{\delta}{\delta c} g([x_i, c_i]; \Phi).$$ +This is the first-order Taylor approximation of the model, a locally linear approximation [@doi:10.48550/arXiv.1602.04938] often used in post-hoc interpretation methods. + This basic yet powerful principle is central to many conditional prediction tasks. For example, personalized recommendation systems often combine a user embedding (as context) with item features to predict ratings. Similarly, in multi-task learning frameworks, shared networks learn representations conditioned on task or environment identifiers, which allows a single model to solve multiple related problems [@doi:10.48550/arXiv.1706.05098]. #### Interaction Effects and Attention Mechanisms