Graph neural networks excel at learning from graph-structured data, yet whether their internal representations align with known topological motifs remains unclear. We apply sparse autoencoders to decompose GNN hidden activations trained on synthetic graphs with ground-truth motif annotations, including feedback loops, cascades, and fan-out structures. Using point-biserial correlation with rigorous permutation testing, we discover that GNNs spontaneously learn monosemantic features corresponding to specific graph motifs. Causal ablation experiments confirm that identified features are functionally necessary and removing feedback loop features selectively degrades performance only on graphs containing those structures. Interestingly, single input module motifs were also causally linked to feedback loops, suggesting that these two motifs might not be mutually exclusive. This work establishes that mechanistic interpretability of graph representations is achievable and demonstrates that topological inductive biases critically determine the structure of learned topological encodings.
Sh1384/GNN_SAE
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|