Skip to content

Sh1384/GNN_SAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph neural networks excel at learning from graph-structured data, yet whether their internal representations align with known topological motifs remains unclear. We apply sparse autoencoders to decompose GNN hidden activations trained on synthetic graphs with ground-truth motif annotations, including feedback loops, cascades, and fan-out structures. Using point-biserial correlation with rigorous permutation testing, we discover that GNNs spontaneously learn monosemantic features corresponding to specific graph motifs. Causal ablation experiments confirm that identified features are functionally necessary and removing feedback loop features selectively degrades performance only on graphs containing those structures. Interestingly, single input module motifs were also causally linked to feedback loops, suggesting that these two motifs might not be mutually exclusive. This work establishes that mechanistic interpretability of graph representations is achievable and demonstrates that topological inductive biases critically determine the structure of learned topological encodings.

About

Interpreting GNN Activations of Graphical Motifs through SAEs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors