Mastering Interaction Discovery in LLMs: A Practical Guide to SPEX and ProxySPEX
Learn to detect influential interactions in LLMs using SPEX and ProxySPEX with ablation techniques. Step-by-step guide with code examples and common pitfalls.
Overview
Large Language Models (LLMs) achieve remarkable performance by learning complex relationships among input features, training data, and internal components. However, understanding how these interactions drive model predictions remains a significant challenge in interpretability. Traditional attribution methods often assume independence, missing the synergistic effects that are crucial for safety and trustworthiness.

SPEX and its scalable variant ProxySPEX are algorithms designed to identify influential interactions at scale. By leveraging a systematic ablation framework, they pinpoint which combinations of features, data points, or model components most impact the model's output. This tutorial provides a concrete, step-by-step guide to implementing and using these methods.
Prerequisites
Before diving in, ensure you have the following:
- Knowledge: Familiarity with LLMs, basic interpretability concepts (e.g., feature attribution, ablation), and Python programming.
- Tools: Python 3.8+, PyTorch or TensorFlow (for model access), NumPy, SciPy, and a library like
transformersfor loading LLMs. - Data: A small dataset of prompts (for feature attribution) or a training set with labels (for data attribution). For mechanistic interpretability, access to model internals is required.
Step-by-Step Guide
Understanding Ablation and Attribution
At the core of SPEX is ablation: measuring how removing a component changes the model's output. We consider three types:
- Feature Ablation: Mask or remove parts of the input prompt (e.g., words, tokens) and observe the logit shift.
- Data Ablation: Retrain the model (or use influence functions) to measure how excluding a training point affects predictions on a test example.
- Component Ablation: Intervene on model internals (e.g., zero out attention heads) to assess their contribution.
The goal is to find interactions – pairs or groups of components whose combined effect differs from the sum of individual effects. With many components, exhaustive testing is infeasible, so SPEX uses a greedy search with a proxy for interaction strength.
Setting Up Your Environment
- Install dependencies:
pip install torch transformers numpy scipy - Load a pre-trained model (e.g., GPT-2):
from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained('gpt2') tokenizer = AutoTokenizer.from_pretrained('gpt2') - Define a baseline input and a target output tensor. For simplicity, we’ll use a single prompt.
Implementing SPEX
SPEX (Simple Perturbation-based EXplanation) works by iteratively selecting the next best component to ablate, considering interactions with already selected ones. Here’s a Python-like pseudocode:
def spex(model, inputs, components, baseline_output, budget):
selected = []
remaining = list(components)
for _ in range(budget):
best_gain = -inf
best_comp = None
for c in remaining:
ablated = selected + [c]
output = ablate_model(model, inputs, ablated)
gain = interaction_gain(baseline_output, output, selected, c)
if gain > best_gain:
best_gain = gain
best_comp = c
selected.append(best_comp)
remaining.remove(best_comp)
return selected
The interaction_gain function computes the additional effect of adding component c given already selected ones. For feature attribution, you could mask tokens; for data, use influence scores.
Scaling with ProxySPEX
SPEX becomes slow when the number of components is large. ProxySPEX reduces cost by learning a proxy model (e.g., a linear regression) that predicts ablation outcomes based on component embeddings. Steps:
- Sample a random subset of component combinations.
- Compute their ablation effects (e.g., logit changes).
- Train a proxy to map component indicator vectors to effects.
- Use the proxy to score all possible interactions without expensive model runs.
def proxyspex(model, inputs, components, num_samples, proxy_model):
# Step 1-2: Sample and compute effects
import itertools, random
indices = list(range(len(components)))
X = []
y = []
for _ in range(num_samples):
mask = [0]*len(components)
subset = random.sample(indices, random.randint(1, 5))
for i in subset:
mask[i] = 1
effect = compute_effect(model, inputs, [components[i] for i in subset])
X.append(mask)
y.append(effect)
# Step 3: Train proxy
proxy_model.fit(X, y)
# Step 4: Score all pairs
scores = {}
for i in range(len(components)):
for j in range(i+1, len(components)):
mask_i = [1 if k==i else 0 for k in range(len(components))]
mask_j = [1 if k==j else 0 for k in range(len(components))]
mask_both = [1 if k==i or k==j else 0 for k in range(len(components))]
pred_i = proxy_model.predict([mask_i])[0]
pred_j = proxy_model.predict([mask_j])[0]
pred_both = proxy_model.predict([mask_both])[0]
interaction = pred_both - pred_i - pred_j
scores[(i,j)] = interaction
return sorted(scores, key=scores.get, reverse=True)[:budget]
Interpreting Results
The output is a ranked list of interactions (pairs or groups). Visualize them as a graph: nodes are components, edges show interaction strength. Check if interactions align with domain knowledge (e.g., tokens that co‐occur often). For model debugging, unexpected interactions might indicate spurious correlations.

Common Mistakes
- Ignoring baselines: Always use a neutral input (e.g., empty prompt) to compute baseline output. Otherwise, interactions can be misleading.
- Overfitting the proxy: ProxySPEX requires enough training samples; too few leads to poor generalization. Use cross-validation.
- Assuming linear interactions: SPEX works for pairwise and higher-order interactions, but assume additive effects? Use
interaction_gainthat measures deviation from additivity. - Budget too small: If you set a low budget for SPEX (e.g., 3 components), you may miss important interactions involving many features.
- Not normalizing effects: Compare interactions across different scales by normalizing ablation effects (e.g., divide by standard deviation of outputs).
Summary
SPEX and ProxySPEX enable efficient identification of influential interactions in LLMs, overcoming the exponential complexity of exhaustive search. By using iterative ablation (SPEX) or a learned proxy (ProxySPEX), you can uncover how features, data points, or model components work together to drive predictions. This guide provides the core concepts, implementation steps, and common pitfalls. Start with a small model and dataset, validate your proxy, and gradually scale up to real-world LLM interpretability tasks.