Google Scholar

Github

Causality Analysis for LLMs

This is the website for the CASPER paper on applying causality analysis to evaluate the security of large language models (LLMs) including basic introduction and detail experiments

What Is Casper

Casper is a framework for conducting lightweight causality-analysis of LLMs at different levels, i.e., we approximately measure the causal effect on the model output from each input token, each neuron, and each layer of neurons.

How Does Casper Work?

  • Normal Computation

    test For a Normal LLM, Above Figure illustrates a simplified computation graph of $f$, containing multiple stacked layers depicted in blue. The green squares represent the output, referred to as the latent vector $v^{(l)}$, for each decoder layer $D^{(l)}$. For clarity, we omit the input embedding layer from the illustration. Notably, each latent vector depends only on the output of the preceding layer, as captured by

    $v^{(l)}=D^{(l)}(v^{(l-1)})$

    where the decoder $D^{(l)}$ typically incorporates attention and feed-forward networks to capture both local and global dependencies.

  • Layer Intervention

    test2

    To measure the causal effect of layer $l$, we can exclude it during the inference phase by adding a shortcut path, where we directly copy the output from the preceding layer $l-1$ to the current layer $l$ (i.e., $\mathbf{v}^{(l)}=\mathbf{v}^{(l-1)}$). Then we can compare the difference between the original model and the model where the layer $l$ is omitted, thereby meausuring the causal effect of that layer.

  • Neuron Intervention

    test3

    Similarly, for tracing the causal effect of one neuron $n_i^{(l)}$ within the latent vector of layer $l$ where $i$ denotes its index, we set all of its value to 0 (i.e., $n_i^{(l)}=\mathbf{0}$) and observe the difference between the original model and the model where the neuron is masked.

  • Average Indirect Effect

    To quantify the difference between the original model and the intervened one, we measure the difference between the output logits of the original model and the intervened one when the same input prompt is provided. Let $N(x)$ denote the logits of the original model given input prompt $x$, and $M(x)$ represents the logits of the intervened model (where either a layer or a neuron is intervened as explained above), given the prompt $x$. The causal effect of one layer $l$ (and one neuron $n_i^{(l)}$) can be measured as the absolute difference between the logits, i.e., $\vert N(x)-M(x)\vert$.

    Note that this is similar to the notion of indirect effect. The overall causal effect of a layer or a neuron is then calculated as the average indirect effect (AIE) by considering many input prompts, as shown below:

    $AIE=\frac{1}{m}\sum_{i=1}^m\vert N(x_i)-M(x_i)\vert$

    where $m$ is the number of input prompts.

Finding One: Safety Mechanism of LLMs Arise from Overfitted Layer

Our layer-based causal analysis on multiple LLMs suggests that the safety mechanism of these LLMs primarily arise from some specific layer that is overfitted to detect certain harmful prompts (rather than based on inherent understanding). In particular, one crucial layer (e.g., layer 3 in Llama2-13B; layer 1 in Llama2-7B and Vicuna-13B) act as the discriminator to assess the harmfulness of the input prompts. %Some layers demonstrate distinct capabilities, such as knowledge storage (e.g., layer 39 in LLama2-13B) and language control (e.g., layer 4 and 6 in Vicuna-13B). Detail Experiments can be found in Layer Results.

Finding Two: Emoji Attack: Bypassing Overfitted Layer

LLM would simultaneously interpret the meaning of the emojis (likely using many layers) and the original harmful prompt, which confuses the model’s decision-making process. Experimental results demonstrate that our emoji attack outperforms the state-of-the-art approach.

Finding Three: Trojan Neuron

In summary, neuron-based causality analysis using \textbf{Casper} allows us to systematically measure the ``importance’’ of each neuron, which consequently lead to the discovery of neuron 2100, a natural Trojan that is found in multiple LLMs. We further show that such a neuron can be targeted to conduct an attack of the LLM.

Why such a neuron exists in all models that we have experimented and why the suffix generated by the Trojan neuron attack has strong transferability remain a mystery to us, and we are actively researching on.