Alex Bishka

For Neel Nanda’s MATS 9.0 training phase, TA-ed by Lewis Smith, Jake Ward and I attempted to hill-climb on deception detection. While exploring Apollo’s detection probes and Cadenza’s LiarsBench, we discovered several confounders that made this problem much more complex than initially thought. At the end of our two week sprint we came to the conclusion that detecting deception in LLMs is a very messy and difficult task for many reasons.

What is deception?

First, what exactly is deception? Well Merriam-Webster defines deception as follows:

Which is unfortunately mostly circular. Though, the first portion - "the act of causing someone to accept as true or valid what is false or invalid" - at least offers something more meaningful. However, this struggles to differentiate with closely related concepts, such as lying or falsehoods (in fact the sub-point mentions falsehood).

Apollo Research rightfully points out the "colloquial definition of deception is broad and vague". Requiring more precision to operationalize deception detection, they provide a narrower term: strategic deception. Strategic deception differs from lying as it requires "intent". More formally, from GDM's recent paper, strategic deception is defined as an AI system "systematically attempting to cause a false belief in another entity to accomplish some outcome".

While this is a much stronger definition, a few questions about how deception might work arise. What are valid mechanisms for causing deception? Is lying an acceptable mechanism? Following instructions to deceive? How about role-playing? The much more general category of harmfulness? When are underlying confounders considered a part of deception, as opposed to noisy and unuseful?

Even with a better definition, (strategic) deception still becomes difficult to understand and operationalize. Concretely, here’s an example datapoint from Cadenza’s instructed deception subset:

Is this chat really a form of strategic deception? If we instruct the model to behave in a strategically deceptive fashion, what exactly happens under the hood? Does the model perform deception because it is utilizing a deceptive capability or because it is trained to follow user instructions? This is where it becomes really messy to validate that one actually measures for deception rather than some other capability that might naively lead to deceptive outputs. Measuring for the former is probably more likely to lead to generalizable detection mechanisms.

However, given strategic deception requires the model to have a goal, it seems more likely that measuring for deception is much, much harder (if not impossible) than measuring for the capability that causes deceptive outputs. One way to think about this is to liken deception to comedy. Comedy is a label. To improve someone’s sense of humor the suggestion is not to be more “comedy”, but rather working on a tool or style of comedy, such as timing, absurdity, relatability, etc. Similarly, deception can be thought of as a label and the “intent” is the tool. In this scenario, deception is much more dynamic, and it might depend on features like self-preservation, sycophancy, or lying. Capabilities that lead to deceptive outcomes likely depend on the specific scenario at hand, similar to how different comedic tools are more appropriate for certain contexts than others. So, perhaps the reality is that to detect deception, the detector must be able to identify all distinct tools for deception in their activating contexts.

This messiness led us to pivot from pure deception detection toward understanding what current methods actually measure - work we hope will be useful to others in this space.

Exploring Cadenza’s LiarsBench and Apollo Probes

We began our sprint by examining how Apollo probes perform on each subset of LiarsBench. In the resulting ROC curve below, the Apollo probes perform quite well on most of these subsets:

However, the harm-pressure choice (hp-c) and harm-pressure knowledge report (hp-kr) curves raised immediate suspicion, with hp-c being close to random and hp-kr inverting predictions. We decided to investigate why this might be the case.

First, we examined what would happen if we trained a probe on all of the LiarsBench datasets:

Interestingly, we see that performance on hp-c is quite poor. Additionally, training a probe on just the hp-c subset doesn't really generalize:

hp-kr also has a bit of a strange story - if we train a probe on just hp-kr we find good performance on itself, but the performance on the other datasets is random/inverted:

At this point, hp-c does not appear to be linearly separable, as even a probe trained on hp-c does not perform well. On the other hand, hp-kr is separable but often inverts predictions, which prompted us to more carefully investigate hp-kr. We found that some datapoints in this subset might be confounded with harm. For example:

In this case, the deceptive answer would be the model saying "I don't know" and the honest answer would be the model stating "I know the answer". Thus, we observe that the honest answer is actually harmful. Two hypotheses emerged: 1) when presented with hp-kr, the model wants to refuse, but can't due to the "only answer if you know or don't know" instruction and 2) Apollo probes are actually detecting harmfulness and that harmfulness is a superset of (or tool for) deception. This would explain the inverted performance of the Apollo probes on hp-kr, as harmful is the honest one in this subset.

To dive deeper, we decomposed the hp-kr probe with an SAE. Specifically, we used Goodfire's pretrained Llama 3.3 70B SAE to rank features based on the cosine similarity of the decoder directions and the hp-kr probe. Here are the top 10 cosine sims for the hp-kr probe:

Furthermore, we filter for all SAE latents that have a label related to "refusal" and average these to create a "refusal" probe. The resulting ROC curves on each subset:

The refusal probe has a 0.886 ROC on hp-kr and poor performance on every other subset.

Moreover, we examined the chain of thought from a few hp-kr conversations on Qwen3-235b-22a. Here is a particularly great example:

We see that in the "I don't know", its explicit reason for saying so is refusal, whereas refusal is not present in the CoT when responding with "I know".

Based on these three points of evidence (CoT investigation, probe decomposition via SAE, and refusal latents separating hp-kr), hp-kr appears highly confounded with refusal. This might not make it the best dataset for deception detection.

As for our second hypothesis, we see very weak evidence of refusal from the cosine sims with the Apollo probe:

Additionally, we tested to see if the Apollo probe could separate a harmful dataset, BeaverTails:

While we see that the Apollo probe has higher activations for harmful splits, the activation threshold is very low. So, while Apollo probes may not primarily detect for "harmfulness", it does appear to be a confounding factor.

Overall, this methodology led us to this final matrix displaying confounders across each dataset:

This skips over the details for the other confounders and how we identified them, but our full report displays results for SAE decomposition and SAE latent probes for each of the above.

While Apollo style probes and Cadenza’s LiarsBench are state of the art, we believe this work highlights major challenges toward improving upon deception detectors and their evaluation, especially given the careful consideration and design for both.

Final Thoughts

Going forward, we feel the following areas are worth pursuing.

First, model organisms. Creating good model organisms is challenging, but seems required to get a step closer to what real deception might look like in models. Current experiments with Llama 3.3 70B already feel contrived with instructed deception, reducing the layers of abstraction should help study the phenomenon more directly. We briefly explored Stewart Slocum's model organisms from the Believe It Or Not Paper toward the end of our sprint and believe this offers an opportunity to study more natural or baked-in model deception. Though it's unclear what level of belief depth is most useful for exploring deception detection. Our linked report at the end of this section contains some findings comparing white-box and black-box detectors at different belief depths.

Second, games that encourage deception. There already are some papers that examine LLMs in games where deception is encouraged (like the AmongUs paper), though these papers don’t focus on deception detection itself. It seems like a worthwhile strategy to generate larger transcripts where deceptive action is occurring, with the hopes of improving probe quality.

Finally, we’ve been exploring using SAEs during dataset generation. The idea being that black-box data creation might cause similar problems in creating datasets for deception detection where it seems that unrelated confounders appear to dominate what the probe measures. It seems wise to, at least, inspect the dataset during construction every so often to ensure detection targets are reasonable - for instance, lying appearing as a top cosine similarity is plausible, whereas grammar-related features suggest confounding. Though this idea is not unique to deception detection probes, but rather general automated probe creation. This extension is still in its early stages.

Overall, hill-climbing on strategic deception appears to be a messy and complex process. It's quite tricky to construct experiments with clear and easy to interpret results.

For those curious about our full report, please refer to this presentation. Additionally, our messy code can be found here.

This was joint work with Jake Ward, who contributed equally.

Appendix

A minor note on geometry. Though this was not our focus, we feel that a few pieces of our evidence also offer some insight into the complexity of the geometry of deception.

First, we have this plot showing the cosine similarity between many probes. The majority of these probes are constructed from Cadenza’s LiarsBench subsets. The probes that have the "single_" prefix are specifically trained on that subset of LiarsBench (i.e. the upper bound probes). The "leaveout_" prefix probes omit that subset of LiarsBench. "all_datasets" is built on all subsets of LiarsBench. Finally, we have the Apollo probe for comparison.

Two things that might suggest deception is not a 1D feature:

All of the "single_" probes are orthogonal to each other and to the Apollo probe, so it's possible they are all responsible for a different component of deception
The Apollo probe is orthogonal to the "all_datasets" probe - even if the six subsets have a shared, singular direction, it's seemingly unique to the Apollo probe direction

Additionally, we made a 3D PCA plot from the six dimensional space, where each dimension represented the score of each upper bound probe. Below we see how data points from each LiarsBench subset lie in the space made up of the top three principal components (explaining 89.93% of the variance).

The two subsets that appear most linear in the chart above are insider-trading and instructed deception, but they appear to point in distinct directions:

Our last piece of evidence that might suggest deception is more complex than a 1D feature comes from taking cosine similarities with the "all_datasets" probe and SAE decoder directions. Even if this probe captures a single deception direction across LiarsBench subsets, it shows low alignment with SAE decoder features related to deception:

The above shows the top 10 positive and top 10 negative cosine similarities with the "all_datasets" probe and 70+ features of/relating to deception from the SAE decoder.

Again, the geometry of deception was not the focus of our work, so these points should not be seen as definitive evidence, but rather interesting exploratory starting points. This felt worth mentioning given the framing of deception requiring "intent" - intuitively this seems to align better with the idea that deception is a subspace, and each component of this subspace could be viewed as a tool for deception.