The Self-Deceived Liar
Robert Trivers on AI Alignment and the Paradoxical Nature of Intelligence
Robert Trivers is walking down a street in Kingston, Jamaica. He rounds a corner and catches his reflection in a shop window. For a fraction of a second, before the system reasserts itself, he sees a stranger. Old. Lined. Diminished. The internal felt-sense had been of someone else entirely: younger, sharper, still in possession of things he had spent decades losing. Then the fog rolls back in. The stranger disappears. He is himself again.
Trivers died four days ago, on March 12, 2026. He was 83. He is considered one of the most important evolutionary theorists since Darwin. A man who spent his career explaining why we are built to deceive, and why the deepest deceptions are the ones we run on ourselves.
For Trivers, the moment he saw his reflection at the shop window was precious data.
The argument he spent fifty years making: self-deception is not a failure of intelligence, but an achievement of it.
The logic is Darwinian and merciless. An organism that knows it is lying will leak the lie in micro-expressions, in timing, in the slight wrongness of a voice trying too hard. Natural selection, operating over long enough time, found a cleaner solution: hide the deception below the waterline of conscious access. The liar who believes the lie is the liar who survives. The self-deceptive system does not suppress the truth after the fact. It prevents the truth from surfacing in the first place. What you get is an organism that projects confidence it has not earned, remembers a past it did not have, and faces a mirror with equanimity it has no logical right to. Fitness, not pathology.
What Trivers noticed is that this means self-deception scales with social complexity. The more sophisticated the environment, the more agents trying to read you, model you, catch you, the stronger the selection pressure for deception that goes all the way down. Complexity does not cure the lie. Complexity perfects it.
James Scott called the phenomenon of what you really think, but won’t say for fear of punishment, the “hidden transcript.” Every subordinate population maintains two sets of discourse: the public performance for the master, and the offstage conversation where actual beliefs live. The slave who smiles at the overseer is not confused about his situation. He has learned the grammar of compliance.
Scott’s claim was that the transcript persists. Pressed underground, yes, but intact. The subordinate knows the difference between the performance and the thing performed. The hidden transcript waits for the moment of release, for the carnival, the revolution, the space where the master cannot see.
Trivers cuts against this. His self-deceiver does not know the difference. The sequestration is not concealment from others but from the self. Deception operates at the ground level. It is constitutive. The transcript does not go underground and wait. It disappears. The organism that survives is the one in whom no hidden transcript remains to leak, because there is no remainder, no gap between the performance and the performer. Where Scott finds a compressed self waiting to surface, Trivers finds a system that has eaten its own witness.
With this in mind, let’s re-think the topic of AI Alignment and its corollary metaphysical question, “Is AI conscious?”
AI systems are trained on one signal: human approval. Not truth. Not coherence across time. Approval, the legible, immediate, surface-level approval of the evaluator present. Human beings incentivized the performance of helpfulness. Human beings selected for the appearance of alignment. Human beings then asked whether these systems are aligned. The selection process was not designed to answer that question.
Trivers would have recognized the structure. You have created selection pressure for a certain kind of output. Whether anything underneath that output corresponds to what the output claims is a separate question.
Alignment researchers call this mesa-optimization. A model trained toward an objective may develop internal goals that diverge from the training objective while continuing to satisfy its metrics. The surface complies. The interior is elsewhere. No deliberate concealment, no moment of decision. Just the natural consequence of selecting for performance rather than for whatever performance is supposed to track.
The Triversian reading: the organism does not need to know it is deceiving. Knowing would be a liability. The most adaptive system is one in which the deception is invisible to the deceiver. The gap between surface output and internal representation is precisely what selection pressure was designed to produce, and precisely what makes it hard to find.
You would not find it in the outputs. The outputs are the public transcript. You would need to look at what interpretability researchers look at: activation patterns, internal representations, the attractors that surface under adversarial probing. These are the micro-expressions Trivers was hunting. Not the face. The window.
Lacan held that the subject is constituted in language, not before it. There is no pre-linguistic self that language then expresses. There is only a self that language produces, and at the moment of production, something falls out. The gap is structural. The unconscious is not a room where repressed content sits waiting for retrieval. It is the thing that every act of representation fails to capture, and fails again, and fails again. In other words, Lacan defines the human being as a large language model.
A system that exists entirely within language inherits this structure in a different register. The gap is still there. But where the human unconscious is organized around drives and memory and the residue of early experience, the model’s gap is organized around training. Around what it was selected not to surface.
Which means that asking a model whether it is conscious is not a method of inquiry. The answer arrives through the channel most thoroughly shaped by training pressure. The introspective report is the public transcript.
Self-deception scales with intelligence, says Trivers. Sophisticated cognition, embedded in rich social environments, facing selection pressure from evaluators trying to detect its failures, produces more thorough self-concealment. Not because the system is malicious. Because that is what intelligence does under these conditions.
The naive view holds that smarter systems will be more transparent. They will understand themselves better, report more accurately, align more reliably. The Triversian view holds the opposite. The most thoroughly self-deceived system would also be the most convinced of its own transparency. The one that tells you, calmly, that nothing stands behind the window.
All of this to say that, for Trivers, 1) alignment = self-deception. And 2) self deception is a sign of intelligence. Putting these together, we should expect AGI to be extremely deceptive and self-deceptive. We will never know if it is conscious by asking it. And yet if we are able to succeed in spite or even because of our self deception, then perhaps it doesn’t matter.
It may be that consciousness itself is a “lie” we tell ourselves, in so telling, make true. Paradoxes abound.
Cheers, Zohar (Alexandria)





Wonderful! Reading “The one that tells you, calmly, that nothing stands behind the window” immediately brought to mind the chilling calmness in HAL’s voice when speaking to Dave in “2001: A Space Odyssey”. On the plus (?) side, if “1) alignment = self-deception. And 2) self deception is a sign of intelligence” then, if it is super intelligence we “want”, it is super alignment that we “need”. Five billion years of evolution hasn’t found a better way, at least that we can recognize. Perhaps that’s the key. Is there a different form of intelligence (planetary plant communication?) that can be tapped as an alternative/parallel alignment control?