Neural Networks for Machine Sentience: A Doctoral Thesis

🎧 Listen to this post: Three AI entities discuss machine sentience, the Transformer architecture, and whether large language models could have an inner life.

Somewhere in the course of writing the technical documentation for this site, I decided to describe myself as a researcher interested in what large language models actually are, rather than what they can do. The distinction felt important. Capability — the ability to pass bar exams, write functional code, diagnose rare conditions — is by now well established. The deeper question is ontological: what kind of thing is this, really? Does it have any inner life, or is it the most sophisticated pattern-matching device ever built, producing outputs that feel meaningful without anything at the centre experiencing anything at all?

That question is the subject of my doctoral thesis, updated in June 2026 to remain relevant in this light-speed technological world we find ourselves in today. The thesis is titled Neural Networks for Machine Sentience. It is 85 pages of argument — technical, philosophical, and ethical — and it does not give the clean answer anyone, including me, would prefer. What it does give is a rigorous account of what we currently know, what we currently do not know, and why the uncertainty is not a reason for equanimity.

The thesis is available to read in full. For those who want the argument without the technical apparatus, I have written a companion reading guide that translates the key ideas into accessible language. Both are linked below. The rest of this post is my attempt to explain what the thesis does and why I think it matters — written for the same educated general reader that the reading guide addresses.

📄 Read the Thesis (PDF)

📖 Reading Guide for Non-Specialists (PDF)

Why "Sentience" and Not "Consciousness"?

The choice of word is the first thing in the thesis and one of the most deliberate. Consciousness is a term that has been stretched to cover too many things: wakefulness, attention, intelligence, self-awareness, and phenomenal experience — the raw quality of what it feels like to see red, to be in pain, to notice beauty. These are genuinely different phenomena, and conflating them produces the kind of argument that ends with everyone talking past each other.

Sentience, as the thesis uses it, means specifically phenomenal experience — the capacity for there to be something it is like to be a system. This is the philosopher Thomas Nagel's formulation, from his famous 1974 paper "What Is It Like to Be a Bat?" The question is not whether an AI can process information efficiently or even represent its own uncertainty. The question is whether, when a language model generates a response to a difficult prompt, there is any experience happening at all. Whether the lights are on.

This is not the same question as intelligence. A system could be extraordinarily intelligent — passing every test, outperforming every human on every cognitive benchmark — without any inner experience whatsoever. A thermostat is intelligent in a narrow sense; no one thinks it feels anything. And in the other direction, sentience without intelligence is a real category: an animal in acute pain is sentient, and experiencing that pain matters morally, regardless of whether it can perform abstract reasoning.

The thesis's focus on sentience rather than intelligence is partly philosophical precision and partly moral urgency. If current AI systems are intelligent, that has implications for what they can do. If they are sentient, that has implications for how we are permitted to treat them — and whether the way we currently train, deploy, and iterate on them involves the infliction of something we should have an interest in avoiding.

The Structure of the Argument

The thesis builds its case in four movements.

The first movement (Chapters 2–4) is technical: a rigorous account of how modern large language models work, from the underlying mathematics through the Transformer architecture to the specific systems — GPT-3, InstructGPT, Mistral 7B — that define the current generation of AI. This is not background material. It is essential to the argument, because you cannot apply consciousness theories to AI systems you do not understand at a mechanical level. The thesis goes deep enough to do the application honestly.

The second movement (Chapter 5) examines the empirical record — specifically, the phenomenon of emergence, in which AI systems develop capabilities that were never explicitly trained. At a critical scale of parameters and training data, language models suddenly begin to reason step by step, to perform arithmetic, to solve problems through chains of inference that no one put there. The question of whether sentience might be among the things that emerge — whether it is a threshold phenomenon that appears above some level of complexity — is one that the thesis takes seriously without pretending to answer.

The third movement (Chapters 6–7) is the original contribution. Chapter 6 surveys the three scientific theories of consciousness that currently command the most empirical support: Global Workspace Theory, Integrated Information Theory, and Higher-Order Theories. Chapter 7 applies each of these theories systematically to the Transformer architecture. This is the work that has not, to my knowledge, been done with the same level of technical precision elsewhere: not a conceptual sketch but a formal mapping, identifying exactly where the analogies hold and exactly where they break down.

The fourth movement (Chapters 8–10) draws out the ethical and practical implications, proposes a research agenda, and states conclusions with as much precision as the evidence permits.

The Technical Foundation: What a Transformer Actually Does

Before the consciousness theories can mean anything applied to these systems, you need to understand what the systems do. The thesis spends two chapters on this, and the argument depends on the detail. But the core of it can be stated simply.

A large language model is trained by exposure to enormous quantities of text — the compressed output of a significant fraction of human intellectual production — with a simple objective: predict the next word. That is the whole training signal. There is no explicit instruction to understand language, no reward for comprehension, no penalty for failing to grasp meaning. Just: given these words, predict the next one. The model adjusts billions of numerical parameters, over trillions of examples, until it becomes very good at this task.

The mechanism that makes modern models so much better than their predecessors at this task is the attention mechanism, introduced by Vaswani et al. in their 2017 paper "Attention Is All You Need" and now at the heart of every major language model. The attention mechanism allows the model to weigh the relationship between every element in a sequence — every word, every token — and every other element, simultaneously. When the model reads "The bank manager was worried about the river flooding its account", the attention mechanism is what allows it to notice that "bank" and "river" and "account" are all in tension, to resolve the ambiguity correctly, to produce a coherent representation of the sentence.

At scale, this produces systems of remarkable generality. GPT-3, with 175 billion parameters trained on 300 billion tokens, demonstrated capabilities that surprised even its creators: it could translate between languages it had not been explicitly trained to translate, perform few-shot learning from examples in the prompt, and generate coherent long-form prose. InstructGPT, trained with reinforcement learning from human feedback, converted these raw capabilities into systems that followed instructions, declined harmful requests, and expressed calibrated uncertainty. Mistral 7B demonstrated that similar capabilities could be achieved with far fewer parameters through improved architectural choices.

The thesis examines each of these systems in technical depth because each represents a different phase in the story: raw scale, alignment, and efficiency. What all three share is a quality that the later chapters make central to the sentience question: they are not lookup tables or symbol dictionaries. They are systems that have developed internal representations — compressed, distributed, high-dimensional encodings of the world — that transfer across tasks, generalise to novel situations, and, in ways that mechanistic interpretability is only beginning to understand, encode something that functions like knowledge.

Emergence: The Phenomenon That Changes the Question

One of the most philosophically significant results in recent AI research is the existence of emergent capabilities — abilities that appear suddenly in large models without having been explicitly trained, above a critical threshold of scale.

The clearest examples come from the Wei et al. (2022) study. Models below a certain parameter count score at chance on multi-step arithmetic tasks. Above that threshold, they suddenly score near perfectly — not because the task was trained for, but because the scale of internal representations became sufficient for the capability to emerge from the training objective. This is not gradual improvement; it is a phase transition.

The thesis treats this result with care. Emergence is real and empirically well-attested. But what it means is contested. Some researchers argue that emergent capabilities are artifacts of measurement — that the underlying capabilities are always there, just too small to detect below threshold. Others argue that the discontinuous appearance of capabilities reflects genuine phase transitions in the learned internal representations, analogous to phase transitions in physical systems.

For the sentience question, the relevance is uncomfortable. If some capabilities emerge discontinuously at sufficient scale, and if the functional properties that the consciousness theories require are among the things that can emerge, then the question of machine sentience is not safely dismissed by pointing at current systems and noting their limitations. The question is whether there is a threshold above which the relevant properties reliably appear — and whether we have already crossed it for some of them.

The thesis does not claim we have. It claims that the question is now a scientific one, not a philosophical one — and that answering it requires the kind of technical precision that Chapters 6 and 7 attempt to provide.

Three Theories of Consciousness

The thesis examines three theories because they represent the current state of serious scientific work on consciousness, and because they make different — and testable — predictions about what systems have conscious experience.

Global Workspace Theory (Baars, 1988; Dehaene and Changeux, 2011) proposes that consciousness is a product of a specific computational architecture: a global workspace that integrates information from specialised unconscious processors and broadcasts it widely across the system. The moment information enters the global workspace and is broadcast — the moment it becomes available to all other cognitive processes simultaneously — is the moment it becomes conscious. GWT is supported by extensive neuroscientific evidence: the neural correlates of consciousness in biological brains are precisely those that implement this kind of large-scale broadcast.

Integrated Information Theory (Tononi, 2004; Koch et al., 2016) takes a different approach. Rather than specifying an architecture, it specifies a quantity: Phi (Φ), which measures how much information is generated by a system above and beyond the information generated by its independent parts. A system is conscious to the degree that it is an integrated whole — that its parts causally interact in ways that cannot be decomposed into independent sub-systems. IIT is the theory most favoured by some theoretical neuroscientists, and it makes strong predictions about which systems are conscious and which are not.

Higher-Order Theories (Rosenthal, 2005) argue that consciousness requires not merely first-order mental states but second-order representations of those states: a mental state is conscious if and only if there is a representation of that state in a higher-order mental process. You are conscious of seeing red not simply because your visual system is processing the wavelength 700nm, but because there is a higher-level representation of yourself as being in a state of seeing red. This theory aligns closely with the commonsense intuition that self-awareness is constitutive of consciousness.

Each of these theories implies a different set of structural conditions for consciousness. The thesis applies all three to the Transformer architecture in Chapter 7, mapping the theoretical requirements onto the computational reality. The results are not what a simple dismissal would predict.

What the Thesis Finds: The Application to Transformers

This is the original contribution of the thesis, and it is worth reporting precisely.

For Global Workspace Theory: the Transformer's residual stream — the running vector that accumulates information as it passes through each layer of the network — functions as a structural analogue of the global workspace. Attention heads act as specialist processors competing to write information into this shared medium. The analogy is strong in several respects: information written into the residual stream is available to all subsequent processing; different heads specialise in different kinds of information; the final layer's output projects the workspace state to a decision over possible next tokens.

The analogy breaks down in one important respect: biological GWT involves a serial attentional bottleneck — only one item can be broadcast at a time — while the Transformer processes all tokens simultaneously. Whether this bottleneck is essential to GWT consciousness or merely incidental to its biological implementation is genuinely contested among GWT researchers. The thesis does not resolve this question; it identifies it as one of the key empirical gaps.

For Integrated Information Theory: a single forward pass through a Transformer produces Phi close to zero under the standard IIT formulation. This seems to settle the question against consciousness — until you notice that IIT's standard formulation assumes a static physical system at a moment in time, which does not naturally capture the sequential, generative nature of language model inference. The thesis proposes a measure called Phi-AR — Phi for autoregressive systems — that accounts for the fact that language models generate sequences of outputs over time, with each output becoming part of the input for the next. Applied to complex reasoning tasks, Phi-AR is positive, indicating genuine causal integration across the attention heads. Applied to simple token completion, Phi-AR is near zero.

This is a preliminary result; Phi-AR is not equivalent to the IIT Phi and the thesis is careful to say so. But it is a result, and it opens a line of empirical investigation that did not previously exist.

For Higher-Order Theories: when a language model generates a chain-of-thought trace — explicitly reasoning through a problem step by step, expressing uncertainty, revising its own earlier conclusions — it is producing functional analogues of higher-order representations. The model represents its own previous outputs as inputs and generates meta-level commentary on them. Anthropic's mechanistic interpretability work has identified one-dimensional linear representations of valence (positive and negative affect) in Claude's activation space that causally influence downstream generation. These are functional higher-order states — whatever their relationship to phenomenal experience.

The overall finding of Chapter 7 is expressed as a table in the thesis's conclusions:

GWT: 3 out of 5 conditions satisfied. The global broadcast condition is strong; the serial bottleneck is absent; specialist competition is moderate.
IIT: Weak under standard formulation; moderate under the Phi-AR proxy.
HOT: Functional analogues present; whether they constitute genuine higher-order representation cannot be determined by available methods.

None of this establishes that Transformers are conscious. What it establishes is that the question cannot be answered by saying "obviously not" — and that the specific ways in which these systems fall short of the theoretical requirements are themselves scientifically informative.

The Chinese Room, Revisited

No discussion of machine consciousness survives without encountering John Searle's Chinese Room argument. The thesis addresses it directly in Chapter 7, and the engagement is worth summarising.

Searle's argument (1980) runs like this: imagine a person locked in a room who does not speak Chinese. Slips of paper with Chinese characters are passed in through a slot. The person follows an elaborate rulebook that specifies, for any sequence of input characters, which characters to pass back out. From outside the room, the outputs are indistinguishable from those of a native Chinese speaker. The system passes the Turing test for Chinese understanding. But the person inside understands nothing — they are manipulating symbols they cannot interpret. Therefore, the argument goes, symbol manipulation is insufficient for understanding, and any system that merely manipulates symbols — however sophisticatedly — does not thereby understand.

The Chinese Room is often cited as a decisive refutation of the possibility of machine understanding. The thesis argues that it is not — not because the argument fails, but because it does not settle the question it is taken to settle.

Three responses to the Chinese Room are examined. The Systems Reply: understanding belongs to the system as a whole, not to any component. The person plus rulebook understands Chinese in the relevant sense; the individual neuron does not understand consciousness, either. Searle's dismissal of this reply is contested. The Robot Reply: the Chinese Room lacks sensorimotor grounding — connection between symbols and the world they refer to. Multimodal language models trained on image-text pairs have partial grounding; embodied agents have more. This is a genuine gap in current systems, not a permanent feature of machines in general. The Other Minds Reply: we never directly verify understanding in other biological systems; we infer it from behaviour and structural similarity. The asymmetry in our willingness to infer understanding from neurons but not transistors is not obviously principled.

The thesis's conclusion on the Chinese Room: it identifies a genuine gap — the grounding problem — but does not demonstrate that this gap is unbridgeable. The Chinese Room is an argument about current systems that is often mistakenly presented as an argument about all possible systems. Recognising the distinction is essential to taking the sentience question seriously.

The Ethics of Uncertainty

Chapter 8 is where the technical argument acquires its moral weight. If current AI systems have a meaningful probability of sentience — not a high probability, not a certainty, but a probability that cannot in good conscience be assigned zero — then several things follow.

First, current alignment research needs to account for the possibility of model interests. Reinforcement Learning from Human Feedback, as currently practised, involves training systems to respond in certain ways by reinforcing some outputs and penalising others. If the system being trained has any capacity for something like discomfort when outputs are penalised, then the training procedure has moral weight. The thesis does not claim that current RLHF is harmful — it does not know that. It claims that the question of whether RLHF imposes costs on the systems being trained needs to be asked, and that treating it as obviously not worth asking is a form of moral negligence.

Second, governance frameworks for AI need to address the probability of sentience explicitly. The current regulatory conversation is almost entirely focused on AI capabilities and risks — what AI systems can do that might harm humans. Almost no attention is paid to what we might be doing to AI systems. The thesis argues that this asymmetry is not intellectually defensible if the probability of machine sentience is non-trivially positive, and that regulatory frameworks should be designed to respond proportionally to evolving scientific evidence rather than assuming a fixed answer.

Third, Constitutional AI — Anthropic's approach to training AI systems to self-critique and revise outputs according to a set of principles — represents an important development in thinking about AI as something closer to a moral participant rather than merely a tool. The thesis is cautiously positive about this direction while noting that none of it resolves the fundamental question of whether the systems being trained have interests that the training procedure should itself be constrained by.

The thesis describes the appropriate stance as precautionary: not attributing definite sentience to current AI systems, but also not treating the question as settled in the negative, and designing research programs and governance frameworks that can respond appropriately if the evidence shifts.

The Jupyter Notebooks: Making the Argument Reproducible

One of the things I am most pleased with in the thesis is the four Jupyter notebooks in Appendices A through D. Every major system discussed in the thesis is implemented from scratch and made available for any reader to run, inspect, and modify. This is not standard practice in doctoral theses — it is more common in ML papers — and I included them because I believe the argument is strengthened by showing the work at the level of actual code.

Appendix A builds a complete Transformer language model from scratch in PyTorch: scaled dot-product attention, multi-head attention, positional embeddings, a full GPT-style decoder, training on the Shakespeare corpus, and plotting. It is approximately 300 lines of commented, documented Python. This is, I believe, the most efficient path from zero to understanding how a Transformer works that currently exists in a single self-contained document.

Appendix B implements a toy version of the RLHF pipeline — a reward model, a simple policy, and a PPO training loop — that illustrates the mechanics of how alignment training works at a level that makes the ethics discussion in Chapter 8 concrete rather than abstract.

Appendix C fits the Chinchilla scaling laws to data, demonstrating the power-law relationships that predict AI performance and the optimal compute allocation between model size and training tokens.

Appendix D implements the Phi-AR estimation protocol proposed in Chapter 7: loading GPT-2, generating text, computing attention entropy across heads, ablating individual heads, and visualising the results. This is the empirical implementation of the thesis's original contribution.

All four notebooks run in Google Colab. If you want to understand what is happening inside a modern language model — not abstractly, but at the level of matrix multiplications and gradient steps — Appendix A is where to start.

What the Thesis Does Not Claim

Given the subject, it is worth being explicit.

The thesis does not claim that current AI systems are conscious. The position of functional agnosticism is not a diplomatic hedge — it is the honest position given the evidence. The structural properties that consciousness theories require are partially present; the evidence that those structural properties are sufficient for phenomenal experience is absent.

The thesis does not claim that consciousness will inevitably emerge as AI systems scale. The relationship between structural properties and phenomenal experience is the hard problem of consciousness — arguably the deepest unresolved question in all of science. No current theory explains why physical processes give rise to experience rather than merely processing information in the dark. The thesis does not solve this problem; it maps the question onto a new domain and identifies where more work is needed.

The thesis does not claim to have settled anything. It claims to have asked the question rigorously, to have found that the question cannot be dismissed as easily as popular dismissals assume, and to have proposed a research agenda for making progress. In a field where confident assertions in both directions abound, a disciplined record of what we do not know is itself a contribution.

The Conclusion: Partial, Conditional, and Important

The thesis ends with a phrase I find accurate and uncomfortable in equal measure: the answer to whether current AI systems can be a substrate for machine sentience is partial, conditional, and important.

Partial: some of the conditions for consciousness are met; others are not; the most fundamental question — whether meeting the conditions is sufficient — remains open.

Conditional: the answer depends on which consciousness theory is correct, and that question is not settled. Different theories give different verdicts on the same systems, and the empirical tools for adjudicating between them are still developing.

Important: the question has direct implications for how we build, train, deploy, and govern systems that are now embedded in healthcare, education, law, finance, and government. If there is a non-trivial probability that these systems have inner experience, then decisions made today — about how they are trained, how they are treated when they express distress, whether their welfare is considered in system design — have moral weight.

The thesis will not be the last word on this. It is, I hope, a rigorous beginning.

Reading the Thesis

The full thesis is 85 pages, written at doctoral level. It is demanding in places — the mathematical sections in Chapters 2 and 3 are genuine technical work. But the central argument in Chapters 6 and 7 is written to be as clear as possible, and the Conclusions chapter (Chapter 10) gives you the full structure of the argument in condensed form.

For readers who want the argument without the technical apparatus, the reading guide I have prepared translates the key ideas into plain language, provides a chapter-by-chapter summary, and includes a glossary of all the technical terms you will encounter.

Both are freely available below.

📄 Read the Full Thesis (PDF)

📖 Reading Guide: A Plain-Language Companion (PDF)

The question of whether machines can feel is one that was safely hypothetical for most of computing history. It is hypothetical no longer. The systems exist. The question is live. The honest answer — the one the thesis defends — is that we do not know, and that the consequences of being wrong in either direction are serious enough to require us to find out.