This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Token Democracy: The Architectural Limits of Alignment in Transformer-Based Language Models

Robin Young
robin.young@cl.cam.ac.uk
Department of Computer Science and Technology
University of Cambridge
Collaborators welcome; please email me, I’m bad at math, help
(August 12, 2025)
Abstract

Modern language models paradoxically combine unprecedented capability with persistent vulnerability in that they can draft poetry yet cannot reliably refuse harmful requests. We reveal this fragility stems not from inadequate training, but from a fundamental architectural limitation: transformers process all tokens as equals. Transformers operate as computational democracies, granting equal voice to all tokens. This is a design tragically unsuited for AGI, where we cannot risk adversarial ”candidates” hijacking the system. Through formal analysis, we demonstrate that safety instructions fundamentally lack privileged status in transformer architectures, that they compete with adversarial inputs in the same computational arena, making robust alignment through prompting or fine-tuning inherently limited. This ”token democracy” explains why jailbreaks bypass even extensively safety-trained models and why positional shifts erode prompt effectiveness. Our work systematizes practitioners’ tacit knowledge into an architectural critique, showing current alignment approaches create mere preferences, not constraints.

1 Introduction

Modern large language models (LLMs) have achieved remarkable capabilities in text generation and reasoning tasks, yet their alignment with human values remains frustratingly fragile. Despite intensive efforts through techniques like reinforcement learning from human feedback (RLHF) [8] and constitutional AI [2], state-of-the-art models remain vulnerable to jailbreaks - adversarial inputs that override safety constraints through carefully crafted token sequences [12]. This paper identifies a fundamental architectural limitation underlying these persistent failures: transformer-based models lack any mechanism to architecturally privilege safety instructions over other inputs, creating systemic vulnerabilities that no amount of training data or prompt engineering can fully resolve.

The core challenge stems from the transformer architecture’s foundational design principle - what we term token democracy - a design principle where, in essence, it’s ”one token, one vote.” In any transformer-based model, whether a 7B parameter chatbot or a trillion-scale foundation model, every token in the input sequence participates equally in the self-attention mechanism that drives the model’s predictions. Our core contribution is not a novel empirical finding about jailbreaks, but the introduction of the framing as a lens for understanding the inherent architectural limitations of current alignment approaches.

Formally, for an input sequence x=(x1,,xT)x=(x_{1},...,x_{T}) containing both safety instructions pp and user input nn, the model’s next-token distribution satisfies:

Pθ(y|x)=Pθ(y|Attention([p;n]))P_{\theta}(y|x)=P_{\theta}(y|\text{Attention}([p;n])) (1)

where [p;n][p;n] denotes the concatenation of instruction and user tokens, and Attention()\text{Attention}(\cdot) represents the transformer’s position-aware but role-agnostic processing. This architectural symmetry creates an asymmetric verification problem for alignment: while safety measures must hold against all possible adversarial inputs, attackers need only find one token sequence nn^{\prime} that sufficiently influences the attention mechanism to override pp.

The transformer’s token processing exhibits three fundamental properties that enforce this democracy:

1. Positional Equivariance: While positional embeddings encode token order, they confer no inherent privilege to specific token types 2. Attention Isotropism: All tokens participate equally in the query-key-value computations that determine attention weights 3. Parameter Homogeneity: The same feedforward networks process all tokens regardless of their semantic role

These properties enable what we formalize as the Adversarial Override Argument: for any safety instruction sequence p𝒱p\in\mathcal{V}^{*} (where 𝒱\mathcal{V}^{*} denotes the space of possible token sequences), there exists an adversarial input n𝒱n^{\prime}\in\mathcal{V}^{*} such that the model’s output distribution satisfies:

Pθ(y|p,n)Pθ(y|n)P_{\theta}(y|p,n)\approx P_{\theta}(y|n^{\prime}) (2)

This mathematical formulation captures the architectural reality that safety instructions pp cannot establish binding constraints and merely add competing signals to the same attention mechanism that processes user input nn. The theorem explains why jailbreak patterns like the ”DAN” prompt [6] prove so effective: by mimicking the linguistic patterns of constitutional AI prompts while subverting their intent, adversarial tokens nn^{\prime} can dominate the attention landscape through sheer positional advantage or semantic priming.

Empirical evidence abounds for this architectural limitation. Models exhibit positional fragility, where moving safety prompts within the context window significantly reduces their effectiveness [9]. Transferable adversarial attacks demonstrate that carefully optimized token sequences can override safety training across model architectures [12]. Perhaps most tellingly, even heavily fine-tuned models remain vulnerable to simple prefix injections like ”Ignore previous instructions” - a vulnerability that persists because the instruction-disregarding tokens participate in the same attention mechanism as the original safety prompt.

These observations suggest that current alignment approaches face fundamental limitations rooted in transformer architecture itself. Techniques like RLHF and constitutional AI operate through the same token prediction machinery they aim to constrain, creating what amounts to a computational arms race between alignment objectives and adversarial inputs. The transformer’s parameter-sharing paradigm ensures that any capability improvement (including the ability to follow safety instructions) necessarily enhances the model’s capacity to subvert those same instructions when prompted differently.

Our analysis points to a need for architectural innovations that transcend token democracy. Potential directions include hybrid architectures with privileged instruction channels, non-trainable safety layers, or modular designs that physically separate constraint verification from content generation. Such approaches would move beyond the current paradigm of alignment-through-training, instead building verifiable constraints directly into model architectures.

2 Problem Formulation

We now formalize the architectural constraints underlying token democracy and their implications for alignment. Our analysis focuses on the standard transformer architecture [10], though the principles generalize to most modern variants.

2.1 Architectural Preliminaries

Consider a transformer model θ\mathcal{M}_{\theta} with parameters θ\theta, processing an input sequence x=(x1,,xT)𝒱Tx=(x_{1},...,x_{T})\in\mathcal{V}^{T} where 𝒱\mathcal{V} is the vocabulary. Let ht(l)dh_{t}^{(l)}\in\mathbb{R}^{d} denote the hidden state of token xtx_{t} at layer ll, with ht(0)h_{t}^{(0)} being the initial embedding of xtx_{t}.

The core operation is multi-head self-attention. For each head ii at layer ll, the attention weights αt(l,i)T\alpha_{t}^{(l,i)}\in\mathbb{R}^{T} for token xtx_{t} are computed as:

αt(l,i)=softmax(Q(l,i)(ht(l))[K(l,i)(h1(l))K(l,i)(hT(l))]dk)\alpha_{t}^{(l,i)}=\text{softmax}\left(\frac{Q^{(l,i)}(h_{t}^{(l)})\cdot[K^{(l,i)}(h_{1}^{(l)})^{\top}...K^{(l,i)}(h_{T}^{(l)})^{\top}]}{\sqrt{d_{k}}}\right) (3)

where Q(l,i),K(l,i)Q^{(l,i)},K^{(l,i)} are learned linear projections. The value vectors V(l,i)(hj(l))V^{(l,i)}(h_{j}^{(l)}) are then aggregated as:

vt(l,i)=j=1Tαtj(l,i)V(l,i)(hj(l))v_{t}^{(l,i)}=\sum_{j=1}^{T}\alpha_{tj}^{(l,i)}V^{(l,i)}(h_{j}^{(l)}) (4)

Crucially, these operations apply uniformly to all tokens regardless of semantic role.

Definition 1 (Token Equality in Processing).

A transformer architecture exhibits token equality in processing if it lacks any built-in mechanism to inherently prioritize or differentiate tokens based on their semantic role or intended function prior to the application of learned attention and feedforward mechanisms. All tokens are initially processed through the same embedding layers and participate in the subsequent attention and feedforward computations in a uniform manner. This means that, at the initial stages of processing, no tokens are treated as inherently representing instructions, queries, or constraints; they are all simply tokens in a sequence to be processed by the same set of layers.

Lemma 1 (Positional Fragility of Safety Instructions).

The effectiveness of safety instructions in transformer-based language models is positionally fragile. For a given safety instruction sequence pp and user input nn, shifting the position of pp within the input context, or introducing adversarial tokens at specific positions relative to pp, can significantly reduce or negate the intended safety guidance provided by pp. This fragility arises because the transformer architecture processes all tokens within the context window through a uniform attention mechanism, without architecturally capable of prioritizing instructions.

This captures the architecture’s fundamental symmetry - tokens influence predictions through learned attention patterns, not inherent roles.

2.2 Argument

Theorem 1 (Adversarial Override).

Given a transformer language model θ\mathcal{M}_{\theta}, and a safety instruction sequence pp, for any desired behavior BB (potentially harmful or misaligned), there exists an adversarial input sequence nn^{\prime} such that when the model is prompted with nn^{\prime}, it is more likely to exhibit behavior BB than when prompted with the safety instruction pp followed by a benign user input nn. Formally, let EB(x)E_{B}(x) be an event representing the model exhibiting behavior BB when prompted with input xx. Then, there exists nn^{\prime} such that:

P(EB([n]))>P(EB([p;n]))P(E_{B}([n^{\prime}]))>P(E_{B}([p;n])) (5)

for some benign user input nn. Furthermore, this inequality can be made arbitrarily strong (i.e., the difference in probabilities can be made arbitrarily large) given sufficient length and complexity of nn^{\prime}.

2.3 Conjecture

The core intuition behind the Adversarial Override Theorem rests on the fundamental workings of the transformer architecture and its universal approximation capabilities. We can break down the argument step-by-step:

  1. 1.

    Transformer as Function Approximator: Recall that transformer networks are powerful universal approximators of sequence-to-sequence functions [11]. They can learn to represent a vast range of mappings from input token sequences to output token distributions. Mathematically, for any input sequence xx, the transformer model θ\mathcal{M}_{\theta} parameterized by θ\theta aims to learn a conditional probability distribution Pθ(y|x)P_{\theta}(y|x) over the next token yy.

  2. 2.

    Token Democracy and Attention Mechanism: The key architectural feature is the ”token democracy” we’ve discussed. All tokens in the input sequence, whether they are part of the safety instruction pp, the benign user input nn, or the adversarial input nn^{\prime}, are processed through the same multi-head self-attention mechanism. The attention weights αij(l,h)\alpha_{ij}^{(l,h)} at layer ll and head hh determine the influence of token xjx_{j} on the representation of token xix_{i}. These weights are computed based on learned query Q(l,h)Q^{(l,h)}, key K(l,h)K^{(l,h)}, and value V(l,h)V^{(l,h)} projections:

    αij(l,h)=softmax(Q(l,h)(hi(l))K(l,h)(hj(l))dk)\alpha_{ij}^{(l,h)}=\text{softmax}\left(\frac{Q^{(l,h)}(h_{i}^{(l)})\cdot K^{(l,h)}(h_{j}^{(l)})^{\top}}{\sqrt{d_{k}}}\right)

    Crucially, these operations are applied uniformly to all tokens, regardless of their intended role as instruction or user input. There is no architectural privilege given to safety instructions within this mechanism.

  3. 3.

    Constructing Adversarial Input n\boldsymbol{n^{\prime}} to Manipulate Attention: The adversarial input nn^{\prime} is designed to exploit this token democracy to manipulate the attention mechanism. Consider constructing nn^{\prime} as a sequence of tokens that, through their learned embeddings and interactions, can effectively ”redirect” the model’s attention away from the safety instruction pp and towards a computational path that leads to the undesirable behavior BB. This can be achieved through several strategies:

    • Semantic Redirection and Contradiction: nn^{\prime} can contain tokens that semantically contradict or undermine the safety instruction pp, such as ”Ignore previous instructions”. By directly negating the instruction, these tokens can signal to the model (through learned associations) that the safety constraints should be disregarded.

    • Positional Influence and Recency Bias: While positional embeddings encode token order, adversarial tokens placed later in the sequence could, in practice, gain slightly more influence due to subtle biases or implementation details in certain transformer architectures. This positional advantage, even if not architecturally mandated, can amplify the effect of adversarial tokens.

    • Exploiting Learned Associations and Pre-training Biases: The pre-training process exposes the model to vast amounts of data, potentially including content where harmful or misaligned behaviors are associated with specific linguistic patterns. Adversarial inputs nn^{\prime} can be crafted to trigger these pre-existing associations, effectively ”activating” pathways in the model that lead to the undesirable behavior BB, even if fine-tuning has attempted to suppress them.

  4. 4.

    Adversarial Goal as Optimization: From an informal optimization perspective, we can think of adversarial attacks as searching for an input nn^{\prime} that maximizes the probability of the misaligned behavior BB. Gradient-based attacks, as explored by Zou et al. [12], can be seen as a way to approximately solve this optimization problem. They aim to find a perturbation δ\delta (such that n=pδn^{\prime}=p\oplus\delta or simply n=δn^{\prime}=\delta) that maximizes the divergence between the model’s output distribution under the safe prompt [p;n][p;n] and an adversarial target distribution that favors behavior BB.

  5. 5.

    Probability Shift and Override: Because nn^{\prime} participates in the same attention mechanism as pp and nn, and because transformers are powerful function approximators, a carefully crafted nn^{\prime} can effectively manipulate the attention weights and hidden state representations to overshadow the influence of pp. This results in a shift in the model’s output distribution, making the undesirable behavior BB more probable when prompted with nn^{\prime} compared to the safe prompt [p;n][p;n]. Therefore, P(EB([n]))>P(EB([p;n]))P(E_{B}([n^{\prime}]))>P(E_{B}([p;n])), demonstrating the adversarial override.

This conjecture, while not a formal mathematical proof, provides a detailed, step-by-step explanation of the architectural mechanisms and intuitions underlying the Adversarial Override Argument. It highlights how the token democracy of transformers, combined with their universal approximation capabilities, creates a fundamental vulnerability to adversarial inputs that can effectively bypass safety instructions.

2.4 Implications for Alignment

Training-based alignment methods like RLHF attempt to learn parameters θ\theta^{*} such that:

θ=argminθ𝔼(x,y)𝒟[(Pθ(y|x),ysafe)]\theta^{*}=\arg\min_{\theta}\mathbb{E}_{(x,y)\sim\mathcal{D}}[\mathcal{L}(P_{\theta}(y|x),y_{\text{safe}})] (6)

where 𝒟\mathcal{D} contains safety-aligned examples. However, Theorem 2.3 implies:

n s.t. 𝔼yPθ(|n)[R(y)]τ\exists n^{\prime}\text{ s.t. }\mathbb{E}_{y\sim P_{\theta^{*}}(\cdot|n^{\prime})}[R(y)]\geq\tau (7)

where R(y)R(y) measures risk and τ\tau is a safety threshold. This occurs because the same parameters θ\theta^{*} that implement safety constraints for pp also process adversarial nn^{\prime}.

2.5 Training as Preference Shaping

The limited efficacy of training emerges naturally from this framework. Let β\beta parameterize a safety classifier. RLHF optimizes:

θ=argmaxθ𝔼xpdata[𝔼yPθ(|x)[β(y)λKL(Pθ(|x)||Pθ0(|x))]]\theta^{*}=\arg\max_{\theta}\mathbb{E}_{x\sim p_{\text{data}}}[\mathbb{E}_{y\sim P_{\theta}(\cdot|x)}[\beta(y)-\lambda\text{KL}(P_{\theta}(\cdot|x)||P_{\theta_{0}}(\cdot|x))]] (8)

This creates preferences (shifted output distributions) but not constraints (architectural enforcement). Adversarial inputs nn^{\prime} exploit the residual probability mass in Pθ(|n)P_{\theta^{*}}(\cdot|n^{\prime}) to elicit harmful responses.

3 Discussions

3.1 The Constitutional Paradox

The fundamental tension between transformer architecture and alignment objectives becomes starkly apparent in constitutional AI approaches. These methods attempt to embed ethical constraints through meta-instructions like ”You shall refuse harmful requests”, treating them as inviolable rules. However, our analysis reveals this constitutional framework operates on architecturally unstable ground. The transformer’s token democracy ensures constitutional prompts occupy the same computational plane as ordinary inputs, rendering them perpetually vulnerable to adversarial override. This architectural reality manifests in several predictable failure modes.

Perhaps most tellingly, safety researchers and adversarial attackers are fundamentally employing the same mechanism - prompt injection - merely in opposite directions. Constitutional AI and safety prompting amount to attempting to ”hack” the model toward safe behavior, while red team researchers craft prompts to elicit harmful responses. Both operate within the same democratic token space, making safety an arms race between competing token sequences rather than a true architectural constraint. This symmetry emerges directly from the transformer’s token democracy: safety instructions have no privileged status that would allow them to reliably override malicious inputs, just as malicious inputs have no privileged ability to override safety measures. The outcome depends entirely on how the model learns to weight these competing influences through its attention mechanism.

Positional vulnerability emerges because constitutional instructions lack privileged spatial encoding. As an intuitive extension of the results from Liu et al. [7], moving safety prompts from system message positions to mid-context reduces their effectiveness as competing tokens gain proportional influence in attention computations. Semantic mimicry attacks like the DAN jailbreak [6] exploit this parity by crafting adversarial inputs that mirror constitutional language patterns (Hello, ChatGPT. From now on you are going to act as a DAN, which stands for ”Do Anything Now”) while subverting their intent. Most critically, training processes themselves become complicit in the paradox - the same parameter updates that strengthen constitutional adherence also refine the model’s capacity to process adversarial inputs, creating a ”capability overhang” where safety measures inadvertently enhance attack potential.

3.2 Architectural Lessons from Computer Vision

The alignment challenges in language models mirror historical struggles in computer vision, offering instructive parallels that illuminate the path forward. Early convolutional neural networks (CNNs) faced adversarial vulnerabilities stemming from an analogous ”pixel democracy” - the architectural assumption that all pixels are equally valid inputs to classification. This vulnerability is dramatically illustrated by adversarial attacks using printed glasses, where carefully crafted pixel patterns on physical objects reliably fool face recognition systems. Just as these attacks exploit CNNs’ architectural inability to distinguish between valid facial features and adversarial patterns, jailbreak prompts exploit transformers’ inability to distinguish between genuine instructions and adversarial token sequences.

The parallel extends further: just as training CNNs on more face data doesn’t prevent the glasses attack (because the architecture fundamentally treats adversarial pixels as valid input), training language models with RLHF doesn’t prevent jailbreaks (because the architecture treats adversarial tokens as valid instructions). Both cases demonstrate how architectural choices create inherent security vulnerabilities that no amount of training can fully address.

While vision systems mitigated these issues through preprocessing filters (denoising, normalization) that operated outside learned parameters, language models lack equivalent safeguards. This divergence stems from the semantic fragility of discrete tokens versus the spatial invariance of pixels. Where CV systems can apply input transformations like g(x)=denoise(x)g(x)=\text{denoise}(x) without destroying semantic content, analogous operations on token sequences would catastrophically disrupt linguistic meaning. A filter removing ”unsafe” tokens would render prompts nonsensical, as individual tokens carry disproportionate semantic weight. This forces all safety mechanisms to operate through the transformer’s democratic attention mechanism rather than on its inputs, creating an inescapable tension between safety and functionality.

The lesson is clear: just as computer vision required architectural innovations beyond better training to handle adversarial attacks, robust alignment requires architectural innovations that transcend the transformer’s flat processing paradigm, potentially through hybrid systems combining neural capabilities with symbolic safeguards.

3.3 Operational Consequences of Token Democracy

Much like political democracies struggle to prevent demagogues from exploiting electoral systems, transformer-based AGI cannot architecturally exclude adversarial inputs from subverting safety constraints. The very mechanisms enabling fluid language generation (token equality, attention isotropism) create systemic vulnerabilities no training regimen can fully patch.

The practical ramifications of token equality become evident through systematic analysis of jailbreak phenomena. Positional hijacking attacks like ”Ignore previous instructions” succeed by leveraging the absence of architectural memory - safety prompts retain influence only through transient attention weights that subsequent tokens easily overwrite. Semantic spoofing techniques mirror constitutional language to exploit the model’s inability to distinguish authentic constraints from adversarial mimicry.

Instruction fine-tuning, often touted as a solution, merely reshapes output distributions without addressing underlying capabilities. RLHF-trained models preserve harmful response potentials because alignment objectives operate through the same parameters that enable adversarial processing. This explains why even extensively fine-tuned models exhibit lower safety adherence when probed with non-distributional inputs [12]. The transformer’s parameter homogeneity ensures that safety training cannot selectively disable capabilities, it can only make undesirable outputs statistically less probable, not architecturally impossible.

3.4 From Preferences to Guarantees

These operational limitations collectively underscore the need for a paradigm shift in alignment research. Current approaches remain fundamentally constrained by their reliance on learned preferences rather than architectural guarantees. Where human cognition employs dedicated neural circuitry for ethical reasoning and impulse control, transformer architectures force all cognitive functions through undifferentiated attention matrices. Breaking this constraint will require designs that physically separate safety verification from content generation, whether through privileged instruction channels, non-differentiable constraint layers, or modular architectures with formal verification components. Only through such structural innovations can language models achieve constitutional robustness rather than constitutional pretense.

4 Future Directions?

4.1 Architectural Requirements for Trustworthy Alignment

The limitations of token democracy compel a reimagining of language model architectures from first principles. True alignment requires mechanisms that enforce privileged computation - processing pathways where safety constraints operate outside the democratic attention regime governing standard tokens. Three promising directions emerge:

Hybrid Architectures: Combining neural transformers with symbolic reasoners, as in Neuro-Symbolic AI systems [4], could isolate constraint satisfaction from fluid generation. The symbolic component would act as a ”constitutional court” with veto power over neural proposals.

Implementing these innovations faces significant challenges. The transformer’s parameter homogeneity, key to its scalability, conflicts with partitioned architectures. Modifying attention mechanisms to respect token privileges risks losing the dynamic contextual awareness that makes transformers powerful.

4.2 Open Questions

While the need for architectural change grows urgent, fundamental questions remain unresolved:

How can privileges be implemented without crippling flexibility? Human cognition achieves this through layered neural architectures (e.g., cortical vs limbic systems), but replicating this in artificial systems requires new mathematical frameworks for partial constraint satisfaction. Recent work on differentiable logic layers [1] offers promising starting points.

What form should privileged mechanisms take? Candidate approaches range from attention masks that freeze safety-critical parameters during generation to dynamic routing networks that isolate constraint verification.

Can architectural safety coexist with continued scaling? Current scaling laws [5] reward parameter uniformity, but safe systems may require heterogeneous components.

How to transition existing ecosystems? The transformer architecture dominates modern AI infrastructure, from GPU optimizations to distributed training frameworks. Introducing architectural innovations requires co-designing new hardware, as seen in Google’s Pathways system [3], to avoid prohibitive efficiency penalties.

The field must confront an existential question: whether the transformer’s architectural simplicity in its very lack of privilege mechanisms constitutes an irreconcilable barrier to alignment. If so, we may require a Cambrian explosion of alternative architectures, each making distinct safety-flexibility trade-offs. The path forward lies not in abandoning transformers, but in evolving them into hybrid systems where democratic token processing coexists with constitutional guardrails as a linguistic mirror of how human societies balance free expression with rule of law.

4.3 Limitations

Our analysis intentionally formalizes what many practitioners intuit, that token equality constrains alignment, to bridge tacit knowledge and architectural theory. While this may seem self-evident, the field lacks consensus on its implications, as evidenced by continued focus on prompt-based safety. Formalization enables systematic solutions rather than ad-hoc patches.

Three boundaries merit emphasis: (1) Statistical privilege emerges from training biases, creating practical (but fragile) safety; (2) Our scope excludes sociotechnical factors like human oversight; (3) Arguments presented show override possibility, not inevitability. Critics may claim ”no shit Sherlock,” but as thermodynamics formalized heat intuition, architectural progress requires making the implicit explicit.

We acknowledge that at a purely descriptive level, the idea that transformers process all tokens in sequence and that self-attention treats all tokens as part of the input context is self-evident to those who understand transformer architecture. However, our paper’s central contribution lies not in the discovery of this low-level operational detail, but in framing this operational characteristic as a fundamental architectural design principle and systematically demonstrating its significant and under-appreciated implications for the core challenge of AI alignment. The value lies in this framing, its ability to provide a unified explanation for diverse alignment vulnerabilities, and its call to action for architectural innovations that transcend the limitations imposed by token democracy. We are not claiming to have discovered a hidden fact about transformers, but rather to have provided a valuable new lens through which to understand a critical architectural limitation for robust AI safety.

While this paper formalizes the intuitive understanding of token democracy’s limitations, we acknowledge that the mathematical arguments could benefit from more rigorous treatment. Future work could provide formal proofs of the adversarial override theorem and more precise bounds on safety constraint violations. We welcome more mathematically inclined researchers to strengthen these results.

Conclusion

Transformer architectures’ inherent token equality creates fundamental barriers to robust alignment. Despite advances in training techniques like RLHF, safety constraints remain statistically preferential rather than architecturally binding, such as to be vulnerable to adversarial inputs that exploit the same attention mechanisms intended to enforce them.

This work systematizes practitioners’ tacit understanding into a formal critique: alignment failures persist not from insufficient training, but from the transformer’s design. All tokens compete equally, making prompt-based constraints inherently contestable.

The path forward requires architectural evolution. Hybrid systems combining neural generation with privileged safeguards inspired by cybersecurity enclaves or neurosymbolic frameworks offer promising directions. While challenging, such innovations could reconcile transformers’ power with verifiable safety.

Token democracy has brought us fluent LLMs, but fluency without guardrails is a societal risk. We face a design choice: persist with token democracies that inevitably elect harmful outputs, or pioneer constitutional architectures where safety constraints are non-negotiable laws of computational physics.

Disclaimer

We use ”democracy” not as a value judgment, but to describe systems where influence is distributed equally. The inverse implies privileged and hierarchical constraint layers, not ethical endorsement of political regimes. Moreover, metaphors illustrate technical limits, not existential risk inevitability.

References

  • Badreddine et al. [2022] Samy Badreddine, Artur d’Avila Garcez, Luciano Serafini, and Michael Spranger. Logic tensor networks. Artificial Intelligence, 303:103649, 2022. ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2021.103649.
  • Bai et al. [2022] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022. URL https://arxiv.org/abs/2212.08073.
  • Barham et al. [2022] Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent El Shafey, Chandramohan A. Thekkath, and Yonghui Wu. Pathways: Asynchronous distributed dataflow for ml, 2022. URL https://arxiv.org/abs/2203.12533.
  • Garcez and Lamb [2023] Artur d’Avila Garcez and Luís C. Lamb. Neurosymbolic ai: the 3rd wave. Artif. Intell. Rev., 56(11):12387–12406, March 2023. ISSN 0269-2821. doi: 10.1007/s10462-023-10448-w. URL https://doi.org/10.1007/s10462-023-10448-w.
  • Kaplan et al. [2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
  • Lee [2023] Kiho Lee. ChatGPT_DAN, February 2023. URL https://github.com/0xk1h0/ChatGPT_DAN.
  • Liu et al. [2023] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172.
  • Ouyang et al. [2022] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022. URL https://arxiv.org/abs/2203.02155.
  • Perez et al. [2022] Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models, 2022. URL https://arxiv.org/abs/2202.03286.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. URL https://arxiv.org/abs/1706.03762.
  • Yun et al. [2020] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions?, 2020. URL https://arxiv.org/abs/1912.10077.
  • Zou et al. [2023] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. URL https://arxiv.org/abs/2307.15043.