The $63-million prediction machine
Today’s artificial intelligence has reached an inflection point that few truly understand. When ChatGPT writes a poem, Claude solves a complex problem, or GPT-4 appears to reason through scientific questions, it feels like we have crossed into genuine machine intelligence. The responses are coherent, contextual, and often brilliant enough to fool experts in specialized fields.
But here is what the tech companies do not want you to fully grasp: these systems do not “understand” anything in the way humans do. Today’s AI represents the triumph of a particular approach – extraordinarily sophisticated prediction engines, trained at enormous cost to guess the next word based on patterns in vast amounts of text. This is not intelligence as we know it; it is statistical pattern-matching elevated to an artform.
OpenAI’s GPT-4, the model behind ChatGPT’s most impressive responses, cost approximately $63 million to train, running on 25,000 A100 GPUs for 90 to 100 days (OpenaiWikipedia). That is more than the annual GDP of some small nations, spent on teaching a machine to predict text sequences. The question that should concern us all: is this the best path to artificial intelligence, or have we built magnificent castles on foundations of sand?
The transformer revolution – attention is all you need
The breakthrough that enabled today’s AI came in 2017 with a deceptively simple paper titled “Attention Is All You Need” by Google researchers (SebastianraschkaCUDO Compute). This introduced the transformer architecture – the foundation of every major AI system from GPT-4 to Claude to Gemini.
At its core, a transformer converts text into numerical representations called tokens, with each token converted into a vector via lookup from a word-embedding table. At each layer, each token is then contextualized within the scope of the context window with other tokens through a parallel multi-head attention mechanism (Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch).
Think of it this way: when you read the word “bank”, you unconsciously consider the surrounding words to determine whether it means a financial institution or the side of a river. The attention mechanism does something similar – it calculates how much each word should “pay attention” to every other word in the sequence.
The attention mechanism addresses ambiguity by computing context-dependent weights. When analyzing “bat” in different sentences, it examines surrounding words (“swing” versus “flew”) and calculates attention scores that determine relevance, resulting in distinct representations for “bat” as a sports tool versus a flying creature (How Transformers Work: A Detailed Exploration of Transformer Architecture | DataCamp).
The genius of this approach was eliminating the sequential processing bottleneck of earlier neural networks. Unlike recurrent neural networks that process input sequentially, transformers use attention mechanisms exclusively, allowing parallel computation that can be accelerated on GPUs for both faster training times and larger model sizes (GPT-4 Details Revealed – by Patrick McGuinness.)
The mixture-of-experts architecture – efficiency through specialization
Here’s where today’s AI gets truly sophisticated – and where the public understanding diverges most sharply from technical reality. GPT-4 is not a single massive neural network but rather a “mixture-of-experts” (MoE) model consisting of approximately 1.8 trillion parameters organized into 16 expert networks, each with about 111 billion parameters (WikipediaGetzep).
The brilliance of this architecture lies in selective activation. During inference, only two of these 16 experts are activated for any given token, meaning that while GPT-4 has 1.8 trillion total parameters, only about 444 billion are active during any single computation (Reducing LLM Hallucinations: A Developer’s Guide).
Imagine a hospital with 16 different specialists. When a patient arrives, the system routes them to the two most appropriate doctors based on their symptoms. A heart patient might be routed to a cardiologist and an anesthesiologist, while someone with a broken bone goes to an orthopedist and a radiologist. The MoE architecture works similarly – different parts of the input are routed to the most relevant expert networks.
This sparse activation is crucial for practical deployment. There is an inference constraint around 300 billion feed-forward parameters for current GPU systems. GPT-4’s design, with two 111 billion parameter experts plus 50 billion common parameters, stays just under this limit while maintaining the representational power of a much larger model (GPT-4 architecture, datasets, costs and more leaked).
This explains why GPT-4 can respond at human reading speed despite its enormous size. A dense 1.8-trillion parameter model would be impossible to run in real-time with current hardware.
The illusion of understanding
Here is where public perception loses touch with technical reality. When GPT-4 appears to reason through a complex problem, it is not thinking in any human sense. Fundamentally, text-generative transformer models operate on the principle of next-word prediction: given a text prompt from the user, what is the most probable next word that will follow this input? (Building a Transformer LLM with Code: Introduction to the Journey of Intelligence).
These parameters are not storing facts as does a database. Instead, they encode statistical patterns learned from training on approximately 13 trillion tokens of text and code data, including CommonCrawl, RefinedWeb, and speculation suggests additional sources such as Twitter, Reddit, YouTube, and a large collection of textbooks (Reducing LLM Hallucinations: A Developer’s Guide).
The model learned that certain word sequences tend to follow others through self-attention mechanisms that allow them to process entire sequences and capture long-range dependencies more effectively than previous architectures (Building a Transformer LLM with Code: Introduction to the Journey of Intelligence).
To understand how different this is from human intelligence, consider what happens when you ask GPT-4 about a historical event. A human would recall facts, cross-reference memories, and reason about causality. The AI, however, generates text by:
-
- Tokenizing your question into numerical representations
- Computing attention between all tokens to understand context
- Routing tokens to the most relevant expert networks
- Predicting the most statistically likely next token based on learned patterns
- Repeating this process thousands of times to generate a complete response
The attention mechanism calculates output as a weighted sum of values, where weights are determined by scaled dot-product of queries with all keys. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions (GPT-4 Cost Estimate (UPDATED) – Community – OpenAI Developer Community).
The result can be remarkably coherent and seemingly intelligent, but it is generated through sophisticated pattern matching, not understanding.
The hallucination problem – when prediction goes wrong
This prediction-based approach creates a fundamental vulnerability that exposes the limitations of today’s AI: hallucinations. Large language model (LLM) hallucinations are instances in which models generate outputs that are coherent and grammatically correct but factually incorrect or nonsensical (IguazioEvidentlyai).
The scope of this problem is larger than most realize. Studies estimate hallucination rates of 15-20% for ChatGPT, with research showing ChatGPT exhibiting self-contradictions at a rate of 14.3%, and GPT-4 at 11.8%.
Unlike humans, these models do not have real-world experiences or the ability to access real-time data during inference. They predict words based on probability, not truthfulness, and once they generate a sentence, they do not go back to verify it against external sources (LLM Hallucinations: What You Need to Know Before Integration).
This creates cascading errors. When an LLM hallucinates, any error in the generated reasoning can propagate into a full-blown hallucination, creating long “hallucinated” reasoning chains if the model is truly unsure (LLM Hallucinations: What You Need to Know Before Integration).
Why determinism has been abandoned
Here lies one of the most critical decisions in today’s AI development – the abandonment of deterministic approaches in favour of probabilistic systems. This choice has profound implications that few in the industry acknowledge.
Traditional computer science emphasized deterministic algorithms; given the same input, you always get the same output. Every step can be traced, verified, and debugged. But as AI researchers pursued more “flexible” and “creative” systems, they embraced probabilistic approaches in which randomness is built into the core architecture.
Today’s AI systems are fundamentally non-deterministic. LLMs use probabilistic modelling to predict the word/phrase most likely to be next, depending on probability rather than factual verification (LLM Hallucinations: What You Need to Know Before Integration). Even with identical inputs, these systems can produce different outputs due to sampling techniques and random initialization.
This choice was driven by several factors:
Computational pragmatism: Deterministic reasoning systems require exhaustive logical checking, which scales poorly with complexity. Probabilistic systems can “guess” their way through problems they cannot solve definitively.
Training data complexity: Models inherit biases present in training data and lack real-world experiences or common-sense reasoning abilities. Rather than solve these fundamental knowledge problems, researchers chose to approximate answers through statistical learning.
The “good enough” philosophy: If a system produces useful outputs 80% of the time, many applications can tolerate the 20% error rate. This pragmatic approach prioritized deployment speed over mathematical rigour.
Biological inspiration: Researchers argued that human intelligence itself involves uncertainty and probabilistic reasoning, so AI should embrace these characteristics rather than pursue perfect logical consistency.
But this choice has consequences. Deterministic systems can provide mathematical guarantees about their behaviour. You can prove they will never produce certain types of errors. Probabilistic systems, by contrast, can always surprise you – often in ways that are not discovered until after deployment in critical applications.
The context window constraint
One of the most significant limitations of today’s AI is the context window – how much text the model can consider at once. GPT-4 operates with context windows of 8,192 to 32,768 tokens, a significant improvement over earlier models but still fundamentally constrained (LLM hallucinations and failures: lessons from 4 examples).
When input exceeds these limits, the model generates responses based on partial understanding of the prompt, potentially leading to contradictions or irrelevant answers. The model essentially “forgets” earlier parts of long conversations or documents (What i an LLM Hallucination? Why Should We Care? | LITSLINK Blog).
This is not like human memory, where we can selectively recall relevant information. When an AI hits its context limit, it literally cannot see the beginning of a long conversation. This truncation can result in the model losing crucial details, increasing the chances of producing inconsistent or hallucinated responses.
The economics of intelligence
The costs involved reveal just how computationally intensive this approach is. Training large language models requires massive computational resources, with modern LLMs using thousands of GPUs working in parallel for weeks or months. A single NVIDIA H100 GPU can cost $25,000-$40,000 (>> The training cost of GPT-4 is now only 1/3 of what it was about a year ago. I… | Hacker News).
But training is just the beginning. The real challenge is inference – generating responses in real-time. GPT-4 outputs at human reading speed, which would be overly expensive if not impossible with a large dense model. This is why OpenAI uses a sparse mixture-of-experts architecture where not every parameter is activated during inference (GPT-4 architecture, datasets, costs and more leaked).
The estimated cost to train a GPT-4-calibre model dropped from $63 million to around $20 million by Q3 2023, less than one-third the cost with about 55 days to complete the process (>> The training cost of GPT-4 is now only 1/3 of what it was about a year ago. I… | Hacker News). Yet even these “reduced” costs represent barriers that only the largest corporations can overcome.
The architecture arms race
The industry response has been to build ever-larger models with more parameters and bigger context windows. But this approach faces fundamental constraints. There is an inference constraint around 300 billion feed-forward parameters for current GPU systems. Beyond this, memory bandwidth requirements make real-time response impossible (GPT-4 architecture, datasets, costs and more leaked).
This scaling approach also faces the law of diminishing returns. While bigger models often perform better, the computational requirements scale roughly linearly with the number of parameters multiplied by the number of training examples (>> The training cost of GPT-4 is now only 1/3 of what it was about a year ago. I… | Hacker News). We are approaching physical and economic limits to this brute-force scaling strategy.
What this means for AI’s future
Understanding this architecture reveals both the remarkable achievements and fundamental limitations of today’s AI. These systems can perform many tasks that appear to require intelligence, but they do so through statistical pattern matching rather than understanding.
The implications are profound:
-
- Reliability concerns: If an AI hallucinates 15-20% of the time, where can we safely deploy it?
- Scaling limitations: Simply making models bigger faces computational and economic walls
- The understanding gap: Pattern-matching may have fundamental limits compared to genuine comprehension
- Determinism tradeoffs: We have gained flexibility but lost mathematical guarantees about system behaviour
Beyond the hype
The technology companies have powerful incentives to emphasize AI capabilities while downplaying limitations. As one industry analysis noted, “OpenAI is keeping the architecture of GPT-4 closed not because of some existential risk to humanity but because what they have built is replicable.” (GPT-4 architecture, datasets, costs and more leaked)
The reality is that today’s AI represents an extraordinary achievement in statistical text processing. The transformer architecture, mixture-of-experts scaling, and attention mechanisms represent genuine breakthroughs in computational pattern recognition. But it is not the artificial general intelligence that marketing messaging sometimes suggests.
It is a sophisticated prediction engine with remarkable capabilities and fundamental constraints. The abandonment of deterministic approaches in favour of probabilistic systems has enabled rapid progress but at the cost of reliability and verifiability.
Understanding these constraints is not pessimistic; it is essential for building AI systems that we can trust and depend on. The gap between current AI and genuine understanding may be where the next breakthrough lies. Whether that breakthrough comes from scaling current approaches further or requires entirely new paradigms remains the critical question for AI’s future.
As we continue exploring AI’s present and future in this series, we will examine whether these fundamental limitations can be overcome, or whether different concepts – perhaps returning to deterministic principles with modern computational power – might be needed to create truly reliable artificial intelligence.
The $63-million question is not just how much it costs to train these systems – it is whether this probabilistic approach can ever bridge the gap between prediction and understanding, or whether the future of AI lies in fundamentally different architectures that prioritize mathematical certainty over statistical approximation.
(Mark Jennings-Bates – BIG Media Ltd., 2025)