How smart is ChatGPT really – and how do we judge intelligence in AIs?
Following claims that an AI has shown "sparks of artificial general intelligence", what are we to make of the hype surrounding this technology? AI expert Melanie Mitchell is your guide
ARTIFICIAL intelligence has been all over the news in the past few years. Even so, in recent months the drumbeat has reached a crescendo, largely because an AI-powered chatbot called ChatGPT has taken the world by storm with its ability to generate fluent text and confidently answer all manner of questions. All of which has people wondering whether AIs have reached a turning point.
The current system behind ChatGPT is a large language model called GPT-3.5, which consists of an artificial neural network, a series of interlinked processing units that allow for programs that can learn. Nothing unusual there. What surprised many, however, is the extent of the abilities of the latest version, GPT-4. In March, Microsoft researchers, who were given access to the system by OpenAI, which makes it, argued that by showing prowess on tasks beyond those it was trained on, as well as producing convincing language, GPT-4 displays “sparks” of artificial general intelligence. That is a long-held goal for AI research, often thought of as the ability to do anything that humans can do. Many experts pushed back, arguing that it is a long way from human-like intelligence.
So just how intelligent are these AIs, and what does their rise mean for us? Few are better placed to answer that than Melanie Mitchell, a professor at the Santa Fe Institute in New Mexico and author of the book Artificial Intelligence: A guide for thinking humans. Mitchell spoke to New Scientist about the wave of attention AI is getting, the challenges in evaluating how smart GPT-4 really is, and why AI is constantly forcing us to rethink intelligence.
Daniel Cossins: There is a groundswell of interest in AI at the moment. Why is it happening now?
Melanie Mitchell: The first thing is that these systems are now available to the public. Anyone can easily play with ChatGPT, so people are discovering these systems and what they can do. More broadly, we are seeing an era of astounding progress in linguistic abilities. Over the past five years or so, we’ve seen the emergence of these large language models, trained on enormous amounts of human-generated language, and they’re able to generate fluent, human-sounding text.
Their fluency gives the appearance of human-like intelligence. That has caught people’s imagination; there’s this feeling that the AIs we’ve seen in movies and read about in science fiction are finally here. I think people are feeling both wonder and partly fear at what these AIs might do.
You mention “human-like intelligence”. Just how intelligent are today’s generative AIs, like those that generate text, and how do we assess that?
This is the subject of enormous debate. The number one reason for that is that all these terms we’re concerned with – intelligence, understanding, consciousness – aren’t well defined. The second reason is that these AI systems work very differently to human minds. Recently we saw that GPT-4 managed to pass the bar exam, a standardised test that people must pass to be able to practise law in the US. If a human did well on this test, which involves multiple choice questions and writing a law essay, we would assume they had a lot of general intelligence. But who is to say such tests are an appropriate way to assess AI intelligence?
Read more: The truth about intelligence: A guide for the confused
What do these large language models actually do, and what might that amount to in terms of intelligence?
Let’s start with the concept of a simple language model. I take a sequence of words like “the green frog” and then I look for those words in a huge amount of text and see what words typically follow that phrase. So it might be “jumped” or “swam” or, less likely, “cauliflower”. What’s the probability of each of these words coming next? I store those probabilities for lots and lots of possible sequences of words. Now I can start out with a text prompt and I can look up what is the most probable next word. This is how a simple language model works.
These days we give huge neural networks this task of working out the word probabilities and we train them on truly enormous numbers of examples from written text. These huge neural networks are called “large language models” (or LLMs) and they learn very complex statistical associations among phrases. The problem is that due to the complexity of the neural network and its operations, it’s hard to look under the hood and say exactly what it has learned [in order] to predict those next words.
So, you could say these AIs are just predicting the next word, which doesn’t sound impressive. But you might argue that this ability amounts to something like human intelligence.
Yeah, it’s murky. There are basically three ways in which we can assess a language model. One is that we can interact with it, like I would with a human. You talk to it, give it questions, puzzles, see how it reacts – and form your impressions. That’s like the Turing test, which basically asks: does this machine seem human? The problem is, we humans do tend to easily attribute intelligence to things that aren’t intelligent.
Another way is to try something like giving an AI sets of two sentences, where, in one set, the first sentence logically implies the second is true, and in the other set there isn’t that logical connection. These LLM systems have done extremely well on knowing which sentences are logically connected. But it often turns out that they do well not because they understood the sentences the way a human would, but because they were using subtle statistical associations.
Finally, you can look at the neural network itself and try to pull out an understanding of the mechanisms by which the machine is solving problems. People are working on that. But it’s incredibly difficult because the system is so complex. So, ultimately, we don’t yet have a cut-and-dried, problem-free test for intelligence in these language models.
2D4HGJC Chess grandmaster Garry Kasparov studies the board against the chess supercomputer Deep Junior, during the fourth game of the Man vs. Machine chess championship in New York, February 2, 2003. The six-match contest is scheduled to run through February 7th. REUTERS/Jeff Christensen JC
Chess grandmaster Garry Kasparov takes on a supercomputer in 2003
Do you think our attempts to get to grips with the capabilities of AIs will force us to sharpen our definitions of intelligence and understanding?
That’s been the case throughout the whole history of AI. In the 1970s and 80s, a lot of people were saying, well, playing chess at a grandmaster level is going to require general human-level intelligence. Then we got Deep Blue, the supercomputer that beat the grandmaster player Garry Kasparov, and we say it won by brute force, searching out the best possible moves. And here we are again. You could say we keep moving the goalposts. But I would say, in a more positive framing, that AI continues to challenge our conception of what intelligence is, or what we mean by understanding.
The thing is, we know that there are several different manifestations of intelligence. There’s human intelligence, which is very different from, say, the intelligence of an octopus, and again from the powers of a generative AI. Some of us have been using the phrase “diverse intelligences”, plural, to emphasise that intelligence isn’t one thing. How do we characterise these different intelligences? Are there any common features? Are they wholly different? These are the questions we want to grapple with.
Is there anything that has surprised you in what large language models can do?
We have recently seen what people are calling “emergent behaviours” – abilities that go beyond language processing and can seem like human reasoning. You can give an LLM math problems or instructions for writing computer code. You can give them stories and ask them to reason about the characters. And they can do these things. It’s not at all clear how this happens. They give the impression that they’re able to understand the world, in some sense, having just been trained on enormous amounts of human-generated text. The question is, are they doing something like human reasoning? Or are they just using sophisticated statistical associations, which doesn’t seem to be the way that we reason.
What are the leading ideas for what is behind these emergent behaviours?
It’s a bit too early to say, because every other month we get a new version of these models that can do new things. With GPT-3, we could at least look at the training data. But with GPT-4, we have no access to this. OpenAI says this is a commercial product, so it doesn’t want to give an advantage to the competition. It also cites “safety implications”. There’s no transparency there, so it impossible to do the research.
Do you think we are already on a trajectory to AIs with something akin to general intelligence? Or do you think that will require a whole new approach?
We first have to ask: what is general intelligence? Again, we don’t have an agreed definition, so saying what it takes to get there is difficult because I’m not totally sure what the target is. I’ve heard a lot of people in psychology questioning whether humans have general intelligence. Human intelligence is very specific to our evolutionary niche and it might not be as general as we like to think it is.
That said, I think simply scaling up these models is probably not going to take us to the kind of human-like understanding that we want. We don’t want just linguistic understanding; we want visual understanding, the ability to understand and do the right thing in a given situation.
Read more: Supersized AIs: Are truly intelligent machines just a matter of scale?
To get to that point, I think we will need some different kinds of architectures. For example, language models like GPT-4 have no long-term memory so they have no recollection of past conversations and they don’t care, in some sense, about what they have said in the past. It has been pointed out that a lot of human intelligence is centred on our motivations; that human intelligence is a means to achieve the goals that evolution has set for us. If a system doesn’t have any motivations, or any of its own goals, maybe it can’t achieve the kind of intelligence that we have.
What do you make of the idea that AIs can or will become sentient or conscious?
As philosophers have pointed out for millennia, how do I even know that you’re conscious? I know that I’m conscious because I can feel it somehow. But maybe you’re just a zombie. I guess I prefer not to go there because I don’t understand what is meant by it and I feel like the discussions never go anywhere.
Read more: The hard problem of consciousness is already beginning to dissolve
How might these language models be used once they are embedded in our daily lives? And what will be the nature of our relationship with them?
There will be lots of prosaic applications, like having them write your emails or reports. I think they will make us more productive. Whether there’s going to be something more dramatic, it’s harder to predict. Maybe they will put lawyers out of business. Maybe doctors will use them to help diagnose diseases and make decisions about our healthcare. I don’t know. But right now, they have a lot of limitations that mean you really have to have a human in the loop. We need to have the ability to distinguish truth from falsehood and that ability is missing from LLMs – that’s a fundamental problem.
Last month, a large number of high-profile AI experts signed anopen letter calling for a moratorium on AI research. Is it possible that we are moving too fast?
Yes, it is possible. Technology often moves faster than policy and regulation. With AI specifically, there are a lot of risks with deploying these systems in healthcare, legal contexts, journalism – all kinds of areas. But I didn’t sign that letter because I thought that it conflated a whole bunch of things, some of which were real risks and some of which are fearmongering science fiction. It painted a doomsday narrative that I don’t buy into.
I do think we should have regulation. These systems can be dangerous in prosaic, everyday ways: bias and misinformation, that kind of thing. But I don’t know if pausing research on them is the right way to go. Rather, we should know what data they’re trained on. We can’t just let companies like OpenAI effectively say: trust us, we know what we’re doing.
What messages would you like to impart to help people think about the risks and benefits of AI at the moment?
First, these systems are not yet reliable. Nor are they conscious. They’re not deciding to do anything that might be harmful to us. The real potential for harm is in humans using them, and therefore we do need to regulate them.
Second, just because we don’t understand precisely how they work yet doesn’t mean they’re magic. It’s just that they’re very complex. We will be able to understand them. We just need to do the science, and to do the science we need these systems not to be entirely in the hands of for-profit corporations.
These language models offer a great opportunity to deepen our understanding of cognition. We can learn a lot from them about ourselves, about how human intelligence works and how intelligence more generally might work in diverse ways. But at the same time, we must be aware of all the dangers, risks and issues that are involved with deploying them in the real world.