Emergent abilities of large language models are a mirage | Top Vip News

[ad_1]

the original version of this story appeared in Quanta Magazine.

Two years ago, in a project called Beyond the Imitation Game Benchmark, or BIG-bench, 450 researchers compiled a list of 204 tasks designed to test the capabilities of large language models, which power chatbots like ChatGPT. On most tasks, performance improved predictably and smoothly as the models scaled up: the larger the model, the better it became. But with other tasks, the jump in capacity was not easy. Performance stayed near zero for a while, then performance jumped. Other studies found similar jumps in ability.

The authors described this as “revolutionary” behavior; Other researchers have compared it to a phase transition in physics, such as when liquid water freezes and turns into ice. In a paper Published in August 2022, the researchers noted that these behaviors are not only surprising but also unpredictable, and should inform evolving conversations about the safety, potential, and risk of AI. They called the skills “emergent”, a word that describes collective behaviors that only appear once a system reaches a high level of complexity.

But things may not be so simple. a new role by a trio of researchers at Stanford University posits that the sudden emergence of these skills is just a consequence of the way researchers measure LLM performance. Abilities, they argue, are neither unpredictable nor sudden. “The transition is much more predictable than people think,” he said. Sanmi Koyejo, a computer scientist at Stanford and lead author of the paper. “Strong emergence claims have as much to do with how we choose to measure as they do with what the models are doing.”

We are only now seeing and studying this behavior because of how large these models have become. Large language models are trained by analyzing huge text data sets(words from online sources, including books, web searches, and Wikipedia) and find links between words that often appear together. Size is measured in terms of parameters, roughly analogous to all the ways words can be connected. The more parameters, the more connections an LLM can find. GPT-2 had 1.5 billion parameters, while GPT-3.5, the LLM that powers ChatGPT, uses 350 billion. GPT-4, which debuted in March 2023 and is now the basis for Microsoft Copilot, reportedly uses 1.75 trillion.

That rapid growth has brought a surprising increase in performance and efficiency, and no one questions that large enough LLMs can complete tasks that smaller models cannot, including those they were not trained to perform. The Stanford trio who called the emergence a “mirage” acknowledge that LLMs become more effective as they grow; Actually, the added complexity The use of larger models should allow improvement in more difficult and diverse problems. But they argue that whether this improvement appears smooth and predictable or patchy and sharp is a result of the choice of metric (or even the paucity of test examples) rather than the internal workings of the model.

Leave a Comment