LLMs outclass humans at predicting neuroscience results
Large language models (LLMs) outscored neuroscientists by between 13 and 22 percentage points (81.4% to 63.4%) in predicting useful neuroscience experimental outcomes, says a new study published last month in Nature Human Behaviour.
The study was supervised by Bradley C Love, professor of cognitive and decision sciences in experimental psychology at University College London (UCL) and a fellow at the Alan Turing Institute (London).
Love’s co-author, UCL post-doctoral researcher Xiaoliang Luo, directed a team of 33 researchers that compared the results of 12 LLMs against 171 neuroscientists – including 43 faculty and academic staff, 43 postdoctoral researchers and 12 research scientists.
The study, titled, “Large language models surpass human experts in predicting neuroscience results”, was not conceived as a 21st century version of “The Ultimate Computer” (1968), an episode in the original Star Trek series in which the crew of the USS Enterprise tests a computer, designed to replace them, after it becomes murderous. To save themselves, they must force it to shut down.
Surge in publications
Rather, as Love explained to University World News, the impetus to develop LLMs that could effectively read scientific papers and predict useful areas of research comes from the fact that it is impossible for a researcher to keep up with the number of papers being published in neuroscience, let alone allied fields.
“There’s been an exponential growth in the size of scientific literature and humans cannot keep up with it. That’s bad because potentially important findings are getting lost in the flood of research.
“There’s a lot of relevant work out there. So, I thought we needed tools or systems that could integrate sets of papers that could take several lifetimes to read,” said Love.
Such systems would, Love further explained, be less biased in the sense that, like everybody else, scientists “have their preferred network of collaborators or people whose work they follow”.
Love noted: “The project was just one step in showing that these LLMs can synthesise the literature to make useful predictions. Once you have that, you can start building useful tools to help scientists do their job better.”
‘Hallucinations’
The number of ‘weights’ that can loosely be thought of as synaptic connections inside the so-called ‘black box’ of the LLMs is staggering. One of the three Galactica LLMs has 6.7 billion weights, a Llama 2, and two Mistrals have 7 billion. There are several Falcon LLMs with 40 billion and two with 180 billion.
Still, these LLMs are subject to ‘hallucinations’, the tendency to create ‘information’. For example, a hallucination is when a response is given in the pattern but did not factually occur. For instance, a LLM reads an article about furniture and says it mentions “chair”, when in fact the word didn’t occur in the article.
In Love’s and Luo’s work, hallucinations are no bad thing. “We view this tendency to mix and integrate information from large and noisy data sets as a virtue. What is a hallucination in a backward-looking task is a generalisation or prediction in a forward-looking task,” they explain. A “forward-looking task” includes asking a LLM to predict the “results from a novel experiment”, they write.
‘BrainBench’
The LLMs and neuroscientists were tested on what Love and Luo call “BrainBench”: a constructed benchmark. It consists of an abstract drawn from the Journal of Neuroscience that has been altered by 200 human editors and 100 times by ChatGPT-4.
In all cases, the abstract had to be syntactically correct and logically coherent, however factually incorrect the abstract was after editing. In the altered abstract, for example, the role of two brain regions could be switched around in the results. ‘Decreases’ could be replaced by ‘increases’ and vice versa.
The abstracts were then presented to the 12 LLMs and 171 neuroscientists, who had the task of choosing the correct one, that is, the original version. Care was taken to ensure that the neuroscientists were not presented with an ‘edited’ abstract in their own subfield, which included neurobiology of disease, cellular-molecular, behavioural-cognition and systems-circuits.
While overall, neuroscientists chose correctly 63% of the time as against the LLM’s average score of 81.4%, in some subfields, humans performed significantly worse than did some of the LLMs.
Only 60% of neuroscientists chose the correct version of the abstracts in neurobiological diseases while almost 100% of the Falcon-40B LLM (with 180 billion weights) did. About 62% of the humans chose the real abstracts dealing with articles on development, plasticity and repair – fully seven of 14 LLMs chose the correct one 100% of the time, while the other seven chose correctly more than 95% of the time.
The LLMs, Love and Luo discovered, share what for humans is a psychodynamic trait: the more confidence the humans reported having in their choices, the more accurate the choice was.
Accordingly, the slope of the line indicating the neuroscientists’ confidence rises more or less in line with their accurate choices: when only 42% were accurate, neuroscientists reported low confidence, and when they were 80% correct, they indicated high confidence.
Using a standard measure of ‘perplexity’, the LLM indicates how surprising this text is, how well, for instance, it fits the pattern that the model was trained on. Lower perplexity means it is less surprising, and greater perplexity means it is more surprising.
Accordingly, while there were some variations, in the vast majority of cases, the slope of the line indicating confidence rises, as does the LLMs’ accuracy.
Against the ‘Zeitgeist’
As important as it is to know if the LLM has correctly identified a promising investigative path from among thousands of articles, the contents of which would otherwise remain unknown, Love explained that there is great value in scientific papers that the LLM finds do not fit a known pattern for, say, behavioural-cognitive research or contain erroneous information.
“Scientists could use LLMs like these not only to find confirmatory information they would otherwise not see but also to challenge assumptions.
“A paper that is declared by the model to be systematically wrong (because it contains an unlikely pattern of results) could advance science in the sense that in trying to understand the author’s thinking, you might find something useful.
“Real breakthroughs,” Love underscored, “often come from going against the Zeitgeist.”
0 Comments