How Good are GPT-4 and GPT-4V at Solving Conceptual Puzzles? A Study Using the ConceptARC Benchmark

Imagine a world where machines can understand and reason with abstract concepts, such as objectness, numerosity, and geometry, just like humans can. Imagine that these machines can solve complex puzzles that require inducing and applying general rules from a few examples and that they can do so using both language and vision. This is the vision of artificial intelligence (AI) that motivates the research of Melanie Mitchell, Alessandro B. Palmarini, and Arseny Moskvichev, who published a paper on 14 November 2023 titled “Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks”.

The paper explores the abstract reasoning abilities of two versions of GPT-4, a large pre-trained language model (LLM) that can generate natural language texts on various topics and tasks. The first version is text-only, meaning that it can only process and produce texts. The second version is multimodal, meaning that it can also process and produce images. The authors use the ConceptARC benchmark, a collection of 480 analogy puzzles that test understanding and reasoning with core-knowledge concepts, such as top and bottom, inside and outside, same and different, etc. The puzzles are given in both text and image formats, and the solver’s challenge is to generate a new grid that results from applying an abstract rule to a test input, based on a few demonstration input-output pairs.

The authors compare the performance of GPT-4 and GPT-4V with that of humans and special-purpose algorithms on the ConceptARC tasks. They use different prompting methods to communicate the tasks to the models, and they measure the accuracy of the model’s responses. They also vary the temperature parameter of the models, which controls the randomness of the text generation. They use two temperature settings: zero and 0.5. A lower temperature means less randomness and more confidence, while a higher temperature means more randomness and more diversity.

The paper extends the work of Moskvichev et al. (2023), who created the ConceptARC benchmark and evaluated GPT-4 on simple, zero-shot prompts with text versions of the tasks. The paper also relates to other studies that have tested LLMs on subsets of ARC tasks, a similar but more difficult benchmark proposed by Chollet (2019). The paper addresses two limitations of the previous work: (1) the prompt format that was used was overly simple and might not have communicated enough about the task; and (2) it might not be fair to compare the performance of humans, who are presented with a visual version of each task, with LLMs, which are given a text-only version of the task.

The paper contributes to the understanding of the strengths and weaknesses of LLMs in abstract reasoning, which is a key aspect of human intelligence and a desirable goal for AI systems. The paper also provides insights into the role of multimodality in reasoning and the challenges of designing fair and informative prompts for LLMs. The paper can help researchers and practitioners to better evaluate and improve the capabilities of LLMs, and to explore the potential applications and implications of LLMs in various domains that require abstract reasoning, such as education, science, art, and entertainment.

The paper reports the following results:

The text-only version of GPT-4 achieved an overall accuracy of 0.33 on the ConceptARC tasks, using a more detailed, one-shot prompt that includes both instructions and an example of a solved task. This is an improvement over the previous work, which reported an accuracy of 0.25 using a simple, zero-shot prompt. However, the accuracy of GPT-4 remains substantially below that of humans, who achieved an accuracy of 0.91 on the same tasks.
The multimodal version of GPT-4V performed substantially worse than the text-only version, achieving an accuracy of 0.23 on the minimal tasks, which are the simplest instantiations of the concepts. This is surprising, given that humans are given these tasks in a visual modality, and that GPT-4V can process and produce images. The authors suggest that GPT-4V might have difficulties in mapping the visual grids to text representations and that the inclusion of an example in the prompt might have had a negative impact on performance.

The paper concludes that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels and that a large gap in basic abstract reasoning still remains between humans and state-of-the-art AI systems. The paper also discusses the limitations and future directions of the research, such as exploring other methods of prompting or task representation, testing other multimodal LLMs, and investigating the underlying mechanisms and representations of LLMs in abstract reasoning.

Reference:

https://arxiv.org/pdf/2311.09247.pdf

Our vision is to lead the way in the age of Artificial Intelligence, fostering innovation through cutting-edge research and modern solutions.

Quick Links

Contact

Phone:
+92 51 8912223

Email:
info@neurog.ai