The Future of Vision-Language Modeling: Insights from Wenqi Shao’s Groundbreaking Research

In a publication dated August 7, 2023, on arXiv, Wenqi Shao and the team reveal their assessment of large vision-language models (LVLMs) using the new benchmark called Tiny LVLM-eHub.

One of the main models they looked at is called Bard, made by Google. Bard did really well in most areas, but it had trouble in a category called “object hallucination. This means it sometimes describes pictures incorrectly. Details:

On Tiny LVLM-eHub, Bard’s object hallucination was 15.8%, notably higher than the 9.6% average of other LVLMs.
On a test called COCO-Captions, Bard scored 20.4% wrong, while the average for other Large Vision-Language Models was 12.7%.
On another test called Flickr30K-Captions, Bard scored 18.6% wrong, while other models scored around 11.9% wrong.

This discrepancy in Bard’s performance is suspected to arise from training on “noisy” datasets with potentially flawed labels. The solution? Enhancing data quality and model robustness.

A new way to test these models is proposed, called ChatGPT Ensemble Evaluation (CEE). This new way is better at understanding how well the AI matches what humans think.

Some of the results they found:

On Tiny LVLM-eHub, CEE recorded a 0.76 correlation with human assessments, surpassing the word-matching method’s 0.64.
CEE scored 0.78 on COCO-Captions, while word matching trailed at 0.66.
Flickr30K-Captions had CEE at 0.77, overshadowing word matching’s 0.65.

This new CEE method is good at understanding the AI’s predictions, which is really useful for future tests.

Diving into model performance:

Bard impressively achieved 0.83 in visual perception and 0.81 in visual reasoning, overshadowing the average LVLM scores of 0.76 and 0.77.
Yet, in object hallucination and embodied intelligence, Bard’s 0.58 and 0.59 scores were bested by the LVLM averages of 0.68 and 0.66.
CLIP-ViT-B/32 excelled in knowledge acquisition (0.82) and commonsense (0.79) but stumbled in perception (0.72) and reasoning (0.73).
DALL-E, with a strong 0.71 in object hallucination, lagged slightly in perception and knowledge acquisition at 0.74 and 0.75.

This paper’s insights underline a clear message: combining the strengths of different LVLMs may be the key to pushing the boundaries of multimodal AI performance. Through innovative approaches like CEE and rigorous evaluations, Wenqi Shao’s team sets the stage for transformative future research in vision-language modeling.

Reference paper:2308.03729.pdf (arxiv.org)

Publication date: 7th August 2023

Our vision is to lead the way in the age of Artificial Intelligence, fostering innovation through cutting-edge research and modern solutions.

Quick Links

Contact

Phone:
+92 51 8912223

Email:
info@neurog.ai