In a publication dated August 7, 2023, on arXiv, Wenqi Shao and the team reveal their assessment of large vision-language models (LVLMs) using the new benchmark called Tiny LVLM-eHub.
One of the main models they looked at is called Bard, made by Google. Bard did really well in most areas, but it had trouble in a category called “object hallucination. This means it sometimes describes pictures incorrectly. Details:
This discrepancy in Bard’s performance is suspected to arise from training on “noisy” datasets with potentially flawed labels. The solution? Enhancing data quality and model robustness.
A new way to test these models is proposed, called ChatGPT Ensemble Evaluation (CEE). This new way is better at understanding how well the AI matches what humans think.
Some of the results they found:
This new CEE method is good at understanding the AI’s predictions, which is really useful for future tests.
Diving into model performance:
This paper’s insights underline a clear message: combining the strengths of different LVLMs may be the key to pushing the boundaries of multimodal AI performance. Through innovative approaches like CEE and rigorous evaluations, Wenqi Shao’s team sets the stage for transformative future research in vision-language modeling.
Reference paper:2308.03729.pdf (arxiv.org)
Publication date: 7th August 2023
Our vision is to lead the way in the age of Artificial Intelligence, fostering innovation through cutting-edge research and modern solutions.