Google Study Finds AI Chatbots Only 69% Accurate, Raising Reliability Concerns

In a recent shocker, Google has released new data showing that present-day AI chatbots are not as reliable as many users assume. Most of the advanced models achieved only about 69% factual accuracy at best. In its latest evaluation using the FACTS Benchmark Suite (a new testing system designed to measure how often chatbot responses are actually correct), Google found that even top AI systems still make mistakes roughly one out of every three times.

Top Chatbots Tested

According to Google’s benchmark, some of the most used chatbots were put to the test:

Gemini 3 Pro scored the highest with 69% accuracy
Other leading AI models, including Gemini 2.5 Pro and OpenAI’s ChatGPT-5, achieved lower accuracy scores.
Systems from developers like Claude and xAI scored well below the top performer.

Why Accuracy Matters?

Most current AI tests focus on whether chatbots can complete a task. Very few focus on whether the answers are actually accurate. That means a chatbot can sound confident while still providing incorrect or misleading information. AI experts say this gap matters more in areas where accuracy is important. This includes sectors such as healthcare, finance and law, where misinformation could have serious implications if acted on without human review.

Benchmark Testing

The FACTS Benchmark Suite measures accuracy across four types of challenges:

Parametric knowledge- answers based only on what the model learned during training its algorithm.
Search performance- how well the model saves factual information for future retrieval.
Grounded answers– sticking to facts in a provided document.
Multimodal understanding– interpreting charts, diagrams and images.

In particular, multimodal tasks showed the lowest performance, with accuracy staying below 50%. Google’s findings back up a growing consensus in the industry. AI chatbots may be improving rapidly, but they cannot be trusted bluntly as accurate and reliable information sources. Experts recommend human oversight and verification, especially in professional or high-stakes situations.

Share