Stanford researchers explain why more documents make neural networks less effective

Stanislav Nikulin 15 April 2026 14:52

Scientists at Stanford University have demonstrated that increasing the volume of documents in a system makes them harder to distinguish — a phenomenon termed “semantic collapse.” Neural networks classifying documents face a challenge: the more similar data they have, the less accurately they can provide precise answers.

This occurs because the system converts documents into vectors. While the dataset is small, distinct clusters form and search accuracy remains high. However, after about 10,000 documents, these clusters begin to overlap, distances between vectors shrink, and all documents appear very similar.

As a result, artificial intelligence fails to select relevant documents, and search accuracy drops by 87% at 50,000 documents. Semantic search underperforms compared to traditional keyword search, while the likelihood of false, or “hallucinatory,” responses increases.

This research is crucial for advancing artificial intelligence and processing large text corpora, as it explains the limits of current neural networks’ effectiveness in semantic search.

“Semantic collapse” poses a challenge that researchers and developers must consider when building natural language processing systems to improve their accuracy and reliability in the future.

Read us on Telegram and Sends

Завантажуй наш додаток