Building a world run by a billion micro LLMs may be harder than you think
The 1950 Japanese film Rashomon, directed by Akira Kurosawa, explores an unsettling crime in the forest, narrated from the perspectives of a samurai, his wife, and a bandit. Each character’s version of events conflicts with the others, yet each is plausible. The audience is left grappling with the impossibility of determining a single “true” account of what happened. This phenomenon, now known as the Rashomon Effect, reveals a profound truth about human perception: when people witness the same event, they process and interpret it differently based on personal biases, cultural contexts, and flawed memories.
In the context of artificial intelligence, the Rashomon Effect has a striking parallel: different AI systems, particularly large language models (LLMs), can provide distinct interpretations of the same input based on their training data, structures, and objectives. While this diversity in perspective is not inherently problematic, it raises a critical question: if we replace our current software ecosystem with a vast network of interconnected LLMs, how will these systems resolve their differing “realities”? If humans, after millennia of civilization, struggle to agree on a unified understanding of truth, can we expect millions of LLMs to perform any better?
The Rashomon Effect demonstrates that even when individuals encounter the same reality, their interpretations vary. This idea extends seamlessly to machine learning, where different models trained on the same dataset can achieve similar performance while representing the underlying data in fundamentally different ways. For instance, a logistic regression model might approximate the data with a hyperplane, while a random forest partitions it into hierarchical decision boundaries, and a neural network learns a highly non-linear mapping. Despite their differences, these models often yield comparable accuracies on key metrics like precision, recall, or AUC.
This phenomenon is not limited to simple classification tasks. In the realm of LLMs, such as GPT, BERT, and PaLM, each model is trained on slightly different data distributions, preprocessing pipelines, and optimization schemes. These variations lead to subtle differences in how they interpret and generate text
Today’s software systems largely rely on structured architectures, such as CRUD (Create, Read, Update, Delete) apps, which operate deterministically within predefined rules. However, imagine a future where this tech stack is replaced by LLMs. Instead of explicit programming logic, we would have LLMs dynamically generating responses, making decisions, and adapting to user input. This shift would fundamentally alter how applications function, with LLMs operating not as static tools but as adaptive, conversational entities.
While this vision is exciting, it also introduces a major complication: how will millions of independent LLMs interact, collaborate, and align their perspectives? Each LLM, shaped by its unique training data and design choices, will inevitably have its own “interpretation” of reality. Without a central mechanism to enforce coherence, these models could generate conflicting outputs, leading to inefficiencies, disagreements, or even systemic failures.
One of the most significant challenges in this envisioned future is the absence of a single source of truth. In a traditional software stack, the underlying database or logic layer provides a consistent foundation for all operations. In contrast, LLMs operate probabilistically, generating responses based on patterns in their training data rather than deterministic rules. This flexibility allows for creativity and adaptability but comes at the cost of consistency
The idea of managing millions of LLMs, each with its own perspective, is analogous to governing a society composed of diverse individuals and groups. Just as human societies grapple with cultural, political, and philosophical differences, an ecosystem of LLMs would need mechanisms to mediate conflicts and align objectives. However, unlike humans, who can rely on shared values, emotions, and social norms, LLMs lack inherent moral frameworks. Their “beliefs” are entirely derived from the data they are trained on, which may be incomplete, biased, or contradictory
Consider a scenario in healthcare where different LLMs manage diagnosis, treatment recommendations, and patient communication. If these models disagree on a critical issue—such as whether a patient should undergo surgery—the resulting confusion could jeopardize lives. Such situations demand not only technical solutions but also ethical frameworks to ensure accountability and transparency.
Ultimately, the question is not whether we can build a world dominated by LLMs, but whether we can build a world where these systems coexist productively, transparently, and in alignment with human values