Ranking AI: Professor Kate Larson wins Paper Award at AAMAS | Cheriton School of Computer Science

Renowned AI researcher Professor Kate Larson has won the Best Paper Award at the International Conference on Autonomous Agents and Multiagent Systems (AAMAS).

Established in 2002, AAMAS is the world’s leading conference for research in AI, autonomous agents and multiagent systems. Every year, it brings researchers and practitioners worldwide to discuss the latest developments in agent technology. This year’s conference took place in Detroit, Michigan, from May 19 to May 23, 2025.

Professor Larson, alongside her colleagues at Google DeepMind, University of Montreal, and Meta — Marc Lanctot, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto-Rafael Maura-Rivero, Yoram Bachrach, Anna Koop and Doina Precup — were recognized for their research paper Soft Condorcet Optimization for Ranking of General Agents.

“What I liked about this work is that we are drawing on ideas from social choice theory, ranking, and optimization to create general-purpose, scalable and principled evaluation methods for AI systems and agents,” says Professor Larson.

A brunette donning glasses, midnight-blue blazer, black shirt and a blue necklace posing in an open space. there are yellow bars in the background — Inspired by social choice theory, Professor Kate Larson co-created a scheme that can rank AI agents with high accuracy

Over the past decade, we have seen rapid leaps in AI’s ability to think, write, and reason. Google DeepMind’s AlphaGo defeated the world champion at Go, one of the hardest and most complex board games. Likewise, Google’s AlphaFold can accurately predict protein structures within minutes, when traditional methods would take years or even decades.

These advancements were driven by the creation of benchmarks that can train and compare AI agents. For example, ImageNet, a database that contained 14 million images, helped propel deep learning, particularly in object detection and image recognition.

Unfortunately, evaluating agents can be difficult because each agent’s performance can vary across tasks and benchmarks, or agents may be evaluated on different tasks. To help aggregate the agents’ results, researchers have created evaluation methods based on classical rating systems like Elo. However, Elo-based systems have a number of limitations. For example, a natural concept from social choice is something called a Condorcet winner. A Condorcet winner is an agent that, when compared to any other agent, is considered better. Elo-based systems can, and often do, ignore Condorcet winners when ranking agents, leading to unintuitive ranking choices.

To address these problems, Professor Larson and her team developed a new ranking scheme inspired by social choice theory: a framework that explores how individual preferences can be combined to make collective decisions. Some real-life examples include voting systems and resource allocation.

In particular, they were influenced by earlier research that suggests using voting rules as maximum likelihood estimators, a technique that estimates the values of unknown parameters in a statistical model — which is key for incomplete datasets.

Their system, Soft Condorcet Optimization (SCO), will treat the evaluation data as “votes” and assign each agent a “score.” The latter acts as the model’s parameters. Then, it will compare if the “scores” match the “votes” by using a mathematical formula in the form of a differentiable loss function. Since this formula is differentiable, SCO can adjust the scores to minimize any discrepancies and misclassifications among the votes. Finally, SCO will conduct a final ranking by sorting the agents by their scores.

The team evaluated SCO with positive results. Compared to other classical voting and rating systems, SCO can effectively determine the Condorcet winner and boasts a low approximation error— even when more than half of the data is missing. Her team also investigated if SCO’s ranking can accurately predict human game outcomes, so they employed a held-out data set of over 31,000 diplomacy games that were played by around 53,000 players. Surprisingly, SCO’s ratings reached the optimal ranking better than the premier methods.

Overall, SCO can outperform state-of-the-art systems and provides an innovative and credible way to evaluate AI agents. With this new system, Professor Larson and her colleagues are helping researchers train the next wave of AI agents that could solve the world’s most pressing challenges.

The team’s research, Soft Condorcet Optimization for Ranking of General Agents, was published in the Proceedings of the 24^th AAMAS.