Facebook's new benchmarking system asks humans to interrogate AIs

How better to test an NLP algorithm than by asking it a question?

Blade Runner (1982) - Warner Bros.

Benchmarking is a crucial step in developing ever more sophisticated artificial intelligence. It provides a helpful abstraction of the AI’s capabilities and allows researchers a firm sense of how well the system is performing on specific tasks. But they are not without their drawbacks. Once an algorithm masters the static dataset from a given benchmark, researchers have to undertake the time-consuming process of developing a new one to further improve the AI. As AIs have improved over time, researchers have had to build new benchmarks with increasing frequency. As a Thursday Facebook post points out, “While it took the research community about 18 years to achieve human-level performance on MNIST and about six years to surpass humans on ImageNet, it took only about a year to beat humans on the GLUE benchmark for language understanding.”

What’s more, these benchmarks might contain biases that the algorithm can exploit to improve its score -- such as image recognition AIs ignoring the subtle contextual differences between “how much” and “how many” and simply answering “2”. So Facebook’s AI research (FAIR) lab has taken a new approach to benchmarking: they’ve put humans in the loop to help train their natural language processing (NLP) AIs directly and dynamically.

The idea is simple: if an NLP model is designed to converse with humans then what better way to see how well it performs than by talking to it? Dubbed the Dynabench (as in “dynamic benchmarking”), this system relies on people to ask a series of NLP algorithms probing and linguistically challenging questions in an effort to trip them up. The less the algorithm can be fooled, the better it is at doing its job.

What’s more, this dynamic benchmarking system is largely unaffected by the issues that plague static benchmarks. “The process cannot saturate, it will be less prone to bias and artifacts, and it allows us to measure performance in ways that are closer to the real-world applications we care most about,” FAIR researcher Douwe Kiela wrote in the post.

“The nice thing about Dynabench is that if a bias exists in previous rounds and people find a way to exploit these models…” Kiela told Engadget, “we collect a lot of examples that can be used to train the model so that it doesn't make that mistake anymore.”

What’s really cool is that anyone can give Dynabench a try, it’s open to the public. Users simply have to log into the Dynabench portal to start chatting (via text of course) with a group of NLP models, there’s no experience required outside of a basic grasp on the English language. Moving forward, Kiela and his team hope to expand the system’s capabilities with more models, more modalities, and additional languages.