Continuing on its open source tear, Meta today released a new AI benchmark, FACET, designed to evaluate the "fairness" of AI models that classify and detect things in photos and videos, including people.
Made up of 32,000 images containing 50,000 people labeled by human annotators, FACET -- a tortured acronym for "FAirness in Computer Vision EvaluaTion" -- accounts for classes related to occupations and activities like "basketball player," "disc jockey" and "doctor" in addition to demographic and physical attributes, allowing for what Meta describes as "deep" evaluations of biases against those classes.
"By releasing FACET, our goal is to enable researchers and practitioners to perform similar benchmarking to better understand the disparities present in their own models and monitor the impact of mitigations put in place to address fairness concerns," Meta wrote in a blog post shared with TechCrunch. "We encourage researchers to use FACET to benchmark fairness across other vision and multimodal tasks."
Certainly, benchmarks to probe for biases in computer vision algorithms aren't new. Meta itself released one several years ago to surface age, gender and skin tone discrimination in both computer vision and audio machine learning models. And a number of studies have been conducted on computer vision models to determine whether they're biased against certain demographic groups. (Spoiler alert: they usually are.)
Then, there's the fact that Meta doesn't have the best track record when it comes to responsible AI.
Late last year, Meta was forced to pull an AI demo after it wrote racist and inaccurate scientific literature. Reports have characterized the company's AI ethics team as largely toothless and the anti-AI-bias tools it’s released as “completely insufficient.” Meanwhile, academics have accused Meta of exacerbating socioeconomic inequalities in its ad-serving algorithms and of showing a bias against Black users in its automated moderation systems.
But Meta claims FACET is more thorough than any of the computer vision bias benchmarks that came before it -- able to answer questions like "Are models better at classifying people as skateboarders when their perceived gender presentation has more stereotypically male attributes?" and "Are any biases magnified when the person has coily hair compared to straight hair?"
To create FACET, Meta had the aforementioned annotators label each of the 32,000 images for demographic attributes (e.g. the pictured person's perceived gender presentation and age group), additional physical attributes (e.g. skin tone, lighting, tattoos, headwear and eyewear, hairstyle and facial hair, etc.) and classes. They combined these labels with other labels for people, hair and clothing taken from Segment Anything 1 Billion, a Meta-designed dataset for training computer vision models to "segment," or isolate, objects and animals from images.
The images from FACET were sourced from Segment Anything 1 Billion, Meta tells me, which in turn were purchased from a "photo provider." But it's unclear whether the people pictured in them were made aware that the pictures would be used for this purpose. And -- at least in the blog post -- it's not clear how Meta recruited the annotator teams, and what wages they were paid.
Historically and even today, many of the annotators employed to label datasets for AI training and benchmarking come from developing countries and have incomes far below the U.S.' minimum wage. Just this week, The Washington Post reported that Scale AI, one of the largest and best-funded annotation firms, has paid workers at extremely low rates, routinely delayed or withheld payments and provided few channels for workers to seek recourse.
In a white paper describing how FACET came together, Meta says that the annotators were "trained experts" sourced from "several geographic regions" including North America (United States), Latin American (Colombia), Middle East (Egypt), Africa (Kenya), Southeast Asia (Philippines) and East Asia (Taiwan). Meta used a "proprietary annotation platform" from a third-party vendor, it says, and annotators were compensated "with an hour wage set per country."
Setting aside FACET's potentially problematic origins, Meta says that the benchmark can be used to probe classification, detection, "instance segmentation" and "visual grounding" models across different demographic attributes.
As a test case, Meta applied FACET to its own DINOv2 computer vision algorithm, which as of this week is available for commercial use. FACET uncovered several biases in DINOv2, Meta says, including a bias against people with certain gender presentations and a likelihood to stereotypically identify pictures of women as "nurses."
"The preparation of DINOv2’s pre-training dataset may have inadvertently replicated the biases of the reference datasets selected for curation," Meta wrote in the blog post. "We plan to address these potential shortcomings in future work and believe that image-based curation could also help avoid the perpetuation of potential biases arising from the use of search engines or text supervision."
No benchmark is perfect. And Meta, to its credit, acknowledges that FACET might not sufficiently capture real-world concepts and demographic groups. It also notes that many depictions of professions in the dataset might've changed since FACET was created. For example, most doctors and nurses in FACET, photographed during the COVID-19 pandemic, are wearing more personal protective equipment than they would've before the health crises.
"At this time we do not plan to have updates for this dataset," Meta writes in the whitepaper. "We will allow users to flag any images that may be objectionable content, and remove objectionable content if found."
In addition to the dataset itself, Meta has made available a web-based dataset explorer tool. To use it and the dataset, developers must agree not to train computer vision models on FACET -- only evaluate, test and benchmark them.