Nov. 16, 2024

Diversity in Datasets

Diversity in Datasets
The player is loading ...
Diversity in Datasets

Today, our host, Carter Considine, explores one of the toughest hurdles in the AI space: the reality that today’s algorithms are not only reinforcing, but even amplifying, age-old biases.

 

Carter unpacks cases such as that of Google's Gemini AI, which sparked outrage after generating controversial outputs echoing real-world racial and gender stereotypes. He dissects the implications of these biases on companies leading AI innovation and why we need transparency in AI model development, as well as more diverse datasets and revamped testing methodologies.

 

Our host also discusses potential solutions proposed by AI researchers, such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization. As AI continues to develop at a rapid pace day by day, we’re hoping that future breakthroughs in the space involve the building of more inclusive algorithms.

 

Key Topics:

  • How Cutting-Edge Generative AI Models Have Generated Biases (0:27)
  • Developments and Continued Limitations in Generative AI Models (2:12)
  • Technical: Under the Hood (3:37)
  • Why RLHF is Not Enough (5:14)
  • Moving Forward (6:50)

Summary: Cutting-edge generative AI models have demonstrated bias against specific communities caused by a lack of diverse data within training sets. Although AI researchers have proposed several solutions, such as techniques like reinforcement learning from human feedback, more transparency in model development and dataset training is necessary to improve consumer understanding of the limits of AI.

 

The inability to address bias in machine learning systems has begun to impact companies on the frontier of generative AI development. Gemini, Google’s text-to-image AI generator, displayed glaring biases only three weeks after being deployed. For instance, if a user wrote a prompt asking to generate pictures of Nazis, the images generated would include Black and Asian Nazis. When asked to draw a picture of the Pope, Gemini created an Asian female Pope and a Black Pope. 

 

Even after taking the tool down, Google wasn’t able to ensure that similar issues wouldn’t arise again. In the official Google-issued statement:

 

“It will make mistakes. As we’ve said from the beginning, hallucinations are a known challenge with all LLMs — there are instances where the AI just gets things wrong. This is something that we’re constantly working on improving.” 

 

Google claims it hadn’t tested Gemini properly beforehand, much to the surprise of researchers and industry analysts. An alternative interpretation of the mishap suggests that the model had been tested but no one reported the mistakes the model was making.  

 

There are two likely causes for the Gemini incident: training data or prompt engineering gone awry. When it comes to datasets, many training datasets lack the diversity that correlates with reality, resulting in leaving out, or misrepresenting, specific communities within algorithmic output. Most existing open-source datasets for large languages models (LLMs), for instance, are in the English language, which only recently changed with the release of the Aya Dataset, a human-curated instruction-following dataset spanning 65 languages. 

 

Bias from lack of diverse datasets originally appeared before the most recent generative AI boom, as highlighted by The Algorithmic Justice League within their documentary Coded Bias, when facial recognition was unable to process black faces. The first striking case occurred in 2015, when Google Photos labeled a black couple as gorillas, resulting in Google temporarily resolving the problem by removing the ‘gorilla’ tag from their categorization. The same problem was repeated in the middle of 2020 by Meta (known at that time as Facebook), when a user watching a video from a British tabloid featuring Black men saw an automated prompt if they would like to “keep seeing videos of Primates.” 

 

Technical: Under the hood

 

Facial recognition involves a subset of machine learning called supervised learning, which requires humans to manually label data. The labeled data is used to teach algorithms how to behave. Reducing bias in the human-created labels, then, should result in non-biased algorithms. 

 

However, even unsupervised machine learning algorithms, which use vast quantities of data without the involvement of humans, have problems as well. The previous version of chatGPT, GPT-3, was trained on 570 GB of data and produced an anti-Muslim bias within its generated text. The training dataset includes text posted to the internet and books uploaded to the internet, such as English-language Wikipedia. The training data contains linguistic regularities that reflect human unconscious biases, such as racism, sexism, and ableism.

 

Some advancements in research have been made to help better steer generative models, such as reinforcement learning from human feedback (RLHF). RLHF is a machine learning technique that uses human feedback to optimize models and learn human preferences more efficiently. The technique uses supervised learning to establish the preferences and then reinforcement learning (RL) to scale up the data. RLHF has been used to help make outputs less biased against certain groups, such as the mis-performing Gemini model.

 

RLHF is not enough

 

Even if strides have been made in research around techniques like RLHF, such as direct preference optimization (DPO), Kahneman-Tversky process optimization (KTO), and so on, most research shows that LLMs are still biased and produce output that is not aligned with human values. Besides the case with Gemini, a recent survey on gender bias in LLMs found that LLMs are 3-6 times more likely to choose an occupation that stereotypically aligns with a person’s gender. The outputs were aligned with people’s perceptions of particular jobs, rather than official job statistics. 

 

Considering the source of the training data for these models, which has been open-source data scraped from the internet, it comes as no surprise that LLMs not only reflect the most common stereotypes but amplify them. The paper highlights a key property of these models, “LLMs are trained on imbalanced datasets; as such, even with the recent successes of reinforcement learning with human feedback, they tend to reflect those imbalances back at us.” 

 

Gender bias is not the only type of bias to be exhibited in LLMs; biases related to race, geography, race, politics, and other areas have also shown up in research. This ultimately points to a deep underlying point: it is not possible to create an unbiased AI system, which Meta’s chief AI scientist, Yann LeCun recently pointed out, since bias is “in the eyes of the beholder.” 

 

Moving forward

 

Since AI cannot be unbiased, academic researchers in AI ethics are pushing for changes to improve trust in AI models by emphasizing transparency. The Stanford University Center for Human-Centered Artificial Intelligence (HAI) claims that “transparency into AI systems is necessary for researchers, and the public to understand these systems and their impacts.” People using models should know the biases present within models and decide if they wish to use it or not. 

 

In order to help establish more transparency within generative AI development, HAI introduced a framework called Holistic Evaluation of Language Models (HELM) to benchmark language models' performance. At this time, there is no clear standard for language model performance evaluation. LLMs are used in a variety of contexts, which forces benchmarking to be local and idiosyncratic. Public reporting of model benchmarks is one way to help empower all stakeholders to better understand their function and impact, ensuring that the design of the model aligns with the expectations of key stakeholders.

 

Besides benchmarking, some academics have proposed we use model cards, which document model performance characteristics. The model cards would accompany trained machine learning models and provide "benchmarked evaluation in a variety of conditions, such as different cultural, demographic, or phenotypic groups (e.g. race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g. age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains." The model cards would also include the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. 

 

Along a similar vein, the paper Datasheets for Datasets suggests a datasheet for each dataset that would describe its operating characteristics, test results, recommended uses, and other information. Datasheets would be used to improve transparency and accountability in the machine-learning community. Microsoft, Google, and IBM have started to pilot datasheets for datasets within their product teams. Creating more documentation within the machine learning development process is one step closer to building more inclusive algorithms.