
There’s an adage in information science: “garbage in, garbage out,” that roughly means if you feed incomplete, inaccurate or skewed information into a program, the output is likely to be of the same poor quality. With the proliferation and continued development of large language models (LLM) – artificial intelligence programs, like ChatGPT, that crunch massive amounts of information to produce natural language responses to queries — the demand for good data to train the programs has skyrocketed. And research suggests these programs may soon run out of training material.
Apple reportedly offered several news companies $50 million to license its content archives for AI training. In the last month, AI technology companies OpenAI and Anthropic have made moves to license user data from WordPress and Tumblr and Reddit to help train their LLM models.
Recent reporting from The New York Times explained how many companies went another route: charging ahead to train their AI programs on all of the English content available on the internet, without licensing copyrighted content. This touched off a string of lawsuits against the companies — and illustrated both the insatiable demand for the data and companies’ frantic urgency to feed their programs.
Providing the programs with more examples of “natural language” is key to improving their performance, but according to Shadi Rezapour, PhD, a researcher and assistant professor in the College of Computing & Informatics who studies natural language processing and computational social science, not all training data is created equal — and if companies lose sight of this in their mad dash for data, it could pose a significant problem as more people interact with, and grow to trust, programs like ChatGPT and Gemini. Rezapour recently shared her insights on the big training data “rush” and the problems, biases and injustices LLMs can create and amplify as a result of their training. She also discussed what can be done to verify the quality of data being used to train these AI assistants.
Why do LLMs need training datasets to be effective?
The datasets used for training LLMs form the foundation of these models, shaping how they learn the nuances and complexities of language. The careful curation and ongoing refinement of these datasets are crucial for developing effective, ethical, and responsible LLMs.
At their core, LLMs are statistical models that learn the likelihood of one word following another or one sequence of words leading to another. They are trained to identify patterns within the data. The training dataset offers a broad range of examples for these models, including grammar, vocabulary, idioms and the various contexts in which words or phrases are used — this also includes the biases and stereotypes embedded within language.
As humans, we inherently possess various biases that are, in turn, transferred into the data we generate. Consequently, the majority of the datasets used in training LLMs inherently contain biases, which LLMs may inadvertently learn and perpetuate. Such biases can manifest as stereotypes, the exclusion of certain perspectives resulting in harm or the preferential treatment of specific topics, as different research has already shown. Therefore, training with diverse and carefully curated datasets is crucial in reducing these biases, enabling LLMs to better comprehend and adjust to diverse contexts.
What are the challenges of obtaining quality training data?
Access to quality data is challenged by legal, ethical and representational complexities. Ethical considerations, such as obtaining consent for data use, are crucial. Regulations, like European Union’s General Data Protection Rule, protect privacy but may also limit data access or raise the costs of obtaining high-quality data. Furthermore, obtaining less biased and more representative training data that reflects the diversity of real-world scenarios is a significant challenge.
Including perspectives and experiences from a wide range of demographic groups, especially underrepresented and marginalized communities, is important for developing models that serve a diverse user base. Unfortunately, systemic biases and issues in data collection often result in datasets that mainly reflect the perspectives of the most accessible or dominant groups, neglecting a significant portion of the global population.
A significant issue is the overrepresentation of views, behaviors and information from WEIRD (Western, Educated, Industrialized, Rich, and Democratic) societies in online data. This disproportionality can introduce biases into models, diminishing their applicability and fairness in globally diverse contexts. To mitigate this, it’s important to collect or generate data that captures a broader spectrum of human experiences and contexts. However, this effort can be costly and challenging. Overall, acquiring high-quality datasets, particularly those that are preprocessed, labeled, and evaluated, is expensive and time-consuming.
Have you seen any examples of LLM “mistakes” that are likely caused by limited or poor training data?
Yes, several instances and studies have highlighted mistakes in LLMs caused by limited or poor training data, which can manifest in various forms such as biases, hallucinations, and inaccuracies in reasoning.
By now, we are all familiar with the concept of “hallucinations” in LLMs, where models generate incorrect or entirely fabricated information. This phenomenon is partly due to the models’ nature of generating words, combined with their limited data and contextual understanding.
Bias in LLM-generated data is a critical issue directly associated with the training data. Various studies have shown inherited stereotypes, including gender, racial, and cultural biases, in LLM-generated responses. For instance, research presented at the AMC’s Fairness Accountability and Transparency conference last year highlighted the perpetuated bias toward people with disabilities.
It’s important to emphasize that LLMs are not like databases or search engines that can extract and present (factual) information. Understanding how these models work and what to expect from their responses is crucial. While LLMs are impressive by nature, we should always take their results with a grain of salt.
What datasets have been used to train the well-known LLMs like ChatGPT and Gemini?
LLMs, such as OpenAI’s ChatGPT and Google’s Gemini, are predominantly trained on diverse, multimodal and multilingual datasets available on the internet. However, specific details about the datasets these models are trained on remain undisclosed.
OpenAI, for instance, has not publicly shared the list of datasets used for training ChatGPT, due to proprietary and competitive reasons. It is known though that these models are trained on a combination of “publicly” available data and data evaluated by humans, including text and code from books, articles, code repositories and various internet sources.
Similarly, information about the datasets used for Google’s Gemini is not available, but it is likely that it uses similar sources to those used by GPT models, including text and code from a variety of origins, and some of Google’s internal data. The opacity regarding the exact nature of these datasets leads to questions regarding the ethical considerations of the training datasets.
How are datasets evaluated to identify potential biases or other problems?
Evaluating datasets for potential biases is essential before using them in model training or real-world applications, though this process can be complex. It necessitates a comprehensive approach that merges quantitative and qualitative validation to effectively inspect biases across various sensitive dimensions like gender, race, age, and other dimensions relevant to the domain.
Some methods for identifying bias include thorough data auditing to detect skews and anomalies through statistical analysis and visualization techniques, synthetic probing where machine learning models are tested against deliberately biased synthetic data, and human evaluation where annotators examine data subsets for biased language and stereotypes. Additionally, post-analysis techniques, like error analysis and bias testing, further help by quantifying biases and identifying patterns that correlate with specific subgroups. It’s important that all findings are accurately documented to maintain transparency and guide subsequent debiasing strategies, such as data filtering or the removal of algorithmic proxies, to mitigate these identified biases.
What are synthetic datasets? Could they be a solution to the challenge of finding quality training data?
Synthetic datasets are collections of data generated programmatically to mimic real-world scenarios, offering a valuable tool for AI development where actual data collection is challenging or unethical. These datasets allow precise control over data distribution and labels, ensuring comprehensive, unbiased training material. Additionally, they address privacy concerns, as they do not directly correspond to real individuals, making them ideal for sensitive domains like healthcare or finance. Generating synthetic data can also be more cost-effective than collecting real-world data, especially in domain-specific fields. However, challenges such as ensuring the realism and trustworthiness of the data must be addressed. Synthetic data must closely model real-world phenomena to be effective, which can be a huge challenge. Thus, while synthetic datasets present a promising solution to the scarcity of quality training data, they should be used to complement real data to build robust, effective models.
Is there an easy way for users to know whether a program is using “good” data? Are any groups offering a certification or seal of approval?
Currently, there is no universally accepted method for users to definitively determine whether a dataset used in an AI program follows acceptable standards for mitigating biases and addressing other issues. Nonetheless, several practices and emerging initiatives are enhancing transparency:
- Dataset owners or creators should offer comprehensive documentation that details the dataset’s creation process, any known biases or limitations, and results from bias evaluation tests. This documentation facilitates external auditing of their methods.
- The concept of “model cards” has been introduced, suggesting that details about the training data and its risk assessment be published alongside any deployed model (see: Model Cards for Model Reporting).
- “Data Sheets” for datasets have also been proposed to standardize documentation, covering aspects such as motivation, composition, collection processes, recommended uses, and known limitations (see: Datasheets for Datasets).
Some emerging initiatives for dataset certification include:
- The Data Nutrition Project aims to develop a “nutritional label” for datasets.
- The Dataset Nutrition Label framework is proposed to evaluate and generate labels for datasets, assessing dimensions like bias, ethics and quality.
- The AI Incident Database documents failures and incidents in AI caused by biased or problematic training data.
While these efforts are not yet widespread, there is a growing acknowledgment within the AI research community of the need for greater examination, standardized documentation, and potentially third-party certification for training datasets, especially in high-risk applications. Future progress in this area will likely require interdisciplinary collaboration among AI experts, ethicists, policymakers and affected communities.
As mentioned, as of now, no universally recognized certification or seal of approval exists specifically for data quality in AI systems across different industries. However, industry-specific standards and regulations like GDPR in Europe for data privacy and HIPAA in the U.S. for health data do enforce certain levels of data quality and protection. Further, initiatives like the IEEE’s AI CertifAIEd™ certification scheme and ISO standards are developing more structured frameworks and guidelines for AI and its underlying data processes. These efforts aim to establish benchmarks for data quality and ethical AI usage, though their adoption and the establishment of acceptable methods are still evolving.
What would you consider the best datasets for training LLMs?
I would argue that the ideal dataset would be one that is truly representative of all ideologies, cultures, languages, and groups across the world. However, creating such a comprehensive and unbiased dataset is a huge challenge that has not yet been adequately achieved.
The primary reason for this is that most existing datasets, even very large ones, tend to be skewed toward certain perspectives, demographics, and linguistic distributions based on the sources and methods used for data collection. For example, datasets derived primarily from web crawling will inherently be biased towards online content and users, which may not accurately represent the full diversity of human knowledge, experiences and viewpoints.
To truly capture a representative sample of all ideologies, cultures, languages and groups, a concerted effort would be required to include data from a multitude of sources, including a wide range of published materials (books, newspapers and journals) across all languages and cultures, transcripts of histories and conversations from diverse communities around the world, inclusive samples of online content beyond just websites and social media platforms, representation of marginalized and under-represented groups, including indigenous populations and their languages.
Additionally, careful curation and annotation would be necessary to ensure balanced coverage of different ideologies, belief systems, and perspectives, rather than over-representing dominant narratives. Achieving such representative datasets would require a massive, coordinated effort involving multi-disciplinary teams of linguists, sociologists, and domain experts from around the world. Significant resources would be needed for such data collection, translation and annotation.
While creating the perfect dataset may not be feasible, continuously attempting to improve the diversity, inclusivity and representativeness of our training data is crucial for developing more ethical and unbiased AI systems that can understand and engage with the full breadth of human knowledge and experience.
Reporters interested in speaking with Rezapour should contact Britt Faulstick, executive director of News & Media Relations, bef29@drexel.edu or 215.895.2617.

