Skip to content
Drexel News Blog
Stories from Drexel University's media relations team
  • Arts & Culture
  • Business & Law
  • Community & Society
  • Health & Medicine
  • Science & Technology
  • Opinion
  • Contact
  • Search
Experts, Science & Technology
by Britt FaulstickMay 14, 20243:13 pm

Q+A: What Are the Consequences of AI’s ‘Data Rush’?

flying letters

There’s an adage in information science: “garbage in, garbage out,” that roughly means if you feed incomplete, inaccurate or skewed information into a program, the output is likely to be of the same poor quality. With the proliferation and continued development of large language models (LLM) – artificial intelligence programs, like ChatGPT, that crunch massive amounts of information to produce natural language responses to queries — the demand for good data to train the programs has skyrocketed. And research suggests these programs may soon run out of training material.

Apple reportedly offered several news companies $50 million to license its content archives for AI training. In the last month, AI technology companies OpenAI and Anthropic have made moves to license user data from WordPress and Tumblr and Reddit to help train their LLM models.

Recent reporting from The New York Times explained how many companies went another route: charging ahead to train their AI programs on all of the English content available on the internet, without licensing copyrighted content. This touched off a string of lawsuits against the companies — and illustrated both the insatiable demand for the data and companies’ frantic urgency to feed their programs.

Providing the programs with more examples of “natural language” is key to improving their performance, but according to Shadi Rezapour, PhD, a researcher and assistant professor in the College of Computing & Informatics who studies natural language processing and computational social science, not all training data is created equal — and if companies lose sight of this in their mad dash for data, it could pose a significant problem as more people interact with, and grow to trust, programs like ChatGPT and Gemini. Rezapour recently shared her insights on the big training data “rush” and the problems, biases and injustices LLMs can create and amplify as a result of their training. She also discussed what can be done to verify the quality of data being used to train these AI assistants.

Why do LLMs need training datasets to be effective?

The datasets used for training LLMs form the foundation of these models, shaping how they learn the nuances and complexities of language. The careful curation and ongoing refinement of these datasets are crucial for developing effective, ethical, and responsible LLMs.

At their core, LLMs are statistical models that learn the likelihood of one word following another or one sequence of words leading to another. They are trained to identify patterns within the data. The training dataset offers a broad range of examples for these models, including grammar, vocabulary, idioms and the various contexts in which words or phrases are used — this also includes the biases and stereotypes embedded within language.

As humans, we inherently possess various biases that are, in turn, transferred into the data we generate. Consequently, the majority of the datasets used in training LLMs inherently contain biases, which LLMs may inadvertently learn and perpetuate. Such biases can manifest as stereotypes, the exclusion of certain perspectives resulting in harm or the preferential treatment of specific topics, as different research has already shown. Therefore, training with diverse and carefully curated datasets is crucial in reducing these biases, enabling LLMs to better comprehend and adjust to diverse contexts.

What are the challenges of obtaining quality training data?

Access to quality data is challenged by legal, ethical and representational complexities. Ethical considerations, such as obtaining consent for data use, are crucial. Regulations, like European Union’s General Data Protection Rule, protect privacy but may also limit data access or raise the costs of obtaining high-quality data. Furthermore, obtaining less biased and more representative training data that reflects the diversity of real-world scenarios is a significant challenge.

Including perspectives and experiences from a wide range of demographic groups, especially underrepresented and marginalized communities, is important for developing models that serve a diverse user base. Unfortunately, systemic biases and issues in data collection often result in datasets that mainly reflect the perspectives of the most accessible or dominant groups, neglecting a significant portion of the global population.

A significant issue is the overrepresentation of views, behaviors and information from WEIRD (Western, Educated, Industrialized, Rich, and Democratic) societies in online data. This disproportionality can introduce biases into models, diminishing their applicability and fairness in globally diverse contexts. To mitigate this, it’s important to collect or generate data that captures a broader spectrum of human experiences and contexts. However, this effort can be costly and challenging. Overall, acquiring high-quality datasets, particularly those that are preprocessed, labeled, and evaluated, is expensive and time-consuming.

Have you seen any examples of LLM “mistakes” that are likely caused by limited or poor training data?

Yes, several instances and studies have highlighted mistakes in LLMs caused by limited or poor training data, which can manifest in various forms such as biases, hallucinations, and inaccuracies in reasoning.

By now, we are all familiar with the concept of “hallucinations” in LLMs, where models generate incorrect or entirely fabricated information. This phenomenon is partly due to the models’ nature of generating words, combined with their limited data and contextual understanding.

Bias in LLM-generated data is a critical issue directly associated with the training data. Various studies have shown inherited stereotypes, including gender, racial, and cultural biases, in LLM-generated responses. For instance, research presented at the AMC’s Fairness Accountability and Transparency conference last year highlighted the perpetuated bias toward people with disabilities.

It’s important to emphasize that LLMs are not like databases or search engines that can extract and present (factual) information. Understanding how these models work and what to expect from their responses is crucial. While LLMs are impressive by nature, we should always take their results with a grain of salt.

What datasets have been used to train the well-known LLMs like ChatGPT and Gemini?

LLMs, such as OpenAI’s ChatGPT and Google’s Gemini, are predominantly trained on diverse, multimodal and multilingual datasets available on the internet. However, specific details about the datasets these models are trained on remain undisclosed.

OpenAI, for instance, has not publicly shared the list of datasets used for training ChatGPT, due to proprietary and competitive reasons. It is known though that these models are trained on a combination of “publicly” available data and data evaluated by humans, including text and code from books, articles, code repositories and various internet sources.

Similarly, information about the datasets used for Google’s Gemini is not available, but it is likely that it uses similar sources to those used by GPT models, including text and code from a variety of origins, and some of Google’s internal data. The opacity regarding the exact nature of these datasets leads to questions regarding the ethical considerations of the training datasets.

How are datasets evaluated to identify potential biases or other problems?

Evaluating datasets for potential biases is essential before using them in model training or real-world applications, though this process can be complex. It necessitates a comprehensive approach that merges quantitative and qualitative validation to effectively inspect biases across various sensitive dimensions like gender, race, age, and other dimensions relevant to the domain.

Some methods for identifying bias include thorough data auditing to detect skews and anomalies through statistical analysis and visualization techniques, synthetic probing where machine learning models are tested against deliberately biased synthetic data, and human evaluation where annotators examine data subsets for biased language and stereotypes. Additionally, post-analysis techniques, like error analysis and bias testing, further help by quantifying biases and identifying patterns that correlate with specific subgroups. It’s important that all findings are accurately documented to maintain transparency and guide subsequent debiasing strategies, such as data filtering or the removal of algorithmic proxies, to mitigate these identified biases.

What are synthetic datasets? Could they be a solution to the challenge of finding quality training data?

Synthetic datasets are collections of data generated programmatically to mimic real-world scenarios, offering a valuable tool for AI development where actual data collection is challenging or unethical. These datasets allow precise control over data distribution and labels, ensuring comprehensive, unbiased training material. Additionally, they address privacy concerns, as they do not directly correspond to real individuals, making them ideal for sensitive domains like healthcare or finance. Generating synthetic data can also be more cost-effective than collecting real-world data, especially in domain-specific fields. However, challenges such as ensuring the realism and trustworthiness of the data must be addressed. Synthetic data must closely model real-world phenomena to be effective, which can be a huge challenge. Thus, while synthetic datasets present a promising solution to the scarcity of quality training data, they should be used to complement real data to build robust, effective models.

Is there an easy way for users to know whether a program is using “good” data? Are any groups offering a certification or seal of approval?

Currently, there is no universally accepted method for users to definitively determine whether a dataset used in an AI program follows acceptable standards for mitigating biases and addressing other issues. Nonetheless, several practices and emerging initiatives are enhancing transparency:

  • Dataset owners or creators should offer comprehensive documentation that details the dataset’s creation process, any known biases or limitations, and results from bias evaluation tests. This documentation facilitates external auditing of their methods.
  • The concept of “model cards” has been introduced, suggesting that details about the training data and its risk assessment be published alongside any deployed model (see: Model Cards for Model Reporting).
  • “Data Sheets” for datasets have also been proposed to standardize documentation, covering aspects such as motivation, composition, collection processes, recommended uses, and known limitations (see: Datasheets for Datasets).

Some emerging initiatives for dataset certification include:

  • The Data Nutrition Project aims to develop a “nutritional label” for datasets.
  • The Dataset Nutrition Label framework is proposed to evaluate and generate labels for datasets, assessing dimensions like bias, ethics and quality.
  • The AI Incident Database documents failures and incidents in AI caused by biased or problematic training data.

While these efforts are not yet widespread, there is a growing acknowledgment within the AI research community of the need for greater examination, standardized documentation, and potentially third-party certification for training datasets, especially in high-risk applications. Future progress in this area will likely require interdisciplinary collaboration among AI experts, ethicists, policymakers and affected communities.

As mentioned, as of now, no universally recognized certification or seal of approval exists specifically for data quality in AI systems across different industries. However, industry-specific standards and regulations like GDPR in Europe for data privacy and HIPAA in the U.S. for health data do enforce certain levels of data quality and protection. Further, initiatives like the IEEE’s AI CertifAIEd™ certification scheme and ISO standards are developing more structured frameworks and guidelines for AI and its underlying data processes. These efforts aim to establish benchmarks for data quality and ethical AI usage, though their adoption and the establishment of acceptable methods are still evolving.

What would you consider the best datasets for training LLMs?

I would argue that the ideal dataset would be one that is truly representative of all ideologies, cultures, languages, and groups across the world. However, creating such a comprehensive and unbiased dataset is a huge challenge that has not yet been adequately achieved.

The primary reason for this is that most existing datasets, even very large ones, tend to be skewed toward certain perspectives, demographics, and linguistic distributions based on the sources and methods used for data collection. For example, datasets derived primarily from web crawling will inherently be biased towards online content and users, which may not accurately represent the full diversity of human knowledge, experiences and viewpoints.

To truly capture a representative sample of all ideologies, cultures, languages and groups, a concerted effort would be required to include data from a multitude of sources, including a wide range of published materials (books, newspapers and journals) across all languages and cultures, transcripts of histories and conversations from diverse communities around the world, inclusive samples of online content beyond just websites and social media platforms, representation of marginalized and under-represented groups, including indigenous populations and their languages.

Additionally, careful curation and annotation would be necessary to ensure balanced coverage of different ideologies, belief systems, and perspectives, rather than over-representing dominant narratives. Achieving such representative datasets would require a massive, coordinated effort involving multi-disciplinary teams of linguists, sociologists, and domain experts from around the world. Significant resources would be needed for such data collection, translation and annotation.

While creating the perfect dataset may not be feasible, continuously attempting to improve the diversity, inclusivity and representativeness of our training data is crucial for developing more ethical and unbiased AI systems that can understand and engage with the full breadth of human knowledge and experience.

Reporters interested in speaking with Rezapour should contact Britt Faulstick, executive director of News & Media Relations, bef29@drexel.edu or 215.895.2617.

Share this:

  • Tweet
  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • More
  • Share on Tumblr
  • Click to share on Reddit (Opens in new window) Reddit

Like this:

Like Loading...
Tagged with: AI algorithms Anthropic Apple artificial intelligence ChatGPT College of Computing & Informatics computational social science Drexel College of Computing & Informatics Gemini Google large language models LLMs Microsoft natural language processing OpenAI training data

Britt Faulstick

All posts

Britt is the executive director of media relations who covers primarily technology and engineering beats, including information and computer science. He also covers a number of areas in media arts and design, student life, research ventures, athletics and more. Britt is a graduate of Syracuse University’s S.I. Newhouse School of Public Communications, in his senior year the Orange won the men’s basketball national championship. He worked in Drexel’s athletics department for seven years, during this time the women’s basketball team won its first conference championship. So if history is any indication, the media relations team is due for a big win. Follow him at @DrexelBritt or view his blog posts here. Contact Britt at britt.faulstick@drexel.edu or 215.895.2617.

Recent Posts

  • Q&A: How Can Feeling in Control Foster Healthy Aging Among Adult Cancer Survivors?
  • Q+A: Breastfeeding Benefits and Formula Facts: What Parents Should Know
  • Q&A: When Interactions with AI cause harm, who is responsible?
  • Q+A: When Water Runs Dry, Consumers are More Likely to Turn to Sugar-Sweetened Beverages
  • Q+A: Is ‘Bot Friday’ Just Around the Corner? – How will AI Change Holiday Shopping This Year?

Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Tags

A.J. Drexel Autism Institute Academy of Natural Sciences Antoinette Westphal College of Media Arts & Design autism autism experts autism research BEES biomedical engineering business civil engineering climate change Close School of Entrepreneurship College of Arts and Sciences College of Engineering College of Nursing and Health Professions Coronavirus COVID-19 cybersecurity Daniel Korschun digital media Dornsife School of Public Health Drexel College of Engineering Drexel College of Medicine Drexel University education engineering Entrepreneurial Game Studio entrepreneurship environmental engineering ExCITe Center fashion game design health health care hunger Kline School of Law law LeBow College of Business marketing music nutrition paleontology Philadelphia psychology public health research science technology Westphal College Westphal College of Media Arts & Design

Archives

  • January 2026
  • December 2025
  • November 2025
  • October 2025
  • September 2025
  • June 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • December 2021
  • November 2021
  • October 2021
  • August 2021
  • July 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • October 2012
Powered by WordPress.com.
Footer navigation
  • Contact
  • DrexelNews
  • Faculty Experts
  • News & Media Relations Staff
  • Privacy Policy

Begin typing your search above and press return to search. Press Esc to cancel.

Discover more from Drexel News Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading

 

Loading Comments...
 

    %d