We don’t know what Microsoft’s Bing or OpenAI’s ChatGPT are trained on. But if we did know, we could very well be put off from using the chatbots at all. A new report from The Washington Post examining the training data powering competing large language models (LLMs) from Google and Facebook reveals chatbots could very easily pull from copyrighted material and discredited news sources to create their responses. And that’s before you consider their tendency to “hallucinate” wrong answers without you even knowing.
The report is a glimpse into what’s been the secretive side of the current AI boom. More and more companies are making it easy to interact with natural text and image generators, but few share what sources those generators are based on. It adds uncertainty to every response in terms of safety and validity.
That people are beginning to turn to chatbots as a definitive source of information only makes things worse. Search engines might be clogged with attempts to game the algorithm and achieve priority by Google, but at least you can see all the possible results you’re dealing with. With chatbots, you get a confident answer and a small option to find out where it came from, setting up the wrong kind of relationship to information and introducing another tool that we don’t fully understand.
This One’s For the WOWHeads
WaPo’s breakdown of the sources included in the C4 (Colossal Clean Crawled Corpus) data set that powers Google’s T5 and Facebook’s LLaMA range from humorous to concerning. On one hand, C4 includes text from wowhead.com, a dedicated World of Warcraft fan site and “player forum.” On the other, WaPo found that the copyright symbol “appears more than 200 million times in the C4 data set,” suggesting LLMs trained on it could easily access someone else’s intellectual property and present it as original.
That a chatbot might have a deep and specific knowledge of Stormwind isn’t necessarily a bad thing, but it is something you might not know unless the data set was examined. Similarly, I was concerned to learn that Google and Facebook’s LLMs could possibly draw from breitbart.com, a far-right news site and documented source of misinformation. That’s not exactly the kind of thing you’d want a chatbot to draw from when answering responses.
The total data set The Washington Post examined included 15.1 million unique websites. Chatbots don’t necessarily rely on one single LLM trained on one set of data to function. And in that way, there are multiple ways a chatbot like Bing, Bard, or ChatGPT could produce an inaccurate response, from the multiple sets of data used to train them, to how they ingest prompts and produce the most desirable response.
The Black Box
But the problem is a lot simpler, too. If I were to search for information about climate change and scroll far enough down the results, I might run into a link from a less desirable or factually inaccurate news source. In that case, I can ignore it. If I ask a chatbot the same question, there’s a chance I get the wrong information without my knowledge and without citation.
Every big company working in this space is making efforts to prevent wrong answers and misuse from happening. ChatGPT is designed so that you can’t ask it for help murdering someone. Bard and Bing, — as I discovered comparing them — actively try to keep you from breaking the law. They’re not afraid to scold you.
And yet, workarounds exist. We now know with a little “hypnotism,” ChatGPT will set you on the path to building a bomb. According to Bloomberg, Google’s own employees are concerned with Bard’s capacity to be a “pathological liar” and place users in dangerous situations based on the responses it gives — like dying while scuba diving dangerous. Issues persist, ones that weren’t necessarily unsolvable problems in the context of search engines, but seem endlessly more complex in a chat interface.
Worse, going to some entity, even a fake “intelligence” that we occupy ourselves with today, sets the wrong expectations around learning information. Gaining knowledge requires multiple sources, understanding, comparison, and even personal experience, not a single response from a jack-of-all-trades expert — especially one we don’t know the bona fides of. Search already introduced an arcane process to govern which web pages would rank first and when. Why obscure that further by eliminating our ability to see them at all?
I said it was worth trying out a chatbot in every app or service because it felt like the best way to figure out where a conversational interface might make the most sense. But I’d like to amend that belief. It’s not worth it if it's going to make it harder to trust the information we learn online, or if there’s going to be even less transparency around how things work. We don’t need another black box.