Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Fortune
Fortune
Sage Lazzaro

Data scraping: A.I.'s original sin

(Credit: Brianna Soukup—Portland Portland Press Herald/Getty Images)

Hello and welcome to Eye on A.I. This past week, 12 data protection watchdogs from around the globe came together to issue a joint statement addressing data scraping and its effects on privacy. 

The statement—signed by privacy officials from Australia, Canada, Mexico, China, Switzerland, Columbia, Argentina, and the U.K., to name a few—takes aim at website operators, specifically social media companies, and states they have obligations under data protection and privacy laws to protect information on their platforms from unlawful data scraping. Even publicly accessible personal information is subject to these laws in most jurisdictions, asserts the statement. Notably, the statement also outlines that data scraping incidents that harvest personal information can constitute reportable data breaches in many jurisdictions.

In addition to publishing the statement, the authors state they sent it directly to Alphabet (YouTube), ByteDance (TikTok), Meta (Instagram, Facebook, and Threads), Microsoft (LinkedIn), Sina Corp (Weibo), and X Corp. (X, previously Twitter). They also suggest a series of controls these companies should have in place to safeguard users against harms associated with data scraping, including designating a team to monitor for and respond to scraping activities.

The potential harms outlined include cyberattacks, identity fraud, surveillance, unauthorized political or intelligence gathering, and unwanted marketing and spam. But while artificial intelligence isn’t once mentioned in the statement, it’s increasingly becoming a major flash point in this issue.

Scraping the internet—including the information on social media sites—is exactly how A.I. powerhouses like OpenAI, Meta, and Google obtained much of the data to train their models. And just in the past few weeks, data scraping has emerged as a major battlefront in the new A.I. landscape. The New York Times, for example, earlier this month updated its terms of service to prevent A.I. scraping of its content, and now the publisher is exploring suing OpenAI over the matter. This follows a proposed class-action lawsuit against OpenAI and investor Microsoft filed in June, which alleged the firm secretly scraped the personal information of hundreds of millions of users from the internet without notice, consent, or just compensation.  

A strongly worded letter is extremely unlikely to impact anything these tech giants do, but lawsuits and regulations against data scraping very well could. In the EU where data privacy and now A.I. regulation is moving fairly quickly, for example, data scraping is being increasingly scrutinized by governmental bodies. 

At its heart, A.I. is about data. So this begs the question: If companies aren’t able to freely scrape data, where will they get the data needed to train their models? 

One option is synthetic data, which refers to information that’s artificially generated rather than created by real-world events. This process often, but not always, involves using A.I. itself to create a large dataset of synthetic data from a smaller set of real-world data, with the resulting synthetic data mirroring the statistical properties of the real-world data. 

As long as the original data isn’t scraped, this could be a viable solution. Gartner estimates that synthetic data will overtake real-world data in A.I. models by 2030. But synthetic data has its drawbacks. For example, it can miss outliers, introduce inaccuracies, and, ideally, involve extra verification steps that slow down the process. And while some companies claim synthetic data eliminates bias, many experts refute this and see ways some forms of synthetic data can actually introduce additional biases into datasets. 

Another potential solution is opt-in first-party data. Unlike how real-world data has historically been scraped, used without permission, and even sold out from under users, this is real-world data that is opt-in and provided voluntarily.

Miami-based Streamlytics is one company working in the emerging opt-in first-party data space with the goal of making data streams more ethical. The company pays users to download their own data from sites they use, such as Netflix, and upload it to Streamlytics, which then packages it up and sells it to customers looking to purchase it. Customers can request specific types of data that they need, and users maintain ownership of the data and can request it be deleted at any time.

Founder and CEO Angela Benton told Eye on A.I. that her company has seen “a remarkable upsurge in interest” amid the current generative A.I. boom. A lot of that interest, she said, is from small and medium-sized businesses that are looking for solutions to train custom A.I. models. 

“In most cases, because of the size of these businesses, they lack the scale of data needed to train and customize their models,” she said. “They are actively seeking out solutions that can provide the data that they need and most are inclined towards models that are ethical from the ground up.”

As a result, Streamlytics is developing new offerings to cater to the surge of businesses jumping into generative A.I., such as allowing organizations to choose between purely human-generated data, synthetic data, or a blend of both, all of which is collected consensually. 

In conversations with customers, Benton said there is “a high degree of concern regarding legal backlash from using scraped data.”

“While everyone is enthusiastic about A.I. no one wants to be sued,” she said. “So there is an extra layer of diligence, especially from larger organizations, that includes reviewing processes of how data is sourced and timelines for when data is purged.”

It’s ironic that the larger organizations that created the very models that kicked off this generative A.I. boom didn’t do so with the same level of concern or diligence. What’s more, these companies have nearly unlimited resources and therefore are most equipped to take the ethical route. 

Even ImageNet, the dataset containing millions of tagged images that single-handedly catalyzed the rise of A.I. after it was released in 2010, was comprised largely of images scraped nonconsensually from the internet. From its modern beginnings, A.I. was built on stolen data, and now we’re entering its reckoning moment.

And with that, here’s the rest of this week’s A.I. news.

But first, a quick plug for Fortune's upcoming Brainstorm A.I. conference in San Francisco on Dec. 11–12, where you'll gain vital insights on how the most powerful and far-reaching technology of our time is changing businesses, transforming society, and impacting our future. Confirmed speakers include such A.I. luminaries as PayPal’s John Kim, Salesforce AI CEO Clara Shih, IBM’s Christina Montgomery, Quizlet’s CEO Lex Bayer, and moreApply to attend today!

Sage Lazzaro
sage.lazzaro@fortune.com
sagelazzaro.com

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.