Tech companies like OpenAI, Google, and Meta find themselves in a desperate quest for high-quality training data the more that artificial intelligence models advance. This data is the backbone of AI development, essential for creating systems that can generate human-like text, images, sounds, and videos.”
The only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data,” Sy Damle, a lawyer representing venture capital firm Andreessen Horowitz, spoke about the critical role of extensive data in AI advancements.
Key Findings From The Report:
- Tech companies have pushed the boundaries of copyright law and corporate policies to gather data.
- They have developed tools to transcribe YouTube videos and discussed acquiring large volumes of copyrighted content without explicit permission.
Compare VPNs With TechRound
Name | Price | Offer | Claim Deal |
---|---|---|---|
Surfshark | £1.79 per month | 30-day money-back guarantee + 2 months free | Get Deal >> |
CyberGhost | £1.99 per month | 45-day money-back guarantee | Get Deal >> |
Private Internet Access | £2.19 per month | 30-day money-back guarantee | Get Deal >> |
Want Your Company To Appear Here? | ...and get in front of thousands of potential customers... | Contact Us Today | Get Deal >> |
How Are The Companies Responding?
In response to the data crunch, companies have adopted various strategies to amass the large amounts of data required for AI training. OpenAI developed Whisper, a speech recognition tool, to transcribe YouTube videos, accumulating over a million hours of conversational text.
Google and Meta have also explored similar paths, with Google transcribing YouTube content and Meta considering purchasing a publishing house for access to copyrighted works.
Company Strategies
In order to train its AI, it is alleged that these tech giants have done the following:
- OpenAI transcribed YouTube videos for GPT-4 training.
- Google and Meta discussed unorthodox methods to acquire high-quality data.
Greg Brockman, OpenAI’s president, was directly involved in collecting YouTube videos for transcription. “OpenAI president Greg Brockman was personally involved in collecting videos that were used,” The New York Times reported.
More from News
- Microsoft Introduces New Phi-3 Small Language Model
- What The UK’s Smart Device Security Law Means For Consumers
- UK Cows Dodge Bird Flu, But Can Tech Save Livestock?
- Kingsley Napley Adds Corporate Partner To Focus On Technology M&A
- The Fall Of Showering And The Rise Of Personal Hygiene Apps
- Have You Used The New Meta AI Feature On WhatsApp And Instagram?
- Experts Weigh In On Google Delaying Removal Of Third-Party Cookies Again
- Will TikTok Pull Out Of The US?
What Does This Mean For Data Ethics?
The actions of these tech giants raise many questions about data ethics and copyright infringement. The use of copyrighted material without permission or compensation to creators has sparked debates and lawsuits. Filmmaker Justine Bateman called this practice “the largest theft in the United States, period,” underlining the controversy surrounding the use of creative works by AI companies.
From an ethical perspective, its important to think of the following:
- The use of copyrighted material without permission has led to legal and ethical concerns.
- The debate over “fair use” and the ethical implications of generating synthetic data from copyrighted content.
What Did YouTube Say In Response?
YouTube’s reaction to the discussions around the use of its content for AI model training is clear and straightforward. Neal Mohan, YouTube’s CEO, pointed out the importance of following the platform’s rules, especially regarding the unauthorised use of video content.
“When a creator uploads their work to our platform, they expect that our terms of service will be followed,” Mohan explained in an interview with Bloomberg. He added, “Downloading transcripts or video bits is a direct violation of our terms of service.”
Mohan spoke on the agreement between YouTube and its content creators, stressing that any violation of these terms, such as using videos to train AI without consent, breaks the trust with the platform.
He also mentioned that Google, YouTube’s parent company, uses YouTube content to train its AI model, Gemini, but only in line with agreements made with content creators. This approach ensures that any YouTube videos used for AI training respect the platform’s policies and honour the rights of the creators.