So, you are a content creator generating lots of useful content in the form of videos or blogs? You are happy with your subscriber count and views that you get for each of your creations. You know who else is happy about your content? All the tech giants as well…(face palm!)
In a recent report in New York Times, it stated that OpenAI, Meta and Google went against their own policies and scraped data from million hours of transcribed YouTube videos to train their own AI engines!! (shocking, isn’t it?) These transcribed texts were fed into GPT-4, a much more powerful AI model with superior AI capabilities as it powers more chatbots and online tutors.
The AI race is heating up and all companies need access to diverse and huge amounts of data to train their AI models so that the models do end up “hallucinating”.
Where else can so much of data be found than in YouTube videos, blogs, vlogs and the entire Internet? With every organization pushing for new ways to create content, the Internet is a gold mine of never ending data. The only catch? There are numerous copyright issues and privacy laws to be overcome. Google is said to have violated copyright claims and said to have scraped a lot of data to train AI models which is said to be with the creators.
Similar scraping episodes by Meta and OpenAI has resulted in lawsuits. The New York Times had already sued OpenAI and Microsoft that they had unfairly used their data to train their models.
What are some sources of data that the tech giants are scraping?
In an era where views are free and open and everybody is ready to express it, here are where the tech giants are lifting data from;
- Wikipedia articles
- Message boards
- Computer programs
- Photos
- Podcasts
- Transcribed YouTube videos
- Movie clips
- GitHub
- Restaurant reviews
- Publicly open Google sheets
- And anything and everything that you have written on the net or off the net as well!
In spite of so many sources of data, the AI models will still run short of data after 2026!
What is the future of this content scraping?
With the world being increasingly mesmerized by AI generated output, all the tech giants will continue to invest heavily on getting more content from any source skirting legal and privacy issues.
It is up to us, the consumer to be mindful of what we create and how it is being used by organizations around the world. We need to keep a close eye on ever changing privacy policies and see where our content is being used!
Good luck to us to monitoring all our data around the net and how is is being used as more powerful AI engines come along!
This post is for BlogchatterA2Z 2024!
References:
https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html
Good to see the list of sources of AI’s input. In turn those companies give us some comfort to use the inventions for free. So, we like them.