15% Of World’s Most Popular Websites Block ChatGPT Data Collection

OpenAI has unveiled a new tool named GPTbot.

This revolutionary web crawler has been crafted to accumulate data from all corners of the internet, amplifying the precision and capabilities of AI models.

OpenAI says that granting GPTbot access to websites can play a pivotal role in refining AI models’ accuracy, increasing their overall potential, and enhancing safety measures. However, it has come to light that a substantial 15% of the world’s top 100 websites have opted to block GPTbot’s access.

 

GPTbot’s Impact and Adoption

 

Originality.AI has released data that reveals that within the initial fortnight following the launch of GPTBot’s documentation, nearly 10% of the globe’s most prominent 1000 websites chose to prevent GPTbot’s intrusion.

Notable sites such as Amazon, Quora, Wikihow, and several international news outlets have taken measures to thwart GPTbot’s presence on their platforms. This brings into question the potential accuracy and limitations of ChatGPT.

 

The Mechanism Behind GPTbot

 

GPTbot operates through a structured process starting with the identification of potential data sources. This step involves web crawling where the tool scours the internet to pinpoint websites containing relevant information. Once an appropriate source is found, GPTbot extracts relevant data from the identified website.

The collected information is then catalogued within a database, used for the training of AI models.

 

 

Versatility in Data Extraction

 

One of GPTbot’s standout attributes is its ability to extract data from an array of sources, spanning text, images, and code. In terms of textual content, GPTbot extracts information from websites, articles, books, and diverse documents.

Furthermore, its ability extends to image-based data, allowing it to discern objects depicted within images and decipher textual content. Impressively, GPTbot can even extract code from repositories hosted on GitHub, as well as other code sources scattered across the internet.

 

The Nexus with AI Models

 

OpenAI’s flagship product, ChatGPT, and similar generative AI tools draw information from the data culled from websites to fuel their training processes. Even prominent figures like Elon Musk, in a previous iteration of the social media platform now known as Twitter, had intervened to halt OpenAI’s data scraping from the platform.

The creation of GPTbot represents a leap forward in AI advancement. By capturing data from the expansive digital landscape, GPTbot is poised to usher in a new era of AI proficiency.

The decision of some top websites to bar GPTbot’s access showcases the complexities around data usage rights. As OpenAI continues its stride toward AI excellence, the interplay between data, innovation, and legal considerations remains a central point.