Recently, you may have heard frequently about ChatGPT AI, a tool that can present any information we write. Some people believe this tool can kill the content writing profession in the future.
However, upon further reflection, it becomes clear that ChatGPT AI not only threatens content writers but the AI also accesses your website to obtain information that will be written in the future.
This has understandably caused website owners to worry that their content may be used illegally for the benefit of ChatGPT AI.
Despite these concerns, Search Engine Journal explains that there are ways to block ChatGPT AI from accessing your website. The process is relatively complicated and there is no guarantee of success, but it's worth trying.
Let's explore step by step how to block ChatGPT from your website!
How AIs Understand Your Content
Are you curious about how ChatGPT works its magic? Well, ChatGPT uses a powerful technology called Large Language Models (LLMs) that are trained on data from various sources. These sources include open-source datasets that are freely available for training AI to understand and learn from the millions of contents scattered across the web.
LLMs use diverse sources to train, such as Wikipedia, books, emails, government court records, and crawled websites. What's more, some several portals and websites provide vast amounts of information and data for free!
For instance, Amazon's Registry of Open Data on AWS is one of the most prominent portals that offer thousands of datasets for machine learning research. But wait, there's more! Wikipedia also lists 28 portals for downloading datasets, including the Google Dataset and Hugging Face portals, which can lead you to thousands of other datasets.
In other words, ChatGPT is not only smart, but it is also well-informed, thanks to its ability to learn from various sources and gain knowledge from vast amounts of information available online.
Datasets Used to Train ChatGPT
Did you know that datasets play a crucial role in training ChatGPT? ChatGPT is based on GPT-3.5, also known as InstructGPT. The datasets used to train GPT-3 are the same ones used to train GPT-3. The fundamental difference between the two is that ChatGPT-3.5 uses a technique known as reinforcement learning from human feedback (RLHF).
The research paper entitled Language Models are Few-Shot Learners describes the five datasets used to train both GPT-3 and GPT-3.5, which include:
- Common crawl (filtered)
Two of these datasets are based on internet crawls, namely WebText2 and Common Crawl. With these datasets, ChatGPT is trained to become a powerful language model that can provide insightful responses to any given text. Exciting.
What Exactly is WebText2
WebText2 is the private dataset collection by OpenAI designed to crawl through Reddit threads with three positive upvotes.
This enhanced version of the original WebText dataset was created to train GPT-3 and GPT-3.5, while the original version was used to train GPT-2.
With a whopping 19 billion tokens, WebText2 is bigger and better than its predecessor, which contained around 15 billion tokens. This valuable dataset is powering some of the world's most advanced language models.
Although WebText2, a dataset created by OpenAI, is not accessible to the public, there is an open-source version of it called OpenWebText2 that anyone can access.
OpenWebText2 is a publicly available dataset that was created using the same crawl patterns as WebText2, meaning that it should contain a similar, if not the same, set of URLs as WebText2.
If you're curious about what's included in WebText2, you can download OpenWebText2 to get an idea of the URLs it contains. A cleaned-up version of OpenWebText2 is also available for download, as well as the raw version of OpenWebText2.
What we do know is that if your site is linked from Reddit with a minimum of three upvotes, there's a high likelihood that your site is included in both the closed-source OpenAI WebText2 dataset as well as the open-source OpenWebText2 version.
How to Block ChatGPT?
1) Blocking Common Crawl
Common Crawl is one of the most commonly used datasets for internet content, consisting of a large collection of web pages.
Did you know that Common Crawl originates from a bot that crawls the entire internet? The bot, known as CCBot, adheres to the robots.txt protocol, which means you can block Common Crawl through robots.txt to prevent your website data from being included in other datasets.
However, if your site has already been crawled by CCBot, likely, your site is already included in the datasets.
While blocking Common Crawl may not be 100% effective in preventing ChatGPT AI from accessing your site, it's still worth considering.
The CCBot User-Agent string:
To block the Common Crawl bot, add the following to your robots.txt file:
User-agent: CCBotDisallow: /
One way to ensure the legitimacy of a CCBot user agent is by verifying that it crawls from Amazon AWS IP addresses.
2) Applying Nofollow in Meta Tag
Additionally, CCBot follows directives of the nofollow robots meta tag.
Apply this in your robots meta tag:
<meta name="CCBot" content="nofollow">
But regardless of your efforts, once content is on the internet, it's almost impossible to completely remove it from existing datasets. That's why many publishers are hoping for more transparency about how internet content is used, especially by AI products like ChatGPT.
In conclusion, understanding how web crawlers work and the potential impact on your website is crucial for any online business. While there are some measures you can take to block certain crawlers from accessing your site, it's important to note that once your site is crawled, your content may end up in various datasets that are used for research and AI development.
To ensure that your website is optimized for search engines and protected from unwanted crawlers, it's recommended to hire SEO experts. Talentport offers a solution to this by connecting you with vetted SEO candidates within two weeks, saving you up to 70% on hiring costs.
Don't let web crawlers negatively impact your online presence, hire the right talent today.