How To Block Chat GPT From Using Your Website Content

We’ve all seen the hype about ChatGPT in recent months., and even I’ve fallen victim to the easy to use AI tool. It’s great for research, expanding understanding, and for some... plagiarising University articles. The tool is great, and I think it’s going to go a long way in a society of constantly modernising tech - but as more conversations praising the tool come to light, more content creators are raising concerns about their content being used to train it.

As a creator, whether you create copy, art, or digital assets, the idea of your work being used to write Kyle’s university work in 5 seconds or make Bethany a new IG picture doesn’t sound particularly appealing…especially when you get no credit or compensation for it.

So if you’re looking to protect your content from ChatGPT, there is a very easy way to do it. Keep reading to learn how.

How AI learns from your content

Large Language Models (LLMs) are trained on data that originates from multiple sources. Many of these datasets are open source and are freely used for training AIs. In general, Large Language Models use a wide variety of sources to train from. These include, but are not exclusive to:

  • Wikipedia

  • Government court records

  • Books

  • Emails

  • Crawled websites

There are actually portals and websites offering datasets that are giving away vast amounts of information.

One of the portals is hosted by Amazon, offering thousands of datasets at the Registry of Open Data on AWS.

The Amazon portal with thousands of datasets is just one portal out of many others that contain these datasets. Wikipedia lists 28 portals for downloading datasets, including the Google dataset and the hugging Face Portal for finding thousands of datasets.

Datasets used to train ChatGPT

ChatGPT is based on GPT-3.5, also known as InstructGPT.

The datasets used to train GPT-3.5 are the same used for GPT-3. The major difference between the two is that GPT-3.5 used a technique known as reinforcement learning from human feedback (RLHF).

The datasets are:

  1. Common Crawl (filtered)

  2. WebText2

  3. Books1

  4. Books2

  5. Wikipedia

Of the five datasets, the two that are based on a crawl of the Internet are:

  • Common Crawl

  • WebText2

Common crawl

One of the most commonly used datasets consisting of Internet content is the Common Crawl dataset that’s created by a non-profit organization called Common Crawl.

Common Crawl data comes from a bot that crawls the entire Internet.

The data is downloaded by organizations wishing to use the data and then cleaned of spammy sites, etc.

The name of the Common Crawl bot is, CCBot.

CCBot obeys the robots.txt protocol so it is possible to block Common Crawl with Robots.txt and prevent your website data from making it into another dataset.

However, if your site has already been crawled then it’s likely already included in multiple datasets.

Nevertheless, by blocking Common Crawl it’s possible to opt out your website content from being included in new datasets sourced from newer Common Crawl datasets.

The CCBot User-Agent string is:

CCBot/2.0

Add the following to your robots.txt file to block the Common Crawl bot:

User-agent: CCBotDisallow: /

An additional way to confirm if a CCBot user agent is legit is that it crawls from Amazon AWS IP addresses.

CCBot also obeys the nofollow robots meta tag directives.

Use this in your robots meta tag:

<meta name="CCBot" content="nofollow">

About the WebText2 dataset

WebText2 is a private OpenAI dataset created by crawling links from Reddit that had three upvotes.

The idea is that these URLs are trustworthy and will contain quality content.

WebText2 is an extended version of the original WebText dataset developed by OpenAI.

The original WebText dataset had about 15 billion tokens. WebText was used to train GPT-2.

WebText2 is slightly larger at 19 billion tokens. WebText2 is what was used to train GPT-3 and GPT-3.5

OpenWebText2

WebText2 (created by OpenAI) is not publicly available.

However, there is a publicly available open-source version called OpenWebText2.  OpenWebText2 is a public dataset created using the same crawl patterns that presumably offer similar, if not the same, dataset of URLs as the OpenAI WebText2.

I thought it was worthwhile to mention this in case someone wants to know what’s in WebText2. You can easily download OpenWebText2 to get an idea of the URLs contained in it.

Before you block any bots

Many datasets, including Common Crawl, could be used by companies that filter and categorize URLs in order to create lists of websites to target with advertising.

For example, a company named Alpha Quantum offers a dataset of URLs categorized using the Interactive Advertising Bureau Taxonomy. The dataset is useful for AdTech marketing and contextual advertising.  Exclusion from a database like that could cause a publisher to lose potential advertisers.

Previous
Previous

Building a B2B Google Ads Campaign That Nurtures AND Converts

Next
Next

Imposter Syndrome In Marketing