Generative AI Scrapers

James Kelley
Min. Read
June 19, 2024

Generative AI has the marketing industry buzzing, mostly about what it can do or produce. However, one issue hasn’t been addressed as much – the scraping of website data for training purposes. 

ChatGPT launched less than 2 years ago and is already on its fourth major version. Large Language Models (LLMs), like ChatGPT, require vast amounts of data for training, with newer models requiring even more. At a certain point, publicly available datasets will be exhausted, and LLMs will need new sources of data – quite possibly from your website. 

What does this mean for your website?

There is a high probability that automatic scrapers will begin indexing your site, similar to how search engines operate. Scrapers are generally filtered out by Google Analytics, however with so many new LLMs in development, not all of these scrapers will play by the same rules. We have experienced this with several clients, seeing a sudden increase in traffic, and upon investigation, we determined this was caused by scrapers. 

Will this impact website analytics?

Yes – to a limited extent. To date, we have only seen scrapers come through Direct channels. At this time, it’s highly unlikely that these scrapers would try to mimic other channels since there’s no advantage to doing so. They are seeking data from the website, and are not attempting to mimic a user by clicking on paid, organic, or referral media. 

Will these scrapers cost us?

Possibly. If your website hosting provider charges by bandwidth usage, then you are paying for that increase in traffic. That said, the bot traffic we’ve seen is a very small percentage of overall traffic, and should not increase your costs significantly. From an advertising perspective, it’s highly unlikely that you will be impacted – again, these scrapers are looking for data, not pretending to click ads.

What can we do about this?

If you’re concerned about your website analytics being impacted, the best way to monitor is to stay on top of normal, or consistent, traffic trends and look out for any sudden anomalies. These scrapers may not make a big impact on your sessions, however, pageviews will skyrocket. From our investigation, we’ve noticed a few commonalities with the scrapers we have detected:

  • Most notably, they index roughly the same number of pages on a pretty regular basis, usually every 2 to 4 days.
  • They identify themselves as the same browser (usually Chrome, but sometimes Safari)
  • They identify themselves as running one or two operating systems (we’ve seen mostly Windows 10 and Linux 3.10)

It’s important to have a marketing partner who knows your data and what to look for. Here at The Moran Group’s deep understanding of our clients and their traffic patterns enables us to identify and address issues such as web scraping booths quickly. We remain dedicated to providing exceptional digital marketing services, driving measurable results, and delivering exceptional ROI across all platforms. Our commitment to leveraging our insights into client traffic ensures we consistently navigate the evolving digital landscape effectively.