Programmatic blocking of referrer spam - Part 2; How SEO companies block it
In Part 1 I talked about how we can use the analytics API to setup filters to block all the bad referrers. After a few months of testing, this is actually only the first part of the problem.
Many people around the world have been dealing with this issue and I have finally found a good better solution to my previous part, and I am certainly not the first. I believe I am/was the first to publish on leveraging Google Analytics API to accomplish the task.
The golden answer everyone has been looking for on how to block referrer spam is combining:
- A “Hostname Filter” on Google Analytics.
- Simple Referrer blocking on your web server (NGINX/Apache/IIS).
Types of referrer spam
To quickly recap if you have not read my other post, there is currently, two main ways that a spammer will use. (As of writing, if there are more please let me know in the comments).
- Spammers that view your site using programs that emulate, or control browsers, with a faked referrer field.
- Spammers that find (scrape) your GA code. Then use your GA code to create referrals in your Analytics. (Ghost Spam)
Both are seriously easy to do and require very little effort to setup or maintain, as is the blocking techniques.
Programmatic approach
It’s the same approach as last time. For legal reasons I’m can’t give my exact code out. But to help you on your way, here is the basis of how you would add a hostname filter to all properties and views.
- Loop over all properties in the account.
get_property_id
- for each view
get_views
loop over them and:- Figure out the hostname
view.get('pre-defined-var')
- Figure out the hostname
- check if a filter already exists:
list_filter_view
(2nd run) - Update the filter
service.management().filters().update(
- Otherwise if the filter doesnt exist “Insert” it
returned_filter_object = _service.management().filters().insert( accountId='%s' % account_id, body={ 'name': 'Hostname Allowed', 'type': 'INCLUDE', 'includeDetails': { 'field': 'PAGE_HOSTNAME', 'expressionValue': '%s' % YOURHOSTNAME_from_step1, 'caseSensitive': False } } ).execute()
- Then link the newly created filter, with the view
service.management().profileFilterLinks().insert( accountId='%s' % account_id, webPropertyId='%s' % property_id, profileId='%s' % view_id, body={ 'filterRef': { 'id': '%s' % returned_filter_object.get('id') } } ).execute()
Webserver Blocking
Now we have blocked all our ghost referrers, we can block all the sites that are still showing up in our analytics (looking at you semalt). I wont go into how you can block them on your website as its covered very well by others.
My Current Nginx configuration is:
# Referrers - 201603
if ($http_referer ~* (((darodar|priceg|buttons-for(-your)?-website|makemoneyonline|blackhatworth|hulfingtonpost|o-o-6-o-o|(social|(simple|free)?-share)-buttons|
best-seo-(solution|offer)|googlsucks|bestwebsitesawards).com)|(semalt.*)|((econom.co|ilovevitaly.(co(m)?|ru)|(humanorightswatch|guardlink|smailik|webmasters|nti
crawler).org)|(Get-Free-Traffic-Now|alivematrix|event-tracking|(100dollars|success)-seo|videos-for-your-business|keywords.*success|free-video-too
l).com)|(youporn-forum)) ) {
return 444;
}
Extended thoughts
This should be enough to go out and start working on your new script application to programmatic update your clients’ Google Analytics accounts, with a Hostname Filter. Which, Inserts a new filter; Links the new filter to the correct view; Updates the old filter to the new view’s url.
For any SEO company, this is a highly manageable solution compared to what I talked about in Part 1. Google Analytics only allows 500 writes per account per 24 hours. If you have 200 accounts that you manage Google Analytics for, keeping an up to date “Filter list” of websites to block is time consuming, impractical, and an unacceptable overhead for a business.
Other sites have a huge list of filters and currently with sweet regex skills the ~600 domains that have been crowd sourced compress down to 8-10 filters. As each filter has a text limit of 256 characters. Using a smaller list myself of only sites that I’ve seen I managed to compress it down to 3 filters. If we multiply the 3 filters across all our accounts in the business… Well its #mathtime
- Creating, (or Updating) a filter for each view acts as one write (3*200)
- Linking a filter for each account acts as one write. (3*200)
Just like that, if I had done some math before part 1, I would have known it was impracticable. We have at least 600 writes whenever we want to update our default filters. I hear you saying, but only update the filter that has changed, instead of updating all 3. That’s still 200 writes, assuming you have 200 views. It then takes 24 to 48 hours to update all filters. I have not counted the latest numbers but its easily above 600 ‘view properties’.
Without the hostname filter, when we noticed a new spammer it’s usually already too late. The spammer will have already populated across many many accounts. Polluting real data, and increasing the workload of SEO technicians.
Having NGINX/IIS/Apache block all referrers that are actual scrapers requires less effort and the resolution time is instant. Being a simple configuration change that can be rolled out across our infrastructure instantaneously is leagues better than having a 48 to 72 hour turn around for previous referrer spam blocking.
If you have any ideas, thoughts, or if it has helped you in making a hostname creator, feel free to leave me a comment. It’s always great to here other success stories, especially others who are in the SEO industry.