Blocking Google Analytics Referrer Spam via Filters with python
My followup post (with correct information)is here: https://blog.slowb.ro/programmatic-blocking-referrer-spam-part-2-the-correct-way-for-seo-companies/. Please do not implement the following filters, only use this information as a refresher course on spam, and referrers in general.
Google Analytics is a great platform for people to want to get to know what is happening with your users, and a good alternative if you cannot host your own analytics software, or just don’t want to. It’s a great first step to on your way to help improving your SEO footprint; working with keywords, seeing how campaigns are tracking, how your users interact with your site and from where, via referrers. This brings us to the problem that all SEO companies and people who work with low-to-medium traffic websites end up dealing with and no-one really has a real answer or solution to. Referrer spam.
What are referrers? A brief introduction and history.
Some small background on referrers is a must (or not, you could skip it) before we follow on. Referrers are used by web servers (such as nginx & apache); browsers (firefox, chrome); and nearly anything that is connected to the web. Funnily enough they indicate where a user was ‘referred’ from. This was part of the original specification for the HTTP specification that the world wide web now exists on. Referrers are useful information that can be used by website operators, which is nearly everyone and their dog nowadays (including dog fan pages) to figure out who is linking to their site around the web.
If you were to visit this blog, at https://blog.slowb.ro and click this post your browser would make a request similar to the following example. The server operator, (Me) can see that you started from the home page and clicked on this post. (I have highlighted the referrer in bold for you)
10.0.0.101 - - [10/Jun/XXXX:XX:XX:XX +XXXX] “GET /google-analytics-filter-whitelister-in-python HTTP/1.1” 200 0 “https://blog.slowb.ro/” “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/41.0.2272.76 Chrome/41.0.2272.76 Safari/537.36”
It shows a valuable amount of information about the user, such as:
- Operating system: Ubuntu
- Browser: Chromium
- Page: /beat-google-analytics-referrer-spam-programmatically
- IP: 10.0.0.101
- Referrer: http://blog.slowb.ro/
This is all information the client has given us just by visiting our site. We accept all data at face value and assume that data is correct. This is the reason why referrer spam exists. “Companies” use the good nature of web servers that trust the user, to make requests to websites with a “fake” referrer header. As far as the web server is concerned its a real referral. Technically it is. Anything can be in the referrer. Just as anything can be the Operating system, and anything can in the “User Agent” field.
Now when you browse your latest analytics data from google and see all these random sites you might of never heard of at the top of your referrer list. You end up visiting the site, and you can see it obviously has nothing to do with your site and they are doing it for their own malicious purposes. Or maybe its not obvious. But if you cannot find a direct link to your site, on the page your just viewed, then it’s more than likely a nefarious company. Their goals are their own, but I feel confident on speculating that its either traffic or revenue related. Each site will have its own motives. 99% of the time its always revenue related.
Referrer spam.
Lately I was tasked to help our company deal with our own referrer spam problems for our clients and the original solution I proposed (which is certainly not new) is to filter out all the referrers that are known spammers not at the analytics level, but at the web server level. Stop them before they even reaching the server. That was the plan atleast. Thus the javascript for these “users” would never end up being executed. This proved successful for a good portion of the spammers who, (I suspect) used actual browsers which conformed to the HTTP standards. Or they used a browser framework such as phantomjs or selenium, which are highly scriptable and configurable to do just about anything.
The image above was taken from this blog’s (old) Google Analytics account. Which I used originally when I started blogging before moving on to piwik, a self hosted analytics platform. What these companies don’t know, is that I have stopped using Google Analytics for the past 3 months. This is not because of referral spam, it’s for my own personal lets not give google too much power. These companies are the worst of the bunch. They don’t even have the decency to view your page. I searched all the log files on my server (and our production logs) for any indication of these referrers. But to no avail. They must run the javascript code with your UA-xxxx-xx tag through, what I assume, would be open proxies and set the referrer url to be whatever they desire. As javascipt is never run by our webserver, only in the clients browser they could easily scrape your page once for your UA code, using a script similar to the one over here. Then in their leisure time, cycle through open proxies (or tor exit nodes) for all the UA codes they have found. Then just wait for all the users to click on the new “top” referrer for the week.
Blocking referrers at the webserver level would only work if these people played ball. Unfortunately I needed to come up with another solution that would be able to be rolled out to our 4000 “views” (also known as profiles)
Coding a solution.
My final and current solution (as of this post) is to programmatically add and remove our filters and to attach them to all the views under our umbrella analytics account.
Google Analytics is quite convoluted to say the least. The above picture shows the hierarchy of the analytics platform. If you wish to learn how everything works together, the latest management API is here. You can do a lot more than just add and remove filters programmatically. You could develop your own dashboard by pulling in the data from the analytics api, or create a webpage application to manage all your sites. (Which if I had the time, I would love to do)
If you are a company with more than a few sites and wanting to implement this solution you will need to follow the API tutorial for Python as stated in the analytics documentation.
tl;dr for the setup guide:
- Create a project in the Google Developers Console.
- Enable all the API’s needed.
- Download the Service Account’s client_secrets.json. (into the same directory as the python scripts)
- run python analytics_api_v3_auth.py
If you are a user with a simple website with just one or two sites and just want all these referrers to disappear. Google has a howto in managing filters. But here is a pretty picture as well on how to create one. (You will need to create two filters for it to block everything)
Current Filters 15/06/10:
.*((darodar|priceg|semalt|buttons-for(-your)?-website|makemoneyonline|blackhatworth|hulfingtonpost|o-o-6-o-o|(social|(simple|free)-share)-buttons|best-seo-(solution|offer)|googlsucks|bestwebsitesawards)\.com)
.*((econom\.co|ilovevitaly\.(co(m)?|ru)|(humanorightswatch|guardlink|smailik|webmasters|nticrawler)\.org)|(get-free-traffic-now|alivematrix)\.com)|(youporn-forum)
Filter Creator
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# analytics_filter_update.py
import sys
import json
# import the Auth Helper class
import analytics_api_v3_auth as v3auth
from apiclient.errors import HttpError
from oauth2client.client import AccessTokenRefreshError
referrers = [
".*((darodar|priceg|semalt|buttons-for(-your)?-website|makemoneyonline|blackhatworth|hulfingtonpost|o-o-6-o-o|(social|(simple|free)-share)-buttons|best-seo-(solution|offer)|googlsucks|bestwebsitesawards)\.com)",
".*((econom\.co|ilovevitaly\.(co(m)?|ru)|(humanorightswatch|guardlink|smailik|webmasters|nticrawler)\.org)|(get-free-traffic-now|alivematrix)\.com)|(youporn-forum)"
]
def usage():
print "\n Usage: python %s option\n" % sys.argv[0]
print " Options: "
print "\t create \t- Create our filters (if they don't exist)"
print "\t delete \t- Remove (all) our filters, then create"
print "\t fdelete \t- Remove (all) filters, NO create"
def main(arg):
# Initialize the Analytics Service Object
service = v3auth.initialize_service()
# Populate profile id
# Query Account(s)
all_ids = get_account_ids(service)
if all_ids:
# Loop through all Accounts
for accid in all_ids:
try:
filters = service.management().filters().list(
accountId='%s' % accid.get('id')
).execute()
check = False
# Find out if we already have filters added
for filter in filters.get('items', []):
# Check if the referrers already exist
if filter.get('name') == "API Blocked Referrers":
print "Our Filter Exists %s, %s" % (
filter.get('id'),
filter.get('name')
)
check = True
# Confirm if we need to "update" to new filters
if str(arg) == "delete" or str(arg) == "fdelete":
delete_filters(
service, accid.get('id'), filter.get('id')
)
check = False
if str(arg) == "fdelete":
# TODO: Will need to refactor our loop
print "API Filters Deleted from Account %s" % accid.get('id')
# Now that filters don't exist, we can add them
elif not check:
print "%s does not have the block list" % accid.get('id')
# Now add all the 'bad' referrers
# As Analytics has a 255 char limit for filter regex,
# we have to use multiple filters to block our list
for ref in referrers:
retfilter = service.management().filters().insert(
accountId='%s' % accid.get('id'),
body={
'name': 'API Blocked Referrers',
'type': 'EXCLUDE',
'excludeDetails': {
'field': 'REFERRAL',
'expressionValue': '%s' % ref,
'caseSensitive': False
}
}
).execute()
# Now we LINK the new filter to the view
# Otherwise they don't get applied to the view...
link_filter_view(
service,
accid.get('id'),
retfilter.get('id')
)
else:
# Some of our Blocked Referrers exist,
print "\n ERR: Filters exist for this account %s " % (
accid.get('id'))
print "\t run %s with 'delete' option if needed" % (
sys.argv[0])
# TODO: For testing purposes uncomment this line
# to exit after the first item (account)
# sys.exit(0)
except TypeError, error:
# Handle errors in constructing a query.
print (
'There was an error in constructing your query : %s, Line: %s' % (
error,
sys.exc_info()[-1].tb_lineno
)
)
except HttpError, error:
# Handle API errors.
print ('Arg, there was an API error : %s : %s : Line: %s' % (
error.resp.status,
error._get_reason(),
sys.exc_info()[-1].tb_lineno
)
)
except AccessTokenRefreshError:
# Handle Auth errors.
print (
"The credentials have been revoked or expired, "
"please re-run the application to re-authorize"
)
else:
print "ERR: Couldn't get any Accounts..."
# Delete the filters so we can create the new ones
# Takes: service Object, accountid, and the Specified FilterID
def delete_filters(_service, _accid, _filterid):
try:
filters = _service.management().filters().delete(
accountId='%s' % _accid,
filterId='%s' % _filterid
).execute()
print "Filter: %s deleted" % _filterid
except TypeError, error:
# Handle errors in constructing a query
print 'There was an error in constructing your query : %s' % error
except HttpError, error:
# Handle API errors.
print ('There was an API error : %s : %s' % (
error.resp.status, error.resp.reason))
# for the created filter on the account, apply to all Views under each property
#
def link_filter_view(service, _accId, _filterId):
properties = get_property_id(service, _accId)
if properties:
for prop in properties:
views = get_views(service, _accId, prop.get('id'))
if views:
for view in views:
print "Attaching F: %s to V: %s on P: %s in A: %s " % (
_filterId,
view.get('name'),
prop.get('name'),
_accId
)
service.management().profileFilterLinks().insert(
accountId='%s' % _accId,
webPropertyId='%s' % prop.get('id'),
profileId='%s' % view.get('id'),
body={
'filterRef': {
'id': '%s' % _filterId
}
}
).execute()
# Get a list of all account ID's that are in the account
# returns all "items" that are found
def get_account_ids(service):
# Get a list of all Google Analytics accounts for this user
try:
accounts = service.management().accounts().list().execute()
if accounts.get('items'):
return accounts.get('items')
return None
except TypeError, error:
# Handle errors in constructing a query.
print (
'There was an error in constructing your query : %s, Line: %s' % (
error,
sys.exc_info()[-1].tb_lineno
)
)
except HttpError, error:
# Handle API errors.
print ('Arg, there was an API error : %s : %s : Line: %s' % (
error.resp.status,
error._get_reason(),
sys.exc_info()[-1].tb_lineno
)
)
except AccessTokenRefreshError:
# Handle Auth errors.
print (
"The credentials have been revoked or expired, "
"please re-run the application to re-authorize"
)
# Get the ID (UA-xxxx-yy) code that matches the URL
# _query eg: xxxx
def get_property_id(service, _query):
if _query:
# Get a list of all the Web Properties for the first account
try:
webproperties = service.management().webproperties().list(
accountId=_query
).execute()
if webproperties.get('items'):
return webproperties.get('items')
except TypeError, error:
# Handle errors in constructing a query.
print (
'There was an error in construct your query : %s, Line: %s' % (
error,
sys.exc_info()[-1].tb_lineno)
)
except HttpError, error:
# Handle API errors.
print ('Arg, there was an API error : %s : %s : Line: %s' % (
error.resp.status,
error._get_reason(),
sys.exc_info()[-1].tb_lineno)
)
except AccessTokenRefreshError:
# Handle Auth errors.
print (
"The credentials have been revoked or expired, "
"please re-run the application to re-authorize"
)
return None
# Return a list of 'Views'
def get_views(service, _accId, _propId):
if _propId:
# Get a list of all Views (Profiles)
# for the first Web Property of the first Account
try:
profiles = service.management().profiles().list(
accountId=_accId,
webPropertyId=_propId
).execute()
if profiles.get('items'):
# return the View (Profile)
return profiles.get('items')
except TypeError, error:
# Handle errors in constructing a query.
print (
'There was an error in construct your query : %s, Line: %s' % (
error,
sys.exc_info()[-1].tb_lineno)
)
except HttpError, error:
# Handle API errors.
print ('Arg, there was an API error : %s : %s : Line: %s' % (
error.resp.status,
error._get_reason(),
sys.exc_info()[-1].tb_lineno)
)
except AccessTokenRefreshError:
# Handle Auth errors.
print (
"The credentials have been revoked or expired, "
"please re-run the application to re-authorize"
)
return None
def get_results(service, profile_id):
# Use the Analytics Service Object to query the Core Reporting API
return service.data().ga().get(
ids='ga:' + profile_id,
start_date='2015-03-03',
end_date='2015-03-03',
metrics='ga:sessions').execute()
def print_results(results):
# Print data nicely for the userself.
if results:
print 'First View (Profile): %s' % results.get(
'profileInfo').get('profileName')
print 'Total Sessions: %s' % results.get('rows')[0][0]
else:
print 'No results found'
if __name__ == '__main__':
if len(sys.argv) < 2:
usage()
sys.exit(1)
main(sys.argv[1])
Authentication Mechanism
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
# analytics_api_v3_auth.py
import httplib2
from apiclient.discovery import build
from oauth2client.client import flow_from_clientsecrets
from oauth2client.file import Storage
from oauth2client.tools import run
CLIENT_SECRETS = 'client_secrets.json'
MISSING_CLIENT_SECRETS_MESSAGE = '%s is missing' % CLIENT_SECRETS
FLOW = flow_from_clientsecrets(CLIENT_SECRETS,
scope='https://www.googleapis.com/auth/analytics.edit https://www.googleapis.com/auth/analytics.manage.users',
message=MISSING_CLIENT_SECRETS_MESSAGE)
TOKEN_FILE_NAME = 'analytics.dat'
def prepare_credentials():
storage = Storage(TOKEN_FILE_NAME)
credentials = storage.get()
if credentials is None or credentials.invalid:
credentials = run(FLOW, storage)
return credentials
def initialize_service():
http = httplib2.Http()
#Get stored credentials or run the Auth Flow if none are found
credentials = prepare_credentials()
http = credentials.authorize(http)
#Construct and return the authorized Analytics Service Object
return build('analytics', 'v3', http=http)
Pitfalls.
Unfortunately, all they have to do is register a new domain, redirect it to their main one, and continuing spamming Google Analytics again. Hey presto, they just bypassed all our newly created filters. If you are not a “power user”, the developer API has a daily limit of 500 writes for analytics objects. This means you need 1 write per filter object, and 1 write per linking of the filter to the view. So if you have 1 domain per account, it will take 3 writes to set it up. You don’t need a mathematics degree to know that if you have a huge amount of domains (over 200) then you will have to do multiple batches. (The script already handles that for you) This does not change historical data. These filters will only be applied to all future data from the date added.
Ideas for the future & other solutions.
This isn’t a solution unfortunately, it’s more of a bandage over a broken arm that is a serious referrer problem. There are a few things you can do to minimise the amount of referrers your sites get.
- Block the referrers in nginx/apache
- Leave Google Analytics behind for a Open Source solution: Piwik
- They even have a ‘cloud’ platform if you cannot host it yourself.
~~I have not received any referrer spam since running piwik. Which is going on 4 years now. Seems they are only targeting Analytics currently. ~~
The above information is not true any more, Piwik does the same as this blog post, it blocks hostnames, it just automatically updates the list every version, so not alot of people see it, (or are targeted for that matter)