alexfedoseev January 4, 2014 at 14:18

Site Visitor Sourcing Module for Ruby on Rails

From the sandbox

This post is mainly about web analytics: how to correctly identify the sources of visitors to your site, and about my module for Ruby on Rails, which helps in this difficult task. At the end there is a small part that I will ask you to draw the attention of the Rails community to: it is about me and Rails. But let's get it in order.

Part one. About web analytics and determining visitors' sources

Problem

There is a rather trivial task: to determine the source of the visitor who came to our site. I don’t know about you, but for a long time we parasitized on the body of Google Analytics cookies: we took utmz, gutted it on mediums and mediums and lived-not-grieve. Analytics for us solved the issues of rewriting sources, accounting for sessions and, in general, eliminated all the bucks of parsing referrers. But all good things come to an end.

When Google rolled out the Universal beta, it became clear that sooner or later, you would have to say goodbye to the Google cookie and learn how to do it yourself. But since then he declared the incompatibility of Classic and Universal, so far you could sit exactly: Classic will be supported for a long time.

In the new version of GA cookie, there is only one left - with the user id. Analytics sends it to its server and already performs all the calculations there. And neither through cookies, nor through js can you now get information about the source from it.

But recently, Google began to gently poke users with a stick: it rolled out the profile converter - Classic → Universal. And here we have to start wiggling the rolls: classic Analytics will receive the final Reader at the time when remarketing lists come to Universal. And this, I think, is not far off.

In this regard, I was born into a self-made utmz cookie generator. And called it sourcebuster .

First about the form

The generator is made in Mountable engine format for Ruby on Rails . It can be pretty quickly adjusted to all your Rails applications as a gem and updated with a single command from the console. The module is independent and does everything by itself. Data about the source can be used to replace telephone numbers, content on the site, save them together with applications and use for further analytics. According to certain rules, the module calculates the source (and a number of parameters) and saves them in the cookies of the visitor.

Immediately a link to GitHub: https://github.com/alexfedoseev/sourcebuster
I haven’t figured out the README.rdoc design yet, in the process, I’ll fix it soon.

Now about the content

Most of the logic repeats GA logic, but there are some differences.
Let's start with the data structure.

Data structure

In total, we have 4 main types of traffic:

utm - traffic marked with utm tags
organic - traffic from organic search engine results
referral - referral traffic (links from third-party resources)
typein - direct transitions

The filtering logic in the image below:

In this way, we pack visitors into these 4 baskets.
Next, we need to create rules for rewriting sources, since one visitor can go to the site at different times from different sources.

Logic of rewriting sources

The rewrite logic follows the logic of Google Analytics:

Please note that referral clicks do not overwrite anything in the current session. Why - let’s explain by example: often a visitor during the current visit goes to the site from a third-party resource that is not a real source — for example, from an email service where he had a link to activate registration.
In this system, I decided, in addition to overwritten data about the current visit, to store data about the very first visit. That is, at the time of the conversion, we will have data on the first and current sources of the visitor.

What exactly can be pulled out using the module:

Data on the very first source:
utm_source, utm_medium, utm_campaign, utm_content, utm_term
The same data about the current source
(if the user made a repeated transition from another source)
Date of first visit
Point of entry
Full referrer at which the source was rewritten
ip and user agent user

Module installation

We assume that you already have a rails application to which you want to fasten the module.

In the Gemfile application, add:

gem 'sourcebuster', :git => "git@github.com:alexfedoseev/sourcebuster.git"

Install:

bundle install

Since this is a Mountable engine, it exists in an isolated namespace.
We mount it in the application, adding in routes.rb :

mount Sourcebuster::Engine => "/sourcebuster"

Next, we need to copy and perform all the migrations.
Copy:

bundle exec rake sourcebuster:install:migrations

And execute:

bundle exec rake db:migrate

There are 3 new tables in your database:

sourcebuster_referer_sources
Data about custom sources.
sourcebuster_referer_types
Data about types of referrals (essentially utm_medium for referral traffic).
sourcebuster_settings
Application settings (session duration and subdomain processing).

You don’t have to do anything with them, there is already data out of the box and there are interfaces for them.

More information about Mountable engines - http://guides.rubyonrails.org

The module is almost connected, the last touch remains: let it set cookies anywhere in your application. To do this, in the application_controller.rb of your application, add:

class ApplicationController < ActionController::Base
  include Sourcebuster::CookieSettersHelper
  before_filter :set_sourcebuster_data
  helper_method :extract_sourcebuster_data
  # some code
  private
    def set_sourcebuster_data
      set_sourcebuster_cookies
    end
end

It seems ready. Engine uses the main application templates, so you can customize the styles yourself (maybe I'll change this). I could miss something, if something does not work - write.

Using

When using the module, please take into account that this is beta and was written by a person with quite a bit of development experience (about which a couple of paragraphs at the end of the post).

The module gives the following methods (more precisely, there is only one method, but it pulls out different data):

Module methods

# Cамый первый тип источника (utm / organic / referral / typein)
extract_sourcebuster_data(:sb_first, :typ)
# Cамый первый utm_source
extract_sourcebuster_data(:sb_first, :src)
# Cамый первый utm_medium
extract_sourcebuster_data(:sb_first, :mdm)
# Cамый первый utm_campaign
extract_sourcebuster_data(:sb_first, :cmp)
# Cамый первый utm_content
extract_sourcebuster_data(:sb_first, :cnt)
# Cамый первый utm_term
extract_sourcebuster_data(:sb_first, :trm)
# Текущий тип источника (utm / organic / referral / typein)
extract_sourcebuster_data(:sb_current, :typ)
# Текущий utm_source
extract_sourcebuster_data(:sb_current, :src)
# Текущий utm_medium
extract_sourcebuster_data(:sb_current, :mdm)
# Текущий utm_campaign
extract_sourcebuster_data(:sb_current, :cmp)
# Текущий utm_content
extract_sourcebuster_data(:sb_current, :cnt)
# Текущий utm_term
extract_sourcebuster_data(:sb_current, :trm)
# Дата первого посещения сайта
extract_sourcebuster_data(:sb_first_add, :fd)
# Точка входа
extract_sourcebuster_data(:sb_first_add, :ep)
# Полный реферер, при котором произошла перезапись источника
extract_sourcebuster_data(:sb_referer, :ref)
# ip пользователя
extract_sourcebuster_data(:sb_udata, :uip)
# И его user agent
extract_sourcebuster_data(:sb_udata, :uag)

Module test page: http://sandbox.alexfedoseev.com/sourcebuster/showoff
You can go to it from different sources and see what the module has determined.

The module also allows you to configure a number of additional parameters.

Default settings

Interface: http://sandbox.alexfedoseev.com/sourcebuster/settings

Duration of the session
How long after the last activity of the user his visit is considered completed. It is indicated in minutes, by default - 30 minutes.

Subdomain Handling
This is essentially the equivalent of _setDomainName in GA. I will explain with an example.
Let's say you have a site on which there are subdomains:

site.com
blog.site.com
shop.site.com

And you want the transitions from site.com pages to blog.site.com to be considered internal non-referral transitions (that is, when switching from one subdomain to another, the source is not overwritten). To do this, in the settings you need to check “I have subdomains and traffic between them should not be a referral” and in the field “Main host” add the root host of the site, all subdomains of which will be regarded by the module as one site. In our case, “site.com” is indicated there .

If you specify blog.site.com in the field , then the transition from alex.blog.site.com to blog.site.com will be non-referral, and the transition from alex.blog.site.comon shop.site.com will have referral traffic.

Additional sources

Interface: http://sandbox.alexfedoseev.com/sourcebuster/custom_sources

The system has the ability to configure the processing of a number of additional sources.
The settings are made according to the following parameters:

Domain
A source matches it, which we will process.
Alias
Beautiful / friendly source name.
Channel
You can specify referral , organic or social .
Query
Parameter Keyword parameter in the search engine url.

What for this table is needed the easiest way to explain with examples.

Example 1
You want the system to consider transitions from the Bing search as organic traffic (which is quite true).
If you go to bing.com and enter the query “apple” in the search box, you will be redirected to the search results page with the address:
www.bing.com/search ? q = apple & go = & qs = n & form = QBLH & pq = apple & sc = 8-5 & sp = -1 & sk = & cvid = 718ad07527244c319ecebf44aa261f64

Based on it we create a new special source:

Domain: bing.com
Alias: bing
Or whatever you want, you can just write nothing, then the referrer host will be substituted.
Channel: organic
Keyword Parameter: q
Is this the symbol between the "?" and "= your_query" in the url of the page with search results.

Now everything that comes from such pages will be considered organic traffic.

Example 2
You want to highlight the transitions from the social. networks in a separate group.
We follow a similar pattern:

Domain: facebook.com
Alias: facebook
Channel: social
Keyword Parameter: not needed

Done. Now all the click links from facebook (except those marked with utm tags) will be with the value of the social channel.

In the domain field, you must completely specify the zone (.com, .com.ru, etc.). If you specify the value facebook.com, then traffic from the domain facebook.com.ru will not fall under this filter. And from the domain m.facebook.com - will get.

Tests

Sources: https://github.com/alexfedoseev/sourcebuster/blob/master/test/integration/navigation_test.rb The
lion's share of tests are Selenium tests to verify that sources are correctly identified and rewritten. They were written in Ruby, but implemented in such a way that it was possible to check not only the code of my module, but in principle any implementation (for example, if someone ported it to php or js). That is, they do not test methods, but the result of their work. In addition, surrogate referrers are not used here, but real transitions from real resources are tested. And if Yandex changes something in the issuance (for example, switches to https, which will kill the referrer), then the tests will show it. Everything is really shorter.

Now the tests are pretty meat and they are written more for themselves, but you can figure it out if you wish.
In order to test the rewritten code, you need to have:

a page folded according to certain rules
(see the code of the test page , find the id of the data blocks, if anything, ask questions)
indexed in search engines
(Yandex, Google, third additional (for example, rambler))
and in the top 5 for a specific request
+ links to this page from social. network and third-party (referral) site

The upper block of code contains the constants that you need to configure before running the tests. From them it is clear what exactly needs to be prepared.

And yes, the test run takes about 20-30 minutes.

I repeat: during the operation of the module, please take into account that this is beta and was written by a person with quite a bit of development experience (about which a couple of paragraphs below).

Part two. About me and Rails

I’ve been engaged in Internet marketing for 3.5 years, and not so long ago I came to the conclusion that traffic generation is not mine. I want to generate meaning, not traffic. And I started writing code. It happened about 9 months ago. I don’t have any IT-mathematical background, I had to go into everything from scratch and myself. The books of Chris Pine, Michael Hartle and other Internet helped me in this.

As a result, I wrote a blog for myself, but about 5 months ago I was forced to take a break, and this module is the first thing I wrote after a downtime. I ask members of the Rails community to criticize the implementation and point out explicit and not-so-poor schools. For all this time, I have never managed to meet a living person who writes in Ruby, and it’s quite difficult to comprehend everything and everything myself.

Thanks in advance for the criticism and I hope this post will be useful to someone. Good luck.

Tags: