Chapter 3 Basics - Gathering Data

We live in a connected world. No matter which website we visit, which app we use and which people we interact with: We leave a digital footprint.
Day by day, there is more behavioral data created and it often makes using the internet more comfortable.

Here is an example: Netflix concludes from user data which movies we like and subsequently optimizes which videos are suggested to us individually. Google individualizes search results and advertisers measure the effect of ad impressions on purchase probability. Tracking data helps companies to better understand consumer behavior and to customize their services.

3.1 The concept of Pixels

User behavior on websites and platforms is mostly captured via tracking pixels (cookies). A “pixel” or “TAG” is a small piece of software that is loaded in the background of a website to collect information (undetected), about users and their behavior on the website. But sometimes these TAGs are forgotten and live on as tiny pieces of code somewhere in the depths of a website. In this case they keep collecting data - mostly unwittingly to the website owner. The control over these code snippets is often spread over multiple divisions and partners of a company (e.g. IT, website UX (creative agency), marketing (media agency) and customer relations (social media agency)).

As a consequence, nobody is really responsible. But especially in this area it is important to know who is actually getting which information. Here is an example in the fashion industry: After the end of a joint campaign, an external data provider keeps collecting data about the website’s users (who are attractive for other brands) via the campaign TAG that was implemented to measure effects of the campaign. Without the website holder’s knowledge, these user profiles could be sold e.g. to a premium timepiece company or even to a competitor from the fashion industry to make a competitor’s campaign more efficient. The sad truth is that many companies neither know that many of these (external) code snippets still lie dormant on their websites nor for which party these TAGs collect data and which value this data could have for both the own company’s marketing and for other companies. Companies often lack a holistic, deliberately implemented data strategy, although answers to such questions are decisive for companies from many perspectives.

3.2 How does the German tagging landscape look like?

For my job, I examined the 350 most important German websites over the past couple weeks to learn more about the current tagging landscape. The analysis shows that the average German website has 16 external pixels implemented with the range being very wide reaching from almost 40 pixels in the tourism industry to just 2 in the pharmaceutical industry. As a general pattern we see that industries dealing with sensitive data (e.g. banks and insurances) tend to have fewer pixels on their websites. The distribution of single providers is interesting as well. Among analytics providers there is a clear order of use: Google Analytics can be found on 47% of all websites, followed by Adobe with Omniture (11%), Webtrekk (10%) and Piwik (4%) take 3rd and 4th position.

3.3 Utilization of pixel data

Providers whose pixels/TAGs are implemented on multiple websites can track user behavior across the different websites to create detailed user profiles. These profile segments are very useful from the advertiser’s perspective because they enable them to address members of their target group in a favorable environment. But website operators are often left out when it comes to monetize the data. From the website operators’ perspective it is therefore important to check whether data from external providers is used for such profiling and if so, how much their data is worth in this pool of profiles. Current rates for user profile data depend on the quality of the data and range from 1€to 5€ CPM based on impressions. For a website with a lot of traffic and detailed user profiles, the value of these data can add up to 100.000€ per month.

3.4 Consequences for security & e-commerce

Another problem occurs in the area of website security and user experience. As soon as unsafe pixels (http connection) are implemented into “safe”websites (https connection), the whole online traffic is no longer encoded. Some browsers (e.g. Internet Explorer) display security warnings in such cases. These security warnings interfere with the user and can lead to user aborting his/her online purchase. Thus, the implementation of unsafe TAGs has a direct effect on the user experience and thereby also an indirect effect on conversion rates or sales. Most TAGs are invocable as both https and http TAGs.

3.5 Design of a deliberate data strategy:

The first step is to get an overview of the owned platforms and to analyze the pixels that are used on them. The second step is the ongoing control of the implemented pixels and their regular check regarding necessity and use. So-called “TAG Management Systems” offered by Google and Adobe help to do that. These systems are especially useful to avoid time consuming agreements between IT and marketing. It is not surprising that such systems are widely spread already, they can be found on 35% of all websites. But companies using such systems are not fundamentally protected against mistakes.

Illustrative view on implemented pixels

Illustrative view on implemented pixels

It is the aim of such systems to administrate and control all external pixels at one place. But there are a couple of websites where pixels are implemented outside the “Tag Management System” and thereby undermine the idea of the system. For example, figure 4 shows that the Google Tag Manager is implemented, however many pixels are loaded directly from the main website. So the arrows point from the actual website in purple directly to the pixels in blue. Considering the abundance and the complexity of the snares it is to be asked why so many pixels are still implemented and used. The benefit seems to exceed the costs by far. That mainly comes from a more efficient ad delivery because of carefully built user segments which can be addressed via targeting. On the consumer side, this segmentation usually happens on Data Management Platforms (DMP). In our survey, pixels of DMPs are found on 50% of the analyzed websites. Some DMPs also offer to advertisers to buy cookie data from third parties and to deliver ads on an individual basis.

3.6 Gathering data to create user profiles

How does data collection in the web work? What is a typical data sources

3.7 What is data onboarding?

3.8 What is look-a-like modelling?

Lookalike modelling is “finding new people who behave like current customers – or like a defined segment. The topic is hot, but there’s deliberate mystery about the method and its accuracy” [theguardian.com] The following section, will demystify the method and bring some science to it.

The basic idea is to find users who belong to a defined audience. Starting from a seed audience such as users who already converted or users that have state to belong to an audience through an survey, the idea is to train a statistical model to identify which user actions separate that user cluster from other users. Let’s have a look at a simple example to predict a user’s gender. The data usually used to create the lookalike models consists of the websites a user has visited in the past and one outcome variable. In this example that outcome variable is the user’s gender.

[image] Schematic view of the process

That information is fed into a various algorithms to train models on the data. One such model can be a tree based model. The following chart exemplifies such a “Tree” model. As basis we used 20000 users and just 4 websites to classify users depending on their gender. The websites are illustrated as boxes in the middle. In the first step the model tells us that if a user visited chip.de (N>=1) he is with a probability of 65% male. As a more complex example down the tree is if a user did not visit chip.de but gesuende.de she is with 73% female. That tree gets more complex if users can be segmented using more websites.

Example tree-algorithm to decide on user attribute

Example tree-algorithm to decide on user attribute

Model accuracy describes how well the model classifies users. As a simple measure we take the prediction accuracy defined as the number of correctly classified users as a percentage of all classified users. For the previous example of predicting a user’s gender we see that the accuracy is a function of how many websites are used. In the example chart 62% of all users are correctly classified by just using 20 randomly selected websites as data source. That number increases to roughly 73% when the model has access to 500+ websites. After that, additional websites do not add value.

Prediction accuracy dependent on website universe

Prediction accuracy dependent on website universe

• How accurate are Lookalike models? What are the key factors? o What is the effect of the number of websites used to measure user behavior? o What is the effect of different Machine Learning algorithms? o What is the effect the selection of websites used to measure user behavior? o What is the minimum number of users needed to create models? o Which attributes can reliably been used to create models?

Prediction accuracy dependent on the share of users classified

Prediction accuracy dependent on the share of users classified

Find attributes or features of (converting) users Instead of looking for the ideal user, we look for the ideal behaviour – activities and interests that indicate a person will convert, regardless of what their profile may look like.

Try to find that users with these features/behavior in an audience in order to reach the users with the same behavioral pattern.

look alike act alike