Demographic Data in Yahoo Web Analytics and its Validity
Note: The following is a post by Jiri Brazda, who is the Founder of Optimics and Web Analytics Association Regional Manager for Eastern Europe. Go connect with @jiribrazda.
It’s already been a year since Dennis announced the launch of Yahoo! Web Analytics 9.5 on this blog. And one of the shiny feature announcements then, was the addition of demographic data, such as gender and age. Dennis, being such an awesome analytics expert he is, went on to suggest a few exciting ways of using these data in his screen cast overview of Yahoo! Web Analytics, which I recommend you watch before you read on if you are not familiar with the tool.
With the announcement, however, a few analysts from European countries questioned Yahoo‘s ability to deliver this kind of data in markets where Yahoo search market share in particular is almost non-existent. The Czech Republic like many other small European countries may well serve as a case in point. Take this purely as a rough indication, but my experience tells me that for Czech websites with as little as a couple thousand visitors, Yahoo! indeed is able to provide demographic data with a confidence level of about 80 % and for websites with more than 50 000 visitors you easily get the highest confidence level of 95 %.
Yahoo’s ability to collect this demographics data is down to the fact that it comes from the Yahoo! ID / Yahoo Cookie which you need in order to use the vast array of Yahoo! web properties, most notably Yahoo! Mail and the photo sharing website Flickr, which is popular in about every country in the world – and so seems to account for a bulk of the visitors to Czech websites with Yahoo! ID / Yahoo Cookie.
So for me, and I hope for many of you as well, no matter if you’re a fellow YWACN member or Yahoo! customer, the question is no longer about whether Yahoo! can provide demographic data about visitors to your website, because the answer is resounding yes. The question really is if you can trust the data, use it to analyze behavior and business outcome of different demographic segments and make decisions informed by such analysis.
I therefore set out to examine the demographic data, gender to be specific, provided by Yahoo! Web Analytics and compare them with data from NetMonitor, the official audience measurement platform in the Czech Republic which provides authoritative data that drive demand in the local online advertising industry.
Okoun.cz, the website in question that the data come from, is a traditional kind of message board that has been around since 2001. All data published below come from a period of 4 months from November 2009 through to February 2010.
Data are worth a thousand words
First and foremost, before we take a deep dive into the numbers, let’s be clear that all kinds of demographic measurement (maybe except for official census) are based on some kind of approximation. The idea is that if we determine a large enough sample, which exhibits the same qualities as the total population it suffices to analyze data from the sample and assert that any analysis outcomes are likely to be true of the whole population as well. It follows from there that the margin of error from such measurement approximations depends heavily on the sample quality and the confidence level is decreased with bigger total population and smaller sample dataset.
So much for theory, here’s the gender split reported by Yahoo! Web Analytics compared to NetMonitor.
From the chart above, it’s pretty clear that while Yahoo’s data show about 70 % of male visitors across the timeframe, NetMonitor’s data for male visitors fluctuate in the range of 55 % – 65 % so the difference between the two measurement systems is up to 15 percentage points.
The metrics are different though. While Yahoo! Web Analytics works with Unique Visitors (which really means cookies), NetMonitor operates with Real Users.
NetMonitor – methodology
I don’t want to go into much detail here, so just in a nutshell: NetMonitor is deployed on something like 95% of the Czech internet (in terms of traffic, not number of websites) and so they can differentiate between good cookies (cookies with a defined minimum lifespan) and bad cookies (below the lifespan threshold) and they approximate the number of Real Users from the good cookies, the number of pageviews generated by the good cookies and the total number of pageviews. This calculation is designed to accommodate the cookie deletion phenomenon.
Demographics data are collected from panel members using two methods: user-centric software based measurement (backbone of the panel, validated data, 1/3 of the panel) and site-centric pop-up surveys (less reliable data that are hard to validate, 2/3 of the panel).
Okay, now let’s take a look at the overall Unique Visitors and Real Users data in order to be able to examine the difference in gender split shown above and draw some conclusions.
Traffic differences aside, they’re different metrics after all. What is more important here is the relative sample size. It is quite evident that Yahoo! indeed has a considerable amount of data. In fact it is three times as much as the local audience measurement panel in this particular case. Websites with primarily international traffic may have this sample even bigger – I’ve seen up to 10 % of all website traffic identified with demographics data!
So can you trust the story they tell?
No data are 100% correct, but I believe it is safe to assume that with NetMonitor, more effort has gone into development of sound methodology for the local market – and so its data on overall gender split should be closer to truth. Therefore for top level reporting and in order to attract advertising suitors, website owners are better off using local audience measurement tools that provide rich validated demographics data, that can be easily compared with other websites.
Where Yahoo! Web Analytics demographics data are not very close in terms of overall numbers, the difference is most likely due to the fact that Yahoo! doesn’t have their websites localized in Czech. In effect, this skews the dataset in favour of more advanced users and away from the general population of internet users in the Czech Republic. Not everybody speaks English after all.
The screenshot below bears a lot of evidence to this fact. The Czech Republic are among the very few countries in the world where Google is not the #1 search engine. It is a local player called Seznam but the more advanced users, usually prefer Google as their first choice. From the Search Engines report you can see that the visitors identified through their Yahoo! ID and Cookie are indeed heavy Google users.
However, Yahoo’s demographics data still represent a lot of value to website owners seeking to better understand and communicate with their customers. It is possible to use this data within Yahoo! Web Analytics to isolate segments and analyze their behaviour in contrast with other segments on a very detailed level.
The chart below attests that the gender data are very, very close to reality. Okoun.cz has a ton of different topics you can discuss and of course, some of them are pure male interests and some on the contrary. The chart illustrates individual message board’s relative popularity measured by content consumption in pageviews for both male and female visitors.
If you speak a little Czech I will let you explore on your own from the following table of top 10 “female” boards what topics are relatively more popular with female audience. If you can’t speak any Czech though, just believe me, that reports like this are a lots of fun and insights at the same time.
Conclusion
Demographics data as a dimension, has been absent among traditional web analytics reporting efforts for years, but now it enables us to talk to our marketing colleagues in their language, because in traditional marketing, demographics segmentation has always played a huge part. It is just about time we discovered its value in web analytics as well.
Demographics data in Yahoo! Web Analytics do not represent precisely the overall traffic for Czech websites, but they do seem to be very precise for the many visitors identified through a Yahoo! ID / Cookie. Therefore my conclusion is that, these valuable demographic data points, can indeed be utilized in segmentation and analysis of your website visitors – even if they primarily come from small countries such as the Czech Republic and especially when combined with dimensions and metrics not available in the audience measurement tool such as campaigns, custom visitor data, conversions and sales.
I’d like to invite you – from whatever part of the world you are – to share your experience on using demographics data in web analytics AND optimization.
Jiri








April 16th, 2010 at 11:14
Hey Jiri,
I commented to you in email of course as we got the post up. BUT wanted to say in public, that this is a great post and some good interesting analysis. Thanks for thinking and writing it up!
d. :-)
April 19th, 2010 at 4:45
Dennis, thanks for the kind words! :-)
My goal was just to find out to what extent we can analyze and rely on this kind of data. I think it is very important in order to set expectations with our clients. Given how unique this feature is I find it quite surprising though that there hasn’t been much buzz around it and would be interested to know whether YWA users take advantage of it and access demographic data through segmentation and custom reports.
Perhaps in a future post we could take a deeper look at how we can use demographic data in YWA to derive actionable insights and actually improve something – be it user experience or the bottom line.
Anybody wants to provide some examples?
April 19th, 2010 at 5:00
[...] Brázda napsal na blog Dennise Mortensena článek o důvěryhodnosti sociodemografických dat v Yahoo Web Analytics v českém prostředí (článek je anglicky). [...]
June 21st, 2010 at 4:27
[...] seems to catch up and is said to provide data for smaller markets as well (see for yourself this article for the .cz population as an example) – we surely will see more of this any time [...]
June 28th, 2010 at 10:55
[...] článek vyšel původně na blogu Dennise Mortensena, který pracuje jako Director of Data Insights [...]