From a Search Engine point of view, the web is full of duplicate content, where their (the search engines) challenge is to index and display the original, or in their word – Canonical – version of that piece of content. Google, Yahoo and Microsoft (announcement links) have agreed upon a new standard for web site managers to inform them of duplicate content. The SEO industry, in general, are all positive about this new feature, and are pretty much celebrating new highs. I’ll leave my opinion about search engine impact and control on content for another day; I would however like to post my commentary to this subject from a Web Analytics point of view, as everything you see is mostly related to SEO (Search Engine Optimization).
Reporting on URL’s in your analytics tool, and with URL’s being the choice of grouping, such as a report like “Most Requested Pages by Page URL“, one get exactly what was asked for; a list of unique URL’s (not unique content).
Unlike search engines, users of analytics tools, tend to be highly affiliated with the website in question, and therefore closely connected to the content. With that in mind, we know that the analytics users have the power, of not being forced to make guesses about what is duplicate content or not.
Most analytics tools provide the opportunity to report on Title (typically the value of the HTML <TITLE> tag), which is an opportunity to group pages with different URL’s but analogous content together.
The following pages are typically reported as three distinct URL’s:
Applying the variable and value to the end of my blog default page, we get the following report result using Yahoo! Web Analytics. (it is very similar in other tools)
Looking at the report result above, we know that the 4 highlighted URL’s all all hold the same information about our friend Lex Luthor (..or actually just my front page), and we might choose a HTML <TITLE> tag that goes like this: <TITLE>Lex Luthor and Friends</TITLE>. OR in the case of my blog, just the Homepage Title as highligted below.
Note to figures: I conducted the experiment (created the example screen-shots) using Yahoo! Web Analytics, as it is reporting on page views in real-time. While refreshing the front page with the previous mentioned category variables, other visitors of coursed dropped by my blog, but you will notice the front page as marked by the yellow highlighter ungrouped and grouped end up with the same amount of page views over the course of this 5 min. exercise; namely 19 page views.
This opportunity to group analogues content together is quite similar to what the search engines just introduced as a the new “canonical” link tag value. Which could look like this, if implemented for our Lex Luthor example:
<link rel="canonical" href="http://www.example.com/superman?category=lexluthor"/>
We, as in most analytics vendors, not Google Analytics though (as far as I know, please correct me Avinash), have actually extended on this and provide opportunities to override the HTML <TITLE> tag with a custom value. The syntax would be the following in Yahoo! Web Analytics:
var DOCUMENTNAME='Lex Luthor Profile Page';
or the new version 5 code:
YWATracker.setDocumentName("Lex Luthor Profile Page");
The two code examples above, provides the opportunity to override the <TITLE> tag and thus keep your reporting intact, should the actual HTML <TITLE> change. I believe we agree that changing the <TITLE> tag itself does not create new content. There are multiple reasons why one would turn to this option of overriding the <TITLE> tag, but the most obvious reason is SEO (Search Engine Optimization), so that the content owner, can test multiple titles, without destroying the reporting.
I believe the analytics industry already provide duplicate content reporting functionality, very much like the canonical tag value functionality, as introduced by search engines. In form of the grouping by Title, or the extended DocumentName grouping opportunities. It is likely that there might be a challenge in making sure that this grouped reporting is aligned with the deployment and use of the canonical tag value.
I hope this shredded some light on the Canonical debate from a Web Analytics point of view; do let me know your commentary to this.