Selection Effect, or Idiots and Web Statistics

[ Home page | Things which suck ]

Quite often, you see stupid people saying things like ``99% of the hits in my web access log are from Internet Explorer. So only 1% of people use non-IE browsers, and I don't give a fuck about them.''

At the risk of pointing out the brutally obvious, this is so much bullshit, and there is no reason that an otherwise-intelligent person should believe it.

There are two reasons for this:

  1. The User-Agent header lies. Get over it.
  2. A strong selection effect.

The second is more interesting.

Suppose that n Netscape users and i Internet Explorer users visit a site. Idiotic web design means that the site doesn't work in Netscape, so each Netscape user looks at only N = 1 pages before giving up in disgust. The Internet Explorer users each view some larger number of pages I. In practice, perhaps, I = 10. Also let a = N / I.

Now let's look at the access log. Clearly, we will see a total of n N + I i hits. f = n N / (n N + i I) is the fraction of hits which come from Netscape (1%, or 0.01, in the example above).

So, if you see 1% of hits coming from Netscape users in your badly-designed site, how many of the people trying to use it are actually running Netscape? This is obviously given by

             n
      u = -------
           n + i

which we can obtain from f as follows:

             a n
      f = ---------
           i + a n

     n        f
    --- = ---------
     i     a - f a

   n             f
------- = ---------------
 n + i     a + f (1 - a)

With the numbers given above -- 1% of hits from Netscape, and a factor of ten between page views by Internet Explorer and Netscape users, you find

       u = 0.092

or about 10%. That web design decision doesn't look quite so intelligent now, does it? What sort of business would say `let's make 10% of our customers just piss off while simultaneously making us look really incompetent to them just because we're too lazy to design our site properly?'

For those who didn't follow the maths, here's a picture:

Graph of number of user fraction against hit fraction

Obviously when your site works just as well in Netscape as in Internet Explorer (a = 1) then the number of hits in the access log exactly reflects the ratio of real users.


Copyright (c) 2003 Chris Lightfoot. All rights reserved.