[ Home page | Web log ]
Not much to say here, but before I go on holiday for the weekend I just thought I'd remind my half-dozen readers that the deadline for responses to the public consultation on the Review of the BBC Charter is this Wednesday, the 31st March. So, if you have an opinion about the BBC -- and, frankly, who doesn't? -- or think that government consultations are a good idea, and should be encouraged, then follow the links and submit a response. If you can't be bothered to write out a whole spiel, then there's a questionnaire version of the consultation that you can fill out.
It is a commonplace among racists of the `I'm not a racist, but...' variety to say things like, ``The police correctly say more black people commit crime so more are stopped.'' This is a misconception which needs to be squashed.
(Usually in this sort of discussion made-up attributes, for instance `tall' and `short', are used, rather than talking about actual ethnicities or whatever; presumably this is to avoid offending the easily-offended. Since the motto of this web log may as well be `offending the easily-offended since 2002', I won't bother.)
Suppose that there are w white people and b black people in some population. Suppose further that every so often a crime is committed, and that the victim is able accurately to report the ethnicity of the perpetrator (assume that these are muggings, or `hot' burglaries, or whatever). W crimes are recorded in which the perpetrator is white, and B in which the perpetrator is black.
Imagine that B > W; that is, more crimes are committed by black criminals than by white criminals. (In some areas of the country this is the case, we are told; these, presumably, are mostly areas with more black people than white people.) What does this tell us about the likelihood that an arbitrary black person -- such as might randomly be stopped by a police officer -- is a criminal, relative to the likelihood that an arbitrary white person is?
The reason (which should be obvious) is that different criminals commit different numbers of crimes; the difference, that is, between `more black people commit crime' and `black people commit more crime'. These two statements are not the same.
As an extreme case, suppose that every white person commits one mugging a year, and one black person commits some large number N > w of muggings, while the rest are law-abiding. In this case, the probability of a randomly-selected white person being a criminal is 1, but the probability of a randomly-selected black person being a criminal is 1/b, a much smaller number. That is to say, in this case it is b times more likely that a random white person stopped by a police officer is a criminal than a black person is even if there are more crimes committed by black people than by white people. Without additional information about the numbers of crimes committed by individual criminals of each ethnicity, aggregate data about numbers of crimes won't tell us anything about the `propensity to crime' of white and black people.
In reality, the situation is not as extreme as in the example. Now we encounter a separate problem. Suppose that both black and white people commit crimes at the same rate, with the probabilities that an individual black or white person is a criminal being equal. In that case, the number of crimes committed by people of each ethnicity will be in proportion to the ratio of ethnicities in the general population.
Imagine that the police attempt to control crime by stopping and searching people at random (perhaps they look for stolen goods or something). Suppose further that the police are racist and stop ten times as many black people as white people. In this case, even if black and white people are equally likely to be criminals, the police will still find ten times as many black criminals as white, because everyone they stop has an equal chance of being a criminal, and they are stopping ten times as many black as white people. And, worse, their tactic is self-reinforcing, since an ill-informed police officer (or politician) might infer from the statistics that -- because so many black people are being arrested -- even more black people should be stopped and searched. But of course this is an incredibly inefficient (as well as unfair) way for the police to try to cut down on crime, since they are letting the (usually much larger) white segment of the population get away with much shallower scrutiny. And it might lead a lazy observer to conclude that black people are more likely to commit crime, while in fact what they're seeing is the effect of racist policing.
Now, all of this is just a model. The police don't investigate crime solely by stopping people at random, and there are lots of other relevant factors in criminality, some of which are correlated with race (for instance, poorer people are more likely to be criminals -- at least of the Bill the Burglar variety, if not the Martha Stewart kind -- and, sad to say, history has left black people on average poorer than white). As with so many things, the real situation is very complicated. But that's all the more reason not to make generalisations like the one which started me off on this rant.
Writing is abundantly difficult without artificial constraints. Adding arbitrary limits to what you may scrawl -- such as to abandon that most common, fifth, symbol of our traditional Latin orthography, or lay down only so many words within pairs of stops -- without doing so ghastly an injury to grammar or signification as to crowd out all worth is in my opinion too hard to do. You may not concur; if not, visit this for compositions surpassing this slight try. You can add your own, too.
A thought (which occurred to me only recently, but presumably isn't original -- not that that matters): why are political commentators, especially on the right, so keen to describe efforts to prevent terrorist attacks as a `war on terror'?
Ignore for the moment the fact that you can't make war on an abstract noun (or an emotion); obviously the term is intended to be parsed as `war against terrorists'. But `war' is a funny term to use here, because a war has two sides. If we're in a war with the terrorists, then actions we take against them are `acts of war' -- but so are the actions that they take against us. That's the difference between fighting a war and prosecuting crime. Describing anti-terrorism measures as `war' legitimises the terrorist acts we are trying to prevent.
So why are right-wingers -- always keen to tell us that they are more opposed to terrorism than others -- so keen to regard our current circumstances as a `war'? And as a what-if question, suppose that during the late Iraq unpleasantness, Iraqi forces had tried to kill George W. Bush. Do we think that the United States would have described this as a legitimate act of war, like their attempts to kill Saddam Hussein?
Yesterday I went to a talk, rather improbably hosted at Microsoft Research, on the subject of `TCPA-Enabled Open Source Platforms' by Demetrios Lambrou from Royal Holloway. (At this stage I should, as usual, apologise to those of my half-dozen readers who are non-technical. This article is pretty long. That said, I -- egotistically -- encourage you to plough your way through it; there's some stuff of general interest in here.)
For those who don't follow IT industry newspeak, `platform' means `software that isn't finished yet', `open source' means `Free software', `enabled' means `compromised', and TCPA, the `Trusted Computing Platform Architecture', is an industry initiative to make PCs less useful for consumers and more useful to certain companies -- in particular, media publishers like the music industry. (Note the slightly peculiar security engineering definition of the word `trust', which is that a thing is `trusted' if, by misbehaving, it could really screw up your life. In these terms, you probably `trust' your bank, car, doctor and significant other, but that doesn't necessarily mean that you'd tell them your deepest secrets. Bear this in mind the next time you're invited to place trust in some inanimate object.)
The basic gag with TCPA (and the Microsoft technology, which is called NGSCB, for `Next Generation Secure Computing Base', and which will presumably win out in the marketplace, partly because Microsoft is a big evil monopoly and partly because the IT industry is likely unable to resist the allure of a five-letter acronym) is that each new computer is equipped with a little sealed cryptographic gadget. A program running on the computer can ask this gadget (which in TCPA is called the `Trusted Platform Module', which is particularly confusing, since the acronym `TPM' already means something completely different in this field) to produce a signed statement about the `identity' -- meaning a secure checksum, a large number which uniquely identifies a particular thing -- of the program.
This process is called `remote attestation', and it is designed to allow people to build services which can only be used with the software intended by their designers. Most web services aren't like that, of course -- you can use any web browser you want, and they rely on standard protocols that anyone can implement. But if you want to restrict the use people make of your service, that's bad news; and this kind of restriction is what remote attestation is supposed to let you implement.
As an example, an online music shop might use remote attestation to make sure that you only download and play music from its servers using their own music player, which could make sure you listen to adverts in between songs, prevent you from playing the songs you've bought for more than a month after you've purchased them unless you pay further protection money, and stop you from recording them onto a CD. (It could also stop you from doing Bad and Wrong things like transferring them onto your iPod or -- horror of horrors -- giving copies to your friends. If you think the other examples are ridiculous, well, witness the DVDs which force you to watch the copyright warning and trailers in sequence, without fast-forwarding, before they allow you to watch the film.) For a slightly less offensive example, your bank might require you to use its own special client to access your bank account -- rather than your normal web browser -- as a condition of using their service; ostensibly this might make the service more secure against fraud, but it's hard to see why your bank should be making choices about what software you run on your own PC.
Now, this talk was about implementing an interface to the TCPA hardware in Linux. It's been known for quite a while that this is (a) feasible, and (b) more-or-less a complete waste of time. So now we know that (a) is true for sure; as for (b), well, Lambrou's implementation allowed you to define a big list of programs which would be allowed to run (by specifying a list of their checksums); the system would use the TCPA hardware to prevent and program from running unless it's on the approved list. You can sort-of see the value of this; for instance, it might stop certain classes of security attacks on your Linux computer. But observe that to be useful, this requires you to maintain an enormous list of every piece of software which you regard as `safe to run' (you can't do it the other way round, obviously, because it would be easy to make a trivial change to a blacklisted program which gave it a different checksum without changing what it did). This has a particularly bad consequence for users of Free software, who expect to be able to modify the programs they use -- after all, that's the whole point of Free software. (As the GPL says, it's about freedom, not price. In the future, expect proprietary software like Microsoft Windows to fall in price to its marginal cost -- which is zero, just the same as Free software -- in order to compete in the marketplace; but you still won't easily be able to alter it, because Microsoft won't give you the source and wouldn't allow you to share your modified versions with others anyway. They can't, of course, stop you from modifying the programs you've bought, though -- that's a right guaranteed by the Software Directive, whatever their `licence agreement' might say.)
This means that `remote attestation' is pretty useless for Free software, I remain convinced that it will be close-to-useless for the other kind, too. Imagine that music playing program again. As I said, it's designed to prove to the music shop that it's a kosher version of the player before it will play any music. So to start with the record company has to have a database of the checksums of all the versions of the player program it's released. But now observe that the nasty vicious user might have installed some other software which is designed to capture the output of the program and save it in a file. Clearly we can't put up with that, so the software will have to ask the TPM to verify that the sound driver is kosher too. So now the record company needs a list of the checksums of every sound card driver in the world. Except that's quite a lot of work, so they'll probably only pick the five most popular ones, leaving users of less-common hardware high and dry.
But now they realise that the user could modify the operating system itself to intercept data coming out of the music player. So now the record company needs a list of the checksums of every version of the Microsoft Windows kernel which they regard as kosher. But a device driver -- unrelated to the sound card -- can modify the kernel after it's running, so now they need a list of all the other device drivers they regard as OK. The same goes for more fundamental parts of your PC, such as the BIOS itself (which TCPA is also designed to check up on).
By this stage the record company finds itself in the position of vetting every Windows program in the world to check that it's happy with releasing its music to be played on a system with that program running, which is obviously going to be pretty expensive and hard to keep up with; or defining a short, restrictive list of software it regards as safe, which will probably be OK if you only run common programs on your PC, but infuriating otherwise; or giving up on the whole idea as basically a waste of time. Even the middle option has a serious technical problem, which is that the record company still has to keep up with new versions of any part of the system which might be upgraded, for instance if a security hole is discovered in them. And if they're lax about doing that, users who upgrade as soon as bugs are fixed will find that none of their music will play any more, which is hardly conducive to good security.
There are ways to ameliorate some of these problems -- one of them is, rather than relying on checksums of programs, to accept any program which is cryptographically signed by a `trusted' (there's that word again) manufacturer -- but at some point you have to face up to the problem that someone will have to maintain this big database of stuff. And as soon as a single program which can be used to subvert the `security' of the system creeps into the database -- that is, the program becomes `trusted' -- the whole thing is blown apart in an embarrassing expensive mess. That is to say, any security which is based on `remote attestation' and a catalogue of permitted software will be extremely brittle.
After all, they don't really care about portability to non-Windows platforms, since their entire strategy is based on forcing users to buy Windows if they want to make use of Microsoft software (they even claim -- ludicrously -- in their `licence agreement' that you can't run their software on another operating system); and they don't really need hardware independence, since Windows only runs on one kind of hardware anyway.
(Note that this discounts Windows CE and possible ports of Windows to new processors like ia64. Note also that Windows NT ports to all the non-x86 processors were swiftly abandoned, with only the Alpha port lingering long enough to get rotten and smelly. This is a big hole in my argument, but ignore it for now.)
One of the things that a VM (virtual machine) lets you do is to impose a security model on programs which run inside it. For instance, the strangely popular Java VM has a security model that is supposed to prevent `applets' which run inside a web browser from doing things which they shouldn't (like deleting all your files or sending your bank details to Nigeria). It turns out that the Java VM's security model had all sorts of problems, but with enough work this kind of thing could be fleshed out and made to work properly.
Once you have a VM and a security model inside it, remote attestation suddenly becomes a workable tool. If the virtual machine really can impose effective restrictions on the software which runs inside it -- for instance, allowing a music player to write only to an encrypted data store, and never reveal the key; an online banking program to be protected from all the other programs on the system; or a file viewer or web browser to write only to a temporary filesystem for downloaded files -- then it's enough for a remote service to be able to prove that the virtual machine is a particular, trusted, version using `remote attestation'. Once that's done, the service provider can rely on the VM itself to do its dirty work. No huge catalogue of kosher software is required, but rather a much smaller catalogue of approved versions of the virtual machine. Even better, there's only one manufacturer of the virtual machine -- Microsoft -- and, just to be helpful, they can control the database too. Isn't that convenient?
Chris Brooke, Matthew Turner and others have put forward the theory that the most rational explanation for the defeat of Jose Maria Aznar in Sunday's Spanish elections was the increase in turnout occasioned by last Thursday's terrorist attacks in Madrid. The idea goes that, in Chris's words,
... when turnout rates rise in the context of a general democratic mobilisation, Left parties are more likely to benefit, given that it's the poor, the unemployed, the working class, the less well educated and so on who are, other things being equal, those who are less likely to cast a ballot[.]
This effect seems to work pretty well for British elections, or, at least, British opinion polls. The raw data from this ICM voting intention poll can be processed to give a plot of turnout against results like this:
by considering the various levels of turnout to be defined by the sum of the first n categories of likelihood to vote. (See also the discussion on this topic from Matthew's web log.)
Now, I haven't been able to find similar polling data for Spain (if you have some, I'd love to see it). But I have had a go at extracting some of the same voting intention data from this archive of Spanish opinion poll results maintained by the government of Valencia. Rather than analysing voting intention from a single poll stratified by confessed likelihood to vote, I've tried to analyse a bunch of polls by assuming that the fraction of people who say they will not vote is a good proxy for turnout in the election. Frankly this doesn't work very well, but it's the data we've got. Attempting the equivalent of the above plot gives, based on polling data from 2000 onwards: (the lines are best-fit regression lines)
(As an aside, I've used blue, like the Tories, for the Popular Party -- PP -- and red, like Labour, for the Socialist Party -- PSOE. I don't actually know what colours they use in Spanish politics, and evidently am too lazy to find out. I always find electoral maps of the United States confusing, since it would never occur to me that Republicans are red and Democrats blue. Such are the perils of multiculturalism, or something.)
This is a surprise. As I say, without polling data which are broken down by likelihood to vote, or a less noisy timeseries, we can't really confirm this. But the increased-turnout-leads-to-PSOE-victory theory isn't supported by the evidence so far.
Update: Matthew Turner links to this UPI story, which states that an opinion poll on Wednesday 10th March, the day before the bombing, showed the PSOE in the lead by `less than a 2 percent margin' -- but in the lead nevertheless. So the `surprise result' may not have been a surprise, or even caused by terrorism.
Among reports of the murder of 201 commuters in Spain last week and of the results of yesterday's elections, one technical detail caught my eye: the revelation that, like in Bali and various ETA bombings, the bombs used in the killings were triggered by mobile phones. (As an aside, there has been much hasty commentary on how the attacks `caused' the fall of Jose Maria Aznar's government, and how this represents a craven appeasement of the terrorists, rather than Spaniards' legitimate exercise of their democratic rights. I think Christopher Sheil summed it up best, in a comment on Crooked Timber: ``If I'm reading the warbloggers right, their conclusion seems clear: we need to elect a new Spanish people.'' Enough said.)
This detail forms an interesting story about the unintended consequences of new technologies. Mobile phones are small, readily available, reliable, and probably easy to adapt into bomb triggers. (At a guess, I'd imagine that the terrorists would have wired the bomb detonators into the ringer or vibrating alert thingy on the phones, and then phoned or texted them at the time they wanted the bombs to explode. An alternative theory is that the phones' alarm clock features were used instead.)
Historically, many terrorist groups seem to have struggled with the technical challenges of bomb-making, or, more specifically, bomb triggering. Some readers will remember a 1996 IRA bus bombing, supposedly an accident which occurred when the perpetrator -- whose time-bomb was controlled by components scavenged from a video recorder -- misunderstood the 24-hour clock and blew himself up on the way to a target he intended to destroy the next morning, after he had made his escape. The `shoe bomber' Richard Reid showed even less technical prowess, being apprehended by his intended victims while he struggled to light a match and ignite the fuse of his diabolical footwear. One IRA `bomb factory' was -- apparently -- destroyed after the security services installed an induction loop in the ceiling; the bomb-makers, too ill-informed to twist the wires connected to their detonators, died for their mistake. Natural selection is, of course, a phenomenon of greater importance in bomb-making than other professions, and it seems that the IRA at least learned from these mistakes, though that didn't stop them from claiming that their Remembrance Day bomb at Enniskillen was triggered accidentally by a British Army radio jammer.
Mobile phones and the GSM network change all this. Even an inexperienced bomb-maker ought to be able to wire together phone and detonator; phones are reliable and the cautious terrorist can either use the thing's alarm clock to trigger the bomb, thereby avoiding anxiety about unintentionally lethal telemarketing calls, or make sure that only a call from his own phone number will cause the bomb to ring....
And there aren't any easy technical measures which can usefully be taken against this new application for GSM. With a couple of exceptions where phones aren't allowed or don't work -- notably on aeroplanes or in underground railways -- everywhere that people gather, presenting an attractive target for the political murderer, they also expect to be able to make and receive phone calls. From the perspective of the GSM network, a phone connected to a bomb is no different from a phone in someone's pocket.
While evidence from phones used in this way may help investigators a little in their detective work -- for instance, they can use network records to search for a phone which was in the location where a bomb exploded, received a call at the time of the explosion, and then went off-air, and then search for the caller -- that's not likely to help enormously, especially since the terrorist could trigger the bomb from anywhere in the world and take steps to conceal their identity when they did.
This use of mobile phones is likely to distress people a little, I suppose; the Mirror lead with the story Massacred by Mobiles, (at least it wasn't a story about GSM masts causing cancer) but this is the sort of thing we will, sadly, have to get used to as reliable wireless networking becomes pervasive. Nobody is going to claim that this threat outweighs the benefits of phones, but we'd better adjust to the fact that some of their uses are undesirable -- and inevitable. Probably the same will be true of other innovations, too.
Continuing on the general theme of nuclear doom, I happened to come across in my notebook somebody's (I can't remember whose, I'm afraid) suggestion to draw a diagram of the movements of the Doomsday Clock displayed on the masthead of the Bulletin of the Atomic Scientists to remind readers that the world might be blown up at any moment. (The magazine itself is well worth reading. It's not all on the web, sadly, but a good selection of the articles is.)
At midnight, the keys are turned, the missiles are loosed and everyone dies. Before midnight, we're safe, more or less; but the later the hour, the closer lies incineration. The clock started off at seven minutes to midnight, and has never shown a time earlier than seventeen minutes to. This should probably tell you that the Atomic Scientists themselves aren't particularly sanguine about our ability as a species to resist blowing ourselves to pieces now that we're in a position to do so. It stood at two minutes to midnight during -- in their estimation -- the worst part of the Cold War, from 1953 -- after the first Soviet hydrogen bomb test -- until 1960, when things began to thaw a little. (There's a potted history of the clock from the 1995 issue of the Bulletin which summarises better than I do. The only other thing to note is that the clock isn't intended to reflect fast-moving events like the 1962 Cuban missile crisis or the 1983 `Able Archer' near-cock-up, but rather the general condition of international relations.)
(Oh, and I suppose it tells you that, like other web loggers, I'll occasionally post stuff which is more than usually content-free. As an aside, I don't think Tufte would approve of the shading on the above graph -- it worsens the ink-to-data ratio -- which is intended to suggest the sense of the vertical axis. Darker is worse, a common if ethnically inappropriate convention. Graphic design suggestions gratefully received....)
Many of you will have seen this set of photographs of the region surrounding Chernobyl in Ukraine, some from archives and others taken by an unnamed Ukrainian woman who motorcycles through the area. (Since the combination of an attractive woman, a powerful motorcycle and an apocalyptic landscape constitutes some kind of pathetic geek fantasy, the original site was swiftly slashdotted; the above link is to a mirror I've made.)
Some time ago I pondered the question of `nuclear terrorism', and particularly the risk of an attack on a nuclear power station using a hijacked aeroplane. It remains unclear to me whether this is a realistic attack, but I think there's some chance that the containment structure of a reactor could be severely enough damaged by a crashing aeroplane to vent the reactor to the atmosphere. (The Chernobyl accident ocurred when an explosion inside the reactor vented its contents to the atmosphere.)
The answer -- as if you didn't know it -- is `a bloody big mess'. Here's a map of the southeastern UK, with the exclusion zones around Chernobyl overlaid on the area around the Sizewell power station:
(This is an approximate and pessimistic scenario. Obviously the fallout might be blown east, out to sea, rather than onto the land. After the Chernobyl accident, the wind changed while the fallout plume was still in the air, and the fallout settled both to the northeast and southwest of the reactor. In the above map, I've assumed an offshore wind which does not change direction while the fallout settles. I picked Sizewell because it's local. The data come from this site on Chernobyl, but unfortunately I've had to composite the above map manually because that site doesn't provide raw data on radiological contamination. The exclusion and evacuation zones are related to particular levels of contamination, chiefly with caesium-137, which is the major contaminant over a scale of a few years -- its halflife is about thirty years. The immediate radiological hazard would have been iodine-131, with a one-week halflife. The underlying map image is produced from the Ordnance Survey Get-a-map service. Image reproduced with kind permission of Ordnance Survey and Ordnance Survey of Northern Ireland.)
As I said above, I'm not sure whether a nuclear power station containment dome could be seriously damaged by an airliner crash. In 1988, Sandia National Laboratories in the US performed a test to address this question, by taking the fuselage of an old F4 fighter aeroplane and driving it into a concrete slab at 480 mph using a bunch of rockets. After the September 11th 2001 terrorist attacks, they published footage from the test on their website. The results of this test suggested that such a slab would be penetrated to a depth of about 2cm by the aeroplane fuselage, or about 6cm by the engines (which have very heavy axles, by comparison to the rest of the aeroplane). There are more details in this FAQ response; basically, the body of the aeroplane does very little damage (2cm penetration) but the engines go a bit further (6cm). Commercial aeroplanes are much heavier and have larger engines, of course. In my view the risk can't be ruled out, and effective countermeasures -- siting anti-aircraft missiles at nuclear sites -- are sufficiently cheap to be worthwhile in any case.
Recently I bought a wire to connect my mobile phone to my computer. While the manufacturer of the phone wanted me to pay about £50 for the wire, some chap on eBay was selling the same thing for (as it turned out) about £10, resulting in a small but splendid victory for capitalism. However, my purchase created a small moral dilemma. I turn to you, my half-dozen readers, for suggestions as to how I should respond. (This will also serve as a test of my flakey new comments system.)
The wire was advertised as coming with a CD of software which allowed you to transfer data to and from your phone, and also contained some free ring tones and god knows what other junk. Since I don't use Microsoft Windows (can't you just feel the moral righteousness...) I couldn't have cared less what was on the CD, but when the seller posted me my wire and CD, I was slightly surprised to discover that it was not, as implied (but not explicitly stated) an official CD issued by the phone's manufacturer, but instead a (boo!) copy of (presumably) the same on a recordable CD. Now, at this point I should, as always, point out that copyright infringement is Bad and Wrong and very definitely Not To Be Encouraged. Anyway....
On eBay, you're encouraged to leave comments and ratings about buyers and sellers, which are intended to allow traders to judge whether they can trust a particular seller (or customer). Sellers typically encourage their customers to leave positive feedback, for obvious reasons, and if you've had good service it's polite to do so. (I often forget, but that's only tangentially relevant.)
So what do I do now? The seller posted me my wire promptly, and it works fine. I couldn't give a toss about the CD, because it's of no use to me except as a drinks coaster (and, as anyone who's visited my home will know, my furniture generally already has a protective covering of abandoned paperwork, so coasters aren't a lot of use to me either). But it's the sort of thing that other buyers might be worried about, and technically the seller is doing something illegal (though distributing infringing copies of the software -- since it is only of use to people who already have a particular type of mobile phone -- won't have any economic impact whatever on the manufacturer). Should I,
(This is probably of interest to Cambridge people only.) I've written a little program that generates an RSS feed of the films showing at the Arts Picture House, Cambridge's only halfway decent public cinema. If you use an RSS reader, you might find this useful.
I've ranted about the deficiencies of RSS as a format before, so I won't repeat my complaints now. But I'm not really sure what the most useful way to present this information in RSS is. At the moment the feed shows information about any films which are showing in the next 24 hours and for which tickets are available, which means that the RSS feed is the answer to the question `what can I go and see at the cinema this evening?' or something like it. Each film is represented by a single `item' which is marked up with a date corresponding to when the film is first shown (that is, all the items are in the future -- if this breaks common RSS readers, I'd love to know, since I wrote my own and am too lazy to test anyone else's), so that the films appear in the right order.
Anyway, I hope that this finds some use. Comments (especially simple suggestions for how to make the thing more useful) welcome (and thanks to Francis for already finding a bug in my comments program). You can also download the current version of the scraping program, artspicturehouserss, if you want to modify or improve it yourself.
So, as you may have noticed, Downing Street Says has been getting some excellent publicity, quite a lot of traffic, and some interesting discussion. (The rest of this post is a introspective and self-congratulatory, so can probably be ignored. However, it does contain graphs -- yay! -- and even some actual maths, so it's not 100% content-free.)
-- this counts page views, rather than visitors (fewer) or HTTP transactions (more). The yellow line shows the request rate in one-minute buckets (that is, the number of matching log lines in that minute), and the thicker red line is a symmetric moving average of the raw data. I've marked the times that we first saw hits from the story in the Register (the actual time of posting on the story was a bit earlier) and the time that the BBC News story appeared. Obviously each of these stories drew visitors to the site. The pattern of traffic after each story was posted rather resembles that which I got after my Political Survey was featured in NTK:
-- and this seems to be characteristic of what happens when a link to some other page is first featured on a popular website. (Note different scales on the two plots above, and also that I've adjusted the Political Survey figures relative to when I last posted them to account for the difference between page views and total traffic, which is what I plotted before.)
Suppose that N people visit some website every D seconds. At some point, that website posts a link to a second website. The first time any visitor to the first website sees this link, they follow it. Assume also that, while the N visitors all check the first website regularly, they do so at different times uniformly distributed over any period of length D. The second website will then see, in the time after the link is posted to the first site, a stream of visitors arriving at some fixed rate R for a period D:
Obviously real websites don't have N visitors who all check the site every D seconds. So imagine a distribution p(D) of visitors, so that we have N0 in total with N0 p(D) dD checking the site at intervals between D and D + dD:
Now the traffic seen by the second site will be a sum of lots of `top hat' functions, one for each possible check interval. I've written out the maths separately, but basically you get something like this:
(that particular example is for a flat distribution of check intervals in a < D < b). The thing to note about this function is that it leaps from zero to its maximum at the time the link is posted, and declines after that.
On the face of it, this is a bit surprising, though given the model you can think of it like this: in the first minute after the link is posted, all the people who check the first site every minute follow it, as do half of the people who check it every two minutes, one third of the people who check it every three minutes, and so forth. In the second minute, half the people who check every two minutes follow it, as do one third of the people who check every three minutes, one quarter of those who check it every four minutes, and so forth. Of course, there aren't that many people who check even the most popular websites every minute; in fact the spike will be dominated by the center of the distribution.
Where visitors check a website at normally-distributed intervals, we get spikes like this (this has visitors checking the website at a mean interval of one hour, and a standard deviation of ten minutes; note that a normal distribution can only be an approximation here, because the check interval must be nonnegative, which is not true of a normally-distributed variable):
Note that in the traffic plots, I take a ten-minute moving average of the data, since traffic data are very noisy, and it's quite hard to pick out any meaningful trends without doing so. So the instant ramp-up of the spike is spread out over about ±5 minutes either side of the time the story is posted.
The posting of a link from the BBC site to Downing Street Says certainly resulted in a sharply-rising spike of traffic which fell off after a little while (note that the model only handles the first page view; obviously we expect to get a chain of hits from users who explore more than the front page of the site, but we don't model that). The Register's story resulted in a much less sharp rise, which the theory does not explain. More on that in a moment.
We can fit the model to the data; I've done this in a fairly Mickey-Mouse way, because basically this isn't going to tell us anything very exciting, except that the theory is kind-of-reasonable. The parameters extracted from the fit won't be well-constrained at all, and anyway this is a rotten way to obtain them. Anyway:
-- that is, we get a good fit (by eye, not properly) by assuming that visitors to the BBC website check it on average every 21 minutes, with a standard deviation of about 5 minutes. (Obviously this is restricted to those visitors who then went on to click through to Downing Street Says.)
-- the reason that this is surprising is that (at a guess) most NTK readers get the thing as an emailed newsletter, rather than by periodically checking the NTK website (this may be wrong, of course). But the explanation here is presumably that most users receive email by polling (using a protocol like POP3) rather than having it delivered directly to their machine (using SMTP or modern IMAP implementations). Under this assumption, the model looks exactly the same; what we're measuring here is the regularity with which NTK readers' email clients poll for mail. Again, the 32±5 minutes figure isn't to be taken very seriously (and I'm surprised that it's so long).
So, the one bit of this I don't understand is why the load spike from the Register story ramped up more slowly than that from the BBC story (or from when the Political Survey was featured in NTK). One possibility is that the Register initially linked to the about page, rather than the front page. If we assume that most of the people who clicked through from the Register story read the about page, then clicked through to the front page -- and that's a pretty simplistic assumption -- then we would expect to get two spikes. Once the moving average is applied, we might get what looks like a shallower slope. The about page is about 2,000 words long, and the average reader reads about 250 words per minute, so would take eight minutes to read it. This gives something like this: (this is for a check interval of 60±10 minutes, with two page views separated by eight minutes, with 80% of the visitors to the first page clicking through to the second)
but this model now has so many free parameters that it can't be taken very seriously. It also does't explain anything after the onset of the spike adequately, but one could imagine extending it for a series of spikes. By that point the model would have so many free parameters as to be completely worthless.
(Appropriately enough, the signature my email client attached to this post was,
Glory may be fleeting, but obscurity is forever.
-- clearly my random number generator is telling me something I don't know....)
This is all done with wwwitter.
Copyright (c) Chris Lightfoot; available under a Creative Commons License. Comments, if any, copyright (c) contributors and available under the same license.
Hosted and supported by