Radio telescope, Lord's Bridge

The length distribution of email messages

[ Home page | Random stuff ]

I wanted to know this so that I could define a safe cut-off for a normal `small' message to forward to a POP box for use over a slow connection.

$ cd ~/mail
$ cat * | sp '^From ' - wc -l >> /tmp/mail-lengths-lines
$ cat * | sp '^From ' - wc -c >> /tmp/mail-lengths-bytes
$ ^D

Next, a simple perl program to bin the results (this is pretty lame, but hey, it works, and I'm not exactly a statistician).

$ cat histogram
#!/usr/bin/perl -w
$binsize = $ARGV[0];
$binsize ||= 5;

$total = 0;

while ($_ = <STDIN>) {
    chomp;
    ++$bin[int($_/$binsize)];
    ++$total;
}

$integ = 0;
$mean = 0;

for ($i = 0; $i <= $#bin; ++$i) {
    $bin[$i] ||= 0;
    $bin[$i] /= $total * $binsize;
    $integ += $binsize * $bin[$i];
    $mean += ($i * $binsize + $binsize / 2) * $bin[$i] * $binsize;
    print STDOUT ($i * $binsize + $binsize / 2) . " $bin[$i] $integ\n";
}

print STDERR "mean = $mean\n";
$ ./histogram 5 < /tmp/mail-lengths-lines > hist-lines
mean = 168.171834625323
$ ./histogram 128 < mail-lengths-bytes > hist-bytes
mean = 7001.94019933554
$ ^D

So, the mean length of an email message is about 168 lines or 7000 bytes. (Note that this includes the headers.) This is from about 16,000 messages in my mailspool. The next thing to do is to plot the results, thus employing the first rule of statistics: if you don't understand it, draw a picture:

Histogram of email lengths, by lines

Histogram of email lengths, by bytes

The vertical axis is probability density; the horizontal axis is length in lines or bytes.

It's not very surprising that the distribution has a peak and a big tail like this; I'm too lazy to fit it to some well-known distribution, but you get the general idea. Equally, I haven't investigated why there appears to be a double peak. This could be an artefact of the way my email delivery works, or something like that; or (unlikely) it could be real in the sense that it actually indicates something about how people compose emails.

The appropriate procmail directive looks something like:

        :0 c
        | formail -A"X-Forwarded-For-POP: yes" \
                | grep -v '^From ' \
                | head -170 \
                | ~/bin/fwdmail popbox@example.com

Copyright (c) 2001 Chris Lightfoot. All rights reserved.