From: Chris Lightfoot <chris@ex-parrot.com>
To: [ various ]
Subject: p2p thoughts
Date: Sun, 26 May 2002 17:41:06 +0100

I've been vaguely thinking about this email P2P stuff,
mainly about the network organisation rather than the
implementation. Anyway, herewith some thoughts....

Take a really simple model of the random graph with $N$
nodes and $k$ connections per node. We expect that
$N < k^r$ where $r$ is the graph radius (maximum minimum
internode distance). Argue then that $r \sim \log N / \log k$
and that the time to propagate a message across the whole
network is of order $r \Delta / 2$ where $\Delta$ is some
kind of average mail-check interval. (For the moment we
assume that nodes are well-connected.) For a network of
$N = 10^6$ nodes with $k = 5$, $r = 8.6$ and if
$\Delta = 5$ minutes then the network-crossing time is
about 20 minutes, which is not unreasonable. (Assume that
the transmission time is small compared to $\Delta$,
obviously; this is probably true because the search
messages are going to be pretty small.)

The search algorithm is pretty simple: we send out a
search request containing some terms to the nodes known to
us. They return results they know about and forward the
message to other nodes. Search requests have a unique id
and a time-to-live; nodes discard requests which they've
already seen. Requests are encapsulated in emails,
possibly lots of requests in one email. Each node finds
matching search results and returns them to the
originating node.

Now, typical P2P systems like GNUTELLA use
store-and-forward for the results, but I think that it
would be much better to just return the results direct to
the originator. There is an obvious privacy concern here,
but I don't know how serious it is.

File requests are much easier-- we identify files by
{hash, length} and use this information to form a request.
File names and other metadata (type, bitrate) are used
only for indexing. We return files in standard MIME email
messages.

An obvious search optimisation is to send out a message
having a small TTL which contains a list of files on our
site to our neighbours. They save this information and
return it in searches. Obviously they must still pass the
search request to us so that we can pass it on to our
neighbours. So tag each search request with a list of the
nodes for whom search results have already been returned
by the network, so as to minimise duplication. Obviously
there will still be some duplication because of
multipathing, but that's relatively unimportant; the nice
thing about this is that if we propagate this information
(say) 3 hops out, this should reduce the effective value
of $r$ by about 3. If you have 1,000 songs with titles of
50 characters this is of order 50Kb of data which will
probably compress to 20Kb, so we can send out these
messages every few hours without trouble.

Various issues:

    - We reveal, pretty much at random, email addresses to
      other members of the network. A spammer could just
      connect their computer to the network and collect
      email addresses of users from it.

      I don't know how serious this actually is. How
      privacy-conscious are the low-down pirating
      scum^W^W^W P2P early adopters who will use a system
      of this type? Can't they just get disposable email
      addresses?

    - There's no security in here at all. The Recording
      Industry Ass. of America could insert random white
      noise tracks masquerading as Britney Spears songs
      (or, worse, vice versa) and the system couldn't be
      protected from it. It would be possible to overlay
      voting on top of the system, but obviously there's
      no trust mechanism to protect that, either.

      This isn't a very serious problem in reality, since
      it is shared by all the other current P2P networks.

    - An average music track is, what, 3Mb? And an episode
      of a television program is a few 100Mb. Most people
      can probably cope with 3Mb emails, but probably not
      with 100Mb emails. Many providers impose quotas on,
      for instance, POP3 mailboxes.

      One possibility here is a hybrid system-- we could
      make files available by (say) HTTP, and nodes could
      have a go at downloading them that way. If that
      didn't work, then try the email route. This way the
      system will also adapt gracefully to a crackdown by
      The Man or his MTA.

    - Although The Man isn't going to shut down
      everybody's email, he might reprogram sendmail (or,
      more likely, Microsoft Exchange) to recognise our
      command format, probably claiming that to do so is
      an anti-virus measure or some such shite.

      I'm not certain what can be done about this.
      Steganography is the obvious option; we could, for
      instance, send our messages as MIME segments of type
      image/gif and arrange for all the packets to begin
      GIF89A and proceed in a manner which resembles an
      actual GIF file. Obviously this will create a bit of
      overhead, but it will also confound The Man --
      assuming that he's as dumb as I'm guessing he is --
      and giving us a bit (not much) of crypto-weenie
      street cred. It's not clear that this -- I mean the
      steganography, not the street cred -- would be
      tremendously effective, though.

      Another possibility is to allow the use of a
      transport other than email. Obvious possibilities
      are the various instant messenger applications, and
      IRC. A nice consequence of using IRC is that it's a
      broadcast medium anyway, which makes the request
      propagation easier.

    - How happy are people going to be with a program
      which randomly sends out email under their control?

      Not sure. This is an issue to do with how seamlessly
      the thing is integrated with peoples' MUAs, I think.


In terms of Win32 implementation, there are a number of
possibilities. An attractive one would be to use the
features of Outlook Express-- probably we can cause it to
filter all mail matching the criteria into a separate
folder from which we can retrieve it and do with it what
we like. I presume that the folders can be accessed
through some COM interface. This is ugly but has the
advantage that we don't have to write any code which
interferes with the user's email setup.

I don't know anything about plugins for Outlook Express,
but if suitable, that might be a good way to implement the
above.

The other possibility is proxies. A proxy for POP3 is
trivial; for IMAP2 less so. 'Course, it would be good
practice for writing a server....

(Actually, it's not completely trivial, since we have to
make it appear as if certain messages simply don't exist
while diverting them for our own Evil Purposes. So the
proxy has to make a connection, build a model of the data
and then present that to the client.)

Sending email can be done either by injecting messages
into the Outlook Express queue, or by direct SMTP
injection. The latter is probably a better option, except
for corporate weenies stuck behind authenticating SMTP
servers and/or Microsoft Exchange craziness. For a rough
draft, they can probably be ignored.

One possibly nice idea would be an implementation for
Microsoft Hotmail. I believe that at one stage Hotmail
supported its own wacky XML-based protocol -- httpmail or
something -- for an MUA to retrieve email, but I don't
know if this is still supported or whether it was ever
publically documented. The alternative is obviously to do
it by screen-scraping; enclosing packets in images might
actually be a helpful way to do this, since Hotmail
presents them in-line in emails, I think.


I don't know how any of this stuff works on the Apple
Macintosh. Arguably, I don't care, either.


Obviously the implementation for a real operating system
is plumbed together using procmail and /usr/libexec/sendmail.
The underlying code is much the same, though.


-- 
``History teaches us that men and nations behave wisely
  when they have exhausted all other alternatives.'' (Abba Eban)