From: Chris Lightfoot To: [ various ] Subject: p2p thoughts Date: Sun, 26 May 2002 17:41:06 +0100 I've been vaguely thinking about this email P2P stuff, mainly about the network organisation rather than the implementation. Anyway, herewith some thoughts.... Take a really simple model of the random graph with $N$ nodes and $k$ connections per node. We expect that $N < k^r$ where $r$ is the graph radius (maximum minimum internode distance). Argue then that $r \sim \log N / \log k$ and that the time to propagate a message across the whole network is of order $r \Delta / 2$ where $\Delta$ is some kind of average mail-check interval. (For the moment we assume that nodes are well-connected.) For a network of $N = 10^6$ nodes with $k = 5$, $r = 8.6$ and if $\Delta = 5$ minutes then the network-crossing time is about 20 minutes, which is not unreasonable. (Assume that the transmission time is small compared to $\Delta$, obviously; this is probably true because the search messages are going to be pretty small.) The search algorithm is pretty simple: we send out a search request containing some terms to the nodes known to us. They return results they know about and forward the message to other nodes. Search requests have a unique id and a time-to-live; nodes discard requests which they've already seen. Requests are encapsulated in emails, possibly lots of requests in one email. Each node finds matching search results and returns them to the originating node. Now, typical P2P systems like GNUTELLA use store-and-forward for the results, but I think that it would be much better to just return the results direct to the originator. There is an obvious privacy concern here, but I don't know how serious it is. File requests are much easier-- we identify files by {hash, length} and use this information to form a request. File names and other metadata (type, bitrate) are used only for indexing. We return files in standard MIME email messages. An obvious search optimisation is to send out a message having a small TTL which contains a list of files on our site to our neighbours. They save this information and return it in searches. Obviously they must still pass the search request to us so that we can pass it on to our neighbours. So tag each search request with a list of the nodes for whom search results have already been returned by the network, so as to minimise duplication. Obviously there will still be some duplication because of multipathing, but that's relatively unimportant; the nice thing about this is that if we propagate this information (say) 3 hops out, this should reduce the effective value of $r$ by about 3. If you have 1,000 songs with titles of 50 characters this is of order 50Kb of data which will probably compress to 20Kb, so we can send out these messages every few hours without trouble. Various issues: - We reveal, pretty much at random, email addresses to other members of the network. A spammer could just connect their computer to the network and collect email addresses of users from it. I don't know how serious this actually is. How privacy-conscious are the low-down pirating scum^W^W^W P2P early adopters who will use a system of this type? Can't they just get disposable email addresses? - There's no security in here at all. The Recording Industry Ass. of America could insert random white noise tracks masquerading as Britney Spears songs (or, worse, vice versa) and the system couldn't be protected from it. It would be possible to overlay voting on top of the system, but obviously there's no trust mechanism to protect that, either. This isn't a very serious problem in reality, since it is shared by all the other current P2P networks. - An average music track is, what, 3Mb? And an episode of a television program is a few 100Mb. Most people can probably cope with 3Mb emails, but probably not with 100Mb emails. Many providers impose quotas on, for instance, POP3 mailboxes. One possibility here is a hybrid system-- we could make files available by (say) HTTP, and nodes could have a go at downloading them that way. If that didn't work, then try the email route. This way the system will also adapt gracefully to a crackdown by The Man or his MTA. - Although The Man isn't going to shut down everybody's email, he might reprogram sendmail (or, more likely, Microsoft Exchange) to recognise our command format, probably claiming that to do so is an anti-virus measure or some such shite. I'm not certain what can be done about this. Steganography is the obvious option; we could, for instance, send our messages as MIME segments of type image/gif and arrange for all the packets to begin GIF89A and proceed in a manner which resembles an actual GIF file. Obviously this will create a bit of overhead, but it will also confound The Man -- assuming that he's as dumb as I'm guessing he is -- and giving us a bit (not much) of crypto-weenie street cred. It's not clear that this -- I mean the steganography, not the street cred -- would be tremendously effective, though. Another possibility is to allow the use of a transport other than email. Obvious possibilities are the various instant messenger applications, and IRC. A nice consequence of using IRC is that it's a broadcast medium anyway, which makes the request propagation easier. - How happy are people going to be with a program which randomly sends out email under their control? Not sure. This is an issue to do with how seamlessly the thing is integrated with peoples' MUAs, I think. In terms of Win32 implementation, there are a number of possibilities. An attractive one would be to use the features of Outlook Express-- probably we can cause it to filter all mail matching the criteria into a separate folder from which we can retrieve it and do with it what we like. I presume that the folders can be accessed through some COM interface. This is ugly but has the advantage that we don't have to write any code which interferes with the user's email setup. I don't know anything about plugins for Outlook Express, but if suitable, that might be a good way to implement the above. The other possibility is proxies. A proxy for POP3 is trivial; for IMAP2 less so. 'Course, it would be good practice for writing a server.... (Actually, it's not completely trivial, since we have to make it appear as if certain messages simply don't exist while diverting them for our own Evil Purposes. So the proxy has to make a connection, build a model of the data and then present that to the client.) Sending email can be done either by injecting messages into the Outlook Express queue, or by direct SMTP injection. The latter is probably a better option, except for corporate weenies stuck behind authenticating SMTP servers and/or Microsoft Exchange craziness. For a rough draft, they can probably be ignored. One possibly nice idea would be an implementation for Microsoft Hotmail. I believe that at one stage Hotmail supported its own wacky XML-based protocol -- httpmail or something -- for an MUA to retrieve email, but I don't know if this is still supported or whether it was ever publically documented. The alternative is obviously to do it by screen-scraping; enclosing packets in images might actually be a helpful way to do this, since Hotmail presents them in-line in emails, I think. I don't know how any of this stuff works on the Apple Macintosh. Arguably, I don't care, either. Obviously the implementation for a real operating system is plumbed together using procmail and /usr/libexec/sendmail. The underlying code is much the same, though. -- ``History teaches us that men and nations behave wisely when they have exhausted all other alternatives.'' (Abba Eban)