Detecting Bulk

Many people wonder, if the definition of Spam is based in bulk, just what it means and how people can detect it in E-mail, when they only get one copy themselves.

Defining Bulk

Bulk E-mail is, in for the purposes of defining Spam, the use of automation to send the same message to multiple unrelated parties. These terms themselves need refining.

Automation means using computers to make it much easier to send the same message to lots of people. It doesn't just mean a traditional mailing list or bulk mailing program. The use of "Form letters" where you have a standard letter and have a computer fill in the blanks or change a few things here and there is bulk mail. Writing any letter where you use a computer to insert the a significant part of the letter with text also sent to others is bulk mail. Having a computer prepare a raft of subtly different messages that mean the same thing is still automation.

(Indeed one should note that rules and laws govern the actions of people, not computers. Tricks to use the computer in clever ways don't get around any human-based definition of bulk mail.)

The definition of the "same" message is one of human meaning rather than computerized text comparison. Ie. a message translated into French is the "same" message as the original English version. As noted, variants of a form letter are the same message. Indeed, all the letters that would result from a roomful of secretaries being told, "write a letter offering our Widgets for $9.95 and expounding on their virtues" are the same message.

Of course, computers can't identify the last case very easily (more on that later) but it's actually not a very common case because it's expensive to do.

Unrelated parties are people that have no personal knowledge of one another. People who work together are related, but two people who both work for a large company may not be. People who have been debating one another in an online forum have a relationship but two people who posted to two different threads may not. The executives of a society have relationships with the members but unless it's a small society, two random members may not.

Finally, the term "multiple:" While technically just two unrelated parties might comprise a bulk mailing, it's simply safer to invent a larger number, above which all agree there is bulk mailing going on. The harm from 10-recipient mailings won't ve significant in the grand scheme. I would suggest a number like 25.

Detecting Bulk

As it turns out, human beings can very easily spot the difference between mail written for them, and mail meant for bulk delivery. Ask most people to tell the personal mail in their mailbox from the junk mail without looking at the headers and they'll get it right 99% of the time, even when the mail has commonly used tricks like filling in the recipient's name and other information, or fake vague claims of referrals.

In fact, most junk E-mail can easily be detected by computers, because the modifications, if any, to different copies, are small. Still, one person receiving one piece of junk E-mail may know it's junk but be unable to prove it directly, having only one copy.

Large Sites

The big bulk of users are of course at large sites, like AOL, Netcom and the like. Those sites can already detect the computer-detectable bulk mails without involvement from the users. All they have to do is calculate digital "hashes" of components of the messages -- paragraphs or sentences, and put these in databases. It's then easy to spot when the same message comes in twice. After all, if two messages have even one identical internal paragraph (of any length) it's very likely they are related. Make it a significant chunk and it's almost certain.

Unfortunately they can't act automatically on this. AOL does this, and famously discarded acceptance letters at Harvard for several of their users because Harvard mailed them at once. Machines can't easily spot such solicited bulk mail.

There are hashing techniques that go down to the vocabulary level that are good at detecting highly similar messages even when they've been tweaked to all look slightly different just to get around these tests. While there obviously is an escalating war between programs that try to make messages look different and those that try to spot those that are similar, there are some things that just can't be avoided. Messages that try to promote products or companies have to have the names of those items, and unless the names are generic English words, they will be easy to spot, especially since trademarks try to be unique. Any contact info, phone numbers, street addresses, web pages and the like are easy to spot even in messages that vary otherwise.

In addition, all messages include a trace of the route they took from their origin. Patterns in this route (or just the origin) matching similarities in the messages can be spotted. Attempts to use multiple origins are possible, but hard, especially when sites disconnect abusing users.

While an AOL can detect two identical messages going to two AOL users, one doesn't even have to be that big. These hashes can be shared, sometimes even broadcast without compromising the text of the messages themselves. All of this can be automated.

When Spam is detected, this can affect filters, or just be there to allow users to ask, "Did others get a message like this one I think is bulk?"

Junk-Bait addresses

It's possible to create fake E-mail addresses and try to get them on the mailing lists of the Spammers. Most junk E-mail is sent by gathering names found posting in USENET newsgroups or mailing lists. It's easy to insert fake names into these lists or otherwise cause junk E-mailers to mail to these addresses. One can assure that a large number of bulk E-mails end up being sent to these addresses, making it easy to detect messages that are bulk. Any message sent to two of these addresses can be quickly and easily detected providing solid proof that they were sent as bulk mail. Any message to such an address is certainly to a stranger.

If all such addresses are combined together it becomes very easy for a user to forward suspected bulk mail to a server that can tell whether that piece of mail, or something similar, was delivered to a bait address.

This is part of the technique used by one anti-spam company, BrightMail.

User sharing

Ideally, to detect all bulk mail, a group of parties interested in stopping the Spam problem could create a central repository for users to test whether not they have received bulk mail. It might be called "bulk-fighters.org" or similar. Such a system could maintain an E-mail address where users could forward suspected bulk mail for testing. As noted above, users can usually very easily tell when they have received piece of mail that is probably bulk. All they need do is forward or "bounce" the message to a special E-mail address at the bulk fighters site. (In most mailers, this is just a couple of clicks or keystrokes once this address is on your list of commonly mailed addresses.)

The program receiving that mail does tests to detect if it is similar to other mail that been forwarded by other users, or mail caught at bait addresses. If it is, or is later shown to be similar to mail that arrived afterwards, the user could be notified that they had in fact received some bulk mail. They could then the initiate any disciplinary practices against such bulk mail -- with proof in their hands.

Bulk-fighters would also maintain junk-bait addresses. They would not be on its domain, instead people would make junk-bait addresses and just forward them on to the bulk-fighters.

E-mail users participating in such a system would receive quick and solid documentary proof that the mail they received was indeed bulk mail. It doesn't take much of a statistical argument to say that if a few entirely unrelated people receive a message that it was very probably sent to a very large number people. The bulk mailer would have a hard time defending against such charge.

The creation of such a system is not unlikely. The cooperation among users who have built the systems to detect and stop USENET spam has been exemplary, and the software has proven able to detect bulk even when spammers try to vary the messages to get around it.

Since I wrote this, systems which spot spam attacks like this have arisen, such as Vipul's Razor.

Enforcement

No matter what method is chosen to discipline unethical bulk mailers, a system like this, with solid verified detection of bulk can serve as the foundation. Complaints to ISPs, customer service departments and even court cases will be well served with solid documentation.

In addition, once bulk messages are detected, it is possible to then put their hash "signatures" into filtering lists, so that highly similar messages can simply be stopped before they arrive.

Laws against bulk

Should there be laws (they are a last resort) which regulate unsolicited bulk mail, then the demonstration of bulk is even easier. That's because the laws would involve civil lawsuits, not criminal trials.

In a civil lawsuit, the defendant is required to provide evidence against their own interests, unlike criminal court where they can often "plead the 5th amendment" (in the USA.) or use other rights against self-incrimination.

That means you can put a spammer on the stand and say, "how many people did you order this message sent to?" They can either tell the truth, or they can lie to the judge. As you might guess, telling lies to the judge is no longer a civil matter. That's a criminal offence, punishable by jail.

And sometimes people do lie in court and get away with it. When it's their word against somebody else's. But this is such an easy lie to get caught at. One would have to be nuts to risk such a lie. Perhaps if you sent the message to only 50 people, you would gamble your freedom that the plaintiff couldn't find anybody else. But if you sent it to 10,000? Would you gamble your freedom on that?

After making such a lie, you would have to be sure that the plaintiff would not be able to produce other recipients. If there were a bulk-fighters.org out there collecting bulk mail, I can't imagine anybody being stupid enough to take that risk of perjury to save some money in court.

You can ask more than just how many. You can even ask who they sent it to, if they didn't destroy the mailing list, and all backups, immediately after mailing. They can't destroy the list after you threaten to sue, that's called obstruction of justice and is also a criminal offence.

It gets better. In civil court, there is no standard of proof beyond a reasonable doubt. It's just who the judge or jury believes more. Consider a typical spam you've received. They read like ads because they are ads. When they try to sound like personal notes they are usually obvious fakes because they aren't personal notes. If you were on the jury, and you saw a spammer testify that no, they only sent their ad to one or two people, who just happened to be random strangers, are you going to believe them?

Does the spammer have no mail logs? Does their ISP have no mail logs? All these things can be commanded in the discovery phase of a civil trial, or ordered by the judge in small claims court.

You may only see one message of a bulk message, but demonstrating bulk in court is easier than you think.