Slashdot Log In
Ending Spam
Posted by
timothy
on Mon Aug 15, 2005 04:25 PM
from the overdue dept.
from the overdue dept.
Shalendra Chhabra writes "Jonathan
Zdziarski has been fighting spam since before the first MIT
spam conference in 2003, and has now released a full-on technical
book,
Ending Spam, on spam filtering. Ending Spam
covers how
the current
and near-future crop of heuristic and statistical filters actually work
under the hood, and how you can most effectively use such filters to
protect your inbox." Read on for the rest of Chhabra's review.
| Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification | |
| author | Jonathan A. Zdziarski |
| pages | 312 |
| publisher | No Starch Press |
| rating | 8 |
| reviewer | Shalendra Chhabra |
| ISBN | 1593270526 |
| summary | Very Good Book Covering Statistical Models and Techniques Implemented in Current Spam Filters |
Spam (unsolicited commercial email) and phishing (fraudulent emails) are causing losses of billions of dollars to businesses. Many initiatives are currently underway for fighting this challenge. On the legal front, a Virginia court recently sentenced a prolific spammer, Jeremy Jaynes, to nine years in prison, and a Nigerian court sentenced a woman to two and a half years for phishing. Michigan and Utah have both passed laws creating "do-not-contact" registries in July/August 2005, covering e-mail addresses, instant messaging addresses and telephone numbers. Technical initiatives to fight spam include server- or client-side spam filtering, using Lists (Blacklists, Whitelists, Greylists), Email Authentication Standards (IIM, DK, DKIM, SPF, SenderID), and emerging sender reputation and accreditation services.
Ending Spam is the first book explaining the fine details of the theoretical models and machine-learning algorithms implemented in these filters. The book is divided into three parts: introduction to spam filtering, fundamentals of statistical filtering, and advanced concepts of statistical filtering.
The first section of the book discusses the history of spam, spam kings, different approaches for fighting spam such as blacklisting, whitelisting, heuristic filtering, challenge response, throttling, collaborative filtering, Authenticated SMTP, Sender Policy Framework and SenderID, spammer fingerprinting, etc. However, the author omitted any mention of locally-sensitive hash functions (such as Nilsimsa Hash) to counter spammers' random insertion of words, the use of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart), Greylisting, Identified Internet Mail, and Domain Keys (now Domain Keys Identified Mail).
In the next chapter, the author clearly explains various components of a Language Classifier Pipeline, including the Historical Dataset (aka wordlist, database, dictionary, filter memory), Tokenizer, and the Analysis Engine with its feedback loop. However, the process flow of a language classifier could have been more generalized, e.g. incorporating an initial text-to-text transformer. This chapter also covers the advantages and disadvantages of various training modes for filters, such as Train Everything (TEFT), Train-on-Error (TOE), and Train Until No Errors (TUNE). This part concludes with the description of Paul Graham's famous spam-filtering technique using Bayesian classification (as described in "A Plan for Spam"), Gary Robinson's Geometric Mean Test, Fisher-Robinsons Inverse Chi Square (including the source code for the inversion function), and some other tricks for optimizing spam- filtering accuracy.
The second part of this book deals with the fundamentals of statistical filtering. The author explains HTML and Base64 encoding, followed by a detailed description of tokenization techniques (e.g. Sparse Binary Polynomial Hashing). Then there's a discussion of the various tricks that spammers use for penetrating filters. Although these tactics are mentioned in John Graham-Cumming's "Spammers Compendium," Jonathan has very elegantly explained why some tricks work for spammers and some don't. This part concludes by addressing some of the resource, storage and scaling concerns raised by the large number of features generated from tokenization techniques.
The third part of this book deals with advanced concepts of statistical filtering. This includes the testing criteria for measuring accuracy of an email filter, and some advanced tokenization concepts, e.g. chained tokens (taking word-pairs and phrases into account, instead of individual words) generated using a sliding 5-byte window as mentioned in Sparse Binary Polynomial Hashing. The next chapter describes the Markovian Model implemented in the CRM114 Discriminator, but the author fails to describe different weighting schemes for features implemented in the Markovian-based version of CRM114. The author then describes the Bayesian Noise Reduction Technique for purging "out of context" data from the mail text. This chapter concludes with a very nice summary of collaborative algorithms and techniques, such as Message Innoculation, Streamlined Blackhole List, Fingerprinting, Automatic Whitelisting, URL Blacklisting, and Honeypot email addresses for snaring spammers' address harvesting bots.
The most interesting part of this book is the appendix, where the author presents interviews with John Graham-Cumming of POPFile, Brian Burton of SpamProbe, Marty Lamb of TarProxy, Bill Yerazunis of CRM114 Discriminator, and Jonathan Zdziarski of DSPAM (himself). I loved this section.
The salient points of the book: it's very easy to read; each chapter begins with a very thought-provoking introduction, and concludes with a crisp "final thoughts" section. The number of technical errors are very few in this print, and the illustrations are of good quality. Since the book is geared more toward the Bayesian and statistical generation of spam filters, the absence of certain spam-busting technologies is acceptable. However, a noticeable omission is the lack of discussion about measuring spam-filter accuracy, and what impact this has on setting filtration thresholds. A section on the economics of tradeoffs, and the use of a Receiver Operating Characteristic curve (ROC) would have been very helpful.
Overall, by putting together Ending Spam, Jonathan Zdziarski has made another significant contribution (after DSPAM) to the anti-spam community. Whether you are a system administrator, anti-spam researcher, engineer or a newbie interested in fighting spam, this book is a great reference.
William S Yerazunis and Richard Jowsey also contributed to this review. Shalendra Chhabra is a Graduate Student in Department of Computer Science and Engineering at University of California, Riverside. He is on the development team of CRM114 Discriminator and has presented his work at MIT Spam Conference 2005, Cisco Systems, and Stanford University. You can purchase Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification from bn.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
You can't have both... (Score:3, Insightful)
(http://tarrysingh.blogspot.com/)
Bill Gates promised to end it (Score:2, Funny)
Like most parasitic maldies (Score:2)
(http://ofteninspired.com/ | Last Journal: Sunday April 01 2007, @05:49PM)
Or will these dedicated folks and others be able to eliminate it, perhaps by changes to the mail protocols?
Is spam a parasitic malady and, if so, what next? (Score:4, Insightful)
(http://www.users.qwest.net/~waffleck-asch/ | Last Journal: Wednesday November 07, @04:46PM)
Or will these dedicated folks and others be able to eliminate it, perhaps by changes to the mail protocols?
Interesting question that, considering my work involves malaria.
My guess is that, like malaria and most parasitic infestations, we will at some point develop a "cure". The "cure" will work for a few years, after which the parasite (spam) will have adapted, surviving until then in different hosts (old windows machines donated to Africa, who knows). Then, having developed a new trick, it will come back as strong as ever.
Biology teaches us that organisms adapt to changing environments, thru selective breeding (natural), point mutations, and unforseen combinations (see the H51N avian influenza). We can develop cures, but once we do so, we can be fairly sure that, baring species extinction, it will develop methods to cope with our cures.
An easy solution would be to move to IPv6 - but this, like authentication, will only kill off the spam which doesn't use "trusted email clients that are identified" while the spam that can survive will be encouraged to spread like wildfire.
So long as the fiscal, legal, and societal penalties for spamming are fairly low and the rewards are high, and while most people do nothing about it, it will spread.
Re:Is spam a parasitic malady and, if so, what nex (Score:4, Interesting)
(http://www.jbryce.org.uk/)
Re:Is spam a parasitic malady and, if so, what nex (Score:4, Funny)
(http://www.nodomain.org/)
Yet.
Esprit d'Corps (Score:5, Funny)
(http://slashdot.org/~Shadow%20Wrought/journal | Last Journal: Thursday November 15, @09:11PM)
Sorry for the flamebait but (Score:2, Funny)
(http://suso.suso.org/ | Last Journal: Tuesday March 09 2004, @12:03AM)
Awww, poor babies. That's a long time to fight spam.
Re:Sorry for the flamebait but (Score:5, Informative)
(http://ofteninspired.com/ | Last Journal: Sunday April 01 2007, @05:49PM)
HERE [castlecops.com]
"ABOUT THE AUTHOR:
Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM. His research in algorithmic theory and neural networking has led to the development of many new approaches in language classification, and he has played a key role in designing some popular algorithms in use today, including Message Inoculation, Bayesian Noise Reduction, and the first functional Neural Networking algorithm for spam filters. Zdziarski lectures widely on the topic of spam and was a speaker at the 2004 and 2005 MIT Spam Conference.
"
The best way to fight spam (Score:5, Funny)
(http://www.users.qwest.net/~waffleck-asch/ | Last Journal: Wednesday November 07, @04:46PM)
Yum!
Score -5 Outdated. (Score:2, Insightful)
By the time a book has been written edited, proof read(though many publishers skip this part), type set, printed, distributed and sold, it no longer resembles the technology.
You can't catch it all (Score:2, Insightful)
Re:You can't catch it all (Score:5, Interesting)
(http://slashdot.org/ | Last Journal: Thursday April 12 2007, @09:41AM)
Spam is no longer simply the domain of a giant server with a huge database. It's increasingly being sent out by zombie PCs, infected with viruses or trojans. Spammers pay the zombie-farmers to send their crap. Zombies send the email masquerading as the PC owner, using their credentials. Sender-ID? No problem, he's got one. SMTP? Sure, use the victim's server.
Zombies mean that no matter what technology is used for sending validated, signed, pre-paid, whatever email, the zombies will have access to those resources and will still spew their crap. No anti-spam server technologies are going to prevent Windows machines from getting infested.
Re:You can't catch it all (Score:5, Interesting)
(Last Journal: Tuesday March 13 2007, @02:39PM)
The first step to a new mail system is to assure that only legitimate and properly configured mail servers honoring MX records on outgoing mail (or whatever ends up replacing MX records) can expect delivery. Mail admins' hands are tied by stealth systems or badly configured ones, but if we do try to implement the no-MX rule, which would eliminate the zombie attacks, we end up shutting out systems that, for whatever reason, don't publish an MX record for outgoing servers.
Zombies ought to be the easiest thing to shut down by a) not permitting non-MTA machines to push anything beyond the network via port 25 and b) publishing both incoming and outgoing mail servers.
Ending Spam? (Score:5, Insightful)
(Last Journal: Wednesday January 18 2006, @06:02PM)
If you can't see it, it ain't there?
Effecitve filtering will end spam (Score:5, Insightful)
I don't think we'll ever get there, but yes filtering really could end spam.
Re:Ending Spam? (Score:4, Insightful)
There's no such thing as a perfect filtering system, but for every message blocked, that's extra effort for the spammer to get through, making it less and less worthwhile to spam at all.
Or maybe they'll just send more and more, hoping at least one gets through.
fantastic advice (Score:2, Interesting)
Heck, our lobby group even points out to Congress how spam laws are not really needed, since people who really don't want the spam are free to filter it. That and a litte payola and we are free to phish for more victims.
Yea, keep "fighting spam" with lame filters, we love it. Thanks!
Email is mostly broken (Score:4, Interesting)
Current anti-spam solutions are to email what an Antivirus package is to Windows - a hack add-on that increases complexity and costs without solving the underlying problem(s).
Rather than fight viruses, we should be engineering an O/S that's inherently resistent to them. How many of you Linux/BSD/MacOS users EVER use antivirus, or need to?
Rather than build ever-better antispam filters for Email, we should be engineering an email solution that's inherenly resistant to SPAM.
The answer lies in authentication - who is sending the email. Some of the best technologies now available use degrees of authentication without actually *saying* it outright. Examples are: refusing invalid domains, greylisting, challenge-response, SenderID - all of these are some form of authentication.
As these are, one-by-one bypassed by the spammers, the need for authentication of senders will continue to increase, until the dolts who will invariably reply with that "your solution will not work because... (check the options)" are shown to simply be.... wrong.
Give it time. It's already happening whatever the originators of the SMTP protocol desired.
Re:Email is mostly broken (Score:5, Insightful)
(http://netapps.com.au/)
And it requires central control. Is this what you want?
Re:Email is mostly broken (Score:4, Informative)
(http://www.dylanbrams.com/ | Last Journal: Saturday September 01, @01:42PM)
Yes, it's a possibility. Unfortunately, in this case the 'dolts who invariably reply with the survey' are actually right. The survey is funny, but it serves a very important purpose in this case - it shows that completely re-engineering the entire e-mail system means that the problems we have are masked temporarily and then reemerge. Identity, no identity, in the end the 'stopgaps' are actually better than the 'build it from the ground up' solution.
You Personally advocate a
(x) technical (x) legislative (x) market-based ( ) vigilante
approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)
(x) Spammers can easily use it to harvest email addresses
(x) Mailing lists and other legitimate email uses would be affected
(x) No one will be able to find the guy or collect the money
( ) It is defenseless against brute force attacks
(x) It will stop spam for two weeks and then we'll be stuck with it
(x) Users of email will not put up with it
( ) Microsoft will not put up with it
( ) The police will not put up with it
(x) Requires too much cooperation from spammers
(x) Requires immediate total cooperation from everybody at once
(x) Many email users cannot afford to lose business or alienate potential employers
(x) Spammers don't care about invalid addresses in their lists
(x) Anyone could anonymously destroy anyone else's career or business
Specifically, your plan fails to account for
( ) Laws expressly prohibiting it
(N/A) Lack of centrally controlling authority for email
(x) Open relays in foreign countries
( ) Ease of searching tiny alphanumeric address space of all email addresses
(x) Asshats
(x) Jurisdictional problems
(x) Unpopularity of weird new taxes
( ) Public reluctance to accept weird new forms of money
(x) Huge existing software investment in SMTP
(x) Susceptibility of protocols other than SMTP to attack
(x) Willingness of users to install OS patches received by email
(x) Armies of worm riddled broadband-connected Windows boxes
(x) Eternal arms race involved in all filtering approaches
(x) Extreme profitability of spam
( ) Joe jobs and/or identity theft
(x) Technically illiterate politicians
(x) Extreme stupidity on the part of people who do business with spammers
(x) Extreme stupidity on the part of people who do business with Microsoft
(x) Extreme stupidity on the part of people who do business with Yahoo
(x) Dishonesty on the part of spammers themselves
( ) Bandwidth costs that are unaffected by client filtering
( ) Outlook
and the following philosophical objections may also apply:
(x) Ideas similar to yours are easy to come up with, yet none have ever been shown practical
(x) Any scheme based on opt-out is unacceptable
(x) SMTP headers should not be the subject of legislation
(x) Blacklists suck
( ) Whitelists suck
( ) We should be able to talk about Viagra without being censored
(x) Countermeasures should not involve wire fraud or credit card fraud
(x) Countermeasures should not involve sabotage of public networks
( ) Countermeasures must work if phased in gradually
(x) Sending email should be free
(x) Why should we have to trust you and your servers?
(x) Incompatiblity with open source or open source licenses
( ) Feel-good measures do nothing to solve the problem
( ) Temporary/one-time email addresses are cumbersome
(x) I don't want the government reading my email
( ) Killing them that way is not slow and painful enough
Furthermore, this is what I think about you:
( ) Sorry dude, but I don't think it would work.
(x) This is a stupid i
Jonathan Zdziarski is out of his mind. (Score:1, Interesting)
(Last Journal: Tuesday August 30 2005, @11:13AM)
The contract between being a logical minded person like a programmer, and being so easily brainwashed into believing comeplete nonsense is startling.
Re:Jonathan Zdziarski is out of his mind. (Score:5, Insightful)
(http://www.cowlark.com/ | Last Journal: Friday March 18 2005, @05:12AM)
This may be the case; however, that doesn't invalidate his work on spam. Remember, Sir Isaac Newton was a firm believer in the more exotic aspects of mystical alchemy, and the vast bulk of his 'research' was complete gibberish. That doesn't make his work on gravity any less valuable.
No good publisher (Score:2, Interesting)
If this was published by O'Reilly, I'd have bought it on sight as they bother to edit their books. As it is, I'll give it a wide berth.
Spam filtering is bullshit. (Score:1)
(Last Journal: Thursday July 07 2005, @10:35PM)
We need to have an automated way of dog-piling the retail site that the spammer is trying to lure you to.
Every time a spammer sends an email for viagra our email client should goto the site and fill out the order form 50 times per second... incorrectly.
There is simply no more time to be pussies about this shit. Spam filtering has been given plenty of time to fix this problem. It's time for something new and aggressive.
VERY AGGRESSIVE.
THE TIME IS NOW!
thank you for your time.
This should really be entitled "Hiding Spam" (Score:3, Insightful)
(http://www.warrenernst.com/)
Even a manservant reading all of my mail and hand-carying printouts of nothing but personal messages to my Jamacian bungalow doesn't "end" spam.
It would seem that These Guys [slashdot.org] are actually making an attempt to "end" spam.
All this guy is just talking about is hiding it from view. Big deal...
Who will buy the book? (Score:1)
(http://www.jasonlowry.com/)
Spam elimination - 101 (Score:1, Interesting)
I also know an acquaintence who developed a very unique and effective program to "finger" every Spam bot infected PC and with a "secret" program under trial, it shut down more than 550,000 spam sending infected PC's.
reports from the SPAM CHAT Channels indicate it was very effective in nailing down and eliminating Spam bots.
The experiment was ongoing for about 4 months last year, and WOW! I had no idea there were that many spam bots...
Word I've gotten is that a few "Checks and Balances" need to be deployed to prevent abuse... but I can imagine what would happen of more mail servers would deploy such a system.
J
Easy Solution to Spam (Score:2, Insightful)
(http://www.hormel.com/)
Next... (Score:1, Troll)
Greylisting solves 95% for me (Score:2, Informative)
(http://lefttochance.com/)
Should one invest time and money in this book? (Score:2)
(http://www.bearcave.com/)
Some of the previous posters mentioned the rather eccentric views (in my opinion) of the author of Ending Spam (Jonathan Zdziarski). You can sample some of these yourself by reading the essays Mr. Zdziarski has posted on his web site NuclearElephant.com [nuclearelephant.com].
While someone might have, in practice, unlimited amounts of money, none of us have unlimited amounts of time. So a book is always an investement in both time and, for those with more finite amounts of money, cash. With this in mind, there is the question of whether one should read a book by someone who is rather eccentric in their views. Will this eccentricity and, in my opinion, limited knowledge outside of narrow areas, also mean that the book is equally flawed?
I'm undecided. My concern is that Mr. Zdziarski's knowledge of Baysian filtering and other topics has the same kind of holes that seem to exist when he applies his intellect to other areas (like evolution of both life and the solar system). While this is a concern, it is not a foregone conclusion. The history of science and, especially, mathematics, is full of giants in their field who were also very eccentric.
Mr. Zdziarski seems to have what I would classify as a narrowly focused intellect and perhaps within these narrow confines the reader can rely on what he writes. DSPAM, the SPAM filter written my Mr. Zdziarski, seems to be a storng competitor to SpamAssassin. So on this basis, perhaps the book may be a good investment.
An Idea or 2 (Score:2)
Or
Jhunkhad: A Holy War Against the Infidel Spammers!
In front of a camera, stand them up and make them recite that they have small, flacid penises and need to refinance their homes and consolidate their debt because they owe all their money to hot horny teen girl web cam sites. Then slap them with a herring until they are unconcious.
Big deal (Score:1)
(http://www.stuii.co.uk/)
Jonathan Zdziarski has been fighting spam since before the first MIT spam conference in 2003
Big deal, I've been fighting spam since 1995.
No spam for 2.5 years.. Use TMDA (Score:1)
For those who don't know, TMDA is a challenge-response based server-side system. It's open-source, all written in Python. Works with all client mail readers. Check it out
Re:dude Milky Ways suck (Score:1, Offtopic)
I have a peanut allergy you insensitive clod!
Re:Great, this will help.. (Score:1)
(http://homepage.mac.com/chevyorange)
It is free: http://junkmatcher.sourceforge.net/Home/index.htm
Even if you aren't intersted or already have filtering this guy's page is very interesting - he even updates definitions often.