Bayesian Spam Filter Poisoning With RSS

Overview

Bayesian Filtering is a great method for fighting spam. Unlike rule based filtering which spammers can easily adapt to with simple modifications, Bayesian adapts with the spammers changes, making it much more difficult for them to defeat the filtering. As a result it’s used in server side mail filtering as well as client side filtering in various products including Mozilla Thunderbird, SpamAssassin, and SpamBayes. Despite this level of “intelligence” it’s not foolproof. Like anything that analyzes unsanitized input, its vulnerable to poisoning. To be fair, there is a debate on if it exists or not. I personally believe it does exist.

So What Is This “Poisoning” You Speak Of?

Poisoning refers to spammers putting non-spam words (either gibberish, random words, or old texts) into spam. This technique itself is nothing new. This is a technique used for years to help get around spam filters. This is why some of your spam may contain things like:

Everything you can imagine is real.
What this country needs is a good five cent cigar.
What the eye does not admire the heart does not desire.
Action is coarsened thought thought becomes concrete, obscure, and unconscious.
A man profits more by the sight of an idiot than by the orations of the learned.

The above comes from spam trying to pitch a Canadian pharmacy! Doesn’t sound very medical does it? That’s the point. They then throw the url and a quick “buy pills” somewhere in there.

What’s Now Going On

My theory is that the new technique spammers seem to be taking on is to use RSS feeds as an input source to make spam look more legitimate and keep the content timely (to avoid filtering). RSS is easy to retrieve, parse, and is extremely plentiful. As a result it’s possible to have an endless sea of salt to try and get around the filters.

Examples

Here are a few examples I collected in about 10 minutes of skimming my spam folder only looking at titles for ones that look like they may have come from feeds. Google searches seem to indicate most come from CNN RSS feeds. To perform searches to find the origin you need to be a little creative and make use of Google’s cache, since an articles title could change through the life of the article.

I then decided to use Google Reader to display over 1,000 titles from the past week in my “General News” tag, this includes a few but not all of their feeds (mainly U.S, World News). As a side note this category is somewhat of an antique, since I don’t read general news via RSS since I work for a news website. I get all the news I can tolerate from 9-5 ;-). I’m also a feed junkie.

I clearly couldn’t find all within the range of a week and 3 feeds, but I did find enough to make me wonder. The screenshots are below:

U.S., Iraqi forces battle insurgents

Story Date: Tue, 9 Jan 2007
Email Sent: Wed, 10 Jan 2007

Rice ‘loves’ Fox News; CBS anchor ‘decent guy’

Story Date: Thu, 11 Jan 2007
Email Sent: Fri, 12 Jan 2007

Rebel ‘We aided bin Laden escape’

Story Date: Thu, 11 Jan 2007
Email Sent: Sun, 14 Jan 2007

Madonna defends Rosie

Story Date: Thu, 11 Jan 2007
Email Sent: Fri, 12 Jan 2007

Swank ‘I am in a relationship

Story Date: Tue, 09 Jan 2007
Email Sent: Wed, 10 Jan 2007

Gwynn, Ripken in Hall, McGwire misses

Story Date: Tue, 09 Jan 2007
Email Sent: Thu, 11 Jan 2007

Court papers Dancer cleared one Duke suspect

Story Date: Tue, 11 Jan 2007
Email Sent: Fri, 12 Jan 2007

As you can see, many were sent the day after the story appeared in the feed.

I should note this is not the feed owners fault in any way, nor is there any reasonable effort they can make to stop or prevent such misuse. No need to go after blog owners or news sites. Most of them get spammed more than you.

Here’s a list of the emails I spotted for the past several days. I’m not sure where a few of them came from (if anyone wants to dig deeper, feel free). As of a week ago, several others could be found around the web by searching google and viewing the google cached version of some pages. Headlines can change as a story evolves. This further complicates this research:

Here’s a list of news related subjects from spam emails:
Court papers Dancer cleared one Duke suspect
Filing Duke suspect just watched
Fortune The 100 best companies to work for
Gwynn, Ripken in Hall, McGwire misses MORE
Iranian officials detained in Iraq, U.S. official says
Kennedy threatens Bush Iraq plan
Madonna defends Rosie
Man in hot pants struts in boots, cheers city MORE
Mom charged with stabbing kids
N.J. suspected as source of stench MORE
O’Reilly, Colbert on each’s shows.
Rebel ‘We aided bin Laden escape’
Rice ‘loves’ Fox News; CBS anchor ‘decent guy’
Sen. Johnson’s condition upgraded
Stem-cell funding passes House, faces veto threat
Swank ‘I am in a relationship’
Teacher accused of taking improper photos found dead
U.S. gunships target al Qaeda suspects in Somalia
U.S., Iraqi forces battle insurgents
Witnesses Al Qaeda targeted MORE

Outlook

The potential for this to manifest itself more in the future seems somewhat high. One could rather easily spider some blogging networks for a bunch of random blog RSS feeds to leach content rather than just the subject. They would resemble legitimate email even more than a news site could.

Will this seriously harm spam filters? I doubt it. It’s not drastically different from previous methods. What’s so interesting is that they seem to be tapping a new fresh data source.

It’s hard to say how widespread this is exactly. I’ve got at least a dozen in the past few days. All from different sources, and even to different addresses. Because of how botnets can be used to send spam, it’s somewhat difficult to tell if they come from the same origin.

This may even help in the war on spam. Because they are distributing copyrighted information, perhaps (I’m not a lawyer) this might qualify as copyright infringement. AOL, whose parent company like CNN is Time Warner may be interested. Microsoft has MSNBC to look out for. That’s two giant email providers who have sued spammers before, with news networks that have an online presence and may be ripped for the purpose of spamming.

What’s interesting about the above emails is that most look strikingly similar in terms of actual contents. The titles also have the theme of being from RSS feeds. The headers indicate different origins, making it likely they were sent using a botnet, but have the same master.

Conclusion

The need for real-time blacklisting may become more of a necessity to be truly effective in the long run. Similar to how Phishing is being handled. The danger might not be spam getting through, but legitimate email looking more like the new spam and being caught.

I’d love to see someone like Google or Yahoo do an analysis of spam in comparison to their search indexes. I can manually do only so many, and visually scan for relevant information. I’m sure with Gmail or Yahoo Mail’s spam, and Google or Yahoo’s index, there could be some real insight. The people at Google have already done some decent work on Phishing and Malware. I think spam wouldn’t be far off. Using what I could access from Google was very valuable in seeing how spammers are operating. I bet they can see more than I can.

Further Research

I do have a copy of the emails referenced in this post. I am not making them publicly accessible to prevent some immature wanna-be hacker from attacking someone’s PC because their IP address was previously issued to an infected computer. By the time I strip all the headers out, they aren’t really any more useful than what’s already posted here.

4 replies on “Bayesian Spam Filter Poisoning With RSS”

A good post — thanks Robert!

For what it’s worth, the spams containing CNN headlines are
probably sent by a single spammer or spam team. That gives
you an idea of how _few_ “bad guys” there are, and how
much volume each of them is pushing out.

“Bayes poisoning”, in my opinion, does indeed have an effect;
spammers undoubtedly want it to cause their spams to
match nonspam training more closely. However, in our
testing, we found that this doesn’t necessarily happen;
instead, when a user trains on spam and nonspam, future
nonspam mails are biased towards looking like spam to
the filter — ie. increased false positives.

It appears various tweaks that we use in “real-world” Bayesian-style probabilistic classifier filters, including the algorithms
used in SpamAssassin, SpamBayes and Thunderbird, may protect
against this however.

This tech report has the details:
http://www.cs.dal.ca/research/.....004-06.pdf .
there’s quite a bit of other research if you go through the
http://www.ceas.cc archives too.

Interesting post!

I agree with Justin that it’s probably the same spammer but when stuff like this starts to work it generally catches on fairly quick.

With the unfortunate rise in image spam Bayes is becoming less effective. More and more the only text in the body of the spam is Bayes busting!

[…] Robert Accettura: Bayesian Spam Filter Poisoning With RSS […]

[…] http://robert.accettura.com/ar.....ith-rss/ […]