Bayesian Filtering is a great method for fighting spam. Unlike rule based filtering which spammers can easily adapt to with simple modifications, Bayesian adapts with the spammers changes, making it much more difficult for them to defeat the filtering. As a result it’s used in server side mail filtering as well as client side filtering in various products including Mozilla Thunderbird, SpamAssassin, and SpamBayes. Despite this level of “intelligence” it’s not foolproof. Like anything that analyzes unsanitized input, its vulnerable to poisoning. To be fair, there is a debate on if it exists or not. I personally believe it does exist.
So What Is This “Poisoning” You Speak Of?
Poisoning refers to spammers putting non-spam words (either gibberish, random words, or old texts) into spam. This technique itself is nothing new. This is a technique used for years to help get around spam filters. This is why some of your spam may contain things like:
Everything you can imagine is real.
What this country needs is a good five cent cigar.
What the eye does not admire the heart does not desire.
Action is coarsened thought thought becomes concrete, obscure, and unconscious.
A man profits more by the sight of an idiot than by the orations of the learned.
The above comes from spam trying to pitch a Canadian pharmacy! Doesn’t sound very medical does it? That’s the point. They then throw the url and a quick “buy pills” somewhere in there.
What’s Now Going On
My theory is that the new technique spammers seem to be taking on is to use RSS feeds as an input source to make spam look more legitimate and keep the content timely (to avoid filtering). RSS is easy to retrieve, parse, and is extremely plentiful. As a result it’s possible to have an endless sea of salt to try and get around the filters.
Here are a few examples I collected in about 10 minutes of skimming my spam folder only looking at titles for ones that look like they may have come from feeds. Google searches seem to indicate most come from CNN RSS feeds. To perform searches to find the origin you need to be a little creative and make use of Google’s cache, since an articles title could change through the life of the article.
I then decided to use Google Reader to display over 1,000 titles from the past week in my “General News” tag, this includes a few but not all of their feeds (mainly U.S, World News). As a side note this category is somewhat of an antique, since I don’t read general news via RSS since I work for a news website. I get all the news I can tolerate from 9-5 . I’m also a feed junkie.
I clearly couldn’t find all within the range of a week and 3 feeds, but I did find enough to make me wonder. The screenshots are below:
As you can see, many were sent the day after the story appeared in the feed.
I should note this is not the feed owners fault in any way, nor is there any reasonable effort they can make to stop or prevent such misuse. No need to go after blog owners or news sites. Most of them get spammed more than you.
Here’s a list of the emails I spotted for the past several days. I’m not sure where a few of them came from (if anyone wants to dig deeper, feel free). As of a week ago, several others could be found around the web by searching google and viewing the google cached version of some pages. Headlines can change as a story evolves. This further complicates this research:
- Here’s a list of news related subjects from spam emails:
- Court papers Dancer cleared one Duke suspect
- Filing Duke suspect just watched
- Fortune The 100 best companies to work for
- Gwynn, Ripken in Hall, McGwire misses MORE
- Iranian officials detained in Iraq, U.S. official says
- Kennedy threatens Bush Iraq plan
- Madonna defends Rosie
- Man in hot pants struts in boots, cheers city MORE
- Mom charged with stabbing kids
- N.J. suspected as source of stench MORE
- O’Reilly, Colbert on each’s shows.
- Rebel ‘We aided bin Laden escape’
- Rice ‘loves’ Fox News; CBS anchor ‘decent guy’
- Sen. Johnson’s condition upgraded
- Stem-cell funding passes House, faces veto threat
- Swank ‘I am in a relationship’
- Teacher accused of taking improper photos found dead
- U.S. gunships target al Qaeda suspects in Somalia
- U.S., Iraqi forces battle insurgents
- Witnesses Al Qaeda targeted MORE
The potential for this to manifest itself more in the future seems somewhat high. One could rather easily spider some blogging networks for a bunch of random blog RSS feeds to leach content rather than just the subject. They would resemble legitimate email even more than a news site could.
Will this seriously harm spam filters? I doubt it. It’s not drastically different from previous methods. What’s so interesting is that they seem to be tapping a new fresh data source.
It’s hard to say how widespread this is exactly. I’ve got at least a dozen in the past few days. All from different sources, and even to different addresses. Because of how botnets can be used to send spam, it’s somewhat difficult to tell if they come from the same origin.
This may even help in the war on spam. Because they are distributing copyrighted information, perhaps (I’m not a lawyer) this might qualify as copyright infringement. AOL, whose parent company like CNN is Time Warner may be interested. Microsoft has MSNBC to look out for. That’s two giant email providers who have sued spammers before, with news networks that have an online presence and may be ripped for the purpose of spamming.
What’s interesting about the above emails is that most look strikingly similar in terms of actual contents. The titles also have the theme of being from RSS feeds. The headers indicate different origins, making it likely they were sent using a botnet, but have the same master.
The need for real-time blacklisting may become more of a necessity to be truly effective in the long run. Similar to how Phishing is being handled. The danger might not be spam getting through, but legitimate email looking more like the new spam and being caught.
I’d love to see someone like Google or Yahoo do an analysis of spam in comparison to their search indexes. I can manually do only so many, and visually scan for relevant information. I’m sure with Gmail or Yahoo Mail’s spam, and Google or Yahoo’s index, there could be some real insight. The people at Google have already done some decent work on Phishing and Malware. I think spam wouldn’t be far off. Using what I could access from Google was very valuable in seeing how spammers are operating. I bet they can see more than I can.
I do have a copy of the emails referenced in this post. I am not making them publicly accessible to prevent some immature wanna-be hacker from attacking someone’s PC because their IP address was previously issued to an infected computer. By the time I strip all the headers out, they aren’t really any more useful than what’s already posted here.