Bayesian Spam Filter Poisoning With RSS


Bayesian Filtering is a great method for fighting spam. Unlike rule based filtering which spammers can easily adapt to with simple modifications, Bayesian adapts with the spammers changes, making it much more difficult for them to defeat the filtering. As a result it’s used in server side mail filtering as well as client side filtering in various products including Mozilla Thunderbird, SpamAssassin, and SpamBayes. Despite this level of “intelligence” it’s not foolproof. Like anything that analyzes unsanitized input, its vulnerable to poisoning. To be fair, there is a debate on if it exists or not. I personally believe it does exist.

Continue reading

eBay and banks need to implement SPF and Domain Keys

eBay and banks really need implement SPF, Sender Policy Framework and DomainKeys. There I said it.

I see quite a few Phishing attacks every day. And just about all of them aren’t caught by SpamAssassin. Technically they aren’t spam, so that does make sense. But what bothers me is that this is easy mitigate for many potential victims. If eBay and banks supported SPF and DomainKeys, it would be much easier for a filter to tell if the message is legitimate or not. Check out this sample SpamAssassin header from a eBay phishing email I received:

X-Spam-Level: **
X-Spam-Status: No, score=3.0 required=5.0 tests=BAYES_50,HTML_IMAGE_ONLY_28,
	MIME_HTML_ONLY autolearn=no version=3.1.0

That’s really not much in this otherwise pretty bad email. The IP of origin isn’t even in North America (it’s Pacific Rim).

Perhaps it’s time to start a campaign to urge institutions subject to having their name used in these attacks to start using a method like SPF and DomainKeys. A mail provider could then throw out emails that don’t match. Anyone know why they still don’t implement one or both of these methods?

It seems to me they could easily take a giant step to solve the problem. I know Google’s Gmail knows about SPF, and Yahoo knows about DomainKeys. That’s two major email providers right there.

Is phishing the new spam?

I’m almost convinced now that the majority of stuff SpamAssassin misses isn’t really spam, but phishing messages. I think it’s time for SpamAssassin to start considering detecting it. Perhaps take a look at mscott’s good work for Mozilla Thunderbird.

Odds are lots of that detection stuff, will also detect spam slipping through by other means.

Microsoft pushing Sender ID?

Ok, just when I was starting to think that Microsoft may be changing their ways and trying to act in good faith after them fixing their website the other day. Microsoft starts talking about pushing their sender ID stuff on us. Sender ID is Microsoft’s alternative to the other spam prevention techniques such as Yahoo’s DomainKeys. One problem with Sender ID is the licensing, which has caused organizations like Apache Foundation (who oversee the SpamAssassin project), to nix support for Sender ID. AOL has also also dropped support, and looked towards SPF.

I agree one one of these standards is needed to help prevent spam. Personally I think DomainKeys is the most promising of them all. It’s licensing looks like it will be adequate, and it has a fair amount of backing. Google’s Gmail has apparantly implemented SPF and DomainKeys at this time. I think it’s time for everyone to start looking at following their lead. These two technologies look to be the best. And by implementing them, your mail is more likely to get past spam filters. Microsoft is right, it’s time to start acting. But not with their own proprietary stuff.

Spammer Spot Checking

It’s pretty well known at this time that a rather large sum of Spam comes through regular ISP’s. There is a rather large debate on how to get rid of them. Some ISP’s just ignore it. Some block port 25. But is there a better way?

I’m going to propose the following:

  • A random check of 1 out of every 100 emails sent through an ISP’s servers, or via port 25 (for ISP’s who allow 3rd party mail servers) get checked by a spam filter (such as SpamAssassin).
  • If a user gets flagged, the user enters a “gray list”. In which their emails are checked at a lower interval (1 out of 25) for the next several days.
  • If more than 10% get flagged (a rather large margin for today’s Spam Filters). That account should be suspended and investigated by the ISP before being re-enabled.

The vast majority of the above can be automated. But how would this cut down on spam?


The vast majority of users send less than 100 emails a day. So the percentage of extra CPU required would be relatively minimal for each legitimate user an ISP has (only 1/100 of outgoing email would be scanned). Odds are the user will have 1 email scanned every 3-7 days (assuming they send between 15-20 emails a day) . For a spammer, or a computer infected with a Trojan, this computer will be sending large sums of spam (perhaps hundreds an hour). It will be rather likely to have one fall into the group tested by the spam filter. Then when it falls into the gray list, it will become rather obvious if it was a fluke (emailing a spouse about Viagra), or a spammer. Spammers need to send bulk amounts of mail to be profitable, since not many who get it actually click and buy something.

Why would an ISP want to bother?

A spammer not only can put a large burden on a mail server (read: cost), but cause an ISP to be blacklisted. This is a negative thing for any ISP because it reduces the quality of service for legitimate users, and could cause customers to feel they can get better service elsewhere. The best way to avoid being blacklisted is to keep your mail servers clean.

Wouldn’t this violate privacy policies?

Not likely. Many ISP’s already scan incoming email for spam and viruses. This is simply applying it in the reverse. There’s likely no additional privacy concerns by doing it this way.

Couldn’t this prevent many virus outbreaks?

Yes, it could be done to prevent viruses, simply by doing the above with a virus scanner.

Could this be done without a “gray list” to make it easier to implement?

Yes, in theory it could. You can just flag an account so an admin is aware. Or suspend right away. Suspending right away (on 1 catch) may cause more false positives than you would want, so I’d advise against it. I’d opt towards flagging an account or perhaps notifying an admin by email. If someone is a real spammer, they will be part of the random sampling a dozen or so times rather quickly. So it will be rather obvious. A “gray list” is more programming, but makes the system more automatic and tolerant. Providing a better experience for end users, with less work for admin’s in the long run.

Where did 1 out of 100 come from?

It’s somewhat arbitrary, but should prove effective. I’m sure some analysis could come up with an even better number. The goal is to prevent spam with minimal CPU. Odds are a spammer won’t send 1 email a day. So they will send it in volume (since the more they send, the higher the chances a consumer will bite). Hopefully more often than note, 1 will fall into the filter. You can cut that in half (1 out of 50) to double your chances. At the expense of system resources.

Wouldn’t this just make email slower?

Not really. You can send the email before you scan it. So this doesn’t slow outbound email. It’s just taking a random sampling at an interval, and reacting based on the analysis. Even if the filter goes off, the mail should be sent (it could be a false positive). Only when the user is flagged as a spammer should the account be unable to send email. This results in minimal disruption of service. For a spammer this should happen relatively quick. scanning 1% of outgoing email shouldn’t be to substantial. Assuming you keep an eye on your mail server anyway, this should only speed up the detection of a spammer using it. If you go to a 1:50 ratio of scanning, you’ll only improve your odds and speed in catching spammers.

Has anyone implemented this? Is there a tutorial?

To the best of my knowledge, nobody has done this yet, at least based on my theories. If you have done this, and would like to contribute some code, information, wisdom, or just mention who did it, let me know.

Why not just scan all outgoing email?

It’s just not practical for performance/resource reasons. Nor is it really necessary, since spammers need to send in bulk.

Couldn’t spammers work around this?

Well, they can space out when they send out mail, say batches of 50, but they still fall trap to perhaps being 1:100 and being scanned. They could send less, but that would be costly. They need to send in bulk so they can get as many eyes looking at their offers as possible. So for them, just sending less isn’t good business. This would hit them where it hurts. By making their business model ineffective. If they can’t send the mail, they can’t profit.

Doesn’t this protect others, rather than myself?

Yes, and no. We are a community, and communities do look out for each other. If everyone did this, the load on incoming mail servers would be substantially less. As said before, by catching your own spammers, you prevent being blacklisted by the many blacklists out there. That has a direct benefit to your business.

What about bounced email?

Those should be scanned as well. Simply because a spammer can bounce their spam off of your mail servers to get around blacklists. If I email, with a spoofed “From:” header, they will likely “bounce” that email to my recipient (who I put in my “from:” tag), quoting the message (my spam). By scanning these as well (1 out of 100), you can effectively cut down on this abuse by your leeching spammers.

The bottom line

By using the above method of scanning outgoing email, you can effectively prevent spammers from profiting off of your mail servers. Spammers need to send in bulk. The more they send, the easier it will be to catch them. This is an easy way for an ISP, webhost or mail provider to cripple the spammers business without harming legitimate email users.

Bug 117532 RIP

Bug 117532. It’s been one long journey, but it’s finally over. And just in time, as 1.7b is extremely close. That was a lot for a UI pref, but at least the wording is good. It’s worth it for good wordage.

Bug 235086 is also wrapping up tonight.

So I wasn’t totally useless tonight. I also got Bender (file server) somewhat up and running at this point. Should have it close to 100% tomorrow sometime. Still todo:

MRTG (and all associated with it)
Some Perl Modules
FTP configuration
some Apache configuration
virus scan
AIM Bot for monitoring status
… [list goes on]….

But hey… it’s working now. File services have resumed.

SpamAssassin and xbl blacklist

There’s a new blacklist in town.

Patch for SpamAssassin bug 2889:

RCS file: /cvsroot/spamassassin/spamassassin/rules/,v
retrieving revision 1.38
diff -r1.38
> # XBL is the Spamhaus Block List:
> header RCVD_IN_SBL		eval:check_rbl_txt('xbl', '')
> describe RCVD_IN_SBL		Received via an exploit in Spamhaus Block List
> tflags RCVD_IN_SBL		net

Go Me! Simple enhancement, should provide better spam filtering for all.


Finally got Net::DNS to compile and install on XP with Active Perl 5.8. Yea for me.

So I’m experimenting with DNSBL. Neat feature, seems to be working well. Still figuring out which lists are best (without slowing down mail to a halt).

Still can’t get sa-learn working properly. But that’s for tomorrow.

Less spam getting through. Much more than 95% effective now.

SpamAssassin for Mac OS X

I will get around to packaging SpamAssassin for Mac OS X one of these days. I got stuck a while back with some other issues with it, but I think they have been smoothed out now. So expect to see something soon. Note I won’t specify a date, but somewhere between now, and (before) hell freezing over.

Should be good, a nice powerful spam filter, free for OS X users. What’s not to like?