Archive for the ‘spam’ Category.

Don’t try to make me spam my contacts

High-quality social network sites grow because contacts are real, and site-mediated communication is welcome. For example, LinkedIn from the beginning treated contact information very carefully, never generating any email except by explicit request of a user. Therefore it felt safe to import contacts into it, since I wasn’t exposing my colleagues to unexpected spam. (LinkedIn has loosened up a bit. Originally one could not even try to connect to someone unless you knew their email address already. They made it easier to connect to people found by search only, and you can pay extra to send messages to strangers; nonetheless, in my experience it’s always user-initiated.)

Low-quality social network sites grow by finding ways to extract contacts from people so the system can spam them, or trick users into acting as individual spam drones. (A worst-case example are those worm-like provocative wall postings that, once clicked, cause your friends to seem to post them also. Just up from that on the low rungs are the game sites that post frequent progress updates to all your friends.)

I’m a joiner and early adopter, but I rarely invite people to use a service they’re not already using. That’s my way of treating my contacts respectfully, and protecting my own reputation as a source of wanted communication, not piles of unsolicited invitations.

Google Plus has recently taken a step toward lower quality by changing their ‘Find People’ feature. Previously it identified/suggested Google Plus users separately (good). Now it identifies and suggests everyone on your contact list and beyond, without identifying whether they are already a Google Plus user. Really they are nudging me toward being an invite machine for them.

As a result, Google Plus will get less high-quality social-network building (among people who respect their contacts and take care with their communication), and more low-quality social-network building (piles of invites from people I barely know). If it goes too far downhill, Google will endanger the willingness of high-quality users to let Google know anything about their contacts or touch their email.

MessageLabs versus GMail

MessageLab mail forwarders have been unwilling to talk to GMail servers at least since Saturday 2008-03-29, with a mix of TCP “connection refused” and SMTP “421 Service Temporarily Unavailable”.

Perhaps it’s related to flurry of articles about GMail CAPTCHA cracking three weeks ago and the resulting surge of spam.

Whatever the reason, it’s a painful outage.

Followup, Tuesday 2008-04-01 7:48am EDT:

MessageLabs appears to be listening to GMail’s servers again. New messages are flowing. I haven’t seen the messages queued up during the outage, yet.

More detail on the Google CAPTCHA-cracking botnet.

Spam introspection

Georgetown University sends spam and faces the wrath of one of its own students.

I’m also getting a little tired of “call for paper” spam sent by otherwise-legitimate conference organizers to lists of web-harvested email addresses. My most frequent offenders will remain nameless for now, but only because I’m busy.

Just because you’re not a fraudulent criminal enterprise doesn’t mean you’re not a spammer.
It would not be a bad thing if everyone started worrying about CAN-SPAM being enforced against them.

Mail server choices for anti-spam — hijacked or derailed by patents?

Yakov Shafranovich on Sender ID and software patents from Microsoft: Part I,
Part II

Update: Eric Raymond is “quoted a promise of a license with no royalties and no requirement to sign an agreement.” That would be helpful if such a license came to pass.

Clever Zombie Tracking by Manipulating DNS Views

Yahoo DomainKeys draft specification

Yahoo publishes its DomainKeys specification.
FAQ at Yahoo! Anti-Spam Resource Center – DomainKeys.

I must say that I share Justin Mason’s distrust and disdain for software patents.
What the heck is patentable among these ideas anyway? They seem like obvious applications of digital signatures and DNS publication.
The most generous interpretation is that these might be defensive patents, and that for all intents, the IETF-required license is good enough.

Is this or SPF
likely to take the world by storm?
Either one permits senders to publish records that permit receivers to make some authentication judgments.

Well, deployment by senders is a bit more work (sign those messages) for DK than for SPF. But SPF breaks what has been considered normal forwarding behavior, in a way that the sender has no control over except by saying “put up with it” or by turning off SPF.

Deployment by receivers has no particular downside for either scheme — you’re basically implementing sender-requested filtering, and who can complain about that?

Of course, initially, rather than trying to subvert either scheme, spammers will avoid both. Is it possible that the world will shift so much that just being a non-DK domain will count against the sender? I do think it’s possible. At which point, yes, spammers adopt the technology but subvert it with throwaway domains and proxy zombies with access to signing servers.
You can’t avoid reputation systems in the end,
trusted third parties, (some even having good incentives to rate
accurately and respond quickly), blacklists, etc.

Chi-squared evidence combination

More on Gary Robinson’s improved chi-squared evidence combination at Handling Redundancy in Email Token Probabilities

Benchmark the anti-spam industry!

It would be very valuable to have an ongoing head-to-head benchmarking of all the current contenders in the anti-spam industry — not just the learning systems, but the online dynamic systems as well.
Form a consortium, operate a bunch of systems (be a customer of commercial systems).
Use the same simultaneous data stream as input, and capture real-time state from the online dynamic systems (return it to the providers so they can replay what went wrong [or right]).
Publish the performance results.

It would generate good data for more research, and really useful comparable performance metrics.
I’m not sure if that would be seen as a good thing or a bad thing by
the commercial services. Laggards in the horserace might prefer less measurement.
Actually, what I think it would show is that most systems are “almost good enough,”
that all systems will soon be “good enough,” that
there’s little excuse not to deploy something,
but there’s plenty of space for distinction based on features such as administration, tunability, interface, integration.
But one would hope that performance metrics would drive the industry forward.

A SPEC effort for real-time and offline anti-spam systems!
Is anyone else inspired by the idea of a non-biased testing/evaluation consortium?

DSPAM does noise reduction and bi-grams

I’ve tried CRM114 and know it performs very well.
I’m just catching up on my DSPAM reading.

Bayesian Noise Reduction looks really helpful, and reduces the cost of implementing bi-grams (Chained Tokens in DSPAM terminology).
Author Jonathan A. Zdziarski gives typical storage figures of 0.5MB-1MB for the average user without bigrams, and 10MB-20MB with. Disk is cheap.

Personally I was thinking of experimenting with boosting into longer n-grams as a way of achieving some space and time tradeoffs. I haven’t had time, though.

While I don’t disbelieve the performance numbers,
I do wish for more corpora (larger and more diverse) and standardized oerformance metrics.

Forging S/MIME signatures

Jon Udell tries his hand at S/MIME signature forgery,
revealing that PKI is not a panacea.

A digital signature proves something. The proof is strong but the something is weak (if it just demonstrates that you clicked a few things to get a persona certificate).

So if you need to prove something stronger, then you put limits on what digitally-signed content you’re willing to accept.
This can go in at least two directions (not mutually exclusive):

  • higher-class certificates (where certificate authorities demand more proof, and encode that fact in the certificate). But higher quality means harder to get and less actual deployment. And higher quality means more attractive target for theft of keys.
  • reputation systems. Of course, building robust reputation systems is not easy. Users may wish to have multiple sources of reputation information to fit their own definitions of good and bad behavior and how fast those judgments are made. It replays the whole DNS blacklist deployment. Some reputation systems may seem arbitrary and capricious. Others may be too slow or too tolerant. They are all lawsuit targets. Will there be too many to choose from?

For message classification, there is a predisposition to disparage machine learning and content inspection as too
probabilistic and uncertain, while viewing signatures as certain and reliable. It is not so, the uncertainty or trust is not eliminated, it’s just at a different level.