DSPAM does noise reduction and bi-grams

I’ve tried CRM114 and know it performs very well.
I’m just catching up on my DSPAM reading.

Bayesian Noise Reduction looks really helpful, and reduces the cost of implementing bi-grams (Chained Tokens in DSPAM terminology).
Author Jonathan A. Zdziarski gives typical storage figures of 0.5MB-1MB for the average user without bigrams, and 10MB-20MB with. Disk is cheap.

Personally I was thinking of experimenting with boosting into longer n-grams as a way of achieving some space and time tradeoffs. I haven’t had time, though.

While I don’t disbelieve the performance numbers,
I do wish for more corpora (larger and more diverse) and standardized oerformance metrics.

