Ticket #358 (closed enhancement: invalid)
Add OCR support
| Reported by: | rjl | Owned by: | rjl |
|---|---|---|---|
| Priority: | normal | Milestone: | 1.1.0 |
| Component: | amavisd-maia | Version: | 1.0.1 |
| Severity: | normal | Keywords: | OCR image |
| Cc: |
Description
As we face an increasing tide of image-spam, OCR is becoming a more useful tool in our arsenal. A plugin for SpamAssassin makes use of OCR tools to score images that contain text, but in order to properly subject the OCR-extracted text to the full battery of SpamAssassin tests, the text needs to be extracted before it gets handed to SpamAssassin in the first place. Amavisd-maia should be doing this, as part of its unpacking and decoding duty.
When the text is extracted, it should be appended to the body of the mail for the purpose of submitting the whole thing to SpamAssassin. That way SpamAssassin doesn't need to know anything about OCR, and no plugins are needed. The OCR-extracted text is then treated like any other text in the body of the mail, so its URLs can be tested against SURBL, and the words and other tokens can be tested against the regular expression rules and the Bayes database.
Note that the OCR-extracted text should only be appended to the body for the purpose of the SpamAssassin scan--it should not become part of the actual mail contents stored in the database or reported to hashing systems. The pristine original should be used for these purposes.
References:

