Optical Character Recognition
Dec, 2006One of the most common questions I get asked is, "Can you scan these pages into a text document for me?" Then when I tell them Optical Character Recognition (OCR) is often very in-accurate, they look surprised. Even with a good scanner and software, it still must be proof-read and re-formatted. Yet they still want to try and see how much time it can save them from typing.
Since people still seem to want and try, I did some test to help improve the accuracy of OCR scans. Just by changing some basic scanner settings I was able to greatly impact the accuracy of the OCR output. The tests found at best 88% accuracy, and at the worst, less than 50%.
Test Sample
The test sample, a letter appearing in the Linux Journal, was scanned and timed using different settings. Below is an image of the text scanned at 150 DPI.
The correct text reads:
Xoops! Not a Security Hole In response to Cameron Spitzer's letter [June 2006, "Xoops! A Security Hole"], Juan Marcelo Rodriguez may have forgotten to mention that Xoops needs the permissions of the uploads, cache and templates_c directories set to 777 just for the install. You can change them back! When you log in to the application for the first time, if they are still set to 777, the application will ask you to change them.
Test Configuration
Test images were scanned using a Mustek 1200 USB Scanner. A bash script then called the sane program scanimage to scan each image to a PNM formatted file. Below is a snip of the scanning command used.
RESULT=$( \ ( time scanimage --mode $mode -l 90 -t 47.5 -x 68 -y 34.5 \ --resolution $res > images/image-$mode-$res.pnm ) \ 2>&1 )
Which was then formated and sent to the results file with time output. A additonal 10 second sleep was added to give the scanner time to return to its home position.
Scanning in [Color] at [1200] dpi... real 0m31.312s user 0m0.163s sys 0m0.042s
For Gentoo Users, only a few quick steps are needed to get a USB scanner working.
- Set USE="usb" (and gimp if you like)
- emerge sane-backends xsane gocr
- Add your user account to the scanner group and re-login
- Plug it in and enjoy
Scanning Results
This table shows the results and file sizes of the scanning tests.
| Mode | DPI | Time (seconds) | PNM filesize |
|---|---|---|---|
| Color | 150 | 10.576 | 239k |
| Color | 300 | 10.773 | 958k |
| Color | 600 | 19.637 | 3.8M |
| Color | 1200 | 31.312 | 15M |
| Grayscale | 150 | 9.728 | 80k |
| Grayscale | 300 | 9.813 | 320k |
| Grayscale | 600 | 13.060 | 1.3M |
| Grayscale | 1200 | 13.003 | 5.0 |
| Lineart | 150 | 9.714 | 10k |
| Lineart | 300 | 9.790 | 40k |
| Lineart | 600 | 13.047 | 160k |
| Lineart | 1200 | 13.001 | 638k |
It's no surprise that the Color scan at 1200 DPI took the longest at 31 seconds. Nor that Lineart at 150 DPI was the shortest with just under 10 seconds. Grayscale is slightly slower then Lineart with times as well.
OCR Tests
The next phase of the test were conducted using GOCR to convert the images to text. These files were then compared using kdiff3 and errors were counted.
| Mode | DPI | Time (seconds) | Errors |
|---|---|---|---|
| Color | 150 | 0.097 | 18 |
| Color | 300 | 0.241 | 10 |
| Color | 600 | 0.798 | 13 |
| Color | 1200 | 3.421 | 24 |
| Grayscale | 150 | 0.154 | 17 |
| Grayscale | 300 | 0.231 | 6 |
| Grayscale | 600 | 0.706 | 12 |
| Grayscale | 1200 | 3.145 | 20 |
| Lineart | 150 | 0.140 | 28+ |
| Lineart | 300 | 0.243 | 11 |
| Lineart | 600 | 0.767 | 10 |
| Lineart | 1200 | 3.341 | 18 |
Somewhat of a surprise here is that Lineart at 150 DPI had the worst error count with over 28 errors. As you can see from this text below, it's almost completely unreadable.
xoops !. _ot a 5ecurity Hole I n res p o n se t o C a M e ro n 5 p i tze r 's l ett e r I J u n e 2 o o 6 , __ Xoo p s !, A 5 e c u r i ty H o l e '' j , J u a n M a rce I o R o d r i g u ez m ay h av e f o r g ot(e n to m e n ti o n th a t xoops nmds the permlssions of the uploads, cache and tem plates c d i reLtori es set to 777 j ust fo r the jnstall You ca_ change them back !. When you log in to the application for the fInt ti me, if they are still set to 777, (he application will ask you to change them.
In contrast to the highest scoring results with only 9 errors (12% errors)was Grayscale at 300 DPI. It was almost perfect. The scan was done intentionally at an very slight angle to simulate human error. Use of a document feeder or careful user only increase accuracy.
Xoops!. _ot a Security Hole In response to Cameron Spitzer's letter [June 2006, ''Xoops!. A Security Hole''I, Juan Marceto Rodriguez may have forgotten to mention that Xoops needs the permissions of the uploads, cache and templates_c dire_ories set to 777 just for the install. You can change them back!. When you log in to the application for the first time, if they are still set to 777, the application will ask you to change them.
Conclusion
These test show that the overall best results can be achieved using Grayscale @ 300 DPI. This not only had the highest accuracy, some of the lowest processing times, and smaller file sizes in out of all tests. While these results were for my old Mustek 1200 USB scanner, they should be very similar for others as well.







