Hits: 1,566
Home >> Software >> Improving OCR

Optical Character Recognition

Dec, 2006

One of the most common questions I get asked is, "Can you scan these pages into a text document for me?" Then when I tell them Optical Character Recognition (OCR) is often very in-accurate, they look surprised. Even with a good scanner and software, it still must be proof-read and re-formatted. Yet they still want to try and see how much time it can save them from typing.

Since people still seem to want and try, I did some test to help improve the accuracy of OCR scans. Just by changing some basic scanner settings I was able to greatly impact the accuracy of the OCR output. The tests found at best 88% accuracy, and at the worst, less than 50%.

Test Sample

The test sample, a letter appearing in the Linux Journal, was scanned and timed using different settings. Below is an image of the text scanned at 150 DPI.

scan of text

The correct text reads:

Xoops! Not a Security Hole
In response to Cameron Spitzer's letter [June
2006, "Xoops! A Security Hole"], Juan Marcelo
Rodriguez may have forgotten to mention that
Xoops needs the permissions of the uploads, cache
and templates_c directories set to 777 just for the
install. You can change them back! When you log in
to the application for the first time, if they are still set
to 777, the application will ask you to change them.

Test Configuration

Test images were scanned using a Mustek 1200 USB Scanner. A bash script then called the sane program scanimage to scan each image to a PNM formatted file. Below is a snip of the scanning command used.

RESULT=$( \
( time scanimage --mode $mode -l 90 -t 47.5 -x 68 -y 34.5 \
	--resolution $res > images/image-$mode-$res.pnm ) \
	2>&1 )

Which was then formated and sent to the results file with time output. A additonal 10 second sleep was added to give the scanner time to return to its home position.

Scanning in [Color] at [1200] dpi... real 0m31.312s user 0m0.163s sys 0m0.042s

For Gentoo Users, only a few quick steps are needed to get a USB scanner working.

  • Set USE="usb" (and gimp if you like)
  • emerge sane-backends xsane gocr
  • Add your user account to the scanner group and re-login
  • Plug it in and enjoy

Scanning Results

This table shows the results and file sizes of the scanning tests.

ModeDPITime (seconds)PNM filesize
Color15010.576239k
Color30010.773958k
Color60019.6373.8M
Color120031.31215M
Grayscale1509.72880k
Grayscale3009.813320k
Grayscale60013.0601.3M
Grayscale120013.0035.0
Lineart1509.71410k
Lineart3009.79040k
Lineart60013.047160k
Lineart120013.001638k

It's no surprise that the Color scan at 1200 DPI took the longest at 31 seconds. Nor that Lineart at 150 DPI was the shortest with just under 10 seconds. Grayscale is slightly slower then Lineart with times as well.

OCR Tests

The next phase of the test were conducted using GOCR to convert the images to text. These files were then compared using kdiff3 and errors were counted.

ModeDPITime (seconds)Errors
Color1500.09718
Color3000.24110
Color6000.79813
Color12003.42124
Grayscale1500.15417
Grayscale3000.2316
Grayscale6000.70612
Grayscale12003.14520
Lineart1500.14028+
Lineart3000.24311
Lineart6000.76710
Lineart12003.34118

Somewhat of a surprise here is that Lineart at 150 DPI had the worst error count with over 28 errors. As you can see from this text below, it's almost completely unreadable.

xoops !. _ot a 5ecurity Hole
I n res p o n se t o C a M e ro n 5 p i tze r 's l ett e r I J u n e
2 o o 6 , __ Xoo p s !, A 5 e c u r i ty H o l e '' j , J u a n M a rce I o
R o d r i g u ez m ay h av e f o r g ot(e n to m e n ti o n th a t
xoops nmds the permlssions of the uploads, cache
and tem plates c d i reLtori es set to 777 j ust fo r the
jnstall You ca_ change them back !. When you log in
to the application for the fInt ti me, if they are still set
to 777, (he application will ask you to change them.

In contrast to the highest scoring results with only 9 errors (12% errors)was Grayscale at 300 DPI. It was almost perfect. The scan was done intentionally at an very slight angle to simulate human error. Use of a document feeder or careful user only increase accuracy.

Xoops!. _ot a Security Hole
In response to Cameron Spitzer's letter [June
2006, ''Xoops!. A Security Hole''I, Juan Marceto
Rodriguez may have forgotten to mention that
Xoops needs the permissions of the uploads, cache
and templates_c dire_ories set to 777 just for the
install. You can change them back!. When you log in
to the application for the first time, if they are still set
to 777, the application will ask you to change them.

Conclusion

These test show that the overall best results can be achieved using Grayscale @ 300 DPI. This not only had the highest accuracy, some of the lowest processing times, and smaller file sizes in out of all tests. While these results were for my old Mustek 1200 USB scanner, they should be very similar for others as well.