Page 1 of 1

OCR Not happening (Resolved)

PostPosted:Sat Sep 29, 2012 4:28 am
by Netvoid
Open KM v5.1.10 build 7564

No error listed, but do get one if I change the tesseract command line. Tesseract is working when I manually execute it against tif's.

Running on windows 2008 r2, 8gb ram, 2 processor.

Increased JVM memory, -Xms512m -Xmx1024m

Mostly trying to get image based PDF scans to OCR like they do on command line after converted to image. All my previews are working and my PDF conversions of documents. Convert seems fine, image uploads work. Just can't get a PDF upload to index for searches when it's a scan rather than a text embedded PDF.... Thanks for any help..

The following related options are set in this manner,
Code: Select all
system.dwg2dxf	string	
system.ghostscript.ps2pdf	string	
system.imagemagick.convert	string	C:\ImageMagick-6.7.9-Q16\convert.exe
system.keyword.lowercase	boolean	false
system.login.lowercase	boolean	false
system.maintenance	boolean	false
system.ocr	string	C:\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
system.openoffice.dictionary	string	C:\OpenOfficeDictionary\dict-en.oxt
system.openoffice.path	string	C:\Program Files (x86)\OpenOffice.org 3
system.openoffice.port	integer	2002
system.openoffice.server	string	
system.openoffice.tasks	integer	200
system.pdf.force.ocr	boolean	true
system.previewer	string	zviewer
system.readonly	boolean	false
system.swftools.pdf2swf	string	C:\SWFTools\pdf2swf.exe -T 9 -f ${fileIn} -o ${fileOut}

Re: OCR Not happening

PostPosted:Sat Sep 29, 2012 5:06 am
by Netvoid
One thing I notice is that when the file I import is for example a JPG image the OCR works, tesseract executes cleanly, and the files in my user/temp folder are created, then purged after processing.

When it does NOT work ... Is when I use a image based PDF as the source, the problem appears to be this,

When tesseract run it works, and generates this file to my user/temp folder,

okm4802275419226312643.txt.txt

That file has the OCR text, but it also has this file,

okm4802275419226312643.txt

Which is basically the same but is 0k without the OCR'd result in it.

Looks like the system then reads from the "okm4802275419226312643.txt" to populate the search index which does not contain anything so then searches do not produce a result.

And neither file is deleted after the processing.

Seems the whole thing would be corrected if the file didn't have that pesky extra ".txt" on the end. I think I see the problem in the code where .txt is being added to the end when the parameter replacement for "${fileOut}" occurs and also in the windows tesseract it put's .txt on the end. At least I think that is the problem so far....

Re: OCR Not happening

PostPosted:Sat Sep 29, 2012 6:27 pm
by Netvoid
Okay ... just thought I'd continue to post my progress (or lack of progress) ...

I found that I needed to replace the reference to "com.openkm.extractor.CuneiformTextExtractor" in the repository.xml, workspace.xml, and the UI administration configuration "registered.text.extractors" parameters with "com.openkm.extractor.Tesseract3TextExtractor" ...

So that solved the issues from my prior post with the addtional .txt.txt since the Cuneiform doesn't append that but the Tesseract does.

Now I am back to processing without error, and I see the .txt file OCR conversion happening, and I don't get any errors .... But the PDF still does not result when I perform content searches.


I think this is a separate issue, but if I define "system.openoffice.dictionary" pointing to "C:\OpenOfficeDictionary\dict-en.oxt" then after the OCR I get a dictionary load error, but if I turn that off so that the dictionary is not being used then I get no error but as stated before, still can't get my PDF to result from searches.. I do see the result of the OCR in my temp folder though. It works 100% fine when I manually convert the PDF to a JPEG and then upload the JPEG the OCR works and the JPEG shows up as a result of search attempts but not the PDF.

Re: OCR Not happening

PostPosted:Sat Sep 29, 2012 7:06 pm
by Netvoid
Okay .. after making the changes above and then setting the "system.pdf.force.ocr" parameter to false it appears the PDF based OCR is working as one would expect. Or it was working before doing that but after the change to the various text extractor settings but then I wasn't searching for text in the documents that had successfully been OCR'd.

Hope this log of what I went through helps others in some way.

Would be great if one of the open OCR solutions was significantly better, any suggestions on a pay OCR that is compatible with OpenKM and far more flexible and accurate against the sources? Is abby strong? Has anyone integrated cvisiontech or leadtools and found the text parsing to be much better?

Re: OCR Not happening (Resolved)

PostPosted:Sun Sep 30, 2012 10:18 am
by jllort
We have tested abby for linux with good results.