• OCR Not happening (Resolved)

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #18559  by Netvoid
 
Open KM v5.1.10 build 7564

No error listed, but do get one if I change the tesseract command line. Tesseract is working when I manually execute it against tif's.

Running on windows 2008 r2, 8gb ram, 2 processor.

Increased JVM memory, -Xms512m -Xmx1024m

Mostly trying to get image based PDF scans to OCR like they do on command line after converted to image. All my previews are working and my PDF conversions of documents. Convert seems fine, image uploads work. Just can't get a PDF upload to index for searches when it's a scan rather than a text embedded PDF.... Thanks for any help..

The following related options are set in this manner,
Code: Select all
system.dwg2dxf	string	
system.ghostscript.ps2pdf	string	
system.imagemagick.convert	string	C:\ImageMagick-6.7.9-Q16\convert.exe
system.keyword.lowercase	boolean	false
system.login.lowercase	boolean	false
system.maintenance	boolean	false
system.ocr	string	C:\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
system.openoffice.dictionary	string	C:\OpenOfficeDictionary\dict-en.oxt
system.openoffice.path	string	C:\Program Files (x86)\OpenOffice.org 3
system.openoffice.port	integer	2002
system.openoffice.server	string	
system.openoffice.tasks	integer	200
system.pdf.force.ocr	boolean	true
system.previewer	string	zviewer
system.readonly	boolean	false
system.swftools.pdf2swf	string	C:\SWFTools\pdf2swf.exe -T 9 -f ${fileIn} -o ${fileOut}
Last edited by Netvoid on Sat Sep 29, 2012 7:06 pm, edited 1 time in total.
 #18560  by Netvoid
 
One thing I notice is that when the file I import is for example a JPG image the OCR works, tesseract executes cleanly, and the files in my user/temp folder are created, then purged after processing.

When it does NOT work ... Is when I use a image based PDF as the source, the problem appears to be this,

When tesseract run it works, and generates this file to my user/temp folder,

okm4802275419226312643.txt.txt

That file has the OCR text, but it also has this file,

okm4802275419226312643.txt

Which is basically the same but is 0k without the OCR'd result in it.

Looks like the system then reads from the "okm4802275419226312643.txt" to populate the search index which does not contain anything so then searches do not produce a result.

And neither file is deleted after the processing.

Seems the whole thing would be corrected if the file didn't have that pesky extra ".txt" on the end. I think I see the problem in the code where .txt is being added to the end when the parameter replacement for "${fileOut}" occurs and also in the windows tesseract it put's .txt on the end. At least I think that is the problem so far....
 #18567  by Netvoid
 
Okay ... just thought I'd continue to post my progress (or lack of progress) ...

I found that I needed to replace the reference to "com.openkm.extractor.CuneiformTextExtractor" in the repository.xml, workspace.xml, and the UI administration configuration "registered.text.extractors" parameters with "com.openkm.extractor.Tesseract3TextExtractor" ...

So that solved the issues from my prior post with the addtional .txt.txt since the Cuneiform doesn't append that but the Tesseract does.

Now I am back to processing without error, and I see the .txt file OCR conversion happening, and I don't get any errors .... But the PDF still does not result when I perform content searches.


I think this is a separate issue, but if I define "system.openoffice.dictionary" pointing to "C:\OpenOfficeDictionary\dict-en.oxt" then after the OCR I get a dictionary load error, but if I turn that off so that the dictionary is not being used then I get no error but as stated before, still can't get my PDF to result from searches.. I do see the result of the OCR in my temp folder though. It works 100% fine when I manually convert the PDF to a JPEG and then upload the JPEG the OCR works and the JPEG shows up as a result of search attempts but not the PDF.
 #18571  by Netvoid
 
Okay .. after making the changes above and then setting the "system.pdf.force.ocr" parameter to false it appears the PDF based OCR is working as one would expect. Or it was working before doing that but after the change to the various text extractor settings but then I wasn't searching for text in the documents that had successfully been OCR'd.

Hope this log of what I went through helps others in some way.

Would be great if one of the open OCR solutions was significantly better, any suggestions on a pay OCR that is compatible with OpenKM and far more flexible and accurate against the sources? Is abby strong? Has anyone integrated cvisiontech or leadtools and found the text parsing to be much better?
 #18576  by jllort
 
We have tested abby for linux with good results.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.