• OCR function, PNG works except for PDF files

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #31357  by fsouren
 
Hi,

I've got everything up and running on Ubuntu 14.04. The problem is OCR'ing PDF files.
If i convert them to PNG files on the commandline with "convert -density 200 -quality 90" and upload them to OpenKm everything gets recognized fine.
Only if i upload the source PDF file i only get garbage text, and can't full text search the document.

Settings are:
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor 
org.apache.jackrabbit.extractor.MsWordTextExtractor 
org.apache.jackrabbit.extractor.MsExcelTextExtractor 
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor 
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor 
org.apache.jackrabbit.extractor.RTFTextExtractor 
org.apache.jackrabbit.extractor.HTMLTextExtractor 
org.apache.jackrabbit.extractor.XMLTextExtractor 
org.apache.jackrabbit.extractor.PngTextExtractor 
org.apache.jackrabbit.extractor.MsOutlookTextExtractor 
com.openkm.extractor.Tesseract3TextExtractor 
com.openkm.extractor.PdfTextExtractor 
com.openkm.extractor.AudioTextExtractor 
com.openkm.extractor.ExifTextExtractor 
com.openkm.extractor.SourceCodeTextExtractor 
com.openkm.extractor.MsOffice2007TextExtractor
Code: Select all
system.imagemagick.convert	String	/usr/bin/convert -density 200 -quality 90
system.ocr	String	/usr/bin/tesseract ${fileIn} ${fileOut} -l nld
system.swftools.pdf2swf	String	/opt/openkm-6.3.0-community/tomcat/bin/pdf2swf -f -T 9 -t -s storeallcharacters ${fileIn} -o ${fileOut}
system.openoffice.dictionary	String		
system.openoffice.path	String	/usr/lib/libreoffice
system.pdf.force.ocr	Boolean	Inactive
Anyone who has the golden answer for me? :D
 #31375  by jllort
 
Do you come from a migration or new installation ( because I suspect org.apache.jackrabbit should be com.openkm, anyway this is not the reason why is not going right ). Upgrate to nighly build because there are some corrections you can find it at integration.openkm.com upgrade guide is here http://wiki.openkm.com/index.php/Migrat ... 3_to_6.3.1

Can you upload here the two files into zip ( for testing purpose ).
 #31382  by fsouren
 
Thanks for the reply! Its a fresh new install.

Stopping Tomcat and editing the OpenKM.cfg i can do, but how can i execute the query when OpenKM/Tomcat is not running?

I can do a fresh install of the new .WAR file is included in the ISO file, if that is necessary.
 #31387  by jllort
 
If you replace OpenKM.war for nighly build ( integration.openkm.com ) and set hibernate.hbml=create, then the repository and all database will be clean and created.
 #31396  by fsouren
 
Hi,

I replaced the .WAR on a fresh install just to be sure.
I'll upload the whole PDF which i tried, and 1 PNG file which i extracted manually with "convert -density 200 -quality 90".
PDF is not OCR correctly, but the PNG is if i upload it.

http://www.famsouren.nl/vastgoed-2.png
http://www.famsouren.nl/vastgoed_juli_2014.pdf
 #31398  by fsouren
 
If i execute the exact same commands at the commandline, OCR works well.
These are the command as configured in OpenKM.
Code: Select all
/usr/bin/convert -density 200 -quality 90 vastgoed_juli_2014.pdf vastgoed.png
/usr/bin/cuneiform -l dut -o ocr.txt vastgoed-2.png
Does OpenKM treat these commands different?
 #31408  by fsouren
 
Is it possible to let OpenKM execute this command:

/usr/bin/convert -density 200 -quality 90 vastgoed_juli_2014.PDF vastgoed.png

for example to place it like this (i've tried it but i don't think OpenKM does)

/usr/bin/convert -density 200 -quality 90 ${fileIn} ${fileOut}
 #31451  by jllort
 
First of all, I suggest using tesseract. If you're using cuneiform this class com.openkm.extractor.Tesseract3TextExtractor is wrong.
 #31460  by fsouren
 
Yeah, i'm using Tesseract.
I tried cuneiform from the commandline too, to check if it works.

But i've tried to use both at system.ocr and at extractors, but it makes no difference.

If i upload any document with English text all works fine, except for dutch scanned files.
Text PDF and images in dutch language are working too.
 #31480  by fsouren
 
When does OpenKM convert a PDF to a image file?
Because when i upload a PDF file, the tesseract process is started right away, i can't see the he's running convert first.
 #31575  by jllort
 
Internally us we're not doing this optimization "convert -density 300 -quality 100" we're extracting the image into with his own resolution. In your case probably you'll like to take more control of the process based on some automation task and do the indexing process. Is not much complicated doing it, but is quite complex explain all the steps you should consider.

Download the dev environment here http://sourceforge.net/projects/openkmportabledev/

First you should upload this files on a separate folder to be processed, and after finished, move to other ( this should be an easy solution ).
/okm:root/special/controled extraction to /okm:root/special/finished

Create and automation task and into move from controled to finished ( this the easies ).
Now the more complex control the text extraction -> convert the pdf document to png, take this code as example how doing it
Code: Select all
File tmpJpeg = FileUtils.createTempFileFromMime(doc.getMimeType());
InputStream is = OKMDocument.getInstance().getContent(null, doc.getUuid(), true);
FileOutputStream fos = new FileOutputStream(tmpJpeg);
IOUtils.copy(is, fos);

// Fichero pbm
File tmpPbm = File.createTempFile("okm", ".pbm");
ImageUtils.imageMagickConvert(tmpJpeg.getPath(), tmpPbm.getPath(), "${fileIn} ${fileOut}");
is.close();
Then you need to extract text ( tesseract ) and store into openkm, etc... if you're interested in this option, when you arrive here, tell me and I will continue explained how doing it.

Other way for doing it is create your own textextractor for pdf, take a look at PdfTextExtractor.java class and modify as your needs ( debug it, and problably you would like to do some changes on extracted image with the code I propesed before ).

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.