Page 1 of 3

OCR function, PNG works except for PDF files

PostPosted:Fri Feb 20, 2015 12:28 pm
by fsouren
Hi,

I've got everything up and running on Ubuntu 14.04. The problem is OCR'ing PDF files.
If i convert them to PNG files on the commandline with "convert -density 200 -quality 90" and upload them to OpenKm everything gets recognized fine.
Only if i upload the source PDF file i only get garbage text, and can't full text search the document.

Settings are:
Code: Select all
org.apache.jackrabbit.extractor.PlainTextExtractor 
org.apache.jackrabbit.extractor.MsWordTextExtractor 
org.apache.jackrabbit.extractor.MsExcelTextExtractor 
org.apache.jackrabbit.extractor.MsPowerPointTextExtractor 
org.apache.jackrabbit.extractor.OpenOfficeTextExtractor 
org.apache.jackrabbit.extractor.RTFTextExtractor 
org.apache.jackrabbit.extractor.HTMLTextExtractor 
org.apache.jackrabbit.extractor.XMLTextExtractor 
org.apache.jackrabbit.extractor.PngTextExtractor 
org.apache.jackrabbit.extractor.MsOutlookTextExtractor 
com.openkm.extractor.Tesseract3TextExtractor 
com.openkm.extractor.PdfTextExtractor 
com.openkm.extractor.AudioTextExtractor 
com.openkm.extractor.ExifTextExtractor 
com.openkm.extractor.SourceCodeTextExtractor 
com.openkm.extractor.MsOffice2007TextExtractor
Code: Select all
system.imagemagick.convert	String	/usr/bin/convert -density 200 -quality 90
system.ocr	String	/usr/bin/tesseract ${fileIn} ${fileOut} -l nld
system.swftools.pdf2swf	String	/opt/openkm-6.3.0-community/tomcat/bin/pdf2swf -f -T 9 -t -s storeallcharacters ${fileIn} -o ${fileOut}
system.openoffice.dictionary	String		
system.openoffice.path	String	/usr/lib/libreoffice
system.pdf.force.ocr	Boolean	Inactive
Anyone who has the golden answer for me? :D

Re: OCR function, PNG works except for PDF files

PostPosted:Sat Feb 21, 2015 10:18 am
by jllort
Do you come from a migration or new installation ( because I suspect org.apache.jackrabbit should be com.openkm, anyway this is not the reason why is not going right ). Upgrate to nighly build because there are some corrections you can find it at integration.openkm.com upgrade guide is here http://wiki.openkm.com/index.php/Migrat ... 3_to_6.3.1

Can you upload here the two files into zip ( for testing purpose ).

Re: OCR function, PNG works except for PDF files

PostPosted:Sun Feb 22, 2015 7:21 am
by fsouren
Thanks for the reply! Its a fresh new install.

Stopping Tomcat and editing the OpenKM.cfg i can do, but how can i execute the query when OpenKM/Tomcat is not running?

I can do a fresh install of the new .WAR file is included in the ISO file, if that is necessary.

Re: OCR function, PNG works except for PDF files

PostPosted:Sun Feb 22, 2015 6:07 pm
by jllort
If you replace OpenKM.war for nighly build ( integration.openkm.com ) and set hibernate.hbml=create, then the repository and all database will be clean and created.

Re: OCR function, PNG works except for PDF files

PostPosted:Mon Feb 23, 2015 10:07 am
by fsouren
Hi,

I replaced the .WAR on a fresh install just to be sure.
I'll upload the whole PDF which i tried, and 1 PNG file which i extracted manually with "convert -density 200 -quality 90".
PDF is not OCR correctly, but the PNG is if i upload it.

http://www.famsouren.nl/vastgoed-2.png
http://www.famsouren.nl/vastgoed_juli_2014.pdf

Re: OCR function, PNG works except for PDF files

PostPosted:Mon Feb 23, 2015 4:00 pm
by fsouren
If i execute the exact same commands at the commandline, OCR works well.
These are the command as configured in OpenKM.
Code: Select all
/usr/bin/convert -density 200 -quality 90 vastgoed_juli_2014.pdf vastgoed.png
/usr/bin/cuneiform -l dut -o ocr.txt vastgoed-2.png
Does OpenKM treat these commands different?

Re: OCR function, PNG works except for PDF files

PostPosted:Tue Feb 24, 2015 11:36 am
by fsouren
Is it possible to let OpenKM execute this command:

/usr/bin/convert -density 200 -quality 90 vastgoed_juli_2014.PDF vastgoed.png

for example to place it like this (i've tried it but i don't think OpenKM does)

/usr/bin/convert -density 200 -quality 90 ${fileIn} ${fileOut}

Re: OCR function, PNG works except for PDF files

PostPosted:Wed Feb 25, 2015 9:40 am
by fsouren
I'm thinking could it be the Dutch language?
I can upload almost every English document and its text gets recognized fine.

Re: OCR function, PNG works except for PDF files

PostPosted:Fri Feb 27, 2015 5:24 pm
by jllort
First of all, I suggest using tesseract. If you're using cuneiform this class com.openkm.extractor.Tesseract3TextExtractor is wrong.

Re: OCR function, PNG works except for PDF files

PostPosted:Fri Feb 27, 2015 8:13 pm
by fsouren
Yeah, i'm using Tesseract.
I tried cuneiform from the commandline too, to check if it works.

But i've tried to use both at system.ocr and at extractors, but it makes no difference.

If i upload any document with English text all works fine, except for dutch scanned files.
Text PDF and images in dutch language are working too.

Re: OCR function, PNG works except for PDF files

PostPosted:Sun Mar 01, 2015 4:23 pm
by jllort
And from command line, tesseract + dutch are working right ?

Re: OCR function, PNG works except for PDF files

PostPosted:Mon Mar 02, 2015 7:35 am
by fsouren
Yes, if i convert the PDF to PNG first with the command:
Code: Select all
convert -density 300 -quality 100

Re: OCR function, PNG works except for PDF files

PostPosted:Mon Mar 02, 2015 9:40 am
by fsouren
When does OpenKM convert a PDF to a image file?
Because when i upload a PDF file, the tesseract process is started right away, i can't see the he's running convert first.

Re: OCR function, PNG works except for PDF files

PostPosted:Wed Mar 04, 2015 7:57 am
by fsouren
Anyone with a suggestion where to look?

Re: OCR function, PNG works except for PDF files

PostPosted:Thu Mar 12, 2015 7:52 am
by jllort
Internally us we're not doing this optimization "convert -density 300 -quality 100" we're extracting the image into with his own resolution. In your case probably you'll like to take more control of the process based on some automation task and do the indexing process. Is not much complicated doing it, but is quite complex explain all the steps you should consider.

Download the dev environment here http://sourceforge.net/projects/openkmportabledev/

First you should upload this files on a separate folder to be processed, and after finished, move to other ( this should be an easy solution ).
/okm:root/special/controled extraction to /okm:root/special/finished

Create and automation task and into move from controled to finished ( this the easies ).
Now the more complex control the text extraction -> convert the pdf document to png, take this code as example how doing it
Code: Select all
File tmpJpeg = FileUtils.createTempFileFromMime(doc.getMimeType());
InputStream is = OKMDocument.getInstance().getContent(null, doc.getUuid(), true);
FileOutputStream fos = new FileOutputStream(tmpJpeg);
IOUtils.copy(is, fos);

// Fichero pbm
File tmpPbm = File.createTempFile("okm", ".pbm");
ImageUtils.imageMagickConvert(tmpJpeg.getPath(), tmpPbm.getPath(), "${fileIn} ${fileOut}");
is.close();
Then you need to extract text ( tesseract ) and store into openkm, etc... if you're interested in this option, when you arrive here, tell me and I will continue explained how doing it.

Other way for doing it is create your own textextractor for pdf, take a look at PdfTextExtractor.java class and modify as your needs ( debug it, and problably you would like to do some changes on extracted image with the code I propesed before ).