Page 2 of 3

Re: OCR function, PNG works except for PDF files

PostPosted:Thu Mar 12, 2015 1:19 pm
by fsouren
Thanks for your reply! I'll dig into it.

But on the other hand, i'm probably not the only one trying to index Dutch text PDF files i guess. (English text PDF works great).
I just can't figure out where it goes wrong.

Re: OCR function, PNG works except for PDF files

PostPosted:Sun Mar 15, 2015 8:10 am
by jllort
Should debug the temp files created before be executed with OCR. Two weeks ago we've released portable dev environment http://sourceforge.net/projects/openkmportabledev/ my suggestion is download it, set some breakpoint into pdf text extractor, and step by step, take a look about what's happening, specially on tmp files. ( upload only one document and from crontab -> force indexing ).

Re: OCR function, PNG works except for PDF files

PostPosted:Tue Mar 17, 2015 12:38 pm
by fsouren
Could you try one more thing for me?
I have 2 scans, doc1.pdf and doc3.pdf. doc1.pdf works, doc3.pdf doesn't.

What could be the difference?

http://www.famsouren.nl/doc1.pdf
http://www.famsouren.nl/doc3.pdf

doc3.pdf works if i first convert it manually to PNG, then upload the PNG file.

Re: OCR function, PNG works except for PDF files

PostPosted:Sat Mar 21, 2015 6:50 pm
by jllort
I've test in our online demo and seems there is going right. I attach here the text extracted.

Do you got the last OpenKM version ( the nighly build, because there're we've corrected some issues http://integration.openkm.com/ and here information about migration http://wiki.openkm.com/index.php/Migration_Guide).

Re: OCR function, PNG works except for PDF files

PostPosted:Sun Mar 22, 2015 8:51 am
by fsouren
I've tried what you said, and upgraded to build 8186.
But still i get a lot of garbage when indexing doc3.pdf.

Sadly enough i should let it go i guess, i just can't seem the get it working :cry:

Re: OCR function, PNG works except for PDF files

PostPosted:Tue Mar 24, 2015 12:18 pm
by fsouren
Is it possible to drop the settings from the demo site here? So i can compare them.
Can't seem to view them when logging in as a demo user.

Re: OCR function, PNG works except for PDF files

PostPosted:Sun Mar 29, 2015 3:07 pm
by jllort
Demo is based on professional version, is not he community ( both versions have a similar base, but are quite different ).

For what you told us, with nightly build you got exactly the same problem no ? Can you post here a text file with extracted contents ?

Re: OCR function, PNG works except for PDF files

PostPosted:Mon Mar 30, 2015 7:19 am
by fsouren
Yes, exact the same problem.

Re: OCR function, PNG works except for PDF files

PostPosted:Fri Apr 03, 2015 11:19 am
by fsouren
Anyone?

Re: OCR function, PNG works except for PDF files

PostPosted:Sat Apr 04, 2015 9:13 am
by pavila
So you've tested with a recent nightbuilt, haven't you?

Re: OCR function, PNG works except for PDF files

PostPosted:Sun Apr 05, 2015 6:50 am
by fsouren
Yes i did. I even did a clean install with Ubuntu 14.04 en OpenKM nightly.
The only thing i did was install LibreOffice and ImageMagick, then OpenKM and replace OpenKM.war with a nightly one.

Re: OCR function, PNG works except for PDF files

PostPosted:Tue Apr 07, 2015 9:20 am
by pavila
I've made some improvements to PDF text extraction, please try with tonight nightbuild.

Check you have installed Tesseract and configured the com.openkm.extractor.Tesseract3TextExtractor in registered.text.extractors. If present, remove com.openkm.extractor.CuneiformTextExtractor.

Re: OCR function, PNG works except for PDF files

PostPosted:Tue Apr 07, 2015 10:32 am
by fsouren
Thanks for looking into this!

So i should wait till tomorrow? Build 8189 still ins't working for me.
Still the same text as uploaded in doc3.zip.

Re: OCR function, PNG works except for PDF files

PostPosted:Tue Apr 07, 2015 3:06 pm
by pavila
Wait until tomorrow to generate a new build.

Re: OCR function, PNG works except for PDF files

PostPosted:Thu Apr 09, 2015 10:18 am
by fsouren
I installed the new build and it's working great now!
Could you try to explain what the underlying problem was? (in simple english please :lol: )