• OCR function, PNG works except for PDF files

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #31587  by fsouren
 
Thanks for your reply! I'll dig into it.

But on the other hand, i'm probably not the only one trying to index Dutch text PDF files i guess. (English text PDF works great).
I just can't figure out where it goes wrong.
 #31607  by jllort
 
Should debug the temp files created before be executed with OCR. Two weeks ago we've released portable dev environment http://sourceforge.net/projects/openkmportabledev/ my suggestion is download it, set some breakpoint into pdf text extractor, and step by step, take a look about what's happening, specially on tmp files. ( upload only one document and from crontab -> force indexing ).
 #31665  by jllort
 
I've test in our online demo and seems there is going right. I attach here the text extracted.

Do you got the last OpenKM version ( the nighly build, because there're we've corrected some issues http://integration.openkm.com/ and here information about migration http://wiki.openkm.com/index.php/Migration_Guide).
Attachments
(2.01 KiB) Downloaded 268 times
 #31671  by fsouren
 
I've tried what you said, and upgraded to build 8186.
But still i get a lot of garbage when indexing doc3.pdf.

Sadly enough i should let it go i guess, i just can't seem the get it working :cry:
 #37384  by jllort
 
Demo is based on professional version, is not he community ( both versions have a similar base, but are quite different ).

For what you told us, with nightly build you got exactly the same problem no ? Can you post here a text file with extracted contents ?
 #38465  by fsouren
 
Yes i did. I even did a clean install with Ubuntu 14.04 en OpenKM nightly.
The only thing i did was install LibreOffice and ImageMagick, then OpenKM and replace OpenKM.war with a nightly one.
 #38473  by pavila
 
I've made some improvements to PDF text extraction, please try with tonight nightbuild.

Check you have installed Tesseract and configured the com.openkm.extractor.Tesseract3TextExtractor in registered.text.extractors. If present, remove com.openkm.extractor.CuneiformTextExtractor.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.