• installing Tesseract OCR into OpenKm

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #43752  by mbmni
 
Dear sir/ madam,

i am pleased to know the step by step procedure on how to include OCR to the OpenKM Community Edition?

thank you
 #43755  by mbmni
 
thank you very much.

i upload a jpg image file .

then i preview the file.

but when i search i could not retrieve the text.

any other configurations needed for OCR ?
 #43756  by mbmni
 
dear Sir / Madam,

i am getting this error after i done the configuration parameters for System.ocr.

system.ocr
Can't read or execute: c:\Program

please be kind to advice me on how to correct.

thank you
 #43761  by jllort
 
What is the value of your system.ocr ?
Also copy here the complete catalina.log statck trace error.

Finally take in mind, documents go into indexing queue, that means documents are not indexed in real time, are in a queue and processed periodically, you can see the queue status at Administration > Stats > Pending indexing queue ( top right menu )
 #43766  by mbmni
 
dear Sir/ Madam,

the value for system.ocr is

C:\Program Files (x86)\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}

anything more to confirm the ocr settings for openkm?

thank you
 #43768  by jllort
 
The parameters seems are right. Is the tesseract application there C:\Program Files (x86)\Tesseract-OCR\tesseract.exe ?
Ensure between tesseract.exe and the parameters ${fileIn} ${fileOut} be only a single white space ( otherwise you will get an error ).
Can you share your catalina.log file ( only the output section where is shown the complete stack trace error )
 #43769  by mbmni
 
dear Sir/ madam,

i am sharing the catalina.log file with you.
Code: Select all
SEVERE: The web application [/OpenKM] created a ThreadLocal with key of type [com.sun.xml.bind.v2.runtime.Coordinator$1] (value [com.sun.xml.bind.v2.runtime.Coordinator$1@1417d5a]) and a value of type [java.lang.Object[]] (value [[Ljava.lang.Object;@872aea]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
May 07, 2017 6:34:38 PM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/OpenKM] created a ThreadLocal with key of type [com.sun.xml.bind.v2.ClassFactory$1] (value [com.sun.xml.bind.v2.ClassFactory$1@1cd22c0]) and a value of type [java.util.WeakHashMap] (value [{class com.openkm.ws.endpoint.jaxws_asm.Logout=java.lang.ref.WeakReference@9b9ac8, class com.openkm.ws.endpoint.jaxws_asm.Login=java.lang.ref.WeakReference@1851b6b, class com.openkm.ws.endpoint.jaxws_asm.GetChildren=java.lang.ref.WeakReference@330cc0}]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
May 07, 2017 6:34:42 PM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:34:42 PM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:35:50 PM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: C:\tomcat\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files\ImageMagick-6.9.8-Q16-HDRI;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL Server\100\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL Server\90\Tools\binn\;C:\php\php-5.6.0-nts-Win32-VC11-x86;C:\Program Files\Common Files\Autodesk Shared\;C:\Program Files (x86)\nodejs\;C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Common Files\Sage SData\;C:\Program Files (x86)\Common Files\Sage SBD\;c:\Program Files (x86)\Common Files\Intuit\QBPOSSDKRuntime;C:\Program Files (x86)\Java\jdk1.7.0\bin;C:\openkm-6.3.2-community\tomcat\bin;C:\Program Files (x86)\Java\jre1.8.0_91;C:\Program Files (x86)\MIT\Kerberos\bin;C:\Program Files (x86)\Skype\Phone\;C:\tomcat\lib\sigar;C:\Program Files (x86)\Tesseract-OCR;;.
May 07, 2017 6:35:54 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:35:54 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:35:54 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 7157 ms
May 07, 2017 6:35:54 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Catalina
May 07, 2017 6:35:54 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/7.0.61
May 07, 2017 6:35:55 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive C:\tomcat\webapps\OpenKM.war
May 07, 2017 6:35:55 PM org.apache.catalina.loader.WebappClassLoader validateJarFile
INFO: validateJarFile(C:\tomcat\webapps\OpenKM\WEB-INF\lib\servlet-api-2.5-20081211.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class
May 07, 2017 6:35:55 PM org.apache.catalina.loader.WebappClassLoader validateJarFile
INFO: validateJarFile(C:\tomcat\webapps\OpenKM\WEB-INF\lib\servlet-api-6.0.36.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class
May 07, 2017 6:38:40 PM net.xeoh.plugins.base.impl.classpath.loader.FileLoader loadFrom
WARNING: Supplied path does not exist. Unable to add plugins from there.
May 07, 2017 6:38:40 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deployment of web application archive C:\tomcat\webapps\OpenKM.war has finished in 165,604 ms
May 07, 2017 6:38:40 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory C:\tomcat\webapps\ROOT
May 07, 2017 6:38:42 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deployment of web application directory C:\tomcat\webapps\ROOT has finished in 2,011 ms
May 07, 2017 6:38:42 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:38:42 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:38:42 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 168391 ms
 #43770  by mbmni
 
Dear Sir / madam,

when i execute the following SQL Query, it shows me the extracted text.
Code: Select all
SELECT * FROM OKM_NODE_DOCUMENT;
but when i preview it and search for a word in that file, i am not getting the searched result.

can i please know the reason ?

thank you.
 #43771  by jllort
 
Might be a problem with lucene search engine. Go to administration > tools > Rebuild indexes > Rebuild lucene indexes ( during the process the repository will go to read only mode while reindexing the whole repository )
 #43782  by mbmni
 
dear sir/ madam,

after i rebuild indexes -> Lucene Index is executed, i newly upload a .gif file and preview that file.

later i typed a word in the search text box given in the preview tab.

after i type a word in the search text box and press the enter key, i am getting the following message.

"finished searching the document. No more matches were found."
 #43786  by mbmni
 
dear sir / madam,

i have installed an ldap server using apache directory.

i have created only one user.
it includes only following attributes :

uid
cn
sn
userPassword

what are the most important and compulsory ldap configurations that i must set in openkm community edition to log in to the system through the created ldap user?

thank you
 #43787  by mbmni
 
dear sir / madam,

i successfully add my LDAP user to the openKM community edition.

and after i add, i logout from the system.

now i cant login to the system.

please be kind to tell me the reason.

thank you
 #43801  by jllort
 
do not merge serveral questions at the same topic because it cause a lot of confusion to me and other community readers what are losing the topic. If you have problem with OCR engine we can continue talking about it, otherwise add a new post for each new topic.

About gif image, take from terminal what happens when you process giff image ( extracts text or not tesseract ? because you might have an image with low quality text to be processed with tesseract OCR engine ), if goes right from terminal then will going right from openkm side ( consider documents go across queue to be processed, check if your document has yet processed, take a look:
https://docs.openkm.com/kcenter/view/ok ... ctionqueue
https://docs.openkm.com/kcenter/view/ok ... ption.html ( text extracted is stored into NDC_TEXT_EXTRACTED column ), might use database query tool to check it https://docs.openkm.com/kcenter/view/ok ... query.html
 #43807  by mbmni
 
Dear sir / madam,

When i checked the column NDC_TEXT_EXTRACTED, I can see the extracted text from the scanned document.

Now after i go to desktop tab then preview the file and search for a character or word, I'm not getting the searching character or the words.

I execute the Lucene Engine also as you adviced me earlier.

Please be kind to look into this.

Thank you.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.