Page 1 of 2

installing Tesseract OCR into OpenKm

PostPosted:Thu May 04, 2017 7:00 am
by mbmni
Dear sir/ madam,

i am pleased to know the step by step procedure on how to include OCR to the OpenKM Community Edition?

thank you

Re: installing Tesseract OCR into OpenKm

PostPosted:Fri May 05, 2017 5:59 am
by jllort
Here you will find all information you need for it https://docs.openkm.com/kcenter/view/ok ... ngine.html

Re: installing Tesseract OCR into OpenKm

PostPosted:Fri May 05, 2017 8:44 am
by mbmni
thank you very much.

i upload a jpg image file .

then i preview the file.

but when i search i could not retrieve the text.

any other configurations needed for OCR ?

Re: installing Tesseract OCR into OpenKm

PostPosted:Fri May 05, 2017 9:34 am
by mbmni
dear Sir / Madam,

i am getting this error after i done the configuration parameters for System.ocr.

system.ocr
Can't read or execute: c:\Program

please be kind to advice me on how to correct.

thank you

Re: installing Tesseract OCR into OpenKm

PostPosted:Sat May 06, 2017 3:44 pm
by jllort
What is the value of your system.ocr ?
Also copy here the complete catalina.log statck trace error.

Finally take in mind, documents go into indexing queue, that means documents are not indexed in real time, are in a queue and processed periodically, you can see the queue status at Administration > Stats > Pending indexing queue ( top right menu )

Re: installing Tesseract OCR into OpenKm

PostPosted:Sun May 07, 2017 6:49 am
by mbmni
dear Sir/ Madam,

the value for system.ocr is

C:\Program Files (x86)\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}

anything more to confirm the ocr settings for openkm?

thank you

Re: installing Tesseract OCR into OpenKm

PostPosted:Sun May 07, 2017 9:57 am
by jllort
The parameters seems are right. Is the tesseract application there C:\Program Files (x86)\Tesseract-OCR\tesseract.exe ?
Ensure between tesseract.exe and the parameters ${fileIn} ${fileOut} be only a single white space ( otherwise you will get an error ).
Can you share your catalina.log file ( only the output section where is shown the complete stack trace error )

Re: installing Tesseract OCR into OpenKm

PostPosted:Sun May 07, 2017 1:41 pm
by mbmni
dear Sir/ madam,

i am sharing the catalina.log file with you.
Code: Select all
SEVERE: The web application [/OpenKM] created a ThreadLocal with key of type [com.sun.xml.bind.v2.runtime.Coordinator$1] (value [com.sun.xml.bind.v2.runtime.Coordinator$1@1417d5a]) and a value of type [java.lang.Object[]] (value [[Ljava.lang.Object;@872aea]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
May 07, 2017 6:34:38 PM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/OpenKM] created a ThreadLocal with key of type [com.sun.xml.bind.v2.ClassFactory$1] (value [com.sun.xml.bind.v2.ClassFactory$1@1cd22c0]) and a value of type [java.util.WeakHashMap] (value [{class com.openkm.ws.endpoint.jaxws_asm.Logout=java.lang.ref.WeakReference@9b9ac8, class com.openkm.ws.endpoint.jaxws_asm.Login=java.lang.ref.WeakReference@1851b6b, class com.openkm.ws.endpoint.jaxws_asm.GetChildren=java.lang.ref.WeakReference@330cc0}]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
May 07, 2017 6:34:42 PM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:34:42 PM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:35:50 PM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: C:\tomcat\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files\ImageMagick-6.9.8-Q16-HDRI;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL Server\100\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL Server\90\Tools\binn\;C:\php\php-5.6.0-nts-Win32-VC11-x86;C:\Program Files\Common Files\Autodesk Shared\;C:\Program Files (x86)\nodejs\;C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Common Files\Sage SData\;C:\Program Files (x86)\Common Files\Sage SBD\;c:\Program Files (x86)\Common Files\Intuit\QBPOSSDKRuntime;C:\Program Files (x86)\Java\jdk1.7.0\bin;C:\openkm-6.3.2-community\tomcat\bin;C:\Program Files (x86)\Java\jre1.8.0_91;C:\Program Files (x86)\MIT\Kerberos\bin;C:\Program Files (x86)\Skype\Phone\;C:\tomcat\lib\sigar;C:\Program Files (x86)\Tesseract-OCR;;.
May 07, 2017 6:35:54 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:35:54 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:35:54 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 7157 ms
May 07, 2017 6:35:54 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Catalina
May 07, 2017 6:35:54 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/7.0.61
May 07, 2017 6:35:55 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive C:\tomcat\webapps\OpenKM.war
May 07, 2017 6:35:55 PM org.apache.catalina.loader.WebappClassLoader validateJarFile
INFO: validateJarFile(C:\tomcat\webapps\OpenKM\WEB-INF\lib\servlet-api-2.5-20081211.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class
May 07, 2017 6:35:55 PM org.apache.catalina.loader.WebappClassLoader validateJarFile
INFO: validateJarFile(C:\tomcat\webapps\OpenKM\WEB-INF\lib\servlet-api-6.0.36.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class
May 07, 2017 6:38:40 PM net.xeoh.plugins.base.impl.classpath.loader.FileLoader loadFrom
WARNING: Supplied path does not exist. Unable to add plugins from there.
May 07, 2017 6:38:40 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deployment of web application archive C:\tomcat\webapps\OpenKM.war has finished in 165,604 ms
May 07, 2017 6:38:40 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory C:\tomcat\webapps\ROOT
May 07, 2017 6:38:42 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deployment of web application directory C:\tomcat\webapps\ROOT has finished in 2,011 ms
May 07, 2017 6:38:42 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:38:42 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:38:42 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 168391 ms

Re: installing Tesseract OCR into OpenKm

PostPosted:Sun May 07, 2017 2:39 pm
by mbmni
Dear Sir / madam,

when i execute the following SQL Query, it shows me the extracted text.
Code: Select all
SELECT * FROM OKM_NODE_DOCUMENT;
but when i preview it and search for a word in that file, i am not getting the searched result.

can i please know the reason ?

thank you.

Re: installing Tesseract OCR into OpenKm

PostPosted:Sun May 07, 2017 5:36 pm
by jllort
Might be a problem with lucene search engine. Go to administration > tools > Rebuild indexes > Rebuild lucene indexes ( during the process the repository will go to read only mode while reindexing the whole repository )

Re: installing Tesseract OCR into OpenKm

PostPosted:Wed May 10, 2017 12:56 pm
by mbmni
dear sir/ madam,

after i rebuild indexes -> Lucene Index is executed, i newly upload a .gif file and preview that file.

later i typed a word in the search text box given in the preview tab.

after i type a word in the search text box and press the enter key, i am getting the following message.

"finished searching the document. No more matches were found."

Re: installing Tesseract OCR into OpenKm

PostPosted:Fri May 12, 2017 8:13 am
by mbmni
dear sir / madam,

i have installed an ldap server using apache directory.

i have created only one user.
it includes only following attributes :

uid
cn
sn
userPassword

what are the most important and compulsory ldap configurations that i must set in openkm community edition to log in to the system through the created ldap user?

thank you

Re: installing Tesseract OCR into OpenKm

PostPosted:Fri May 12, 2017 10:37 am
by mbmni
dear sir / madam,

i successfully add my LDAP user to the openKM community edition.

and after i add, i logout from the system.

now i cant login to the system.

please be kind to tell me the reason.

thank you

Re: installing Tesseract OCR into OpenKm

PostPosted:Sat May 13, 2017 11:21 am
by jllort
do not merge serveral questions at the same topic because it cause a lot of confusion to me and other community readers what are losing the topic. If you have problem with OCR engine we can continue talking about it, otherwise add a new post for each new topic.

About gif image, take from terminal what happens when you process giff image ( extracts text or not tesseract ? because you might have an image with low quality text to be processed with tesseract OCR engine ), if goes right from terminal then will going right from openkm side ( consider documents go across queue to be processed, check if your document has yet processed, take a look:
https://docs.openkm.com/kcenter/view/ok ... ctionqueue
https://docs.openkm.com/kcenter/view/ok ... ption.html ( text extracted is stored into NDC_TEXT_EXTRACTED column ), might use database query tool to check it https://docs.openkm.com/kcenter/view/ok ... query.html

Re: installing Tesseract OCR into OpenKm

PostPosted:Mon May 15, 2017 3:29 am
by mbmni
Dear sir / madam,

When i checked the column NDC_TEXT_EXTRACTED, I can see the extracted text from the scanned document.

Now after i go to desktop tab then preview the file and search for a character or word, I'm not getting the searching character or the words.

I execute the Lucene Engine also as you adviced me earlier.

Please be kind to look into this.

Thank you.