Page 1 of 2
installing Tesseract OCR into OpenKm
PostPosted:Thu May 04, 2017 7:00 am
by mbmni
Dear sir/ madam,
i am pleased to know the step by step procedure on how to include OCR to the OpenKM Community Edition?
thank you
Re: installing Tesseract OCR into OpenKm
PostPosted:Fri May 05, 2017 5:59 am
by jllort
Re: installing Tesseract OCR into OpenKm
PostPosted:Fri May 05, 2017 8:44 am
by mbmni
thank you very much.
i upload a jpg image file .
then i preview the file.
but when i search i could not retrieve the text.
any other configurations needed for OCR ?
Re: installing Tesseract OCR into OpenKm
PostPosted:Fri May 05, 2017 9:34 am
by mbmni
dear Sir / Madam,
i am getting this error after i done the configuration parameters for System.ocr.
system.ocr
Can't read or execute: c:\Program
please be kind to advice me on how to correct.
thank you
Re: installing Tesseract OCR into OpenKm
PostPosted:Sat May 06, 2017 3:44 pm
by jllort
What is the value of your system.ocr ?
Also copy here the complete catalina.log statck trace error.
Finally take in mind, documents go into indexing queue, that means documents are not indexed in real time, are in a queue and processed periodically, you can see the queue status at Administration > Stats > Pending indexing queue ( top right menu )
Re: installing Tesseract OCR into OpenKm
PostPosted:Sun May 07, 2017 6:49 am
by mbmni
dear Sir/ Madam,
the value for system.ocr is
C:\Program Files (x86)\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
anything more to confirm the ocr settings for openkm?
thank you
Re: installing Tesseract OCR into OpenKm
PostPosted:Sun May 07, 2017 9:57 am
by jllort
The parameters seems are right. Is the tesseract application there C:\Program Files (x86)\Tesseract-OCR\tesseract.exe ?
Ensure between tesseract.exe and the parameters ${fileIn} ${fileOut} be only a single white space ( otherwise you will get an error ).
Can you share your catalina.log file ( only the output section where is shown the complete stack trace error )
Re: installing Tesseract OCR into OpenKm
PostPosted:Sun May 07, 2017 1:41 pm
by mbmni
dear Sir/ madam,
i am sharing the catalina.log file with you.
Code: Select allSEVERE: The web application [/OpenKM] created a ThreadLocal with key of type [com.sun.xml.bind.v2.runtime.Coordinator$1] (value [com.sun.xml.bind.v2.runtime.Coordinator$1@1417d5a]) and a value of type [java.lang.Object[]] (value [[Ljava.lang.Object;@872aea]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
May 07, 2017 6:34:38 PM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks
SEVERE: The web application [/OpenKM] created a ThreadLocal with key of type [com.sun.xml.bind.v2.ClassFactory$1] (value [com.sun.xml.bind.v2.ClassFactory$1@1cd22c0]) and a value of type [java.util.WeakHashMap] (value [{class com.openkm.ws.endpoint.jaxws_asm.Logout=java.lang.ref.WeakReference@9b9ac8, class com.openkm.ws.endpoint.jaxws_asm.Login=java.lang.ref.WeakReference@1851b6b, class com.openkm.ws.endpoint.jaxws_asm.GetChildren=java.lang.ref.WeakReference@330cc0}]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.
May 07, 2017 6:34:42 PM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:34:42 PM org.apache.coyote.AbstractProtocol stop
INFO: Stopping ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:35:50 PM org.apache.catalina.core.AprLifecycleListener lifecycleEvent
INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: C:\tomcat\bin;C:\Windows\Sun\Java\bin;C:\Windows\system32;C:\Windows;C:\Program Files\ImageMagick-6.9.8-Q16-HDRI;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL Server\100\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL Server\90\Tools\binn\;C:\php\php-5.6.0-nts-Win32-VC11-x86;C:\Program Files\Common Files\Autodesk Shared\;C:\Program Files (x86)\nodejs\;C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Common Files\Sage SData\;C:\Program Files (x86)\Common Files\Sage SBD\;c:\Program Files (x86)\Common Files\Intuit\QBPOSSDKRuntime;C:\Program Files (x86)\Java\jdk1.7.0\bin;C:\openkm-6.3.2-community\tomcat\bin;C:\Program Files (x86)\Java\jre1.8.0_91;C:\Program Files (x86)\MIT\Kerberos\bin;C:\Program Files (x86)\Skype\Phone\;C:\tomcat\lib\sigar;C:\Program Files (x86)\Tesseract-OCR;;.
May 07, 2017 6:35:54 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:35:54 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:35:54 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 7157 ms
May 07, 2017 6:35:54 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Catalina
May 07, 2017 6:35:54 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/7.0.61
May 07, 2017 6:35:55 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive C:\tomcat\webapps\OpenKM.war
May 07, 2017 6:35:55 PM org.apache.catalina.loader.WebappClassLoader validateJarFile
INFO: validateJarFile(C:\tomcat\webapps\OpenKM\WEB-INF\lib\servlet-api-2.5-20081211.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class
May 07, 2017 6:35:55 PM org.apache.catalina.loader.WebappClassLoader validateJarFile
INFO: validateJarFile(C:\tomcat\webapps\OpenKM\WEB-INF\lib\servlet-api-6.0.36.jar) - jar not loaded. See Servlet Spec 3.0, section 10.7.2. Offending class: javax/servlet/Servlet.class
May 07, 2017 6:38:40 PM net.xeoh.plugins.base.impl.classpath.loader.FileLoader loadFrom
WARNING: Supplied path does not exist. Unable to add plugins from there.
May 07, 2017 6:38:40 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deployment of web application archive C:\tomcat\webapps\OpenKM.war has finished in 165,604 ms
May 07, 2017 6:38:40 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory C:\tomcat\webapps\ROOT
May 07, 2017 6:38:42 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deployment of web application directory C:\tomcat\webapps\ROOT has finished in 2,011 ms
May 07, 2017 6:38:42 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-bio-0.0.0.0-8080"]
May 07, 2017 6:38:42 PM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["ajp-bio-127.0.0.1-8009"]
May 07, 2017 6:38:42 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 168391 ms
Re: installing Tesseract OCR into OpenKm
PostPosted:Sun May 07, 2017 2:39 pm
by mbmni
Dear Sir / madam,
when i execute the following SQL Query, it shows me the extracted text.
but when i preview it and search for a word in that file, i am not getting the searched result.
can i please know the reason ?
thank you.
Re: installing Tesseract OCR into OpenKm
PostPosted:Sun May 07, 2017 5:36 pm
by jllort
Might be a problem with lucene search engine. Go to administration > tools > Rebuild indexes > Rebuild lucene indexes ( during the process the repository will go to read only mode while reindexing the whole repository )
Re: installing Tesseract OCR into OpenKm
PostPosted:Wed May 10, 2017 12:56 pm
by mbmni
dear sir/ madam,
after i rebuild indexes -> Lucene Index is executed, i newly upload a .gif file and preview that file.
later i typed a word in the search text box given in the preview tab.
after i type a word in the search text box and press the enter key, i am getting the following message.
"finished searching the document. No more matches were found."
Re: installing Tesseract OCR into OpenKm
PostPosted:Fri May 12, 2017 8:13 am
by mbmni
dear sir / madam,
i have installed an ldap server using apache directory.
i have created only one user.
it includes only following attributes :
uid
cn
sn
userPassword
what are the most important and compulsory ldap configurations that i must set in openkm community edition to log in to the system through the created ldap user?
thank you
Re: installing Tesseract OCR into OpenKm
PostPosted:Fri May 12, 2017 10:37 am
by mbmni
dear sir / madam,
i successfully add my LDAP user to the openKM community edition.
and after i add, i logout from the system.
now i cant login to the system.
please be kind to tell me the reason.
thank you
Re: installing Tesseract OCR into OpenKm
PostPosted:Sat May 13, 2017 11:21 am
by jllort
do not merge serveral questions at the same topic because it cause a lot of confusion to me and other community readers what are losing the topic. If you have problem with OCR engine we can continue talking about it, otherwise add a new post for each new topic.
About gif image, take from terminal what happens when you process giff image ( extracts text or not tesseract ? because you might have an image with low quality text to be processed with tesseract OCR engine ), if goes right from terminal then will going right from openkm side ( consider documents go across queue to be processed, check if your document has yet processed, take a look:
https://docs.openkm.com/kcenter/view/ok ... ctionqueue
https://docs.openkm.com/kcenter/view/ok ... ption.html ( text extracted is stored into NDC_TEXT_EXTRACTED column ), might use database query tool to check it
https://docs.openkm.com/kcenter/view/ok ... query.html
Re: installing Tesseract OCR into OpenKm
PostPosted:Mon May 15, 2017 3:29 am
by mbmni
Dear sir / madam,
When i checked the column NDC_TEXT_EXTRACTED, I can see the extracted text from the scanned document.
Now after i go to desktop tab then preview the file and search for a character or word, I'm not getting the searching character or the words.
I execute the Lucene Engine also as you adviced me earlier.
Please be kind to look into this.
Thank you.