• Searching PDF OCR

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #9271  by jllort
 
I've tryed on my installation and runs perfect. The file is found by content kofax.
Seems something strange is happening in your system.
 #9279  by joako
 
pavila wrote:Try to reproduce the issue in the OpenKM demo: http://demo.openkm.com
Yes that works.

But how do I have 100% working search index in local install? I can't use OpenKM demo for day-to-day use obviously.

Like I said already I don't see any error in the openkm log.
 #9281  by jllort
 
Obviously it's something with your OpenKM configuration.
Which OpenKM version do you've got installed ?
Put here your OpenKM.cfg to take a look at your configuration parameters.
Have you make any change on repository.xml parameteres ?
Some problems without taking a look directly at the system can not be solved ... because in forum we can use some imagination about what can happened ... but it's limited to our imagination capabilities.
 #9282  by joako
 
I am using OpenKM 5.0.2

I start over with the repository, delete the directory and it is created again I start off clean since yesterday (i.e.: My last posted results are after starting with a clean repository)

I'll post the config later when I have access, but it is the standard config. I see nothing about search index in config, and I am using PDF already OCR files so no wrong OCR setting in config will be an issue. I need more to know how to debug the search index like I asked before:

1) any way to re-build search index
2) any way to see status of search index
3) any way to know when is a document supposed to be put into the index
... etc...
 #9285  by joako
 
OpenKM.conf:
Code: Select all
system.ocr=/opt/local/bin/tesseract
system.openoffice.path=/Applications/OpenOffice.org.app/Contents
system.openoffice.tasks=5
system.openoffice.port=8100
system.img2pdf=/opt/local/bin/convert
system.pdf2swf=/opt/local/bin/pdf2swf_wrapper.sh
#system.antivir=/usr/bin/clamscan
hibernate.dialect=org.hibernate.dialect.HSQLDialect
hibernate.hbm2ddl=none
#application.url=http://localhost:8080/OpenKM/com.openkm.frontend.Main/index.jsp
 #9290  by jllort
 
Nothing strange here, except the openoffice path, but it's not relevant to indexing
 #9774  by snowman
 
Hello,

are there any updates on this topic here? I got a similar problem:

I OCRed scanned TIFs with Finereader 10. Results are:
-Search for text with Acrobat, ok.
-Upload to online demo and search, ok.
-Upload and search on local installation, fail.
-Old PDFs that I imported months ago at installation time and that were OCRed with some HP tool are searchable. Have not checked yet if it is still possible for new scans.

I never changes indexing_configuration.xml
In the documentation area I cannot search for indexing because I have to log in. Looks like it is for registered/paying users only?
Are the configuration files of your online demo available for comparison?

BTW: How do I find out the exact version of OpenKM I am running. I think it is 5.0.2 but Help->About tells me 5.0.

Best regards,
Snowman
 #9794  by jllort
 
There's no difference between community version and what we install to supported customers about OCR configuration params, nothing hidden sure. Could be other kind of problems on it.

Which OpenKM build ( number ) do you have ?
 #9815  by snowman
 
I just upgraded to 5.0.3 Build 5159 and tried again with a fresh pdf. It also does not matter if it is pdf or pdf/a.
The pdf is not searched in my local installation.
Again I uploaded to your demo site and it worked immediately.

My indexing_configuration.xml:
Code: Select all
<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.1.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
               xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:okm="http://www.openkm.org/1.0">
  <index-rule nodeType="okm:resource">
    <property nodeScopeIndex="true">jcr:data</property>
    <property nodeScopeIndex="false">jcr:isCheckedOut</property>
    <property nodeScopeIndex="false">jcr:lastModified</property>
    <property nodeScopeIndex="false">jcr:mimeType</property>
    <property nodeScopeIndex="false">jcr:primaryType</property>
    <property nodeScopeIndex="false">jcr:uuid</property>
    <property nodeScopeIndex="false">okm:author</property>
    <property nodeScopeIndex="false">okm:versionComment</property>
  </index-rule>
  <analyzers>
        <analyzer class="com.openkm.analysis.FilenameAnalyzer">
            <property>okm:name</property>
        </analyzer>
  </analyzers>
</configuration>
I dont know what or how to cut down to the error...
 #9853  by snowman
 
Update: the file were indexed after at least 24h. Does anyone know how to set this indexing interval? I would like to have the uploaded file immedaitely indexed.

Best regards,
Snowman
 #9936  by pavila
 
The indexing mechanism tries to index the document as soon as possible, depending on the server load. Usually it is performed almost intermediately. Never seen that it delays several hours. Can check the machine load and CPU usage? Are you performing a massive document import or sporadic document uploads?
 #9951  by joako
 
pavila wrote:The indexing mechanism tries to index the document as soon as possible, depending on the server load. Usually it is performed almost intermediately. Never seen that it delays several hours. Can check the machine load and CPU usage? Are you performing a massive document import or sporadic document uploads?

I already said many times, it would be nice to see:

1) Log of indexing
2) Status of a document in index

Because right now it seems everything is guess and hope it works. Right now mine seems to be working, but if it were to stop I have no idea how it can be debug.
 #10153  by pavila
 
I would like to see this log also, but we depends on Jackrabbit for this and is not implemented. We are thinking in handle the indexing process our-self and don't delegate on Jackrabbit for this. This way OpenKM has control over the process and can do more things. But for now, it not possible.
 #11122  by joako
 
Can I see what is in index database for a document?

E.g.

1) Import document
2) See what index has for document
3) Make some change
4) try again

Otherwise I don't see how it's even possible do debug these OCR issues.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.