Open Source Document Management System | OpenKM - Searching PDF OCR

Reply

Re: Searching PDF OCR

#9271 by jllort
Thu Mar 03, 2011 9:09 am

I've tryed on my installation and runs perfect. The file is found by content kofax.
Seems something strange is happening in your system.

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Searching PDF OCR

#9278 by pavila
Thu Mar 03, 2011 4:43 pm

Try to reproduce the issue in the OpenKM demo: http://demo.openkm.com

Username

pavila

Rank

Moderator

Posts

3140

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Searching PDF OCR

#9279 by joako
Thu Mar 03, 2011 5:54 pm

pavila wrote:Try to reproduce the issue in the OpenKM demo: http://demo.openkm.com

Yes that works.

But how do I have 100% working search index in local install? I can't use OpenKM demo for day-to-day use obviously.

Like I said already I don't see any error in the openkm log.

Username

joako

Rank

Expert Boarder

Posts

92

Joined

Wed Feb 23, 2011 5:31 am

Re: Searching PDF OCR

#9281 by jllort
Thu Mar 03, 2011 6:33 pm

Obviously it's something with your OpenKM configuration.
Which OpenKM version do you've got installed ?
Put here your OpenKM.cfg to take a look at your configuration parameters.
Have you make any change on repository.xml parameteres ?
Some problems without taking a look directly at the system can not be solved ... because in forum we can use some imagination about what can happened ... but it's limited to our imagination capabilities.

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Searching PDF OCR

#9282 by joako
Thu Mar 03, 2011 6:39 pm

I am using OpenKM 5.0.2

I start over with the repository, delete the directory and it is created again I start off clean since yesterday (i.e.: My last posted results are after starting with a clean repository)

I'll post the config later when I have access, but it is the standard config. I see nothing about search index in config, and I am using PDF already OCR files so no wrong OCR setting in config will be an issue. I need more to know how to debug the search index like I asked before:

1) any way to re-build search index
2) any way to see status of search index
3) any way to know when is a document supposed to be put into the index
... etc...

Username

joako

Rank

Expert Boarder

Posts

92

Joined

Wed Feb 23, 2011 5:31 am

Re: Searching PDF OCR

#9285 by joako
Fri Mar 04, 2011 1:55 am

OpenKM.conf:

Code: Select all

system.ocr=/opt/local/bin/tesseract
system.openoffice.path=/Applications/OpenOffice.org.app/Contents
system.openoffice.tasks=5
system.openoffice.port=8100
system.img2pdf=/opt/local/bin/convert
system.pdf2swf=/opt/local/bin/pdf2swf_wrapper.sh
#system.antivir=/usr/bin/clamscan
hibernate.dialect=org.hibernate.dialect.HSQLDialect
hibernate.hbm2ddl=none
#application.url=http://localhost:8080/OpenKM/com.openkm.frontend.Main/index.jsp

Username

joako

Rank

Expert Boarder

Posts

92

Joined

Wed Feb 23, 2011 5:31 am

Re: Searching PDF OCR

#9290 by jllort
Fri Mar 04, 2011 8:36 am

Nothing strange here, except the openoffice path, but it's not relevant to indexing

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Searching PDF OCR

#9774 by snowman
Wed Mar 23, 2011 10:35 pm

Hello,

are there any updates on this topic here? I got a similar problem:

I OCRed scanned TIFs with Finereader 10. Results are:
-Search for text with Acrobat, ok.
-Upload to online demo and search, ok.
-Upload and search on local installation, fail.
-Old PDFs that I imported months ago at installation time and that were OCRed with some HP tool are searchable. Have not checked yet if it is still possible for new scans.

I never changes indexing_configuration.xml
In the documentation area I cannot search for indexing because I have to log in. Looks like it is for registered/paying users only?
Are the configuration files of your online demo available for comparison?

BTW: How do I find out the exact version of OpenKM I am running. I think it is 5.0.2 but Help->About tells me 5.0.

Best regards,
Snowman

Username

snowman

Rank

Junior Boarder

Posts

36

Joined

Mon Feb 21, 2011 8:32 pm

Re: Searching PDF OCR

#9794 by jllort
Thu Mar 24, 2011 11:35 am

There's no difference between community version and what we install to supported customers about OCR configuration params, nothing hidden sure. Could be other kind of problems on it.

Which OpenKM build ( number ) do you have ?

Username

jllort

Rank

Moderator

Posts

12048

Joined

Fri Dec 21, 2007 11:23 am

Location

Sineu - ( Illes Balears ) - Spain

Contact

Re: Searching PDF OCR

#9815 by snowman
Thu Mar 24, 2011 9:14 pm

I just upgraded to 5.0.3 Build 5159 and tried again with a fresh pdf. It also does not matter if it is pdf or pdf/a.
The pdf is not searched in my local installation.
Again I uploaded to your demo site and it worked immediately.

My indexing_configuration.xml:

Code: Select all

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://jackrabbit.apache.org/dtd/indexing-configuration-1.1.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
               xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:okm="http://www.openkm.org/1.0">
  <index-rule nodeType="okm:resource">
    <property nodeScopeIndex="true">jcr:data</property>
    <property nodeScopeIndex="false">jcr:isCheckedOut</property>
    <property nodeScopeIndex="false">jcr:lastModified</property>
    <property nodeScopeIndex="false">jcr:mimeType</property>
    <property nodeScopeIndex="false">jcr:primaryType</property>
    <property nodeScopeIndex="false">jcr:uuid</property>
    <property nodeScopeIndex="false">okm:author</property>
    <property nodeScopeIndex="false">okm:versionComment</property>
  </index-rule>
  <analyzers>
        <analyzer class="com.openkm.analysis.FilenameAnalyzer">
            <property>okm:name</property>
        </analyzer>
  </analyzers>
</configuration>

I dont know what or how to cut down to the error...

Username

snowman

Rank

Junior Boarder

Posts

36

Joined

Mon Feb 21, 2011 8:32 pm

Re: Searching PDF OCR

#9853 by snowman
Sat Mar 26, 2011 10:53 am

Update: the file were indexed after at least 24h. Does anyone know how to set this indexing interval? I would like to have the uploaded file immedaitely indexed.

Best regards,
Snowman

Username

snowman

Rank

Junior Boarder

Posts

36

Joined

Mon Feb 21, 2011 8:32 pm

Re: Searching PDF OCR

#9936 by pavila
Tue Mar 29, 2011 6:02 pm

The indexing mechanism tries to index the document as soon as possible, depending on the server load. Usually it is performed almost intermediately. Never seen that it delays several hours. Can check the machine load and CPU usage? Are you performing a massive document import or sporadic document uploads?

Username

pavila

Rank

Moderator

Posts

3140

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Searching PDF OCR

#9951 by joako
Wed Mar 30, 2011 3:25 am

pavila wrote:The indexing mechanism tries to index the document as soon as possible, depending on the server load. Usually it is performed almost intermediately. Never seen that it delays several hours. Can check the machine load and CPU usage? Are you performing a massive document import or sporadic document uploads?

I already said many times, it would be nice to see:

1) Log of indexing
2) Status of a document in index

Because right now it seems everything is guess and hope it works. Right now mine seems to be working, but if it were to stop I have no idea how it can be debug.

Username

joako

Rank

Expert Boarder

Posts

92

Joined

Wed Feb 23, 2011 5:31 am

Re: Searching PDF OCR

#10153 by pavila
Wed Apr 06, 2011 9:11 am

I would like to see this log also, but we depends on Jackrabbit for this and is not implemented. We are thinking in handle the indexing process our-self and don't delegate on Jackrabbit for this. This way OpenKM has control over the process and can do more things. But for now, it not possible.

Username

pavila

Rank

Moderator

Posts

3140

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Searching PDF OCR

#11122 by joako
Thu May 26, 2011 2:35 am

Can I see what is in index database for a document?

E.g.

1) Import document
2) See what index has for document
3) Make some change
4) try again

Otherwise I don't see how it's even possible do debug these OCR issues.

Username

joako

Rank

Expert Boarder

Posts

92

Joined

Wed Feb 23, 2011 5:31 am

Reply

Page 2 of 3
33 posts

1
2
3