Open Source Document Management System | OpenKM

PostPosted:**Fri Feb 25, 2011 5:21 am**

What I've done is used a LiveCD called WatchOCR to take .PDF images of scanned documents (B&W, 300 dpi or similar) to generate searchable PDF files. This process appears to work reasonably well and appears to produce at least some recognizable text out of the images. When the PDF is viewed you can see the image but also highlight it and copy & paste the text. Using other software the PDF can be searched.

However my issue is that when I upload these PDF files into OpenKM these PDF files are not indexed. PDF files composed of text e.g. from Word files are indexed no problems.

Does anyone have a solution on how these files can be searched?

PostPosted:**Fri Feb 25, 2011 5:15 pm**

You've generated non indexable pdf ( it can be done using for example paper port ). That problem is solved on version 5.1 when the pdf are images then OpenKM uses OCR to extract text from images.

If you're sure than your pdf contains text, send someone to us, for testing it.

PostPosted:**Sat Feb 26, 2011 3:06 am**

Yes I am 100% positive the PDF contain text. I'll generate some without sensitive content and upload those, for the time being you can see my screenshot.

PostPosted:**Sat Feb 26, 2011 10:56 am**

Can you execute tesseract from terminal with that pdf file to see what text extracts ?

PostPosted:**Mon Feb 28, 2011 10:34 am**

In that case, please post an sample PDF here and I will try to check its text extraction.

PostPosted:**Mon Feb 28, 2011 11:43 pm**

jllort wrote:Can you execute tesseract from terminal with that pdf file to see what text extracts ?

I'm using 5.0.2 and my understanding is it won't support PDF OCR until 5.1. Either way the .TIF OCR isn't working for me. I am processing these files using other software. I've attached a sample document, and it seems the OCR on it 100% accurate.

PostPosted:**Tue Mar 01, 2011 8:36 pm**

Yes, the OCR PDF is only available from OpenKM 5.1 and actually is not released. Of course the OCR results are not accurate. OCR is a complex task and if you want a better OCR support you should go to professional specialized OCR tools like Abby FineReader or so. Don't expect miracles from Tesseract.

In the upcoming OpenKM 5.1 we also support Tesseract 3.x and Cuneiform OCR tools, which are much better than Tesseract 2.0 and also we can configure an external dictionary to fix bad extracted terms.

PostPosted:**Tue Mar 01, 2011 8:44 pm**

BTW, I have passed the attach PDF to the text extractor and give this result:

Code: Select all

Scan to e   mail failed
 Contact:
 Details
 Primary SMTP Server: macserver.ml.loab
 Connection failure:
 Mail server response:
 554 5.5.1 Error: no valid recipients
 Secondary SMTP Server:
 Connection failure:
 Mail server response:
 Scan Log

PostPosted:**Wed Mar 02, 2011 6:25 am**

So now the question is, how can I enable the text extractor?

PostPosted:**Wed Mar 02, 2011 6:57 am**

Well now I am seeing results. Maybe after a few days the database has initialized?

PostPosted:**Wed Mar 02, 2011 8:18 am**

When you install OpenKM for the first time, you start JBoss and after the start process is finished you have to shutdown it and change the OpenKM.cfg to set hibernate.hbm2ddl=none. Then start JBoss again and enjoy OpenKM. There is no secret step and this should work in any environment.

To prevent this kind of odd behaviour a solution is contact us for professional support.

PostPosted:**Wed Mar 02, 2011 8:06 pm**

Ok, I see. I haven't changed that yes. I need to switch to something outside the standard database... is MySQL a good choice?

PostPosted:**Wed Mar 02, 2011 10:39 pm**

Mysql and PostgreSQL are good options. For less 200.000 documents both dbms offers similar performace.

PostPosted:**Thu Mar 03, 2011 5:10 am**

Well now I see that I am actually still having problems with the PDF search. SOME documents are found and SOME are not. I copy and paste text from the document and sometimes it is found and sometimes it is not. If a document is shown in the search results I can open it and copy + paste any text to a new search and it is found. If I copy and paste some text out of the document and it isn't found in the search then no text from that document will be found.

I can't see:

1) any way to re-build search index
2) any way to see status of search index
3) any way to know when is a document supposed to be put into the index
... etc...

PostPosted:**Thu Mar 03, 2011 5:18 am**

So I single in on one file and uploaded it to the OpenKM demo site and it is fine. I don't understand why in the attached file I can't search "Kofax" on my local system?

My log is as follows:

Code: Select all

sses=423, cacheRatio=58%
23:32:03,996 INFO  [BundleCache] num=3380 mem=8190k max=8192k avg=2481 hits=82794 miss=7206
23:32:04,428 INFO  [BundleCache] num=1495 mem=8190k max=8192k avg=5610 hits=32178 miss=7822
23:32:04,432 INFO  [LRUNodeIdCache] num=6323/10240 hits=39928 miss=40072
23:32:04,662 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=97878
23:32:04,874 INFO  [BundleCache] num=3617 mem=8191k max=8192k avg=2319 hits=90991 miss=9009
23:32:05,141 INFO  [LRUNodeIdCache] num=6323/10240 hits=41586 miss=48414
23:32:28,122 INFO  [BundleCache] num=1495 mem=8191k max=8192k avg=5610 hits=42175 miss=7825
00:00:03,911 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=107878
00:00:03,950 INFO  [BundleCache] num=3646 mem=8190k max=8192k avg=2300 hits=99346 miss=10654
00:00:04,308 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=117878
00:00:04,447 INFO  [BundleCache] num=3378 mem=8191k max=8192k avg=2483 hits=107968 miss=12032
00:00:04,741 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=127878
00:00:04,862 INFO  [BundleCache] num=3318 mem=8191k max=8192k avg=2527 hits=115910 miss=14090
00:00:05,082 INFO  [LRUNodeIdCache] num=6323/10240 hits=45605 miss=54395
00:00:05,096 INFO  [BundleCache] num=1495 mem=8189k max=8192k avg=5609 hits=52155 miss=7845
00:00:05,106 INFO  [LRUNodeIdCache] num=6323/10240 hits=45605 miss=64395
00:00:05,120 INFO  [BundleCache] num=1496 mem=8188k max=8192k avg=5605 hits=62149 miss=7851
00:02:03,586 INFO  [LRUNodeIdCache] num=6323/10240 hits=45905 miss=74095
00:05:05,708 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=137878
00:05:05,845 INFO  [BundleCache] num=3390 mem=8191k max=8192k avg=2474 hits=124344 miss=15656
00:05:05,872 INFO  [BundleCache] num=1495 mem=8189k max=8192k avg=5609 hits=72131 miss=7869
00:05:06,410 INFO  [LRUNodeIdCache] num=6323/10240 hits=49475 miss=80525
00:05:17,030 INFO  [LRUNodeIdCache] num=957/10240 hits=3025 miss=146975

Open Source Document Management System | OpenKM

Searching PDF OCR

Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR

Re: Searching PDF OCR