• Searching PDF OCR

  • OpenKM has many interesting features, but requires some configuration process to show its full potential.
OpenKM has many interesting features, but requires some configuration process to show its full potential.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #9143  by joako
 
What I've done is used a LiveCD called WatchOCR to take .PDF images of scanned documents (B&W, 300 dpi or similar) to generate searchable PDF files. This process appears to work reasonably well and appears to produce at least some recognizable text out of the images. When the PDF is viewed you can see the image but also highlight it and copy & paste the text. Using other software the PDF can be searched.

However my issue is that when I upload these PDF files into OpenKM these PDF files are not indexed. PDF files composed of text e.g. from Word files are indexed no problems.

Does anyone have a solution on how these files can be searched?
 #9169  by jllort
 
You've generated non indexable pdf ( it can be done using for example paper port ). That problem is solved on version 5.1 when the pdf are images then OpenKM uses OCR to extract text from images.

If you're sure than your pdf contains text, send someone to us, for testing it.
 #9183  by joako
 
Yes I am 100% positive the PDF contain text. I'll generate some without sensitive content and upload those, for the time being you can see my screenshot.
Attachments
Screen shot 2011-02-25 at 10.03.17 PM.png
Screen shot 2011-02-25 at 10.03.17 PM.png (72.21 KiB) Viewed 9863 times
 #9188  by jllort
 
Can you execute tesseract from terminal with that pdf file to see what text extracts ?
 #9208  by pavila
 
In that case, please post an sample PDF here and I will try to check its text extraction.
 #9217  by joako
 
jllort wrote:Can you execute tesseract from terminal with that pdf file to see what text extracts ?
I'm using 5.0.2 and my understanding is it won't support PDF OCR until 5.1. Either way the .TIF OCR isn't working for me. I am processing these files using other software. I've attached a sample document, and it seems the OCR on it 100% accurate.
Attachments
(13.7 KiB) Downloaded 289 times
 #9237  by pavila
 
Yes, the OCR PDF is only available from OpenKM 5.1 and actually is not released. Of course the OCR results are not accurate. OCR is a complex task and if you want a better OCR support you should go to professional specialized OCR tools like Abby FineReader or so. Don't expect miracles from Tesseract.

In the upcoming OpenKM 5.1 we also support Tesseract 3.x and Cuneiform OCR tools, which are much better than Tesseract 2.0 and also we can configure an external dictionary to fix bad extracted terms.
 #9238  by pavila
 
BTW, I have passed the attach PDF to the text extractor and give this result:
Code: Select all
Scan to e   mail failed
 Contact:
 Details
 Primary SMTP Server: macserver.ml.loab
 Connection failure:
 Mail server response:
 554 5.5.1 Error: no valid recipients
 Secondary SMTP Server:
 Connection failure:
 Mail server response:
 Scan Log
 #9241  by joako
 
So now the question is, how can I enable the text extractor?
 #9243  by joako
 
Well now I am seeing results. Maybe after a few days the database has initialized?
 #9246  by pavila
 
When you install OpenKM for the first time, you start JBoss and after the start process is finished you have to shutdown it and change the OpenKM.cfg to set hibernate.hbm2ddl=none. Then start JBoss again and enjoy OpenKM. There is no secret step and this should work in any environment.

To prevent this kind of odd behaviour a solution is contact us for professional support.
 #9255  by joako
 
Ok, I see. I haven't changed that yes. I need to switch to something outside the standard database... is MySQL a good choice?
 #9264  by jllort
 
Mysql and PostgreSQL are good options. For less 200.000 documents both dbms offers similar performace.
 #9269  by joako
 
Well now I see that I am actually still having problems with the PDF search. SOME documents are found and SOME are not. I copy and paste text from the document and sometimes it is found and sometimes it is not. If a document is shown in the search results I can open it and copy + paste any text to a new search and it is found. If I copy and paste some text out of the document and it isn't found in the search then no text from that document will be found.

I can't see:

1) any way to re-build search index
2) any way to see status of search index
3) any way to know when is a document supposed to be put into the index
... etc...
 #9270  by joako
 
So I single in on one file and uploaded it to the OpenKM demo site and it is fine. I don't understand why in the attached file I can't search "Kofax" on my local system?

My log is as follows:
Code: Select all
sses=423, cacheRatio=58%
23:32:03,996 INFO  [BundleCache] num=3380 mem=8190k max=8192k avg=2481 hits=82794 miss=7206
23:32:04,428 INFO  [BundleCache] num=1495 mem=8190k max=8192k avg=5610 hits=32178 miss=7822
23:32:04,432 INFO  [LRUNodeIdCache] num=6323/10240 hits=39928 miss=40072
23:32:04,662 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=97878
23:32:04,874 INFO  [BundleCache] num=3617 mem=8191k max=8192k avg=2319 hits=90991 miss=9009
23:32:05,141 INFO  [LRUNodeIdCache] num=6323/10240 hits=41586 miss=48414
23:32:28,122 INFO  [BundleCache] num=1495 mem=8191k max=8192k avg=5610 hits=42175 miss=7825
00:00:03,911 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=107878
00:00:03,950 INFO  [BundleCache] num=3646 mem=8190k max=8192k avg=2300 hits=99346 miss=10654
00:00:04,308 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=117878
00:00:04,447 INFO  [BundleCache] num=3378 mem=8191k max=8192k avg=2483 hits=107968 miss=12032
00:00:04,741 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=127878
00:00:04,862 INFO  [BundleCache] num=3318 mem=8191k max=8192k avg=2527 hits=115910 miss=14090
00:00:05,082 INFO  [LRUNodeIdCache] num=6323/10240 hits=45605 miss=54395
00:00:05,096 INFO  [BundleCache] num=1495 mem=8189k max=8192k avg=5609 hits=52155 miss=7845
00:00:05,106 INFO  [LRUNodeIdCache] num=6323/10240 hits=45605 miss=64395
00:00:05,120 INFO  [BundleCache] num=1496 mem=8188k max=8192k avg=5605 hits=62149 miss=7851
00:02:03,586 INFO  [LRUNodeIdCache] num=6323/10240 hits=45905 miss=74095
00:05:05,708 INFO  [LRUNodeIdCache] num=910/10240 hits=2122 miss=137878
00:05:05,845 INFO  [BundleCache] num=3390 mem=8191k max=8192k avg=2474 hits=124344 miss=15656
00:05:05,872 INFO  [BundleCache] num=1495 mem=8189k max=8192k avg=5609 hits=72131 miss=7869
00:05:06,410 INFO  [LRUNodeIdCache] num=6323/10240 hits=49475 miss=80525
00:05:17,030 INFO  [LRUNodeIdCache] num=957/10240 hits=3025 miss=146975
Attachments
(52.07 KiB) Downloaded 269 times

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.