Page 1 of 3
Searching PDF OCR
PostPosted:Fri Feb 25, 2011 5:21 am
by joako
What I've done is used a LiveCD called WatchOCR to take .PDF images of scanned documents (B&W, 300 dpi or similar) to generate searchable PDF files. This process appears to work reasonably well and appears to produce at least some recognizable text out of the images. When the PDF is viewed you can see the image but also highlight it and copy & paste the text. Using other software the PDF can be searched.
However my issue is that when I upload these PDF files into OpenKM these PDF files are not indexed. PDF files composed of text e.g. from Word files are indexed no problems.
Does anyone have a solution on how these files can be searched?
Re: Searching PDF OCR
PostPosted:Fri Feb 25, 2011 5:15 pm
by jllort
You've generated non indexable pdf ( it can be done using for example paper port ). That problem is solved on version 5.1 when the pdf are images then OpenKM uses OCR to extract text from images.
If you're sure than your pdf contains text, send someone to us, for testing it.
Re: Searching PDF OCR
PostPosted:Sat Feb 26, 2011 3:06 am
by joako
Yes I am 100% positive the PDF contain text. I'll generate some without sensitive content and upload those, for the time being you can see my screenshot.
Re: Searching PDF OCR
PostPosted:Sat Feb 26, 2011 10:56 am
by jllort
Can you execute tesseract from terminal with that pdf file to see what text extracts ?
Re: Searching PDF OCR
PostPosted:Mon Feb 28, 2011 10:34 am
by pavila
In that case, please post an sample PDF here and I will try to check its text extraction.
Re: Searching PDF OCR
PostPosted:Mon Feb 28, 2011 11:43 pm
by joako
jllort wrote:Can you execute tesseract from terminal with that pdf file to see what text extracts ?
I'm using 5.0.2 and my understanding is it won't support PDF OCR until 5.1. Either way the .TIF OCR isn't working for me. I am processing these files using other software. I've attached a sample document, and it seems the OCR on it 100% accurate.
Re: Searching PDF OCR
PostPosted:Tue Mar 01, 2011 8:36 pm
by pavila
Yes, the OCR PDF is only available from OpenKM 5.1 and actually is not released. Of course the OCR results are not accurate. OCR is a complex task and if you want a better OCR support you should go to professional specialized OCR tools like Abby FineReader or so. Don't expect miracles from Tesseract.
In the upcoming OpenKM 5.1 we also support Tesseract 3.x and Cuneiform OCR tools, which are much better than Tesseract 2.0 and also we can configure an external dictionary to fix bad extracted terms.
Re: Searching PDF OCR
PostPosted:Tue Mar 01, 2011 8:44 pm
by pavila
BTW, I have passed the attach PDF to the text extractor and give this result:
Code: Select allScan to e mail failed
Contact:
Details
Primary SMTP Server: macserver.ml.loab
Connection failure:
Mail server response:
554 5.5.1 Error: no valid recipients
Secondary SMTP Server:
Connection failure:
Mail server response:
Scan Log
Re: Searching PDF OCR
PostPosted:Wed Mar 02, 2011 6:25 am
by joako
So now the question is, how can I enable the text extractor?
Re: Searching PDF OCR
PostPosted:Wed Mar 02, 2011 6:57 am
by joako
Well now I am seeing results. Maybe after a few days the database has initialized?
Re: Searching PDF OCR
PostPosted:Wed Mar 02, 2011 8:18 am
by pavila
When you install OpenKM for the first time, you start JBoss and after the start process is finished you have to shutdown it and change the OpenKM.cfg to set hibernate.hbm2ddl=none. Then start JBoss again and enjoy OpenKM. There is no secret step and this should work in any environment.
To prevent this kind of odd behaviour a solution is contact us for professional support.
Re: Searching PDF OCR
PostPosted:Wed Mar 02, 2011 8:06 pm
by joako
Ok, I see. I haven't changed that yes. I need to switch to something outside the standard database... is MySQL a good choice?
Re: Searching PDF OCR
PostPosted:Wed Mar 02, 2011 10:39 pm
by jllort
Mysql and PostgreSQL are good options. For less 200.000 documents both dbms offers similar performace.
Re: Searching PDF OCR
PostPosted:Thu Mar 03, 2011 5:10 am
by joako
Well now I see that I am actually still having problems with the PDF search. SOME documents are found and SOME are not. I copy and paste text from the document and sometimes it is found and sometimes it is not. If a document is shown in the search results I can open it and copy + paste any text to a new search and it is found. If I copy and paste some text out of the document and it isn't found in the search then no text from that document will be found.
I can't see:
1) any way to re-build search index
2) any way to see status of search index
3) any way to know when is a document supposed to be put into the index
... etc...
Re: Searching PDF OCR
PostPosted:Thu Mar 03, 2011 5:18 am
by joako
So I single in on one file and uploaded it to the OpenKM demo site and it is fine. I don't understand why in the attached file I can't search "Kofax" on my local system?
My log is as follows:
Code: Select allsses=423, cacheRatio=58%
23:32:03,996 INFO [BundleCache] num=3380 mem=8190k max=8192k avg=2481 hits=82794 miss=7206
23:32:04,428 INFO [BundleCache] num=1495 mem=8190k max=8192k avg=5610 hits=32178 miss=7822
23:32:04,432 INFO [LRUNodeIdCache] num=6323/10240 hits=39928 miss=40072
23:32:04,662 INFO [LRUNodeIdCache] num=910/10240 hits=2122 miss=97878
23:32:04,874 INFO [BundleCache] num=3617 mem=8191k max=8192k avg=2319 hits=90991 miss=9009
23:32:05,141 INFO [LRUNodeIdCache] num=6323/10240 hits=41586 miss=48414
23:32:28,122 INFO [BundleCache] num=1495 mem=8191k max=8192k avg=5610 hits=42175 miss=7825
00:00:03,911 INFO [LRUNodeIdCache] num=910/10240 hits=2122 miss=107878
00:00:03,950 INFO [BundleCache] num=3646 mem=8190k max=8192k avg=2300 hits=99346 miss=10654
00:00:04,308 INFO [LRUNodeIdCache] num=910/10240 hits=2122 miss=117878
00:00:04,447 INFO [BundleCache] num=3378 mem=8191k max=8192k avg=2483 hits=107968 miss=12032
00:00:04,741 INFO [LRUNodeIdCache] num=910/10240 hits=2122 miss=127878
00:00:04,862 INFO [BundleCache] num=3318 mem=8191k max=8192k avg=2527 hits=115910 miss=14090
00:00:05,082 INFO [LRUNodeIdCache] num=6323/10240 hits=45605 miss=54395
00:00:05,096 INFO [BundleCache] num=1495 mem=8189k max=8192k avg=5609 hits=52155 miss=7845
00:00:05,106 INFO [LRUNodeIdCache] num=6323/10240 hits=45605 miss=64395
00:00:05,120 INFO [BundleCache] num=1496 mem=8188k max=8192k avg=5605 hits=62149 miss=7851
00:02:03,586 INFO [LRUNodeIdCache] num=6323/10240 hits=45905 miss=74095
00:05:05,708 INFO [LRUNodeIdCache] num=910/10240 hits=2122 miss=137878
00:05:05,845 INFO [BundleCache] num=3390 mem=8191k max=8192k avg=2474 hits=124344 miss=15656
00:05:05,872 INFO [BundleCache] num=1495 mem=8189k max=8192k avg=5609 hits=72131 miss=7869
00:05:06,410 INFO [LRUNodeIdCache] num=6323/10240 hits=49475 miss=80525
00:05:17,030 INFO [LRUNodeIdCache] num=957/10240 hits=3025 miss=146975