Open Source Document Management System | OpenKM

PostPosted:**Fri Mar 21, 2014 10:48 am**

I cannot see subj option:

1.JPG (36.95 KiB) Viewed 5288 times

I'm using OpenKM 6.2.6 build 8125 (upload manager is not working in stable version) on WinXP. Configuration settings are:
system.ocr C:\Program Files\Tesseract-OCR\tesseract -l rus+eng ${fileIn} ${fileOut}
system.ocr.rotate 90;180;270;
system.pdf.force.ocr True

PostPosted:**Sun Mar 23, 2014 5:48 am**

Where have you seen this option?

PostPosted:**Mon Mar 24, 2014 7:11 am**

In user's manual
http://wiki.openkm.com/index.php/OCR_data_capture

PostPosted:**Wed Mar 26, 2014 10:52 am**

OK, now I understood. Well this option is only part of professional version, in community version is not present.

PostPosted:**Tue Apr 15, 2014 9:57 am**

According to http://www.openkm.com/en/overview/compa ... sions.html ("General features" block) this option included into Community ver. too without Zonal OCR only ("Modules" block).

PostPosted:**Thu Apr 17, 2014 7:10 am**

I think you're on confusion with OMR and Zone OCR in table. In table Zone OCR is clearly not included in community version and probably will never been.

PostPosted:**Fri Apr 25, 2014 8:13 am**

Ok. Thanks to your explaination and this thread: http://forum.openkm.com/viewtopic.php?f ... orm#p26151 now it's clear. Now, I suppose, I understand how it should work but it doesn't. Here is the log:

Code: Select all

2014-04-25 11:45:00,109 [Thread-58] INFO  com.openkm.extractor.TextExtractorWorker - processSerial.Working on {docUuid=29af1d2d-a4be-41a4-8edc-642c66d9a507, docPath=/okm:root/phototest.tif, docVerUuid=311e4074-cabc-45bf-9d07-1405c246e305, date=Fri Apr 25 11:41:31 YEKT 2014}
2014-04-25 11:45:06,203 [Thread-58] INFO  com.openkm.util.DocumentUtils - Using OpenOffice dictionary: C:\Program Files\OpenOffice 4\share\extensions\install\dict_ru_RU-0.3.6.oxt
2014-04-25 11:45:06,328 [Thread-58] WARN  com.openkm.extractor.Tesseract3TextExtractor - Failed to extract OCR text
java.lang.IllegalStateException: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "KOI8-R"
	at org.dts.spell.dictionary.OpenOfficeSpellDictionary.waitToLoad(OpenOfficeSpellDictionary.java:289)
	at org.dts.spell.dictionary.OpenOfficeSpellDictionary.getSuggestions(OpenOfficeSpellDictionary.java:264)
	at com.openkm.util.DocumentUtils.spellChecker(DocumentUtils.java:59)
	at com.openkm.extractor.Tesseract3TextExtractor.doOcr(Tesseract3TextExtractor.java:143)
	at com.openkm.extractor.Tesseract3TextExtractor.extractText(Tesseract3TextExtractor.java:82)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:211)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:172)
	at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1300)
	at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:138)
	at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:125)
	at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:80)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at bsh.Reflect.invokeOnMethod(Unknown Source)
	at bsh.Reflect.invokeObjectMethod(Unknown Source)
	at bsh.BSHPrimarySuffix.doName(Unknown Source)
	at bsh.BSHPrimarySuffix.doSuffix(Unknown Source)
	at bsh.BSHPrimaryExpression.eval(Unknown Source)
	at bsh.BSHPrimaryExpression.eval(Unknown Source)
	at bsh.Interpreter.eval(Unknown Source)
	at bsh.Interpreter.eval(Unknown Source)
	at bsh.Interpreter.eval(Unknown Source)
	at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
	at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
	at java.lang.Thread.run(Thread.java:724)
Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "KOI8-R"
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:188)
	at org.dts.spell.dictionary.OpenOfficeSpellDictionary.waitToLoad(OpenOfficeSpellDictionary.java:283)
	... 26 more
Caused by: java.lang.NumberFormatException: For input string: "KOI8-R"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:492)
	at java.lang.Integer.parseInt(Integer.java:527)
	at org.dts.spell.dictionary.myspell.MySpell.load_tables(MySpell.java:398)
	at org.dts.spell.dictionary.myspell.MySpell.initFromStreams(MySpell.java:177)
	at org.dts.spell.dictionary.myspell.MySpell.<init>(MySpell.java:69)
	at org.dts.spell.dictionary.OpenOfficeSpellDictionary.initFromZipFile(OpenOfficeSpellDictionary.java:198)
	at org.dts.spell.dictionary.OpenOfficeSpellDictionary.access$100(OpenOfficeSpellDictionary.java:31)
	at org.dts.spell.dictionary.OpenOfficeSpellDictionary$2.call(OpenOfficeSpellDictionary.java:88)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	... 1 more

The ocr settings are:
registered.text.extractors: com.openkm.extractor.Tesseract3TextExtractor
system.ocr: C:\Program Files\Tesseract-OCR\tesseract ${fileIn} ${fileOut} -l rus+eng
system.ocr.rotate: 90;180;270;
Test file is from standart tesseract package (attached) and it's console extraction executed well. Any idea what's wrong with it?

PostPosted:**Sat Apr 26, 2014 10:49 am**

I do not know exactly what are you doing but the message

Code: Select all

Java.util.concurrent.ExecutionException: Java.lang.NumberFormatException: For input string: "KOI8-R"

Indicate you are trying to convert a string test to a number. that's why is raising the error. Could be some problem with dictionary ?

PostPosted:**Wed Apr 30, 2014 5:06 am**

You were right. I tried another dictionary and it works fine.
By the way, russian text extraction gets better without any dictionary.
Thanks a lot.

Open Source Document Management System | OpenKM

There's no ocr data capture option in menu

There's no ocr data capture option in menu

Re: There's no ocr data capture option in menu

Re: There's no ocr data capture option in menu

Re: There's no ocr data capture option in menu

Re: There's no ocr data capture option in menu

Re: There's no ocr data capture option in menu

Re: There's no ocr data capture option in menu

Re: There's no ocr data capture option in menu

Re: There's no ocr data capture option in menu