• tesseract-ocr not adding *.txt to output causing errors

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #39816  by gwaitsi
 
i switched from cuneiform to tesseract per jllort's recommendation and have so far cleared the java heap errors i was getting after tweaking the memory values. Also got the office dictionary installed.

i first noticed however, that tesseract is not adding the default *.txt to the files in the temp directory, whereas cuneiform was.

this seems to be producing an error
Code: Select all
2015-06-05 12:35:01,702 [Thread-14570] WARN  com.openkm.util.ExecutionUtils- CommandLine: [/usr/bin/tesseract, -l, eng+deu+ces, /usr/local/openkm/temp/okm6898227341957316764-001.pbm, /usr/local/openkm/temp/okm1885491060987496539]
2015-06-05 12:35:01,702 [Thread-14570] WARN  com.openkm.util.ExecutionUtils- STDERR: Error in findFileFormatStream: truncated file
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read

2015-06-05 12:35:01,702 [Thread-14570] WARN  com.openkm.extractor.Tesseract3TextExtractor- IO exception executing command: /usr/bin/tesseract -l eng+deu+ces /usr/local/openkm/temp/okm6898227341957316764-001.pbm /usr/local/openkm/temp/okm1885491060987496539
java.io.FileNotFoundException: /usr/local/openkm/temp/okm1885491060987496539.txt (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:146)
	at java.io.FileInputStream.<init>(FileInputStream.java:101)
	at com.openkm.extractor.Tesseract3TextExtractor.doOcr(Tesseract3TextExtractor.java:152)
	at com.openkm.extractor.Tesseract3TextExtractor.doOcr(Tesseract3TextExtractor.java:127)
	at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:175)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:96)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:214)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:173)
	at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1344)
	at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:164)
	at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:149)
	at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:100)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at bsh.Reflect.invokeMethod(Reflect.java:166)
	at bsh.Reflect.invokeObjectMethod(Reflect.java:99)
	at bsh.BSHPrimarySuffix.doName(BSHPrimarySuffix.java:176)
	at bsh.BSHPrimarySuffix.doSuffix(BSHPrimarySuffix.java:120)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:80)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:47)
	at bsh.Interpreter.eval(Interpreter.java:664)
	at bsh.Interpreter.eval(Interpreter.java:758)
	at bsh.Interpreter.eval(Interpreter.java:747)
	at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
	at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
	at java.lang.Thread.run(Thread.java:745)
 #39831  by jllort
 
Please indicate which is your tesseract configuration parameter ( system.ocr ), also did you changed the text.extrators class from cuneiform to tesseract ? and restarted the openkm service.

This kind of errors could be caused because pdf contains into small images that into not containing text.
Code: Select all
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
I will not be worried about it, check if really extract the contents in the whole document or not.

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.