Page 1 of 1

tesseract-ocr not adding *.txt to output causing errors

PostPosted:Fri Jun 05, 2015 11:11 am
by gwaitsi
i switched from cuneiform to tesseract per jllort's recommendation and have so far cleared the java heap errors i was getting after tweaking the memory values. Also got the office dictionary installed.

i first noticed however, that tesseract is not adding the default *.txt to the files in the temp directory, whereas cuneiform was.

this seems to be producing an error
Code: Select all
2015-06-05 12:35:01,702 [Thread-14570] WARN  com.openkm.util.ExecutionUtils- CommandLine: [/usr/bin/tesseract, -l, eng+deu+ces, /usr/local/openkm/temp/okm6898227341957316764-001.pbm, /usr/local/openkm/temp/okm1885491060987496539]
2015-06-05 12:35:01,702 [Thread-14570] WARN  com.openkm.util.ExecutionUtils- STDERR: Error in findFileFormatStream: truncated file
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read

2015-06-05 12:35:01,702 [Thread-14570] WARN  com.openkm.extractor.Tesseract3TextExtractor- IO exception executing command: /usr/bin/tesseract -l eng+deu+ces /usr/local/openkm/temp/okm6898227341957316764-001.pbm /usr/local/openkm/temp/okm1885491060987496539
java.io.FileNotFoundException: /usr/local/openkm/temp/okm1885491060987496539.txt (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:146)
	at java.io.FileInputStream.<init>(FileInputStream.java:101)
	at com.openkm.extractor.Tesseract3TextExtractor.doOcr(Tesseract3TextExtractor.java:152)
	at com.openkm.extractor.Tesseract3TextExtractor.doOcr(Tesseract3TextExtractor.java:127)
	at com.openkm.extractor.PdfTextExtractor.doOcr(PdfTextExtractor.java:175)
	at com.openkm.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:96)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:214)
	at com.openkm.extractor.RegisteredExtractors.getText(RegisteredExtractors.java:173)
	at com.openkm.dao.NodeDocumentDAO.textExtractorHelper(NodeDocumentDAO.java:1344)
	at com.openkm.extractor.TextExtractorWorker.processSerial(TextExtractorWorker.java:164)
	at com.openkm.extractor.TextExtractorWorker.processQueue(TextExtractorWorker.java:149)
	at com.openkm.extractor.TextExtractorWorker.run(TextExtractorWorker.java:100)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at bsh.Reflect.invokeMethod(Reflect.java:166)
	at bsh.Reflect.invokeObjectMethod(Reflect.java:99)
	at bsh.BSHPrimarySuffix.doName(BSHPrimarySuffix.java:176)
	at bsh.BSHPrimarySuffix.doSuffix(BSHPrimarySuffix.java:120)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:80)
	at bsh.BSHPrimaryExpression.eval(BSHPrimaryExpression.java:47)
	at bsh.Interpreter.eval(Interpreter.java:664)
	at bsh.Interpreter.eval(Interpreter.java:758)
	at bsh.Interpreter.eval(Interpreter.java:747)
	at com.openkm.util.ExecutionUtils.runScript(ExecutionUtils.java:112)
	at com.openkm.core.Cron$RunnerBsh.run(Cron.java:103)
	at java.lang.Thread.run(Thread.java:745)

Re: tesseract-ocr not adding *.txt to output causing errors

PostPosted:Sat Jun 06, 2015 11:06 am
by jllort
Please indicate which is your tesseract configuration parameter ( system.ocr ), also did you changed the text.extrators class from cuneiform to tesseract ? and restarted the openkm service.

This kind of errors could be caused because pdf contains into small images that into not containing text.
Code: Select all
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
I will not be worried about it, check if really extract the contents in the whole document or not.