Open Source Document Management System | OpenKM

PostPosted:**Thu Feb 12, 2015 1:06 pm**

Hey,

can you test this 3 documents in your OpenKM?
jpg:

datei_bunt.jpg (456.57 KiB) Viewed 7118 times

png:

datei_bunt.png (2.34 MiB) Viewed 7118 times

tif:

datei_bunt.tif (327.88 KiB) Viewed 7118 times

After i uploaded them, i can't search for text.

I have installad tesseract "tesseract-ocr-setup-3.02.02.exe"
-> OK - C:\OpenKM-OCR\Tesseract-OCR\tesseract.exe (it's correct)
-> PDF, word, excel, txt is working correctly.
-> I can see the preview without errors.

I get following error with this files:

Code: Select all

2015-02-12 14:02:05,295 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.tif': Too few text extracted
2015-02-12 14:02:05,373 [Thread-122] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=cfb84da2-e46f-4461-b051-ef422b06170b, docPath=/okm:root/Scans/datei_bunt.png, docVerUuid=fe506f55-1fe0-4959-a7fe-d6cf1dfc88d4, date=Thu Feb 12 14:01:03 CET 2015}
2015-02-12 14:02:10,802 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.png': Too few text extracted
2015-02-12 14:02:10,849 [Thread-122] INFO  com.openkm.extractor.TextExtractorWorker- processSerial.Working on {docUuid=d42f77ce-62c6-43fe-96cc-f8462ebe2bd2, docPath=/okm:root/Scans/datei_bunt.jpg, docVerUuid=60f97e6f-4af6-42cf-9941-17a5c468b68c, date=Thu Feb 12 14:01:02 CET 2015}
2015-02-12 14:02:16,091 [Thread-122] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/Scans/datei_bunt.jpg': Too few text extracted

André

PostPosted:**Sat Feb 14, 2015 9:14 am**

I've tested in our online demo ( demo.openkm.com ) and it's going right ( you can take a look here http://demo.openkm.com/OpenKM/index.jsp ... 287f3484a2 ) do some search by content like felsmann.

write here your tesseract configuration parameters and also tell us your openkm version.

PostPosted:**Mon Feb 16, 2015 8:54 am**

Hello,

i also tested it in your online environment. Following url's to this three files (If the documents aren't deleted

):
- http://demo.openkm.com/OpenKM/index.jsp ... 9f9d1ede4a (png successfully)
- http://demo.openkm.com/OpenKM/index.jsp ... 8f1983de26 (jpg successfully)
- http://demo.openkm.com/OpenKM/index.jsp ... 79bb37247d (tiff not successfully)

I've been waiting for 10 minutes because I do not know how long the interval from TextExtractor.
I found the following files with the search-string "felsmann": png & jpg
- tiff is missing in your online environment.

My tesseract configuration: C:\OpenKM-OCR\Tesseract-OCR\tesseract.exe ${fileIn} ${fileOut}
OpenKM version: Version 6.3.0 (build: 8156) With Community Extension

André

PostPosted:**Mon Feb 16, 2015 1:43 pm**

Now the tif-file will be displayed in your demo environment after some hours.

PostPosted:**Wed Feb 18, 2015 10:16 pm**

how do you generate this tiff file

PostPosted:**Thu Feb 19, 2015 7:16 am**

We generate these files by a scanner import (black / white)

There are two possibilities, how we import documents from a scanner.

First:
- We import them as pdf-files.

Second:
- We import them as tif-files.

The text is detected in the pdf-file, but not in the tif-file.

- I have tried different documents... (always the same problem)

And then i get this error:

Code: Select all

2015-02-19 10:58:15,378 [http-bio-0.0.0.0-8080-exec-8] WARN  com.openkm.dao.NodeDocumentDAO- There was a problem extracting text from '/okm:root/unverteilt/2015-02-19T20150219081221.tif': Too few text extracted

André

PostPosted:**Sat Feb 21, 2015 8:54 am**

Can you upload here the tiff file and the tif file into pdf, I want to do some test with it.

PostPosted:**Mon Feb 23, 2015 7:42 am**

Hello,

Here is the PDF file that I have created with our scanner:

file.pdf

(24.79 KiB) Downloaded 352 times

Here is the TIF file that I have created with our scanner:

file.tif (62.86 KiB) Viewed 7057 times

The pdf is working.
tif: "Too few text extracted"

André

PostPosted:**Tue Feb 24, 2015 7:54 am**

Now I have a problem with tiff files which have not been imported with the scanner.
- I think it's the same problem, like the tiff or other image-files.

Code: Select all

2015-02-24 08:46:03,224 [http-bio-0.0.0.0-8080-exec-2] WARN  com.openkm.extractor.PdfTextExtractor- PDF does not contains text layer

Does OpenKM only detect text, if files contain a textlayer?

André

PostPosted:**Wed Feb 25, 2015 7:15 am**

I have found the problem.

I had to use in the configuration instead of "com.openkm.extractor.CuneiformTextExtractor" this "com.openkm.extractor.Tesseract3TextExtractor"

Thanks.

PostPosted:**Fri Feb 27, 2015 4:34 pm**

OK, Thanks for the clue, I will try to keep in mind for other posts. On latest releases we've merged all into a wrapper and will not be necessary choose the class, automatically depending the system.ocr value choose automatically the correct class. I think it's still not available at community ( I'm not sure if version 6.3 comes with it ).

PostPosted:**Mon Mar 02, 2015 5:55 am**

okey, thanks for your reply

Open Source Document Management System | OpenKM

Uploading images - extracting problem

Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem

Re: Uploading images - extracting problem