Open Source Document Management System | OpenKM - Problem with text extraction of PDF Files

Reply

Problem with text extraction of PDF Files

#26271 by kalatchev
Mon Nov 18, 2013 6:22 pm

Hello there,

I encountered a problem with PDF text extraction. Text extraction of PDF with image layer (scanned documents) works just fine. Text extraction of other type of documents (Word, Text, etc.) works also and documents are indexed as expected. But I found that certain PDF files are not indexed. I’ve made further investigation of the problem by using function “Administration -> Utilities -> Check text extraction” function.

I’ve made tests with 5 files, containing the same text – one line of English text and another line with Cyrillic. File contains following text:

Code: Select all

This is text. 
Това е текст.

Here are the results of my investigation:

Text file (txt), ANSI encoding: works, although Cyrillic text is wrongly read with characters of ISO 8859-1 insted of WIndows 1252, but it's normal for ANSI encoding.
Text file (txt), UTF-8 encoding: works fine, correct encoding.
Word file (docx, MS Word 2010): works fine, correct encoding.
PDF file (produced with Word, "Save As PDF" function): Doesn’t extract any text, neither English text nor Cyrillic.
PDF file (produced trough printing to CutePDF printer 3.0/Ghostscript): Doesn’t extract any text, neither English text nor Cyrillic.

If it helps somehow - Copy-Paste between Adobe Reader and Notepad (Adobe Reader 10.1.8 to Notepad) – works on both mentioned PDF files, with correct encoding – both English text and Cyrillic.

Another clue: Previously I had problem with copy-paste with some PDF files, especially with ones, made with PDF printers. I’ve found that this topic helped me. I think that is somehow connected with mentioned above problem.

Here is the information about installation:
Server: OpenKM – community edition, version 6.2.5 (build: 8109), running on Windows 7 Pro SP1 with Apache Tomcat 7.0.27, JRE 7 Update 45, OpenOffice 4.0.1, Tesseract 3.02, ImageMagick 6.8.7, MS SQL Server 2008 R2 Express edition.
Client: Google Chrome 31.0.1650.57 on Windows 7 Pro SP 1.

Configuration settings:

Screenshot of related configuration values.
config.png (16.17 KiB) Viewed 43874 times

Here are my test files:

Attachments

TextFiles.zip

The ANSI encoded text file (code page - Windows 1251) and the second - an UTF-8 encoded one.
(431 Bytes) Downloaded 636 times

TextExtractionTestCutePDF.pdf

The Word document, "printed" to CutePDF.
(11.18 KiB) Downloaded 697 times

TextExtractionTest.pdf

The Word document, saved as PDF from Word.
(83.37 KiB) Downloaded 712 times

TextExtractionTest.docx

The Word 2010 document.
(12.48 KiB) Downloaded 742 times

Username

kalatchev

Rank

Fresh Boarder

Posts

6

Joined

Mon Nov 18, 2013 6:10 am

Re: Problem with text extraction of PDF Files

#26274 by kalatchev
Tue Nov 19, 2013 1:06 am

Some clarifications.

I found that in result of checking file extraction, I am getting the following record in catalina.log (real IP address is replaced by me with w.x.y.z):

Code: Select all

2013-11-19 02:43:35,529 [http-bio-w.x.y.z-8080-exec-3] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer

I've forgot to say that OS of server is 64-bit (Windows 7 Pro SP1 64-bit) if that matters.

And the most interesting thing - I tried these files with official online demo. Both problem files are indexed correctly and here is the screenshot:

Onlinde demo somehow extracts text correctly.
online-demo-test.png (12.83 KiB) Viewed 43865 times

Really strange ...
I hope someone will help me

Username

kalatchev

Rank

Fresh Boarder

Posts

6

Joined

Mon Nov 18, 2013 6:10 am

Re: Problem with text extraction of PDF Files (RESOLVED)

#26275 by kalatchev
Tue Nov 19, 2013 2:35 am

Finally I found where the porblem is

The problem was setting system.pdf.force.ocr to true. Once I set it to false, everything begun to work. Thanks to God I'm dealing with open source!

Here's the explination:

Code: Select all

   65                 PDDocument document = parser.getPDDocument();
   66                 CharArrayWriter writer = new CharArrayWriter();
   67                 
   68                 PDFTextStripper stripper = new PDFTextStripper();
   69                 stripper.setLineSeparator("\n");
   70                 stripper.writeText(document, writer);
   71                 String st = writer.toString().trim();
   72                 log.debug("TextStripped: '{}'", st);
   73                 
   74                 if (Config.SYSTEM_PDF_FORCE_OCR || st.length() <= 1) {
   75                         log.warn("PDF does not contains text layer");
  ...
  Do OCR if SYSTEM_PDF_FORCE_OCR or no text captured from text layer
  ...
  109                         
  110                         return new StringReader(sb.toString());
  111                 } else {
  112                         return new CharArrayReader(writer.toCharArray());
  113                 }

So, what does this code mean? It means that if you set system.pdf.force.ocr to true, you just disable text extraction from text layer. No matter if there is an image or not. As a side effect, if you have PDF files with both text AND image, depending on this setting, you're getting only text from text layer or text from OCRed images, but not both of them. I've tested it with PDF containing both text and image and it works as expected - only text from image or only text from text layer?!?

I wish to say that all this thread is my fault, but IMHO this is just a bug and should be fixed. Right now I am newbie with this system, but I promise once I get used to with it, I will fix it. Shame on us if we miss the opportunity to extract text from text layer AND OCRed images together.

And once again I am thankful to the open source community. If OpenKM was proprietary software I wouldn't have a chance to investigate my problem and probably would have a negative opinion just because I wasn't able to understand what's happening.

My advice to other users: Guys, unless you have special considerations, don't set system.pdf.force.ocr to true.

Username

kalatchev

Rank

Fresh Boarder

Posts

6

Joined

Mon Nov 18, 2013 6:10 am

Re: Problem with text extraction of PDF Files

#26276 by kalatchev
Tue Nov 19, 2013 6:07 am

One more comment to mentioned above - Warning text "PDF does not contains text layer" is deceptive. It should state "PDF does not contains text layer or OCR forced by config" at least. However I think captured text from previous step (extraction from text layer) shouldn't be thrown out, but insted - added to text captured from OCR. Just removing the line

Code: Select all

if (Config.SYSTEM_PDF_FORCE_OCR || st.length() <= 1) {

will achieve what I propose - capture text both from text layer and image.

I am hearing all of you - deevlopers and users if there are some other considerations against my opinion, as I said before - I'm newbie in the system and may miss something. Performance issues for instance.

Username

kalatchev

Rank

Fresh Boarder

Posts

6

Joined

Mon Nov 18, 2013 6:10 am

Re: Problem with text extraction of PDF Files

#27449 by pavila
Thu Dec 12, 2013 2:53 pm

Configuration property system.pdf.force.ocr is disabled by default, so I don't see the problem.

Username

pavila

Rank

Moderator

Posts

3144

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Re: Problem with text extraction of PDF Files

#27501 by vincentk222
Wed Dec 18, 2013 1:48 pm

Hi
I had this problem too and change the config as described.
But I still have problem with search ...
I do not have the same result intext extraction, terms and inside the PDF

Icone UTILITIES / list of index

Code: Select all

#	65
_hibernate_class 	com.openkm.dao.bean.NodeDocument
uuid 	8785daf6-d5b3-4ec6-8f1e-c10823bb26a9
parent 	07eee10c-db57-432a-a1b4-bad48b805d58
context 	okm_root
author 	okmAdmin
created 	20131218
name 	999.pdf
category 	200f79e4-b913-4eed-ae04-78df8d5d99f7
userPermission 	okmAdmin
rolePermission 	ROLE_USER
okp:livraison.type 	VENTE
okp:livraison.numero 	999
lastModified 	20131218
language 	ro
mimeType 	application/pdf
checkedOut 	false
textExtracted 	true
locked 	false
terms 	[anfügen, b, c, d, e, ergänzung, ermöglicht, f, für, g, geschäfts, geschäftsdokumente, h, hinzufügen, i, ingen, j, k, l, lösung, m, message, n, o, p, q, r, rüberbr, s, systemunabhängigkeit, t, u, v, w, x, y, z, ändern, änderungen]

When I do text extraction:

Code: Select all

O P T I M I E R E N S I E I H R E D O K U M E N T E N P R O Z E S S E - N A C H B E L I E B E N D A T E N HINZUFÜGEN O D E R ÄNDERN, B I L D E R U N D G R A F I K E N I N T E G R I E R E N , S C H N E L L E R A U F N E U E A N F O R D E R U N G E N U N D ÄNDERUNGEN D E R G E S E T Z G E B U N G R E A G I E R E N , D O K U M E N T E I N E C H T Z E I T ÄNDERN S O W I E D Y N A M I S C H E U N D P E R S O N A L I S I E R T E I N F O R M A T I O N E N A N I H R E K U N D E N S E N D E N ... U N D A L L E S O H N E ÄNDERUNGEN A N I H R E M E R P - S Y S T E M ! O b j e c t i f L u n e s P l a n e t P r e s s S u i t e i s t e i n e w e r t v o l l e Ergänzung z u I h r e m E R P - S y s t e m , d i e e s U n t e r n e h m e n ermöglicht, i h r e Geschäftsdokumente a u f z u w e r t e n u n d i n d e m v o m A d r e s s a t e n b e v o r z u g t e n F o r m a t z u v e r s e n d e n . I m D o k u m e n t e n w o r k f l o v v - v o n d e r E r s t e l l u n g , d e r A u s g a b e u n d d e r Z u s t e l l u n g b i s h i n z u r A r c h i v i e r u n g i n I h r e m D o k u m e n t - M a n a g e m e n t - S y s t e m - k a n n P l a n e t P r e s s j e d e n e i n z e l n e n A r b e i t s a b l a u f s t e u e r n u n d a u t o m a t i s i e r e n . Hinzufügen v o n G r a f i k e n , O v e r l a y s , Q R - C o d e s , W e r b u n g , T e x t e n , B i l d e r n . D o k u m e n t e a u f w e r t e n ( z . B . B a r c o d e s , O M R - C o d e s ) E l e k t r o n i s c h e R e c h n u n g K u n d e n a n s c h r e i b e n hinzufügen P o s t a l i s c h e S o r t i e r u n g . i A u t o m a t i s i e r t e A u s g a b e p r o z e s s e D i e e i n f a c h e B e d i e n u n g u n d d i e Systemunabhängigkeit m a c h e n d a s P r o g r a m m z u r p e r f e k t e n Lösung, u m Geschäftsdokumente m i t w i c h t i g e n I n f o r m a t i o n e n , d i e d i e A u f m e r k s a m k e i t d e s L e s e r s f e s s e l n u n d d i e „Message rüberbr ingen", s c h n e l l u n d e i n f a c h a u f z u w e r t e n , z u p r o d u z i e r e n u n d z u v e r s e n d e n - p e r P o s t , E - M a i l o d e r o n l i n e u n d für d i e A r c h i v i e r u n g . A u t o m a t i s c h Geschäfts b e d i n g u n g e n anfügen E i n b i n d u n g v o n R e s p o n s e - S t e u e r u n g ( B a r c o d e s o d e r m i t d e m D i g i t a l i s i e r u n g s s t i f t v o n A n o t o ) Z e i t g l e i c h e E r s t e l l u n g e i n e r o r i g i n a l g e t r e u e n d i g i t a l e n K o p i e A r c h i v i e r u n g E - M a i l W e b F a x D r u c k & N a c h v e r a r b e i t u n g

when I copy paste the text inside the PDF:

Code: Select all

OPTIMIEREN SIE IHRE DOKUMENTENPROZESSE - NACH BELIEBEN DATEN HINZUFÜGEN ODER ÄNDERN, BILDER UND GRAFIKEN INTEGRIEREN, SCHNELLER AUF NEUE ANFORDERUNGEN UND ÄNDERUNGEN DER GESETZGEBUNG REAGIEREN, DOKUMENTE IN ECHTZEIT ÄNDERN SOWIE DYNAMISCHE UND PERSONALISIERTE INFORMATIONEN AN IHRE KUNDEN SENDEN ... UND ALLES OHNE ÄNDERUNGEN AN IHREM ERP-SYSTEM!
Objectif Lunes
PlanetPres
s Suite ist eine wertvolle Ergänzung zu Ihrem ERP-System, die es Unternehmen ermöglicht, ihre Geschäftsdokumente aufzuwerten und in dem vom Adressaten bevorzugten Format zu versenden.
Im Dokumentenworkflovv - von der Erstellung, der Ausgabe und der Zustellung bis hin zur Archivierung in Ihrem Dokument-Management-System - kann PlanetPress jeden einzelnen Arbeitsablauf steuern und automatisieren.
Hinzufügen von Grafiken, Overlays, QR-Codes, Werbung, Texten, Bildern .
Dokumente aufwerten (z.B. Barcodes, OMR-Codes)
Elektronische Rechnung
Kundenanschreiben hinzufügen
Postalische Sortierung
.i
Automatisierte Ausgabeprozesse
Die einfache Bedienung und die Systemunabhängigkeit machen das Programm zur perfekten Lösung, um Geschäftsdokumente mit wichtigen Informationen, die die Aufmerksamkeit des Lesers fesseln und die „Message rüberbringen", schnell und einfach aufzuwerten, zu produzieren und zu versenden - per Post, E-Mail oder online und für die Archivierung.
Automatisch Geschäftsbedingungen anfügen
Einbindung von Response-Steuerung (Barcodes oder mit dem Digitalisierungsstift von Anoto)
Zeitgleiche Erstellung einer originalgetreuen digitalen Kopie
Archivierung E-Mail
Web
F a x
Druck & Nachverarbeitung

Attachments

999.pdf

(208.53 KiB) Downloaded 707 times

Username

vincentk222

Rank

Junior Boarder

Posts

22

Joined

Fri Sep 20, 2013 12:27 pm

Re: Problem with text extraction of PDF Files

#27582 by pavila
Wed Jan 08, 2014 7:14 pm

Please don't mix different problems in the same thread because it is already marked as solved.

Username

pavila

Rank

Moderator

Posts

3144

Joined

Tue Dec 11, 2007 6:02 pm

Location

Alicante, Spain

Contact

Reply

Page 1 of 1
7 posts