• Problem with text extraction of PDF Files

  • We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
We tried to make OpenKM as intuitive as possible, but an advice is always welcome.
Forum rules: Please, before asking something see the documentation wiki or use the search feature of the forum. And remember we don't have a crystal ball or mental readers, so if you post about an issue tell us which OpenKM are you using and also the browser and operating system version. For more info read How to Report Bugs Effectively.
 #26271  by kalatchev
 
Hello there,

I encountered a problem with PDF text extraction. Text extraction of PDF with image layer (scanned documents) works just fine. Text extraction of other type of documents (Word, Text, etc.) works also and documents are indexed as expected. But I found that certain PDF files are not indexed. I’ve made further investigation of the problem by using function “Administration -> Utilities -> Check text extraction” function.

I’ve made tests with 5 files, containing the same text – one line of English text and another line with Cyrillic. File contains following text:
Code: Select all
This is text. 
Това е текст.
Here are the results of my investigation:
  1. Text file (txt), ANSI encoding: works, although Cyrillic text is wrongly read with characters of ISO 8859-1 insted of WIndows 1252, but it's normal for ANSI encoding.
  2. Text file (txt), UTF-8 encoding: works fine, correct encoding.
  3. Word file (docx, MS Word 2010): works fine, correct encoding.
  4. PDF file (produced with Word, "Save As PDF" function): Doesn’t extract any text, neither English text nor Cyrillic.
  5. PDF file (produced trough printing to CutePDF printer 3.0/Ghostscript): Doesn’t extract any text, neither English text nor Cyrillic.
If it helps somehow - Copy-Paste between Adobe Reader and Notepad (Adobe Reader 10.1.8 to Notepad) – works on both mentioned PDF files, with correct encoding – both English text and Cyrillic.

Another clue: Previously I had problem with copy-paste with some PDF files, especially with ones, made with PDF printers. I’ve found that this topic helped me. I think that is somehow connected with mentioned above problem.

Here is the information about installation:
Server: OpenKM – community edition, version 6.2.5 (build: 8109), running on Windows 7 Pro SP1 with Apache Tomcat 7.0.27, JRE 7 Update 45, OpenOffice 4.0.1, Tesseract 3.02, ImageMagick 6.8.7, MS SQL Server 2008 R2 Express edition.
Client: Google Chrome 31.0.1650.57 on Windows 7 Pro SP 1.

Configuration settings:
Screenshot of related configuration values.
Screenshot of related configuration values.
config.png (16.17 KiB) Viewed 35260 times
Here are my test files:
Attachments
The ANSI encoded text file (code page - Windows 1251) and the second - an UTF-8 encoded one.
(431 Bytes) Downloaded 299 times
The Word document, "printed" to CutePDF.
(11.18 KiB) Downloaded 353 times
The Word document, saved as PDF from Word.
(83.37 KiB) Downloaded 371 times
The Word 2010 document.
(12.48 KiB) Downloaded 400 times
 #26274  by kalatchev
 
Some clarifications.

I found that in result of checking file extraction, I am getting the following record in catalina.log (real IP address is replaced by me with w.x.y.z):
Code: Select all
2013-11-19 02:43:35,529 [http-bio-w.x.y.z-8080-exec-3] WARN  com.openkm.extractor.PdfTextExtractor - PDF does not contains text layer
I've forgot to say that OS of server is 64-bit (Windows 7 Pro SP1 64-bit) if that matters.

And the most interesting thing - I tried these files with official online demo. Both problem files are indexed correctly and here is the screenshot:
Onlinde demo somehow extracts text correctly.
Onlinde demo somehow extracts text correctly.
online-demo-test.png (12.83 KiB) Viewed 35251 times
Really strange ...
I hope someone will help me ;-)
 #26275  by kalatchev
 
Finally I found where the porblem is :-)

The problem was setting system.pdf.force.ocr to true. Once I set it to false, everything begun to work. Thanks to God I'm dealing with open source!

Here's the explination:
Code: Select all
   65                 PDDocument document = parser.getPDDocument();
   66                 CharArrayWriter writer = new CharArrayWriter();
   67                 
   68                 PDFTextStripper stripper = new PDFTextStripper();
   69                 stripper.setLineSeparator("\n");
   70                 stripper.writeText(document, writer);
   71                 String st = writer.toString().trim();
   72                 log.debug("TextStripped: '{}'", st);
   73                 
   74                 if (Config.SYSTEM_PDF_FORCE_OCR || st.length() <= 1) {
   75                         log.warn("PDF does not contains text layer");
  ...
  Do OCR if SYSTEM_PDF_FORCE_OCR or no text captured from text layer
  ...
  109                         
  110                         return new StringReader(sb.toString());
  111                 } else {
  112                         return new CharArrayReader(writer.toCharArray());
  113                 }
So, what does this code mean? It means that if you set system.pdf.force.ocr to true, you just disable text extraction from text layer. No matter if there is an image or not. As a side effect, if you have PDF files with both text AND image, depending on this setting, you're getting only text from text layer or text from OCRed images, but not both of them. I've tested it with PDF containing both text and image and it works as expected - only text from image or only text from text layer?!?

I wish to say that all this thread is my fault, but IMHO this is just a bug and should be fixed. Right now I am newbie with this system, but I promise once I get used to with it, I will fix it. Shame on us if we miss the opportunity to extract text from text layer AND OCRed images together.

And once again I am thankful to the open source community. If OpenKM was proprietary software I wouldn't have a chance to investigate my problem and probably would have a negative opinion just because I wasn't able to understand what's happening.

My advice to other users: Guys, unless you have special considerations, don't set system.pdf.force.ocr to true.
 #26276  by kalatchev
 
One more comment to mentioned above - Warning text "PDF does not contains text layer" is deceptive. It should state "PDF does not contains text layer or OCR forced by config" at least. However I think captured text from previous step (extraction from text layer) shouldn't be thrown out, but insted - added to text captured from OCR. Just removing the line
Code: Select all
if (Config.SYSTEM_PDF_FORCE_OCR || st.length() <= 1) {
will achieve what I propose - capture text both from text layer and image.

I am hearing all of you - deevlopers and users if there are some other considerations against my opinion, as I said before - I'm newbie in the system and may miss something. Performance issues for instance.
 #27501  by vincentk222
 
Hi
I had this problem too and change the config as described.
But I still have problem with search ...
I do not have the same result intext extraction, terms and inside the PDF

Icone UTILITIES / list of index
Code: Select all
#	65
_hibernate_class 	com.openkm.dao.bean.NodeDocument
uuid 	8785daf6-d5b3-4ec6-8f1e-c10823bb26a9
parent 	07eee10c-db57-432a-a1b4-bad48b805d58
context 	okm_root
author 	okmAdmin
created 	20131218
name 	999.pdf
category 	200f79e4-b913-4eed-ae04-78df8d5d99f7
userPermission 	okmAdmin
rolePermission 	ROLE_USER
okp:livraison.type 	VENTE
okp:livraison.numero 	999
lastModified 	20131218
language 	ro
mimeType 	application/pdf
checkedOut 	false
textExtracted 	true
locked 	false
terms 	[anfügen, b, c, d, e, ergänzung, ermöglicht, f, für, g, geschäfts, geschäftsdokumente, h, hinzufügen, i, ingen, j, k, l, lösung, m, message, n, o, p, q, r, rüberbr, s, systemunabhängigkeit, t, u, v, w, x, y, z, ändern, änderungen]
When I do text extraction:
Code: Select all
O P T I M I E R E N S I E I H R E D O K U M E N T E N P R O Z E S S E - N A C H B E L I E B E N D A T E N HINZUFÜGEN O D E R ÄNDERN, B I L D E R U N D G R A F I K E N I N T E G R I E R E N , S C H N E L L E R A U F N E U E A N F O R D E R U N G E N U N D ÄNDERUNGEN D E R G E S E T Z G E B U N G R E A G I E R E N , D O K U M E N T E I N E C H T Z E I T ÄNDERN S O W I E D Y N A M I S C H E U N D P E R S O N A L I S I E R T E I N F O R M A T I O N E N A N I H R E K U N D E N S E N D E N ... U N D A L L E S O H N E ÄNDERUNGEN A N I H R E M E R P - S Y S T E M ! O b j e c t i f L u n e s P l a n e t P r e s s S u i t e i s t e i n e w e r t v o l l e Ergänzung z u I h r e m E R P - S y s t e m , d i e e s U n t e r n e h m e n ermöglicht, i h r e Geschäftsdokumente a u f z u w e r t e n u n d i n d e m v o m A d r e s s a t e n b e v o r z u g t e n F o r m a t z u v e r s e n d e n . I m D o k u m e n t e n w o r k f l o v v - v o n d e r E r s t e l l u n g , d e r A u s g a b e u n d d e r Z u s t e l l u n g b i s h i n z u r A r c h i v i e r u n g i n I h r e m D o k u m e n t - M a n a g e m e n t - S y s t e m - k a n n P l a n e t P r e s s j e d e n e i n z e l n e n A r b e i t s a b l a u f s t e u e r n u n d a u t o m a t i s i e r e n . Hinzufügen v o n G r a f i k e n , O v e r l a y s , Q R - C o d e s , W e r b u n g , T e x t e n , B i l d e r n . D o k u m e n t e a u f w e r t e n ( z . B . B a r c o d e s , O M R - C o d e s ) E l e k t r o n i s c h e R e c h n u n g K u n d e n a n s c h r e i b e n hinzufügen P o s t a l i s c h e S o r t i e r u n g . i A u t o m a t i s i e r t e A u s g a b e p r o z e s s e D i e e i n f a c h e B e d i e n u n g u n d d i e Systemunabhängigkeit m a c h e n d a s P r o g r a m m z u r p e r f e k t e n Lösung, u m Geschäftsdokumente m i t w i c h t i g e n I n f o r m a t i o n e n , d i e d i e A u f m e r k s a m k e i t d e s L e s e r s f e s s e l n u n d d i e „Message rüberbr ingen", s c h n e l l u n d e i n f a c h a u f z u w e r t e n , z u p r o d u z i e r e n u n d z u v e r s e n d e n - p e r P o s t , E - M a i l o d e r o n l i n e u n d für d i e A r c h i v i e r u n g . A u t o m a t i s c h Geschäfts­ b e d i n g u n g e n anfügen E i n b i n d u n g v o n R e s p o n s e - S t e u e r u n g ( B a r c o d e s o d e r m i t d e m D i g i t a l i s i e r u n g s s t i f t v o n A n o t o ) Z e i t g l e i c h e E r s t e l l u n g e i n e r o r i g i n a l g e t r e u e n d i g i t a l e n K o p i e A r c h i v i e r u n g E - M a i l W e b F a x D r u c k & N a c h v e r a r b e i t u n g 
when I copy paste the text inside the PDF:
Code: Select all
OPTIMIEREN SIE IHRE DOKUMENTENPROZESSE - NACH BELIEBEN DATEN HINZUFÜGEN ODER ÄNDERN, BILDER UND GRAFIKEN INTEGRIEREN, SCHNELLER AUF NEUE ANFORDERUNGEN UND ÄNDERUNGEN DER GESETZGEBUNG REAGIEREN, DOKUMENTE IN ECHTZEIT ÄNDERN SOWIE DYNAMISCHE UND PERSONALISIERTE INFORMATIONEN AN IHRE KUNDEN SENDEN ... UND ALLES OHNE ÄNDERUNGEN AN IHREM ERP-SYSTEM!
Objectif Lunes
PlanetPres
s Suite ist eine wertvolle Ergänzung zu Ihrem ERP-System, die es Unternehmen ermöglicht, ihre Geschäftsdokumente aufzuwerten und in dem vom Adressaten bevorzugten Format zu versenden.
Im Dokumentenworkflovv - von der Erstellung, der Ausgabe und der Zustellung bis hin zur Archivierung in Ihrem Dokument-Management-System - kann PlanetPress jeden einzelnen Arbeitsablauf steuern und automatisieren.
Hinzufügen von Grafiken, Overlays, QR-Codes, Werbung, Texten, Bildern .
Dokumente aufwerten (z.B. Barcodes, OMR-Codes)
Elektronische Rechnung
Kundenanschreiben hinzufügen
Postalische Sortierung
.i
Automatisierte Ausgabeprozesse
Die einfache Bedienung und die Systemunabhängigkeit machen das Programm zur perfekten Lösung, um Geschäftsdokumente mit wichtigen Informationen, die die Aufmerksamkeit des Lesers fesseln und die „Message rüberbringen", schnell und einfach aufzuwerten, zu produzieren und zu versenden - per Post, E-Mail oder online und für die Archivierung.
Automatisch Geschäftsbedingungen anfügen
Einbindung von Response-Steuerung (Barcodes oder mit dem Digitalisierungsstift von Anoto)
Zeitgleiche Erstellung einer originalgetreuen digitalen Kopie
Archivierung E-Mail
Web
F a x
Druck & Nachverarbeitung
Attachments
(208.53 KiB) Downloaded 368 times

About Us

OpenKM is part of the management software. A management software is a program that facilitates the accomplishment of administrative tasks. OpenKM is a document management system that allows you to manage business content and workflow in a more efficient way. Document managers guarantee data protection by establishing information security for business content.