How-to: Using the OCR webservice of webPDF 7

Minimum technical requirements

  • Java version: 7
  • webPDF version: 7
  • wsclient version: 1

In this example the use of the OCR webservice of webPDF is presented. The OCR functions in webPDF are based on tesseract. By default the languages German, English, French, Spanish and Italian are supported. Languages can be added in the tesseract folder (see the webPDF manual for details). Currently languages with a “Multibyte Character Set” are not supported. This applies for example to Arabic and Far Eastern languages. We assume that using the OCR webservice makes sense especially for documents that contain text, but which is not embedded as text. To extract normal text from PDF documents, webPDF offers a simple option in the Toolbox Web Service.

Creating the project and generating the necessary proxy classes

Create a Java project in IntelliJ with the following options like here

Template Command Line App
Project name OCRExample
Project Location ..\OCRExample
Base package net.webpdf

Open the project view in IntelliJ. Open the command prompt in the “src” folder and generate the proxy classes with the following commands.

wsimport -Xnocompile -s . http://localhost:8080/webPDF/soap/ocr?wsdl       -extension

As in the previous examples, create a “Main” class with a “main” method.

The project is created and the proxy classes are generated.

Use of the “OCR” webservice

As in the example of the previous blogs, all the code is reinserted into the “main” method of the “main” class.

1.001

Before you do this, however, create a folder called “content” by inserting the example files “TIFFimgContent.tiff” and “webPDFContent.pdf”, which can be found in the appendix.

1.002

Create the new “content” folder

1.003

Project with the filled “content” folder

Now start creating the program code in the ‘main’ method.

File pdfFile = new File("./content/webPDFContent.pdf");
File tiffFile = new File("./content/TIFFimgContent.tiff");

URL ocrUrl;

try {
    ocrUrl = new URL("https://localhost:8080/webPDF/soap/ocr?wsdl");
} catch (MalformedURLException ex) {
    System.err.println(ex.getMessage());
    return;
}

At the beginning 2 file objects are created, which refer to the files in the “content” folder of the project. Then, as known from the previous blogs, the URL for generating the service instance is created.

Operation ocrStrictTextOperation = new Operation();
ocrStrictTextOperation.setOcr(new OcrType());
ocrStrictTextOperation.getOcr().setLanguage(OcrLanguageType.DEU);
ocrStrictTextOperation.getOcr().setOutputFormat(OcrOutputType.TEXT);

Operation ocrTolerantTextOperation = new Operation();
ocrTolerantTextOperation.setOcr(new OcrType());
ocrTolerantTextOperation.getOcr().setLanguage(OcrLanguageType.DEU);
ocrTolerantTextOperation.getOcr().setOutputFormat(OcrOutputType.TEXT);
//Even files with a resolution smaller than 200 dpi are processed.
ocrTolerantTextOperation.getOcr().setCheckResolution(false);

Operation ocrHocrOperation = new Operation();
ocrHocrOperation.setOcr(new OcrType());
ocrHocrOperation.getOcr().setLanguage(OcrLanguageType.DEU);
ocrHocrOperation.getOcr().setOutputFormat(OcrOutputType.HOCR);

Operation ocrPdfOperation = new Operation();
ocrPdfOperation.setOcr(new OcrType());
ocrPdfOperation.getOcr().setLanguage(OcrLanguageType.DEU);
ocrPdfOperation.getOcr().setOutputFormat(OcrOutputType.PDF);
ocrPdfOperation.getOcr().setCheckResolution(false);

Here 4 different “Operation” instances are created. The use of the webservice specific operations was introduced in the last blog post. All “Operation” instances have German as their language (OcrLanguageType.DEU).

2 of the instances use text(OcrOutputType.TEXT) as output format, one instance XHTML according to the hOCR standard(OcrOutputType.HOCR) and one instance PDF(OcrOutputType.PDF).  2 instances ignore the check for a resolution of the transferred document of at least 200 DPI (setCheckResolution(false)) and 2 execute it (by default true). The parameters can also be found in the webPDF manual.

OCRService ocrService = new OCRService(ocrUrl);

OCR ocr = ocrService.getOCRPort();

StringBuilder pdfTextResult = new StringBuilder("Text extracted from a PDF:\n");
StringBuilder imgTextResult = new StringBuilder("Text extracted from a tiff image:\n");

try {
    DataHandler ocrHandler = ocr.execute(ocrStrictTextOperation, new DataHandler(new FileDataSource(pdfFile)), null);
    Scanner textScanner = new Scanner(ocrHandler.getInputStream());
    while (textScanner.hasNextLine()) {
        pdfTextResult.append(textScanner.nextLine()).append("\n");
    }

    ocrHandler = ocr.execute(ocrTolerantTextOperation, null, tiffFile.toURI().toURL().toString());
    textScanner = new Scanner(ocrHandler.getInputStream());
    while (textScanner.hasNextLine()) {
        imgTextResult.append(textScanner.nextLine()).append("\n");
    }

    ocrHandler = ocr.execute(ocrHocrOperation, new DataHandler(new FileDataSource(pdfFile)), null);
    ocrHandler.writeTo(new FileOutputStream(new File("./hOCRResult.xhtml")));

    ocrHandler = ocr.execute(ocrPdfOperation, null, tiffFile.toURI().toURL().toString());
    ocrHandler.writeTo(new FileOutputStream(new File("./PDFResult.pdf")));
    
} catch (WebserviceException | IOException e) {
    System.err.println(e.getMessage());
    return;
}

System.out.println(pdfTextResult.toString());
System.out.println("-----------------------------------------\n");
System.out.println(imgTextResult.toString());

In this part of the code, the service and the port/endpoint object are generated first. Then 4 requests are sent to the OCR webservice. The return value is temporarily loaded as DataHandler object into the variable “ocrHandler”. With the first two requests a text file is returned in a DataHandler. The content is read immediately with a scanner object and appended to the appropriate StringBuilder. The 3rd request returns an XHTML file in the hOCR standard in a DataHandler and the 4th a PDF file. These are immediately saved in the project directory. At the end, the texts extracted from the PDF and the TIFF file are output.

1.004

The generated XHTML and PDF files

1.005

Output during execution

1.006

hOCRResult.xhtml in browser

With the OCR webservice you have extracted the text (which was not included as normal text) from a PDF and a TIFF file, generated a PDF file from the text analysis of a TIFF file and generated an XHTML file according to the hOCR standard from the text analysis of a PDF file!

Congratulations to you!

Attachment:

Required imports for the class:

import de.webpdf.schema._1_0.operation.OcrLanguageType;
import de.webpdf.schema._1_0.operation.OcrOutputType;
import de.webpdf.schema._1_0.operation.OcrType;
import de.webpdf.schema._1_0.operation.Operation;
import de.webpdf.schema._1_0.soap.ocr.OCR;
import de.webpdf.schema._1_0.soap.ocr.OCRService;
import de.webpdf.schema._1_0.soap.ocr.WebserviceException;

import javax.activation.DataHandler;
import javax.activation.FileDataSource;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Scanner;