Using OCR Webservice with wsclient library

The OCR Webservice: Font recognition/Text recognition

How to use the webservices of webPDF with the wsclient library? We want to show this here with a concrete coding example. Here we will introduce the OCR webservice and how you can use it with the webPDF wsclient library.

Important note:

The following coding example is based on the use of the webPDF wsclient library. In order to understand and apply the example, the following blog post should be considered first:

webPDF and Java: very easy with the “wsclient” library

Creating a REST or SOAP Session

In order to call the Webservice as we would like to present it here, it is assumed that you have already created a REST or SOAP session and can therefore either create an UrlConverterWebService object (for a SOAP session) by calling the WebserviceFactory:

OcrWebService ocrWebService =
    WebServiceFactory.createInstance(
        session, WebServiceType.OCR
    );

..or create an UrlConverterRestWebService object (for a REST session):

OcrRestWebService ocrWebService =
    WebServiceFactory.createInstance(
        session, WebServiceType.OCR
    );

And have passed either a RestDocument or a SoapDocument object to this WebService object by calling the method setDocument().

Webservice Parameters

To get changing access to a document, you have to give the current open and/or permission password of the document to the webservice call. You can do this directly at the created ocrWebService object:

ocrWebService.getPassword().setOpen("password");
ocrWebService.getPassword().setPermission("password");

The OCR Webservice

The OCR Webservice is an endpoint of your webPDF server that allows you to recognize font/text in graphics of your PDF document and either extract them or place them directly in a text layer of the PDF document.
You retrieve the OcrType object from your OcrWebService object as follows to pass additional parameters to it:

OcrType ocr = ocrWebService.getOperation();

The following parameters can be set on the Ocr object:

language (default value: “narrow”)

The language in which the text is to be recognized. The following values are possible here:

• narrow = English
• fra = French
• spa = Spanish
• deu = German
• ita = Italian

ocr.setLanguage(OcrLanguageType.DEU);

checkResolution (default value: true)

If this value is set to true, the system checks whether the resolution of the document is sufficient for text recognition. Resolutions below 200 DPI are rejected, because text recognition with low-resolution graphics usually leads to incorrect results.

ocr.setCheckResolution(false);

forceEachPage (default value: false)

If a PDF document already contains pages with text content, a new text recognition is regularly rejected. If this value is set to true, text recognition is forced and a text layer is created for all text-free pages.

ocr.setForceEachPage(true);

imageDpi (default: 200)

This value sets the minimum resolution of output documents.

ocr.setImageDpi(300);

outputFormat (default value: “pdf”)

Determines the output format for the recognized text. If a PDF document is selected here, the recognized texts are placed as text layers over the pages of the source document. The following values are possible here:

• text = text document
• hocr = XML (HOCR)
• pdf = PDF Textlayer

ocr.setOutputFormat(OcrOutputType.TEXT);

The object “page”

For text recognition, images of the PDF pages are generated with the selected resolution. If you add an OcrPageType object to the Ocr object, you can further influence the dimensions of the generated graphics:

OcrPageType page = new OcrPageType();
ocr.setPage(page);

The following parameters can be set on the Page object:

width (default value: 210)

Sets the width of the pages using the selected metrics.

page.setWidth(800);

height (default: 297)

Sets the height of the pages using the selected metrics.

page.setHeight(16);

metrics (default value: “mm”)

The unit of measurement in which the dimensions are to be given. The following values can be set here:

• mm = millimeter
• px = Pixel

page.setMetrics(MetricsType.MM);

More detailed example of our entire webservice call

We now want to give an example for the entire webservice call for addressing the SOAP interface:

try (
    // Setup of a session with the webPDF server (here SOAP):
    SoapSession session = SessionFactory.createInstance(
        WebServiceProtocol.SOAP,
        new URL("https://localhost:8080/webPDF/")
    );
    // Make available the document that is to be processed.
    // and the file in which the result is to be written:
    SoapDocument soapDocument = new SoapDocument(
        new File("Path of the source document").toURI(),
        new File("Path of the target document")
    )
) {
    // Selection of the webservice via a factory:
    OcrWebService ocrWebService =
        WebServiceFactory.createInstance(
            session, WebServiceType.OCR
        );
    ocrWebService.setDocument(soapDocument);
    ocrWebService.getPassword().setOpen("password");
    ocrWebService.getPassword().setPermission("password");

    OcrType ocr = ocrWebService.getOperation();

    ocr.setLanguage(OcrLanguageType.DEU);
    ocr.setCheckResolution(false);
    ocr.setForceEachPage(true);
    ocr.setImageDpi(300);
    ocr.setOutputFormat(OcrOutputType.TEXT);

    OcrPageType page = new OcrPageType();
    ocr.setPage(page);

    page.setWidth(800);
    page.setHeight(600);
    page.setMetrics(MetricsType.PX);
    // Ausführung.
    ocrWebService.process();
} catch (ResultException | MalformedURLException ex) {
    // To evaluate possible errors that have occurred, the
    // wsclient library appropriate methods are available:
}

Concluding remarks

  • More information about the OCR parameter structure and error codes can be found in our user manual.
  • Please also note: All parameters are preset with certain default values. If a default value is specified and does not deviate from your desired value, it is not absolutely necessary to set this parameter.

More coding examples for webservices that you can use with the ws-client library can be found here.