OCR Webservice

Technical Minimum Requirements

  • Java version: 7
  • webPDF version: 7
  • wsclient version: 1

Using OCR Text Recognition with the wsclient Library

How can webPDF webservices be used in practice with the wsclient library? This article shows a concrete coding example focused on the OCR webservice.

Important Note

The following coding example is based on the webPDF wsclient library. To understand and apply the example, you should first review the related blog post.

Creating a REST or SOAP Session

To call the webservice in the way shown here, it is assumed that you have already created a REST or SOAP session. Then, using the WebserviceFactory, you can create either an OcrWebService object for SOAP:

OcrWebService ocrWebService =
WebServiceFactory.createInstance(
session, WebServiceType.OCR
);

Or an OcrRestWebService object for REST:

OcrRestWebService ocrWebService =
WebServiceFactory.createInstance(
session, WebServiceType.OCR
);

Then pass either a RestDocument or a SoapDocument to this WebService object using the setDocument() method.

Webservice Parameters

To get write access to a document, you need to provide the current open and/or permission password of the document to the webservice call. You can do this directly on the created ocrWebService object:

ocrWebService.getPassword().setOpen("password");
ocrWebService.getPassword().setPermission("password");

If the document is not password protected, you can skip this step.

The OCR Webservice

The OCR webservice is an endpoint of your webPDF server that allows you to recognize text in image content of a PDF document and either extract it or write it directly into a text layer of the PDF.

You can retrieve the OcrType object from your OcrWebService object like this:

OcrType ocr = ocrWebService.getOperation();

The following parameters can be set on the Ocr object:

language (default: "eng")

Defines the language in which text should be recognized. The following values are available:

  • eng = English
  • fra = French
  • spa = Spanish
  • deu = German
  • ita = Italian
ocr.setLanguage(OcrLanguageType.DEU);

checkResolution (default: true)

If this value is set to true, the system checks whether the document resolution is sufficient for text recognition. Resolutions below 200 DPI are rejected because OCR on low-resolution graphics usually produces poor results.

ocr.setCheckResolution(false);

forceEachPage (default: false)

If a PDF document already contains pages with text content, OCR is normally rejected. If this value is set to true, OCR is forced and a text layer is created for all text-free pages.

ocr.setForceEachPage(true);

imageDpi (default: 200)

Sets the minimum resolution of source documents.

ocr.setImageDpi(300);

outputFormat (default: "pdf")

Determines the output format for the recognized text. If PDF is selected, the recognized text is placed as a text layer on top of the source document pages. The following values are available:

  • text = text document
  • hocr = XML (HOCR)
  • pdf = PDF text layer
ocr.setOutputFormat(OcrOutputType.TEXT);

The page Object

For OCR, page images are generated using the selected resolution. If you add an OcrPageType object to the Ocr object, you can further control the dimensions of the generated graphics:

OcrPageType page = new OcrPageType();
ocr.setPage(page);

The following parameters can be set on the Page object:

width (default: 210)

Sets the page width using the selected metrics.

page.setWidth(800);

height (default: 297)

Sets the page height using the selected metrics.

page.setHeight(16);

metrics (default: "mm")

Defines the unit used for dimensions:

  • mm = millimeter
  • px = pixel
page.setMetrics(MetricsType.MM);

More Detailed Example of the Full Webservice Call

Below is an example of the full webservice call for the SOAP interface:

try (
// Set up a session with the webPDF server (SOAP in this example):
SoapSession session = SessionFactory.createInstance(
WebServiceProtocol.SOAP,
new URL("https://localhost:8080/webPDF/")
);
// Provide the document to be processed
// and the file to which the result should be written:
SoapDocument soapDocument = new SoapDocument(
new File("Path to the source document").toURI(),
new File("Path to the target document")
)
) catch (ResultException | MalformedURLException ex) {
// Error handling
}

Final Notes

  • More information about the OCR parameter structure and error codes can be found in our documentation.
  • Please also note that all parameters have default values. If a default value already matches your desired behavior, it is not strictly necessary to set the parameter.

More coding examples for webservices that can be used with the wsclient library can be found here.