Toolbox WebService Extraction: Extract Content

Technical Minimum Requirements

  • Java version: 7
  • webPDF version: 7
  • wsclient version: 1

Extracting Content with the webPDF wsclient Library

This article provides a concrete coding example for the Extraction operation of the webPDF ToolboxWebService and shows how it can be implemented with the wsclient library.

Important Note

The following coding example is based on the webPDF wsclient library. To understand and apply the example, the related blog post should be reviewed first.

Important Preparation Steps

Before calling the webservice, you should already have created a REST or SOAP session. Then you can use the WebServiceFactory to create either a ToolboxWebService object for SOAP:

ToolboxWebService toolboxWebService =
WebServiceFactory.createInstance(
session, WebServiceType.TOOLBOX
);

Or a ToolboxRestWebService object for REST:

ToolboxRestWebService toolboxWebService =
WebServiceFactory.createInstance(
session, WebServiceType.TOOLBOX
);

You can then pass either a RestDocument or a SoapDocument to this WebService object using setDocument().

What Do the Webservice Parameters Look Like?

To get modifying access to a document, you usually need to pass the current open and/or permission password of the document to the webservice call. It is best to do this directly on the created ToolboxWebService object:

toolboxWebService.getPassword().setOpen("password");
toolboxWebService.getPassword().setPermission("password");

If the document is not password protected, you can skip this step.

The ToolboxWebService in Detail

The ToolboxWebService is an endpoint of your webPDF server that combines a range of operations. These operations allow you to edit your PDF document. One of these operations is Extraction. The Extraction operation allows specific content to be extracted from a PDF document.

Add an Extraction operation to your WebService object like this:

ExtractionType extraction = new ExtractionType();
toolboxWebService.getOperation().add(extraction);

The content to be extracted is selected by adding a corresponding object to the Extraction object. The following options are available:

The text Object

To extract text from the document, add an ExtractionTextType object to the Extraction object. This creates either a text, XML, or JSON file.

ExtractionTextType text = new ExtractionTextType();
extraction.setText(text);

The following parameters can be set on the Text object:

pages (default: "")

Defines the page range from which content should be extracted. You can specify a single page (1), a list of pages (1,3,5), a page range (1-5), or a combination (1,3-5,6). All pages of the document can be selected with *.

text.setPages("1,3-5,6");

fileFormat (default: "xml")

Defines the output format for the extracted content:

  • text = text document
  • xml = XML document
  • json = JSON document
text.setFileFormat(ExtractionFileFormatType.XML);

To extract links from the document, add an ExtractionLinksType object to the Extraction object. This creates either a text, XML, or JSON file.

ExtractionLinksType links = new ExtractionLinksType();
extraction.setLinks(links);

The following parameters can be set on the Links object:

pages (default: "")

Defines the page range from which content should be extracted.

links.setPages("1,3-5,6");

fileFormat (default: "xml")

Defines the output format for the extracted content:

  • text = text document
  • xml = XML document
  • json = JSON document
links.setFileFormat(ExtractionFileFormatType.XML);

The text Object as a Substructure of ExtractionLinksType

Normally, link extraction only extracts annotations that are clearly marked as links. If links should also be extracted directly from the page text, an ExtractionLinksType.Text object can be added to the Links object.

ExtractionLinksType.Text text = new ExtractionLinksType.Text();
links.setText(text);

The following parameters can be set on this Text object:

fromText (default: false)

If this value is set to true, links are also extracted from the page content.

text.setFromText(true);

protocol (default: "")

Lets you extract only links from specific protocols. Multiple protocols can be separated by commas, for example http,https,ftp.

text.setProtocol("http,https,ftp");

withoutProtocol (default: true)

If this value is set to true, URL/URI-like structures without a specified protocol are also extracted, for example www.webpdf.de.

text.setWithoutProtocol(false);

The info Object

To extract information and meta information from the document, such as security settings, PDF properties, or the PDF/A status, add an ExtractionInfoType object to the Extraction object. This creates either an XML or JSON file.

ExtractionInfoType info = new ExtractionInfoType();
extraction.setInfo(info);

The following parameters can be set on the Info object:

pages (default: "")

Defines the page range from which content should be extracted.

info.setPages("1,3-5,6");

fileFormat (default: "xml")

Defines the output format for the extracted content:

  • text = text document
  • xml = XML document
  • json = JSON document
info.setFileFormat(ExtractionFileFormatType.XML);

The words Object

To extract text word by word together with coordinates, add an ExtractionWordsType object to the Extraction object. This creates either a text, XML, or JSON file.

ExtractionWordsType words = new ExtractionWordsType();
extraction.setWords(words);

The following parameters can be set on the Words object:

pages (default: "")

Defines the page range from which content should be extracted.

words.setPages("1,3-5,6");

fileFormat (default: "xml")

Defines the output format for the extracted content:

  • text = text document
  • xml = XML document
  • json = JSON document
words.setFileFormat(ExtractionFileFormatType.XML);

delimitAfterPunctuation (default: true)

If this value is set to true, punctuation marks are also treated as word boundaries.

words.setDelimitAfterPunctuation(false);

extendedSequenceCharacters (default: false)

If this value is set to true, not only brackets are attached to a word, but also quotation marks, apostrophes, and similar characters.

words.setExtendedSequenceCharacters(true);

removePunctuation (default: false)

If this value is set to true, all punctuation marks are excluded from the export.

words.setRemovePunctuation(true);

The paragraphs Object

To extract text paragraph by paragraph, add an ExtractionParagraphsType object to the Extraction object. This creates either a text, XML, or JSON file.

ExtractionParagraphsType paragraphs = new ExtractionParagraphsType();
extraction.setParagraphs(paragraphs);

The following parameters can be set on the Paragraphs object:

pages (default: "")

Defines the page range from which content should be extracted.

paragraphs.setPages("1,3-5,6");

fileFormat (default: "xml")

Defines the output format for the extracted content:

  • text = text document
  • xml = XML document
  • json = JSON document
paragraphs.setFileFormat(ExtractionFileFormatType.XML);

The images Object

To extract images from the document as a ZIP archive, add an ExtractionImagesType object to the Extraction object. This creates a ZIP file.

ExtractionImagesType images = new ExtractionImagesType();
extraction.setImages(images);

The following parameters can be set on the Images object:

pages (default: "")

Defines the page range from which content should be extracted.

images.setPages("1,3-5,6");

fileFormat (default: "zip")

Defines the output format for the images to be extracted:

  • zip = ZIP archive
images.setFileFormat(ExtractionFileFormatType.ZIP);

fileNameTemplate (default: "file[%d]")

This value is a template for the names of the images in the ZIP archive. The placeholder %d must be included and will be replaced by an index.

images.setFileNameTemplate("image[%d]");

folderNameTemplate (default: "page[%d]")

This value is a template for the folder names in the ZIP archive. For each page from which images are extracted, one such folder is created. The placeholder %d must be included and will be replaced by the page number.

images.setFolderNameTemplate("page[%d]");

fallbackFormat (default: "png")

When exporting images from a PDF, the system tries to extract them in the format in which they are stored inside the PDF document. If this is not possible, for example because the format is not supported, this value defines the fallback format:

  • png = PNG file
  • jpeg = JPEG file
images.setFallbackFormat(ExtractionImageFormat.PNG);

Webservice Call via SOAP

Here is a more detailed example of the full webservice call using the SOAP interface:

try (
// Set up a session with the webPDF server (SOAP in this example):
SoapSession session = SessionFactory.createInstance(
WebServiceProtocol.SOAP,
new URL("https://localhost:8080/webPDF/")
);

// Provide the document that should be processed
// and the file to which the result should be written:
SoapDocument soapDocument = new SoapDocument(
new File("Path to the source document").toURI(),
new File("Path to the target document")
)
) catch (ResultException | MalformedURLException ex) {
// Error handling
}

Our Documentation

  • A detailed description of the parameters, without wsclient examples, can be found here: Extraction parameter structure
  • Documentation of error codes and possible errors can be found here.
  • Also note that all parameters come with default values. If a default value already matches your desired behavior, it does not necessarily need to be set.

More coding examples for webservices that can be used with the wsclient library can be found here.