Jul 31 2019 31. July 2019 4. June 2020 by Nele Zymek

ToolboxWebService Extraction: Extract Pages

Posted in Technology, webPDF, webPDF Webservices, webPDF wsclient

Minimum technical requirements

Java version: 7
webPDF version: 7
wsclient version: 1

Content extraction with the webPDF wsclient library

We would like to give you a concrete coding example here: Our example for the extraction operation of the webPDF ToolboxWebService and how it can be implemented with the wsclient library.

Important note:

The following coding example is based on the use of the webPDF wsclient library. In order to understand and apply the example, the following blog post should be considered first:

webPDF and Java: very easy with the “wsclient” library

Important preliminary work

To call the webservice, you must first have created a REST or SOAP session. Then it is possible to create either a ToolboxWebService object (for a SOAP session) by calling the WebserviceFactory:

ToolboxWebService toolboxWebService =
    WebServiceFactory.createInstance(
        session, WebServiceType.TOOLBOX
    );

..or create a ToolboxRestWebService object (for a REST session):

ToolboxRestWebService toolboxWebService =
    WebServiceFactory.createInstance(
        session, WebServiceType.TOOLBOX
    );

You can then pass either a RestDocument or a SoapDocument object to this WebService object by calling the method setDocument().

What do the webservice parameters look like?

First of all, to get changing access to a document, you have to give the current open and/or permission password of the document to the webservice call. It is best to do this directly on the ToolboxWebService object created:

toolboxWebService.getPassword().setOpen("password");
toolboxWebService.getPassword().setPermission("password");

By the way: If the document does not have an appropriate password protection, you can skip this point.

The ToolboxWebService in detail

The ToolboxWebService is an endpoint of your webPDF server that summarizes a number of operations. With these operations you can edit your PDF document. One of these operations is the Extraction operation. With the Extraction operation certain contents can be extracted from a PDF document.

This is how you add an Extraction Operation to your WebService object:

ExtractionType extraction = new ExtractionType();
toolboxWebService.getOperation().add(extraction);

The selection of the contents to be extracted is done by adding a corresponding object to the Extraction object. The following are your choices:

The „text“ object

If you want to extract texts from the document, set an ExtractionTextType object on the Extraction object. (creates either a text, XML, or JSON file)

ExtractionTextType text = new ExtractionTextType();
extraction.setText(text);

The following parameters can be set for the Text object:

pages (default value: „“)

Specifies the page range from which content is to be extracted. You can specify either a single page („1“), a list of pages („1,3,5“), a page range („1-5“), or a combination of these elements („1,3-5,6“). All pages of the document can be selected using „*“.

text.setPages("1,3-5,6");

fileFormat (default value: „xml“)

Specified the output format of the extracted content. The following values can be set here:

text = Text document
xml = XML document
json = JSON document

text.setFileFormat(ExtractionFileFormatType.XML);

The object „links“

If you want to extract links from the document, set an ExtractionLinksType object on the Extraction object. (creates either a text, XML, or JSON file)

ExtractionLinksType links = new ExtractionLinksType();
extraction.setLinks(links);

The following parameters can be set on the Links object:

pages (default value: „“)

links.setPages("1,3-5,6");

fileFormat (default value: „xml“)

Specified the output format of the extracted content. The following values can be set here:

text = Text document
xml = XML document
json = JSON document

links.setFileFormat(ExtractionFileFormatType.XML);

The object „text“ (ExtractionLinksType substructure)

Usually, the extraction of links only extracts annotations that are clearly marked as such. If links are to be extracted directly from the page texts, an ExtractionLinksType.Text object can be added to the Links object.

ExtractionLinksType.Text text = new ExtractionLinksType.Text();
links.setText(text);

The following parameters can be set for the Text object:

fromText (default value: false)

If this value is set to true, links are also extracted from the page contents.

text.setFromText(true);

protocol (default value: „“)

Provides the ability to extract only links from a specific protocol. It is possible to specify multiple protocols separated by commas (for example: „http,https,ftp“).

text.setProtocol("http,https,ftp");

withoutProtocol (default value: true)

If this value is set to true, URL/URI-like structures are also extracted for which no protocol is specified. (for example: www.webpdf.de)

text.setWithoutProtocol(false);

The object „info“

If you want to extract information and meta information from the document (such as security settings, PDF properties, or the PDF/A status), set an ExtractionInfoType object on the Extraction object. (creates either an XML or JSON file)

ExtractionInfoType info = new ExtractionInfoType();
extraction.setInfo(info);

The following parameters can be set for the Info object:

pages (default value: „“)

info.setPages("1,3-5,6");

fileFormat (default value: „xml“)

Specified the output format of the extracted content. The following values can be set here:

text = Text document
xml = XML document
json = JSON document

info.setFileFormat(ExtractionFileFormatType.XML);

The object „words“

If texts are to be extracted from the document word by word and with coordinates specified, set an ExtractionWordsType object on the Extraction object. (creates either a text, XML or JSON file)

ExtractionWordsType words = new ExtractionWordsType();
extraction.setWords(words);

The following parameters can be set for the Words object:

pages (default value: „“)

words.setPages("1,3-5,6");

fileFormat (default value: „xml“)

Specified the output format of the extracted content. The following values can be set here:

text = Text document
xml = XML document
json = JSON document

words.setFileFormat(ExtractionFileFormatType.XML);

delimitAfterPunctuation (default value: true)

If this value is set to true, all punctuation marks are also considered word boundaries.

words.setDelimitAfterPunctuation(false);

extendedSequenceCharacter (default value: false)

If this value is set to true, not only parentheses (square and round brackets) are added to the word, but also quotation marks, apostrophes and the like.

words.setExtendedSequenceCharacters(true);

removePunctuation (default value: false)

If this value is set to true, all punctuation marks are excluded from the export.

words.setRemovePunctuation(true);

The object „paragraphs“

If texts are to be extracted from the document article by article, set an ExtractionParagraphsType object on the Extraction object. (creates either a text, XML or JSON file)

ExtractionParagraphsType paragraphs = new ExtractionParagraphsType();
extraction.setParagraphs(paragraphs);

The following parameters can be set for the Paragraphs object:

pages (default value: „“)

paragraphs.setPages("1,3-5,6");

fileFormat (default value: „xml“)

Specifies the output format of the extracted content. The following values can be set here:

text = text document
xml = XML document
json = JSON document

paragraphs.setFileFormat(ExtractionFileFormatType.XML);

The object „images“

If images are to be extracted from the document as a ZIP file, set an ExtractionImagesType object on the Extraction object. (creates a ZIP file)

ExtractionImagesType images = new ExtractionImagesType();
extraction.setImages(images);

The following parameters can be set for the Images object:

pages (default value: „“)

images.setPages("1,3-5,6");

fileFormat (default value: „zip“)

Specifies the output format for the images to be extracted. The following values can be set here:

zip = ZIP archive

paragraphs.setFileFormat(ExtractionFileFormatType.XML);

fileNameTemplate (default value: „file[%d]“)

This value is a template for the names of the images in the ZIP archive. The placeholder “%d” must be contained and will be replaced by an index.

extractionImagesType.setFileNameTemplate("image[%d]");

folderNameTemplate (default value: „page[%d]“)

This value is a template for the names of the folders in the ZIP archive. For each page from which images are extracted, such a folder is created. The placeholder “%d” must be included and will be replaced by the page number.

extractionImagesType.setFolderNameTemplate("page[%d]");

fallbackFormat (default value: „png“)

When exporting images from a PDF, an attempt is made to extract the images in the format in which they were stored in the PDF document. If this is not possible, for example, because the format in question is not supported, this value determines the format to which the images are to be dropped. The following values can be set here:

png = PNG file
jpeg = JPEG file

extractionImagesType.setFallbackFormat(ExtractionImageFormat.PNG);

Webservice call by addressing the SOAP interface

We now want to give a more detailed example for our entire webservice call (for addressing the SOAP interface):

try (
    // Setup of a session with the webPDF server (here SOAP):
    SoapSession session = SessionFactory.createInstance(
        WebServiceProtocol.SOAP,
        new URL("https://localhost:8080/webPDF/")
    );
    // Make available the document that is to be processed
    // and the file in which the result is to be written:
    SoapDocument soapDocument = new SoapDocument(
        new File("Path of the source document").toURI(),
        new File("Path of the target document")
    )
) {
    // Selection of the webservice via a factory:
    ToolboxWebService toolboxWebService =
        WebServiceFactory.createInstance(
            session, WebServiceType.TOOLBOX
        );
    toolboxWebService.setDocument(soapDocument);
    toolboxWebService.getPassword().setOpen("password");
    toolboxWebService.getPassword().setPermission("password");

    ExtractionType extraction = new ExtractionType();
    toolboxWebService.getOperation().add(extraction);

    ExtractionImagesType extractionImagesType = new ExtractionImagesType();
    extraction.setImages(extractionImagesType);

    extractionImagesType.setFileFormat(ExtractionFileFormatType.ZIP);
    extractionImagesType.setPages("1-5");

    extractionImagesType.setFileNameTemplate("image[%d]");
    extractionImagesType.setFolderNameTemplate("page[%d]");
    extractionImagesType.setFallbackFormat(ExtractionImageFormat.PNG);

    // execution.
    toolboxWebService.process();
} catch (ResultException | MalformedURLException ex) {
    // For the evaluation of possible errors,
    // the wsclient library provides corresponding methods:
}

Our Documentation

Here you will find a detailed description of the parameters, here without examples for use with the wsclient library: extraction Parameter structure
A documentation of the error codes and possibly occurring errors can be found here.
Also note: All parameters are preset with certain default values. If a default value is specified and does not deviate from your desired value, it is not absolutely necessary to set this parameter.

More coding examples for webservices that you can use with the ws-client library can be found here.

Tags:coding example

Categories

Menu

ToolboxWebService Extraction: Extract Pages

Content extraction with the webPDF wsclient library

Important preliminary work

What do the webservice parameters look like?

The ToolboxWebService in detail

The „text“ object

pages (default value: „“)

fileFormat (default value: „xml“)

The object „links“

pages (default value: „“)

fileFormat (default value: „xml“)

The object „text“ (ExtractionLinksType substructure)

fromText (default value: false)

protocol (default value: „“)

withoutProtocol (default value: true)

The object „info“

pages (default value: „“)

fileFormat (default value: „xml“)

The object „words“

pages (default value: „“)

fileFormat (default value: „xml“)

delimitAfterPunctuation (default value: true)

extendedSequenceCharacter (default value: false)

removePunctuation (default value: false)

The object „paragraphs“

pages (default value: „“)

fileFormat (default value: „xml“)

The object „images“

pages (default value: „“)

fileFormat (default value: „zip“)

fileNameTemplate (default value: „file[%d]“)

folderNameTemplate (default value: „page[%d]“)

fallbackFormat (default value: „png“)

Webservice call by addressing the SOAP interface

Our Documentation

Categories

Recent Posts

Archive