Class PdfTextExtractor


  • public final class PdfTextExtractor
    extends Object
    Extracts text from a PDF file.
    Since:
    2.1.4
    • Method Detail

      • getTextFromPage

        public static String getTextFromPage​(PdfReader reader,
                                             int pageNumber,
                                             TextExtractionStrategy strategy,
                                             Map<String,​ContentOperator> additionalContentOperators)
                                      throws IOException
        Extract text from a specified page using an extraction strategy. Also allows registration of custom ContentOperators
        Parameters:
        reader - the reader to extract text from
        pageNumber - the page to extract text from
        strategy - the strategy to use for extracting text
        additionalContentOperators - an optional map of custom ContentOperators for rendering instructions
        Returns:
        the extracted text
        Throws:
        IOException - if any operation fails while reading from the provided PdfReader
      • getTextFromPage

        public static String getTextFromPage​(PdfReader reader,
                                             int pageNumber,
                                             TextExtractionStrategy strategy)
                                      throws IOException
        Extract text from a specified page using an extraction strategy.
        Parameters:
        reader - the reader to extract text from
        pageNumber - the page to extract text from
        strategy - the strategy to use for extracting text
        Returns:
        the extracted text
        Throws:
        IOException - if any operation fails while reading from the provided PdfReader
        Since:
        5.0.2
      • getTextFromPage

        public static String getTextFromPage​(PdfReader reader,
                                             int pageNumber)
                                      throws IOException
        Extract text from a specified page using the default strategy.

        Note: the default strategy is subject to change. If using a specific strategy is important, use getTextFromPage(PdfReader, int, TextExtractionStrategy)

        Parameters:
        reader - the reader to extract text from
        pageNumber - the page to extract text from
        Returns:
        the extracted text
        Throws:
        IOException - if any operation fails while reading from the provided PdfReader
        Since:
        5.0.2