aoptk.literature.pymupdf_parser
===============================

.. py:module:: aoptk.literature.pymupdf_parser


Classes
-------

.. autoapisummary::

   aoptk.literature.pymupdf_parser.PymupdfParser


Module Contents
---------------

.. py:class:: PymupdfParser(pdfs: list[aoptk.literature.pdf.PDF], figure_storage: str = 'tests/figure_storage', text_generation: aoptk.text_generation_api.TextGenerationAPI | None = None)

   Bases: :py:obj:`aoptk.literature.pdf_parser.PDFParser`


   Parse PDFs using PyMuPDF.


   .. py:attribute:: figure_storage
      :value: 'tests/figure_storage'


   .. py:attribute:: pdfs


   .. py:attribute:: pattern_figure_descriptions
      :value: '(?ms)(?<=\\n)\\s*Figure\\s+\\d+\\.?\\s*(.*?)(?=\\n)'


   .. py:attribute:: pattern_any_character
      :value: '(.*)'


   .. py:attribute:: text_generation
      :value: None


   .. py:method:: get_publications() -> list[aoptk.literature.publication.Publication]

      Get a list of publications.


   .. py:method:: get_abstracts() -> list[aoptk.literature.abstract.Abstract]

      Get abstracts from the PDFs.

      :returns: List of abstracts obtained from the PDF's.
      :rtype: list[Abstract]


   .. py:method:: _parse_pdf(pdf: aoptk.literature.pdf.PDF) -> aoptk.literature.publication.Publication

      Parse a single PDF and return a Publication object.


   .. py:method:: _extract_abstract(pdf: aoptk.literature.pdf.PDF, publication_id: aoptk.literature.id.ID) -> aoptk.literature.abstract.Abstract

      Extract the abstract from the text.


   .. py:method:: _extract_full_text(pdf: aoptk.literature.pdf.PDF) -> str

      Extract text to parse from the PDF.

      :param pdf: The PDF object to extract text from.
      :type pdf: PDF

      :returns: The extracted full text from the PDF.
      :rtype: str


   .. py:method:: _is_too_short(text: str, min_length: int = 1000) -> bool

      Check if the text is too short to be a valid full text.

      :param text: The text to check.
      :type text: str
      :param min_length: The minimum length of valid full text.
      :type min_length: int

      :returns: True if the text is too short, False otherwise.
      :rtype: bool


   .. py:method:: _is_corrupted(text: str, max_corruption_ratio: float = 0.1) -> bool

      Check if the text is corrupted based on the ratio of control characters.

      :param text: The text to check.
      :type text: str
      :param max_corruption_ratio: The maximum allowed ratio of corrupted characters.
      :type max_corruption_ratio: float

      :returns: True if the text is corrupted, False otherwise.
      :rtype: bool


   .. py:method:: _extract_pdf_as_images(pdf: aoptk.literature.pdf.PDF) -> list[str]

      Extract each page of the PDF as an image and return a list of base64-encoded images.

      :param pdf: The PDF object to extract images from.
      :type pdf: PDF

      :returns: A list of base64-encoded image strings.
      :rtype: list[str]


   .. py:method:: _extract_full_text_from_images(pdf_as_images: list[str]) -> str

      Extract text from a list of base64-encoded images using the TextGenerationAPI.

      :param pdf_as_images: A list of base64-encoded image strings.
      :type pdf_as_images: list[str]

      :returns: The extracted full text from the images.
      :rtype: str


   .. py:method:: _extract_text_blocks_without_irrelevant_border_text(pages: collections.abc.Iterable[tuple[int, pymupdf.Page]], top_margin_frac: float = 0.07, bottom_margin_frac: float = 0.07, side_margin_frac: float = 0.02) -> list[tuple[int, int, float, float, float, float, str]]

      Collect text blocks from pages within margin bounds.


   .. py:method:: _extract_text_to_parse(pdf: aoptk.literature.pdf.PDF) -> str

      Extract text to parse from the PDF.


   .. py:method:: _clean_control_chars(text: str) -> str

      Remove unwanted control characters.


   .. py:method:: _extract_figure_descriptions(text: str) -> list[str]

      Extract figure descriptions from the text.


   .. py:method:: _extract_figures(pdf: aoptk.literature.pdf.PDF) -> list[str]

      Extract figures from the PDF and save them to the output directory.


   .. py:method:: _save_figure(output_dir: str, figure_count: int, base_figure: dict, figure_bytes: bytes) -> None

      Save the extracted figure to the output directory.


   .. py:method:: _figure_large_enough(figure_bytes: bytes) -> bool

      Check if the figure is larger than 50 KB.