aoptk.literature.pymupdf_parser =============================== .. py:module:: aoptk.literature.pymupdf_parser Classes ------- .. autoapisummary:: aoptk.literature.pymupdf_parser.PymupdfParser Module Contents --------------- .. py:class:: PymupdfParser(pdfs: list[aoptk.literature.pdf.PDF], figure_storage: str = 'tests/figure_storage', text_generation: aoptk.text_generation_api.TextGenerationAPI | None = None) Bases: :py:obj:`aoptk.literature.pdf_parser.PDFParser` Parse PDFs using PyMuPDF. .. py:attribute:: figure_storage :value: 'tests/figure_storage' .. py:attribute:: pdfs .. py:attribute:: pattern_figure_descriptions :value: '(?ms)(?<=\\n)\\s*Figure\\s+\\d+\\.?\\s*(.*?)(?=\\n)' .. py:attribute:: pattern_any_character :value: '(.*)' .. py:attribute:: text_generation :value: None .. py:method:: get_publications() -> list[aoptk.literature.publication.Publication] Get a list of publications. .. py:method:: get_abstracts() -> list[aoptk.literature.abstract.Abstract] Get abstracts from the PDFs. :returns: List of abstracts obtained from the PDF's. :rtype: list[Abstract] .. py:method:: _parse_pdf(pdf: aoptk.literature.pdf.PDF) -> aoptk.literature.publication.Publication Parse a single PDF and return a Publication object. .. py:method:: _extract_abstract(pdf: aoptk.literature.pdf.PDF, publication_id: aoptk.literature.id.ID) -> aoptk.literature.abstract.Abstract Extract the abstract from the text. .. py:method:: _extract_full_text(pdf: aoptk.literature.pdf.PDF) -> str Extract text to parse from the PDF. :param pdf: The PDF object to extract text from. :type pdf: PDF :returns: The extracted full text from the PDF. :rtype: str .. py:method:: _is_too_short(text: str, min_length: int = 1000) -> bool Check if the text is too short to be a valid full text. :param text: The text to check. :type text: str :param min_length: The minimum length of valid full text. :type min_length: int :returns: True if the text is too short, False otherwise. :rtype: bool .. py:method:: _is_corrupted(text: str, max_corruption_ratio: float = 0.1) -> bool Check if the text is corrupted based on the ratio of control characters. :param text: The text to check. :type text: str :param max_corruption_ratio: The maximum allowed ratio of corrupted characters. :type max_corruption_ratio: float :returns: True if the text is corrupted, False otherwise. :rtype: bool .. py:method:: _extract_pdf_as_images(pdf: aoptk.literature.pdf.PDF) -> list[str] Extract each page of the PDF as an image and return a list of base64-encoded images. :param pdf: The PDF object to extract images from. :type pdf: PDF :returns: A list of base64-encoded image strings. :rtype: list[str] .. py:method:: _extract_full_text_from_images(pdf_as_images: list[str]) -> str Extract text from a list of base64-encoded images using the TextGenerationAPI. :param pdf_as_images: A list of base64-encoded image strings. :type pdf_as_images: list[str] :returns: The extracted full text from the images. :rtype: str .. py:method:: _extract_text_blocks_without_irrelevant_border_text(pages: collections.abc.Iterable[tuple[int, pymupdf.Page]], top_margin_frac: float = 0.07, bottom_margin_frac: float = 0.07, side_margin_frac: float = 0.02) -> list[tuple[int, int, float, float, float, float, str]] Collect text blocks from pages within margin bounds. .. py:method:: _extract_text_to_parse(pdf: aoptk.literature.pdf.PDF) -> str Extract text to parse from the PDF. .. py:method:: _clean_control_chars(text: str) -> str Remove unwanted control characters. .. py:method:: _extract_figure_descriptions(text: str) -> list[str] Extract figure descriptions from the text. .. py:method:: _extract_figures(pdf: aoptk.literature.pdf.PDF) -> list[str] Extract figures from the PDF and save them to the output directory. .. py:method:: _save_figure(output_dir: str, figure_count: int, base_figure: dict, figure_bytes: bytes) -> None Save the extracted figure to the output directory. .. py:method:: _figure_large_enough(figure_bytes: bytes) -> bool Check if the figure is larger than 50 KB.