aoptk.literature.pymupdf_parser

Classes

PymupdfParser

Parse PDFs using PyMuPDF.

Functions

_validate_pdf(→ bool)

Module Contents

aoptk.literature.pymupdf_parser._validate_pdf(pdf: aoptk.literature.pdf.PDF) → bool[source]

class aoptk.literature.pymupdf_parser.PymupdfParser(pdfs: list[aoptk.literature.pdf.PDF], figure_storage: pathlib.Path = Path('tests/figure_storage'), text_generation: aoptk.text_generation_api.TextGenerationAPI | None = None)[source]

Bases: aoptk.literature.pdf_parser.PDFParser

Parse PDFs using PyMuPDF.

unified_image_format = 'png'[source]

figure_storage[source]

pdfs[source]

pattern_figure_descriptions = '(?ms)(?<=\\n)\\s*Figure\\s+\\d+\\.?\\s*(.*?)(?=\\n)'[source]

pattern_any_character = '(.*)'[source]

text_generation = None[source]

get_publications(download_figures_enabled: bool = True) → list[aoptk.literature.publication.Publication][source]

Get a list of publications.

Parameters:

download_figures_enabled (bool) – Whether to download figures and
objects. (include their paths in the Publication)

Returns:

A list of Publication objects.

Return type:

list[Publication]

get_abstracts() → list[aoptk.literature.abstract.Abstract][source]

Get abstracts from the PDFs.

Returns:: List of abstracts obtained from the PDF’s.
Return type:: list[Abstract]

_parse_pdf(pdf: aoptk.literature.pdf.PDF, download_figures_enabled: bool = True) → aoptk.literature.publication.Publication[source]: Parse a single PDF and return a Publication object.

_extract_abstract(pdf: aoptk.literature.pdf.PDF, publication_id: aoptk.literature.id.ID) → aoptk.literature.abstract.Abstract[source]: Extract the abstract from the text.

_extract_full_text(pdf: aoptk.literature.pdf.PDF) → str[source]

Extract text to parse from the PDF.

Parameters:: pdf (PDF) – The PDF object to extract text from.
Returns:: The extracted full text from the PDF.
Return type:: str

_is_too_short(text: str, min_length: int = 1000) → bool[source]

Check if the text is too short to be a valid full text.

Parameters:

text (str) – The text to check.
min_length (int) – The minimum length of valid full text.

Returns:

True if the text is too short, False otherwise.

Return type:

bool

_is_corrupted(text: str, max_corruption_ratio: float = 0.1) → bool[source]

Check if the text is corrupted based on the ratio of control characters.

Parameters:

text (str) – The text to check.
max_corruption_ratio (float) – The maximum allowed ratio of corrupted characters.

Returns:

True if the text is corrupted, False otherwise.

Return type:

bool

_extract_pdf_as_images(pdf: aoptk.literature.pdf.PDF) → list[str][source]

Extract each page of the PDF as an image and return a list of base64-encoded images.

Parameters:: pdf (PDF) – The PDF object to extract images from.
Returns:: A list of base64-encoded image strings.
Return type:: list[str]

_extract_full_text_from_images(pdf_as_images: list[str]) → str[source]

Extract text from a list of base64-encoded images using the TextGenerationAPI.

Parameters:: pdf_as_images (list[str]) – A list of base64-encoded image strings.
Returns:: The extracted full text from the images.
Return type:: str

_extract_text_blocks_without_irrelevant_border_text(pages: collections.abc.Iterable[tuple[int, pymupdf.Page]], top_margin_frac: float = 0.07, bottom_margin_frac: float = 0.07, side_margin_frac: float = 0.02) → list[tuple[int, int, float, float, float, float, str]][source]: Collect text blocks from pages within margin bounds.

_extract_text_to_parse(pdf: aoptk.literature.pdf.PDF) → str[source]: Extract text to parse from the PDF.

_clean_control_chars(text: str) → str[source]: Remove unwanted control characters.

_extract_figure_descriptions(text: str) → list[str][source]: Extract figure descriptions from the text.

_extract_figures(pdf: aoptk.literature.pdf.PDF) → list[pathlib.Path][source]: Extract figures from the PDF and save them to the output directory.

_save_figure(output_dir: pathlib.Path, figure_count: int, base_figure: dict, figure_bytes: bytes) → None[source]: Save the extracted figure to the output directory.

_figure_large_enough(figure_bytes: bytes) → bool[source]: Check if the figure is larger than 50 KB.