aoptk.literature.pymupdf_parser

Classes

PymupdfParser

Parse PDFs using PyMuPDF.

Module Contents

class aoptk.literature.pymupdf_parser.PymupdfParser(pdfs: list[aoptk.literature.pdf.PDF], figure_storage: str = 'tests/figure_storage', text_generation: aoptk.text_generation_api.TextGenerationAPI | None = None)[source]

Bases: aoptk.literature.pdf_parser.PDFParser

Parse PDFs using PyMuPDF.

figure_storage = 'tests/figure_storage'[source]
pdfs[source]
pattern_figure_descriptions = '(?ms)(?<=\\n)\\s*Figure\\s+\\d+\\.?\\s*(.*?)(?=\\n)'[source]
pattern_any_character = '(.*)'[source]
text_generation = None[source]
get_publications() list[aoptk.literature.publication.Publication][source]

Get a list of publications.

get_abstracts() list[aoptk.literature.abstract.Abstract][source]

Get abstracts from the PDFs.

Returns:

List of abstracts obtained from the PDF’s.

Return type:

list[Abstract]

_parse_pdf(pdf: aoptk.literature.pdf.PDF) aoptk.literature.publication.Publication[source]

Parse a single PDF and return a Publication object.

_extract_abstract(pdf: aoptk.literature.pdf.PDF, publication_id: aoptk.literature.id.ID) aoptk.literature.abstract.Abstract[source]

Extract the abstract from the text.

_extract_full_text(pdf: aoptk.literature.pdf.PDF) str[source]

Extract text to parse from the PDF.

Parameters:

pdf (PDF) – The PDF object to extract text from.

Returns:

The extracted full text from the PDF.

Return type:

str

_is_too_short(text: str, min_length: int = 1000) bool[source]

Check if the text is too short to be a valid full text.

Parameters:
  • text (str) – The text to check.

  • min_length (int) – The minimum length of valid full text.

Returns:

True if the text is too short, False otherwise.

Return type:

bool

_is_corrupted(text: str, max_corruption_ratio: float = 0.1) bool[source]

Check if the text is corrupted based on the ratio of control characters.

Parameters:
  • text (str) – The text to check.

  • max_corruption_ratio (float) – The maximum allowed ratio of corrupted characters.

Returns:

True if the text is corrupted, False otherwise.

Return type:

bool

_extract_pdf_as_images(pdf: aoptk.literature.pdf.PDF) list[str][source]

Extract each page of the PDF as an image and return a list of base64-encoded images.

Parameters:

pdf (PDF) – The PDF object to extract images from.

Returns:

A list of base64-encoded image strings.

Return type:

list[str]

_extract_full_text_from_images(pdf_as_images: list[str]) str[source]

Extract text from a list of base64-encoded images using the TextGenerationAPI.

Parameters:

pdf_as_images (list[str]) – A list of base64-encoded image strings.

Returns:

The extracted full text from the images.

Return type:

str

_extract_text_blocks_without_irrelevant_border_text(pages: collections.abc.Iterable[tuple[int, pymupdf.Page]], top_margin_frac: float = 0.07, bottom_margin_frac: float = 0.07, side_margin_frac: float = 0.02) list[tuple[int, int, float, float, float, float, str]][source]

Collect text blocks from pages within margin bounds.

_extract_text_to_parse(pdf: aoptk.literature.pdf.PDF) str[source]

Extract text to parse from the PDF.

_clean_control_chars(text: str) str[source]

Remove unwanted control characters.

_extract_figure_descriptions(text: str) list[str][source]

Extract figure descriptions from the text.

_extract_figures(pdf: aoptk.literature.pdf.PDF) list[str][source]

Extract figures from the PDF and save them to the output directory.

_save_figure(output_dir: str, figure_count: int, base_figure: dict, figure_bytes: bytes) None[source]

Save the extracted figure to the output directory.

_figure_large_enough(figure_bytes: bytes) bool[source]

Check if the figure is larger than 50 KB.