aoptk.literature.pymupdf_parser
Classes
Parse PDFs using PyMuPDF. |
Module Contents
- class aoptk.literature.pymupdf_parser.PymupdfParser(pdfs: list[aoptk.literature.pdf.PDF], figure_storage: str = 'tests/figure_storage', text_generation: aoptk.text_generation_api.TextGenerationAPI | None = None)[source]
Bases:
aoptk.literature.pdf_parser.PDFParserParse PDFs using PyMuPDF.
- get_publications() list[aoptk.literature.publication.Publication][source]
Get a list of publications.
- get_abstracts() list[aoptk.literature.abstract.Abstract][source]
Get abstracts from the PDFs.
- _parse_pdf(pdf: aoptk.literature.pdf.PDF) aoptk.literature.publication.Publication[source]
Parse a single PDF and return a Publication object.
- _extract_abstract(pdf: aoptk.literature.pdf.PDF, publication_id: aoptk.literature.id.ID) aoptk.literature.abstract.Abstract[source]
Extract the abstract from the text.
- _extract_full_text(pdf: aoptk.literature.pdf.PDF) str[source]
Extract text to parse from the PDF.
- _is_too_short(text: str, min_length: int = 1000) bool[source]
Check if the text is too short to be a valid full text.
- _is_corrupted(text: str, max_corruption_ratio: float = 0.1) bool[source]
Check if the text is corrupted based on the ratio of control characters.
- _extract_pdf_as_images(pdf: aoptk.literature.pdf.PDF) list[str][source]
Extract each page of the PDF as an image and return a list of base64-encoded images.
- _extract_full_text_from_images(pdf_as_images: list[str]) str[source]
Extract text from a list of base64-encoded images using the TextGenerationAPI.
- _extract_text_blocks_without_irrelevant_border_text(pages: collections.abc.Iterable[tuple[int, pymupdf.Page]], top_margin_frac: float = 0.07, bottom_margin_frac: float = 0.07, side_margin_frac: float = 0.02) list[tuple[int, int, float, float, float, float, str]][source]
Collect text blocks from pages within margin bounds.
- _extract_text_to_parse(pdf: aoptk.literature.pdf.PDF) str[source]
Extract text to parse from the PDF.
- _extract_figure_descriptions(text: str) list[str][source]
Extract figure descriptions from the text.
- _extract_figures(pdf: aoptk.literature.pdf.PDF) list[str][source]
Extract figures from the PDF and save them to the output directory.