aoptk.literature.databases.europepmc

Classes

EuropePMC

Class to get PDFs from EuropePMC based on a query.

Functions

_get_publication_id(→ str | None)

Extract the publication ID from the API result, checking for 'pmcid', 'pmid', and 'id' in order.

Module Contents

class aoptk.literature.databases.europepmc.EuropePMC(query: str, storage: str, figure_storage: str)[source]

Bases: aoptk.literature.get_abstract.GetAbstract, aoptk.literature.get_pdf.GetPDF, aoptk.literature.get_id.GetID, aoptk.literature.get_publication.GetPublication, aoptk.literature.get_publication_metadata.GetPublicationMetadata

Class to get PDFs from EuropePMC based on a query.

page_size = 1000[source]
timeout = 10[source]
headers: ClassVar[source]
image_extensions = ('.jpg', '.jpeg', '.png', '.bmp', '.tiff')[source]
_query[source]
storage[source]
figure_storage[source]
_session[source]
id_list = [][source]
get_pdfs() list[aoptk.literature.pdf.PDF][source]

Retrieve PDFs based on the query.

get_abstracts() list[aoptk.literature.abstract.Abstract][source]

Retrieve Abstracts based on the query.

get_publications() list[aoptk.literature.publication.Publication][source]

Retrieve Publications based on the query.

get_publications_metadata() list[aoptk.literature.publication_metadata.PublicationMetadata][source]

Retrieve Publication metadata based on the query.

get_ids() list[aoptk.literature.id.ID][source]

Get a list of publication IDs from EuropePMC based on the query.

remove_reviews() EuropePMC[source]

Modify the query to exclude review articles.

abstracts_only() EuropePMC[source]

Modify the query to search in the text of abstracts only.

_get_pdf(publication_id: str) aoptk.literature.pdf.PDF | None[source]

Retrieve the PDF for a given publication ID.

Parameters:

publication_id (str) – The ID of the publication for which to retrieve the PDF.

Returns:

The PDF object if successful, None otherwise.

Return type:

PDF | None

_write_pdf(publication_id: str, response: requests.Response) aoptk.literature.pdf.PDF[source]

Write the PDF content to a file and return a PDF object.

Parameters:
  • publication_id (str) – The ID of the publication for which the PDF is being written.

  • response (requests.Response) – The HTTP response containing the PDF content.

_get_abstract(publication_id: str) aoptk.literature.abstract.Abstract[source]

Return abstract from Europe PMC for a given publication ID.

Parameters:

publication_id (str) – The ID of the publication for which to retrieve the abstract.

Returns:

The abstract object if successful, None otherwise.

Return type:

Abstract

_call_api(cursor_mark: str, result_type: str, query: str) dict[source]

Call the EuropePMC web api to query the search.

Parameters:
  • cursor_mark (str) – Parameter for pagination.

  • result_type (str) – Whether to search for idlists or core.

  • query (str) – main query to carry out - default self._query

Returns:

JSON response

Return type:

dict

_get_publication_metadata(publication_id: str) aoptk.literature.publication_metadata.PublicationMetadata | None[source]

Return abstract from Europe PMC for a given publication ID.

Parameters:

publication_id (str) – The ID of the publication to retrieve metadata for.

_get_publication(publication_id: str) aoptk.literature.publication.Publication | None[source]

Return a Publication object for a given publication ID.

Parameters:

publication_id (str) – The ID of the publication to retrieve.

_parse_xml_abstract(root: xml.etree.ElementTree.Element) str[source]

Return the full text content of the first <abstract> element as a single string.

Parameters:

root (ET.Element) – The root element of the XML tree.

_parse_xml_full_text(root: xml.etree.ElementTree.Element) str[source]

Parse the XML content to extract the full text.

Parameters:

root (ET.Element) – The root element of the XML tree.

_parse_xml_figure_descriptions(root: xml.etree.ElementTree.Element) str[source]

Parse the XML content to extract the figure descriptions.

Parameters:

root (ET.Element) – The root element of the XML tree.

_parse_xml_tables(root: xml.etree.ElementTree.Element) list[pandas.DataFrame][source]

Parse the XML content to extract tables as a list of DataFrames, preserving order.

Parameters:

root (ET.Element) – The root element of the XML tree.

_extract_rows(table_elem: xml.etree.ElementTree.Element) list[list[str]][source]

Extract rows from a table element, preserving order.

Parameters:

table_elem (ET.Element) – The XML element representing the table.

_get_xml(publication_id: str) str | None[source]

Retrieve the XML content for a given publication ID.

Parameters:

publication_id (str) – The ID of the publication to retrieve XML for.

_get_figures(publication_id: str) list[str][source]

Retrieve the figure file paths for a given publication ID.

Parameters:

publication_id (str) – The ID of the publication to retrieve figures for.

_get_supplementary_zip_path(publication_id: str) str | None[source]

Download the supplementary files ZIP for a given publication ID and return the path to the ZIP file.

Parameters:

publication_id (str) – The ID of the publication to retrieve supplementary files for.

aoptk.literature.databases.europepmc._get_publication_id(result: dict) str | None[source]

Extract the publication ID from the API result, checking for ‘pmcid’, ‘pmid’, and ‘id’ in order.

Args: result (dict): The API result containing publication information.