aoptk.literature.databases.europepmc

Classes

EuropePMC

Class to get data from Europe PMC based on a query.

Functions

_get_publication_id(→ aoptk.literature.id.ID)

Extract the publication ID from the API result, checking for 'pmcid', 'pmid', and 'id' in order.

Module Contents

class aoptk.literature.databases.europepmc.EuropePMC(storage: pathlib.Path, figure_storage: pathlib.Path, query: aoptk.literature.query.Query | None = None)[source]

Bases: aoptk.literature.get_abstract.GetAbstract, aoptk.literature.get_pdf.GetPDF, aoptk.literature.get_id.GetID, aoptk.literature.get_publication.GetPublication, aoptk.literature.get_metadata.GetMetadata

Class to get data from Europe PMC based on a query.

page_size = 1000[source]

timeout = 30[source]

headers: ClassVar[source]

image_extensions = ('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.tif')[source]

unified_image_format = 'png'[source]

search_term[source]

storage[source]

figure_storage[source]

_session[source]

retry_strategy[source]

adapter[source]

build_search_term(query: aoptk.literature.query.Query) → str[source]: Convert Query to Europe PMC search syntax.

update_retry_strategy(strategy: urllib3.util.retry.Retry) → None[source]

Update the retry strategy - allows customizing retry behaviour.

This function updates the adapter and the session to ensure the new retry strategy is used for future requests.

Parameters:: strategy (Retry) – Strategy to use.

_get_license_filter(licensing: str) → str[source]

Get the license filter string for a given licensing type.

Parameters:: licensing (str) – The licensing type.
Returns:: The license filter string for Europe PMC search.
Return type:: str

get_pdfs(ids: list[aoptk.literature.id.ID]) → list[aoptk.literature.pdf.PDF][source]: Retrieve PDFs.

get_abstracts(ids: list[aoptk.literature.id.ID]) → list[aoptk.literature.abstract.Abstract][source]: Retrieve Abstracts.

get_publications(ids: list[aoptk.literature.id.ID], download_figures_enabled: bool = True) → list[aoptk.literature.publication.Publication][source]

Retrieve Publications.

Parameters:

ids (list[ID]) – A list of publication IDs to retrieve.
download_figures_enabled (bool) – Whether to download figures and
objects. (include their paths in the Publication)

get_publications_metadata(ids: list[aoptk.literature.id.ID]) → list[aoptk.literature.metadata.Metadata][source]: Retrieve Publication metadata.

get_ids() → list[aoptk.literature.id.ID][source]: Get a list of publication IDs from EuropePMC based on the search term.

_get_pdf(publication_id: aoptk.literature.id.ID) → aoptk.literature.pdf.PDF[source]

Retrieve the PDF for a given publication ID.

Parameters:: publication_id (ID) – The ID of the publication for which to retrieve the PDF.
Returns:: The PDF object if successful, None otherwise.
Return type:: PDF | None

_write_pdf(publication_id: aoptk.literature.id.ID, response: requests.Response) → aoptk.literature.pdf.PDF[source]

Write the PDF content to a file and return a PDF object.

Parameters:

publication_id (ID) – The ID of the publication for which the PDF is being written.
response (requests.Response) – The HTTP response containing the PDF content.

_get_abstract(publication_id: aoptk.literature.id.ID) → aoptk.literature.abstract.Abstract | None[source]

Return abstract from Europe PMC for a given publication ID.

Parameters:: publication_id (ID) – The ID of the publication for which to retrieve the abstract.
Returns:: The abstract object if successful, None otherwise.
Return type:: Abstract

_call_api(cursor_mark: str, result_type: str, query: str | aoptk.literature.id.ID) → dict[source]

Call the EuropePMC web api to query the search.

Parameters:

cursor_mark (str) – Parameter for pagination.
result_type (str) – Whether to search for idlists or core.
query (str | ID) – main query to carry out - default self._query

Returns:

JSON response

Return type:

dict

_get_publication_metadata(publication_id: aoptk.literature.id.ID) → aoptk.literature.metadata.Metadata | None[source]

Return abstract from Europe PMC for a given publication ID.

Parameters:: publication_id (ID) – The ID of the publication to retrieve metadata for.

_get_publication(publication_id: aoptk.literature.id.ID, download_figures_enabled: bool = True) → aoptk.literature.publication.Publication | None[source]

Return a Publication object for a given publication ID.

Parameters:

publication_id (ID) – The ID of the publication to retrieve.
download_figures_enabled (bool) – Whether to download figures
object. (and include their paths in the Publication)

_parse_xml_abstract(root: xml.etree.ElementTree.Element) → str[source]

Return the full text content of the first <abstract> element as a single string.

Parameters:: root (ET.Element) – The root element of the XML tree.

_parse_xml_full_text(root: xml.etree.ElementTree.Element) → str[source]

Parse the XML content to extract the full text.

Parameters:: root (ET.Element) – The root element of the XML tree.

_parse_xml_figure_descriptions(root: xml.etree.ElementTree.Element) → str[source]

Parse the XML content to extract the figure descriptions.

Parameters:: root (ET.Element) – The root element of the XML tree.

_parse_xml_tables(root: xml.etree.ElementTree.Element) → list[pandas.DataFrame][source]

Parse the XML content to extract tables as a list of DataFrames, preserving order.

Parameters:: root (ET.Element) – The root element of the XML tree.

_extract_rows(table_elem: xml.etree.ElementTree.Element) → list[list[str]][source]

Extract rows from a table element, preserving order.

Parameters:: table_elem (ET.Element) – The XML element representing the table.

_get_xml(publication_id: aoptk.literature.id.ID) → xml.etree.ElementTree.Element | None[source]

Retrieve the XML root element for a given publication ID.

Parameters:: publication_id (ID) – The ID of the publication to retrieve XML for.

_get_figures(publication_id: aoptk.literature.id.ID) → list[pathlib.Path][source]

Retrieve the figure file paths for a given publication ID.

Parameters:: publication_id (ID) – The ID of the publication to retrieve figures for.

_get_supplementary_zip_path(publication_id: aoptk.literature.id.ID) → pathlib.Path | None[source]

Download the supplementary files ZIP for a given publication ID and return the path to the ZIP file.

Parameters:: publication_id (ID) – The ID of the publication to retrieve supplementary files for.

aoptk.literature.databases.europepmc._get_publication_id(result: dict) → aoptk.literature.id.ID[source]

Extract the publication ID from the API result, checking for ‘pmcid’, ‘pmid’, and ‘id’ in order.

Args: result (dict): The API result containing publication information.