aoptk.literature.databases.pmc

Classes

PMC

Class to get data from PMC based on a query.

Module Contents

class aoptk.literature.databases.pmc.PMC(storage: pathlib.Path, figure_storage: pathlib.Path, query: aoptk.literature.query.Query | None = None)[source]

Bases: aoptk.literature.get_publication.GetPublication, aoptk.literature.get_pdf.GetPDF, aoptk.literature.get_id.GetID, aoptk.literature.get_abstract.GetAbstract, aoptk.literature.get_metadata.GetMetadata

Class to get data from PMC based on a query.

aws_region = 'us-east-1'[source]

s3[source]

bucket = 'pmc-oa-opendata'[source]

paginator[source]

image_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.tif')[source]

unified_image_format = 'png'[source]

search_term[source]

_ncbi[source]

storage[source]

figure_storage[source]

build_search_term(query: aoptk.literature.query.Query) → str[source]: Convert Query to PMC search syntax.

_get_license_filter(licensing: str) → str[source]

Get the license filter string for a given licensing type.

Parameters:: licensing (str) – The licensing type.
Returns:: The license filter string for PMC search.
Return type:: str

get_pdfs(ids: list[aoptk.literature.id.ID]) → list[aoptk.literature.pdf.PDF][source]

Retrieve PDFs.

Returns:: A list of PDF objects.
Return type:: list[PDF]

get_publications(ids: list[aoptk.literature.id.ID], download_figures_enabled: bool = True) → list[aoptk.literature.publication.Publication][source]

Get a list of publications.

Parameters:

ids (list[ID]) – A list of publication IDs to retrieve.
download_figures_enabled (bool) – Whether to download figures
objects. (and include their paths in the Publication)

Returns:

A list of Publication objects.

Return type:

list[Publication]

get_ids() → list[aoptk.literature.id.ID][source]: Retrieve a list of publication IDs based on the search term.

get_abstracts(ids: list[aoptk.literature.id.ID]) → list[aoptk.literature.abstract.Abstract][source]: Retrieve Abstracts based on the list of IDs.

_parse_pmc_abstract_records(records: list[Any]) → list[aoptk.literature.abstract.Abstract][source]

Parse PMC abstract handles and return a list of Abstract objects.

Parameters:: records (list[Any]) – A list of PMC Entrez fetch handles.

get_publications_metadata(ids: list[aoptk.literature.id.ID]) → list[aoptk.literature.metadata.Metadata][source]

Retrieve Publication metadata.

Parameters:: ids (list[ID]) – A list of publication IDs for which to retrieve metadata.

_parse_pmc_metadata_records(records: list[str]) → list[aoptk.literature.metadata.Metadata][source]

Parse PMC metadata records and return a list of PublicationMetadata objects.

Parameters:: records (list) – A list of PMC XML summary payloads.

_get_publication(publication_id: aoptk.literature.id.ID, download_figures_enabled: bool = True) → aoptk.literature.publication.Publication | None[source]

Parse a single PDF and return a Publication object.

Parameters:

publication_id (str) – The publication ID to retrieve and parse.
download_figures_enabled (bool) – Whether to download figures
object. (and include their paths in the Publication)

_get_full_text(publication_id: aoptk.literature.id.ID) → str | None[source]

Retrieve the full text for a given publication ID.

Parameters:: publication_id (str) – The publication ID to retrieve the full text for.

_get_file(publication_id: aoptk.literature.id.ID, file_format: str) → pathlib.Path | None[source]

Retrieve the file for a given publication ID and format.

Parameters:

publication_id (str) – The publication ID to retrieve the file for.
file_format (str) – The format of the file to retrieve (pdf, xml, json, or txt).
txt (Formats)
xml
full-text (pdf contain)
metadata. (while json contains)

_get_figures(publication_id: aoptk.literature.id.ID) → list[pathlib.Path][source]

Retrieve the figure files for a given publication ID.

Parameters:: publication_id (ID) – The publication ID to retrieve the figure files for.

_extract_figures_from_supplements(publication_id: aoptk.literature.id.ID, supplementary_files: list[str]) → list[pathlib.Path][source]

Extract figure files from the supplementary files.

Parameters:

publication_id (ID) – The publication ID to retrieve the figure files for.
supplementary_files (list[str]) – A list of supplementary file URLs to extract figures from.

_get_json(publication_id: aoptk.literature.id.ID) → dict[str, Any] | None[source]

Retrieve the json for a given publication ID.

Parameters:: publication_id (str) – The publication ID to retrieve the json for.

_get_pdf(publication_id: aoptk.literature.id.ID) → aoptk.literature.pdf.PDF | None[source]

Retrieve the PDF for a given publication ID.

Parameters:: publication_id (str) – The publication ID to retrieve the PDF for.