aoptk.literature.databases.pmc

Classes

PMC

Class for retrieving and parsing open access PMC publications.

Module Contents

class aoptk.literature.databases.pmc.PMC(query: str, storage: str, figure_storage: str)[source]

Bases: aoptk.literature.get_publication.GetPublication, aoptk.literature.get_pdf.GetPDF, aoptk.literature.get_id.GetID

Class for retrieving and parsing open access PMC publications.

aws_region = 'us-east-1'[source]
s3[source]
bucket = 'pmc-oa-opendata'[source]
paginator[source]
max_pmc_results = 9998[source]
max_concurrency = 2[source]
max_requests_per_second = 2.0[source]
minimal_year_publication = 1800[source]
semaphore[source]
limiter[source]
retries = 5[source]
image_extensions = ('.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff')[source]
_query[source]
id_list[source]
storage[source]
figure_storage[source]
get_pdfs() list[aoptk.literature.pdf.PDF][source]

Retrieve PDFs based on the query.

Returns:

A list of PDF objects corresponding to the publications matching the query.

Return type:

list[PDF]

get_publications() list[aoptk.literature.publication.Publication][source]

Get a list of publications.

Returns:

A list of Publication objects.

Return type:

list[Publication]

async get_ids() list[aoptk.literature.id.ID][source]

Retrieve a list of publication IDs based on the query.

_get_publication(publication_id: str) aoptk.literature.publication.Publication[source]

Parse a single PDF and return a Publication object.

Parameters:

publication_id (str) – The publication ID to retrieve and parse.

_get_full_text(publication_id: str) str | None[source]

Retrieve the full text for a given publication ID.

Parameters:

publication_id (str) – The publication ID to retrieve the full text for.

_get_file(publication_id: str, file_format: str) aoptk.literature.pdf.PDF | str | None[source]

Retrieve the file for a given publication ID and format.

Parameters:
  • publication_id (str) – The publication ID to retrieve the file for.

  • file_format (str) – The format of the file to retrieve (pdf, xml, json, or txt).

  • txt (Formats)

  • xml

  • full-text (pdf contain)

  • metadata. (while json contains)

_get_figures(publication_id: str) list[str][source]

Retrieve the figure files for a given publication ID.

Parameters:

publication_id (str) – The publication ID to retrieve the figure files for.

_extract_figures_from_supplements(publication_id: str, supplementary_files: list[str]) list[str][source]

Extract figure files from the supplementary files.

Parameters:
  • publication_id (str) – The publication ID to retrieve the figure files for.

  • supplementary_files (list[str]) – A list of supplementary file URLs to extract figures from.

_get_json(publication_id: str) str | None[source]

Retrieve the json for a given publication ID.

Parameters:

publication_id (str) – The publication ID to retrieve the json for.

_get_pdf(publication_id: str) aoptk.literature.pdf.PDF | None[source]

Retrieve the PDF for a given publication ID.

Parameters:

publication_id (str) – The publication ID to retrieve the PDF for.

_get_publication_count_and_ids(mindate: str | None = None, maxdate: str | None = None) tuple[int, list[str]][source]
async _async_get_publication_count_and_ids(mindate: str | None = None, maxdate: str | None = None) tuple[int, list[str]] | None[source]
async _collect_ids_for_year(year: int) list[str][source]
async _collect_ids_split_by_months_days(year: int) list[str][source]