Sentences and citation contexts identified from the PubMed Central open access articles
The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019.
The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles.
Each row in the file is a sentence/citation context and contains the following columns:
Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns:
 Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885
• intxt_cit_license_fromPMC.tsv – This file contains the CC licensing information for each article. The licensing information is from PMC's file lists , retrieved on June 19, 2020, and March 9, 2023. It should be noted that the license information for 189,855 PMCIDs is NO-CC CODE in the file lists, and 521 PMCIDs are absent in the file lists. The absence of CC licensing information does not indicate that the article lacks a CC license. For example, PMCID: 6156294 (NO-CC CODE) and PMCID: 6118074 (absent in the PMC's file lists) are under CC-BY licenses according to their PDF versions of articles.
The intxt_cit_license_fromPMC.tsv file has two columns:
• Supplementary_File_1.zip – This file contains the code for generating the dataset.
|Keywords||citation context; in-text citation; inline citation; bibliometrics; science of science|
|Funder||U.S. National Institutes of Health (NIH) - Grant: P01AG039347|
|Corresponding Creator||Tzu-Kun Hsiao|
|Article||Hsiao, T.-K., & Torvik, V. I. (2023). OpCitance: Citation contexts identified from the PubMed Central open access articles. Scientific Data, 10, 243. https://doi.org/10.1038/s41597-023-02134-x|