Sentences and citation contexts identified from the PubMed Central open access articles
----------------------------------------------------------------------
The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019.
The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles.
Files:
• A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A.
• B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B.
• C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C.
• D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D.
• E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E.
• F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F.
• G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G.
• H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H.
• I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I.
• J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J.
• K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K.
• L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L.
• M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M.
• N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N.
• O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O.
• P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1).
• P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2).
• Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q.
• R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R.
• S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S.
• T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T.
• UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V.
• W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W.
• XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z.
Each row in the file is a sentence/citation context and contains the following columns:
• pmcid: PMCID of the article
• pmid: PMID of the article. If an article does not have a PMID, the value is NONE.
• location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs.
• IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable.
• sentence_id: The ID of the citation context/sentence in the article component
• total_sentences: The number of sentences in the article component.
• intxt_id: The ID of the citation.
• intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-".
• intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-".
• intxt_mark: The citation marker associated with the inline citation.
• best_id: The best source link ID (e.g., PMID) of the citation.
• best_source: The sources that confirm the best ID.
• best_id_diff: The comparison result between the best_id column and the intxt_pmid column.
• citation: A citation context. If no citation is found in a sentence, the value is the sentence.
• progression: Text progression of the citation context/sentence.
Supplementary Files
• PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column.
Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns:
• pmcid: PMCID of the citing article.
• pos: The citation's position in the reference list.
• fromPMID: PMID of the citing article.
• toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci.
• SRC: The sources that confirm the toPMID.
• MatchDB: The origin bibliographic database of the toPMID.
• Probability: The match probability of the toPMID.
• toPMID2: PMID of the citation (as tagged in the XML file).
• SRC2: The sources that confirm the toPMID2.
• intxt_id: The ID of the citation.
• journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files.
• same_ref_string: Whether the citation string appears in the reference list more than once.
• DIFF: The comparison result between the toPMID column and the toPMID2 column.
• bestID: The best source link ID (e.g., PMID) of the citation.
• bestSRC: The sources that confirm the best ID.
• Match: Matching result produced by Patci.
[1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885
• intxt_cit_license_fromPMC.tsv – This file contains the CC licensing information for each article. The licensing information is from PMC's file lists [2], retrieved on June 19, 2020, and March 9, 2023. It should be noted that the license information for 189,855 PMCIDs is NO-CC CODE in the file lists, and 521 PMCIDs are absent in the file lists. The absence of CC licensing information does not indicate that the article lacks a CC license. For example, PMCID: 6156294 (NO-CC CODE) and PMCID: 6118074 (absent in the PMC's file lists) are under CC-BY licenses according to their PDF versions of articles.
The intxt_cit_license_fromPMC.tsv file has two columns:
• pmcid: PMCID of the article.
• license: The article’s CC license information provided in PMC’s file lists. The value is nan when an article is not present in the PMC’s file lists.
[2] https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/
• Supplementary_File_1.zip – This file contains the code for generating the dataset.
|