Dataset Description
|
This dataset, uCite, is the union of nine large-scale open-access PubMed citation data separated by reliability. There are 20 files, including the reliable and unreliable citation PMID pairs, non-PMID identifiers to PMID mapping (for DOIs, Lens, MAG, and Semantic Scholar), original PMID pairs from the nine resources, some metadata for PMIDs, duplicate PMIDs, some redirected PMID pairs, and PMC OA Patci citation matching results.
The short description of each data file is listed as follows. A detailed description can be found in the README.txt.
DATASET DESCRIPTION
- PPUB.tsv.gz - tsv format file containing reliable citation pairs uCite.
- PUNR.tsv.gz - tsv format file containing reliable citation pairs uCite.
- DOI2PMID.tsv.gz - tsv format file containing results mapping DOI to PMID.
- LEN2PMID.tsv.gz - tsv format file containing results mapping LensID pairs to PMID pairs..
- MAG2PMIDsorted.tsv.gz - tsv format file containing results mapping MAG ID to PMID.
- SEM2PMID.tsv.gz - tsv ormat file containing results mapping Semantic Scholar ID to PMID.
- JVNPYA.tsv.gz - tsv format file containing metadata of papers with PMID, journal name, volume, issue, pages, publication year, and first author's last name.
- TiLTyAlJVNY.tsv.gz - tsv format file containing metadata of papers.
- PMC-OA-patci.tsv.gz - tsv format file containing PubMed Central Open Access subset reference strings extracted by \cite{} processed by Patci.
- REDIRECTS.gz - txt file containing unreliable PMID pairs mapped to reliable PMID pairs.
- REMAP - file containing pairs of duplicate PubMed records (lhs PMID mapped to rhs PMID).
- ami_pair.tsv.gz - tsv format file containing all citation pairs from Aminer (2015 version).
- dim_pair.tsv.gz - tsv format file containing all citation pairs from Dimensions.
- ice_pair.tsv.gz - tsv format file containing all citation pairs from iCite (April 2019 version, version 1).
- len_pair.tsv.gz - tsv format file containing all citation pairs from Lens.org (harvested through Oct 2021).
- mag_pair.tsv.gz - tsv format file containing all citation pairs from Microsoft Academic Graph (2015 version).
- oci_pair.tsv.gz - tsv format file containing all citation pairs from Open Citations (Nov. 2021 dump, csv version ).
- pat_pair.tsv.gz - tsv format file containing all citation pairs from Patci (i.e., from "PMC-OA-patci.tsv.gz").
- pmc_pair.tsv.gz - tsv format file containing all citation pairs from PubMed Central (harvest through Dec 2018 via e-Utilities).
- sem_pair.tsv.gz - tsv format file containing all citation pairs from Semantic Scholar (2019 version) .
COLUMN DESCRIPTION
FILENAME : PPUB.tsv.gz, PUNR.tsv.gz
(1) fromPMID - PubMed ID of the citing paper.
(2) toPMID - PubMed ID of the cited paper.
(3) sources - citation sources, in which the citation pairs are identified.
(4) fromYEAR - Publication year of the citing paper.
(5) toYEAR - Publication year of the cited paper.
FILENAME : DOI2PMID.tsv.gz
(1) DOI - Semantic Scholar ID of paper records.
(2) PMID - PubMed ID of paper records.
(3) PMID2 - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs.
FILENAME : SEMID2PMID.tsv.gz
(1) SemID - Semantic Scholar ID of paper records.
(2) PMID - PubMed ID of paper records.
(3) DOI - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs.
FILENAME : JVNPYA.tsv.gz
- Each row refers to a publication record.
(1) PMID - PubMed ID.
(2) journal - Journal name.
(3) volume - Journal volume.
(4) issue - Journal issue.
(5) pages - The first page and last page (without leading digits) number of the publication separated by '-'.
(6) year - Publication year.
(7) lastname - Last name of the first author.
FILENAME : TiLTyAlJVNY.tsv.gz
(1) PMID - PubMed ID.
(2) title_tokenized - Paper title after tokenization.
(3) languages - Language that paper is written in.
(4) pub_types - Types of the publication.
(5) length(authors) - String length of author names.
(6) journal -Journal name .
(7) volume - Journal volume .
(8) issue - Journal issue.
(9) year - Publication year of print (not necessary epub).
FILENAME : PMC-OA-patci.tsv.gz
(1) pmcid - PubMed Central identifier.
(2) pos -
(3) fromPMID - PubMed ID of the citing paper.
(4) toPMID - PubMed ID of the cited paper.
(5) SRC - citation sources, in which the citation pairs are identified.
(6) MatchDB - PubMed, ADS, DBLP.
(7) Probability - Matching probability predicted by Patci.
(8) toPMID2 - PubMed ID of the cited paper, extracted from OA xml file
(9) SRC2 - citation sources, in which the citation pairs are identified.
(10) intxt_id -
(11) jounal - First character of the journal name.
(12) same_ref_string - Y if patci and xml reference string match, otherwise N.
(13) DIFF -
(14) bestSRC - Citation sources, in which the citation pairs are identified.
(15) Match - Matching strings annotated by Patci.
FILENAME : REDIRECTS.gz
Each row in Redirectis.txt is a string sequence in the same format as follows.
- "REDIRECTED FROM: source PMID_i PMID_j -> PMID_i' PMID_j "
- "REDIRECTED TO: source PMID_i PMID_j -> PMID_i PMID_j' "
Note: source is the names of sources where the PMID_i and PMID_j are from.
FILENAME : REMAP
Each row is remapping unreliable PMID pairs mapped to reliable PMID pairs.
The format of each row is "$REMAP{PMID_i} = PMID_j".
FILENAME : ami_pair.tsv.gz, dim_pair.tsv.gz, ice_pair.tsv.gz, len_pair.tsv.gz, mag_pair.tsv.gz, oci_pair.tsv.gz, pat_pair.tsv.gz,pmc_pair.tsv.gz, sem_pair.tsv.gz
(1) fromPMID - PubMed ID of the citing paper.
(2) toPMID - PubMed ID of the cited paper.
|