|Related Article||Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773|
Self-citation analysis data based on PubMed Central subset (2002-2005)
This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.
* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors
## Dataset creation
Our experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](https://clarivate.com/products/web-of-science/databases/). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset.
* MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html
* Citation data from PubMed Central (original paper includes additional citations from Web of Science)
* Author-ity 2009 dataset:
* Genni 2.0 + Ethnea for identifying author gender and ethnicity:
* MapAffil for identifying article country of affiliation:
* IMPLICIT journal similarity:
* Novelty dataset for identify article level novelty:
* Expertise dataset for identifying author expertise on articles:
* Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis
**Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.**
Additional data related updates can be found at Torvik Research Group
This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.
|Keywords||Self citation; PubMed Central; Data Analysis; Citation Data;|
|Funder||U.S. National Institutes of Health (NIH) - Grant: P01AG039347|
|Funder||U.S. National Science Foundation (NSF) - Grant: 1348742|
|Corresponding Creator||Vetle I. Torvik|