Self-citation analysis data based on PubMed Central subset (2002-2005)

Mishra, Shubhanshu; Fegley, Brent D; Diesner, Jana; Torvik, Vetle I.

doi:10.13012/B2IDB-9665377_V1

Self-citation analysis data based on PubMed Central subset (2002-2005)

Cite this dataset:

Mishra, Shubhanshu; Fegley, Brent D; Diesner, Jana; Torvik, Vetle I. (2018): Self-citation analysis data based on PubMed Central subset (2002-2005). University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9665377_V1

Use this persistent URL to link to this dataset:

Metadata


Dataset Description	Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](https://clarivate.com/products/web-of-science/databases/). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
Subject	Social Sciences
Keywords	Self citation; PubMed Central; Data Analysis; Citation Data;
License	CC BY
Funder	U.S. National Institutes of Health (NIH)-Grant:P01AG039347
Funder	U.S. National Science Foundation (NSF)-Grant:1348742
Corresponding Creator	Vetle I. Torvik
Downloaded	3103 times
Related Materials (1) Article Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773

Versions

Version	DOI	Comment	Publication Date
1	10.13012/B2IDB-9665377_V1		2018-04-23

Files

Change Log

Contact the Research Data Service for help interpreting this log.

Dataset	update: {"all_globus"=>[nil, true]}	2026-01-16T15:37:49Z
Dataset	update: {"all_medusa"=>[nil, true]}	2026-01-16T15:35:58Z
RelatedMaterial	update: {"link"=>["", "https://doi.org/10.1371/journal.pone.0195773"], "uri"=>["", "10.1371/journal.pone.0195773"], "uri_type"=>["", "DOI"], "citation"=>["Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE", "Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773\r\n"]}	2018-09-29T15:39:29Z
Dataset	update: {"description"=>["Self-citation analysis data based on PubMed Central subset (2002-2005)\r\n----------------------------------------------------------------------\r\nCreated by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018\r\n\r\n## Introduction\r\n\r\nThis is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.\r\nIt contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. \r\nThe dataset is distributed in the form of the following tab separated text files: \r\n\r\n* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors\r\n* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors\r\n* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors\r\n* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data\r\n* COLUMNS_DESC.txt file - Descriptions of all columns\r\n* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. \r\n* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. \r\n* README.txt file\r\n\r\n## Dataset creation\r\n\r\nOur experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href=\"https://clarivate.com/products/web-of-science/databases/\">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. \r\n\r\n* MEDLINE 2015 baseline: <a href=\"https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html\">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>\r\n* Citation data from PubMed Central (original paper includes additional citations from Web of Science)\r\n* <b>Author-ity 2009 dataset</b>: \r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4222651_V1\">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/1552303.1552304\">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1002/asi.20105\">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a>\r\n* <b>Genni 2.0 + Ethnea for identifying author gender and ethnicity</b>:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-9087546_V1\">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/2467696.2467720\">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a>\r\n - Paper citation: <a href=\"http://hdl.handle.net/2142/88927\">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a>\r\n* <b>MapAffil for identifying article country of affiliation</b>:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4354331_V1\">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a>\r\n - Paper citation: <a href=\"http://doi.org/10.1045/november2015-torvik\">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a>\r\n* <b>IMPLICIT journal similarity</b>:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4742014_V1\">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a>\r\n* <b>Novelty dataset for identify article level novelty</b>:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-5060298_V1\">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1045/september2016-mishra\">\tMishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a>\r\n - Code: <a href=\"https://github.com/napsternxg/Novelty\">https://github.com/napsternxg/Novelty</a>\r\n* Expertise dataset for identifying author expertise on articles: \r\n* Source code provided at: <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>\r\n\r\nNote: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.\r\nCheck <a href=\"https://www.nlm.nih.gov/databases/download/pubmed_medline.html\">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions\r\n\r\nAdditional data related updates can be found at <a href=\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n## Acknowledgments\r\n\r\nThis work was made possible in part with funding to VIT from <a href=\"https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490\">NIH grant P01AG039347</a> and <a href=\"http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742\">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.\r\n\r\n## License\r\n\r\nSelf-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.\r\nPermissions beyond the scope of this license may be available at <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.\r\n", "Self-citation analysis data based on PubMed Central subset (2002-2005)\r\n----------------------------------------------------------------------\r\nCreated by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018\r\n\r\n## Introduction\r\n\r\nThis is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.\r\nIt contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. \r\nThe dataset is distributed in the form of the following tab separated text files: \r\n\r\n* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors\r\n* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors\r\n* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors\r\n* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data\r\n* COLUMNS_DESC.txt file - Descriptions of all columns\r\n* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. \r\n* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. \r\n* README.txt file\r\n\r\n## Dataset creation\r\n\r\nOur experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href=\"https://clarivate.com/products/web-of-science/databases/\">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. \r\n\r\n* MEDLINE 2015 baseline: <a href=\"https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html\">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>\r\n\r\n* Citation data from PubMed Central (original paper includes additional citations from Web of Science)\r\n\r\n* Author-ity 2009 dataset: \r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4222651_V1\">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/1552303.1552304\">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1002/asi.20105\">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a>\r\n\r\n* Genni 2.0 + Ethnea for identifying author gender and ethnicity:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-9087546_V1\">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/2467696.2467720\">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a>\r\n - Paper citation: <a href=\"http://hdl.handle.net/2142/88927\">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a>\r\n\r\n* MapAffil for identifying article country of affiliation:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4354331_V1\">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a>\r\n - Paper citation: <a href=\"http://doi.org/10.1045/november2015-torvik\">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a>\r\n\r\n* IMPLICIT journal similarity:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4742014_V1\">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a>\r\n\r\n* Novelty dataset for identify article level novelty:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-5060298_V1\">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1045/september2016-mishra\">\tMishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a>\r\n - Code: <a href=\"https://github.com/napsternxg/Novelty\">https://github.com/napsternxg/Novelty</a>\r\n\r\n* Expertise dataset for identifying author expertise on articles: \r\n\r\n* Source code provided at: <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>\r\n\r\nNote: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.\r\nCheck <a href=\"https://www.nlm.nih.gov/databases/download/pubmed_medline.html\">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions\r\n\r\nAdditional data related updates can be found at <a href=\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n## Acknowledgments\r\n\r\nThis work was made possible in part with funding to VIT from <a href=\"https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490\">NIH grant P01AG039347</a> and <a href=\"http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742\">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.\r\n\r\n## License\r\n\r\nSelf-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.\r\nPermissions beyond the scope of this license may be available at <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.\r\n"]}	2018-04-27T18:04:49Z
Dataset	update: {"description"=>["Self-citation analysis data based on PubMed Central subset (2002-2005)\r\n----------------------------------------------------------------------\r\nCreated by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018\r\n\r\n## Introduction\r\n\r\nThis is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.\r\nIt contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. \r\nThe dataset is distributed in the form of the following tab separated text files: \r\n\r\n* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors\r\n* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors\r\n* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors\r\n* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data\r\n* COLUMNS_DESC.txt file - Descriptions of all columns\r\n* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. \r\n* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. \r\n* README.txt file\r\n\r\n## Dataset creation\r\n\r\nOur experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href=\"https://clarivate.com/products/web-of-science/databases/\">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. \r\n\r\n* MEDLINE 2015 baseline: <a href=\"https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html\">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>\r\n* Citation data from PubMed Central (original paper includes additional citations from Web of Science)\r\n* Author-ity 2009 dataset: \r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4222651_V1\">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/1552303.1552304\">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1002/asi.20105\">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a>\r\n* Genni 2.0 + Ethnea for identifying author gender and ethnicity:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-9087546_V1\">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/2467696.2467720\">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a>\r\n - Paper citation: <a href=\"http://hdl.handle.net/2142/88927\">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a>\r\n* MapAffil for identifying article country of affiliation:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4354331_V1\">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a>\r\n - Paper citation: <a href=\"http://doi.org/10.1045/november2015-torvik\">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a>\r\n* IMPLICIT journal similarity:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4742014_V1\">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a>\r\n* Novelty dataset for identify article level novelty:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-5060298_V1\">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1045/september2016-mishra\">\tMishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a>\r\n - Code: <a href=\"https://github.com/napsternxg/Novelty\">https://github.com/napsternxg/Novelty</a>\r\n* Expertise dataset for identifying author expertise on articles: \r\n* Source code provided at: <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>\r\n\r\nNote: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.\r\nCheck <a href=\"https://www.nlm.nih.gov/databases/download/pubmed_medline.html\">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions\r\n\r\nAdditional data related updates can be found at <a href=\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n## Acknowledgments\r\n\r\nThis work was made possible in part with funding to VIT from <a href=\"https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490\">NIH grant P01AG039347</a> and <a href=\"http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742\">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.\r\n\r\n## License\r\n\r\nSelf-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.\r\nPermissions beyond the scope of this license may be available at <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.\r\n", "Self-citation analysis data based on PubMed Central subset (2002-2005)\r\n----------------------------------------------------------------------\r\nCreated by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018\r\n\r\n## Introduction\r\n\r\nThis is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.\r\nIt contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. \r\nThe dataset is distributed in the form of the following tab separated text files: \r\n\r\n* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors\r\n* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors\r\n* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors\r\n* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data\r\n* COLUMNS_DESC.txt file - Descriptions of all columns\r\n* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. \r\n* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. \r\n* README.txt file\r\n\r\n## Dataset creation\r\n\r\nOur experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href=\"https://clarivate.com/products/web-of-science/databases/\">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. \r\n\r\n* MEDLINE 2015 baseline: <a href=\"https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html\">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>\r\n* Citation data from PubMed Central (original paper includes additional citations from Web of Science)\r\n* <b>Author-ity 2009 dataset</b>: \r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4222651_V1\">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/1552303.1552304\">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1002/asi.20105\">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a>\r\n* <b>Genni 2.0 + Ethnea for identifying author gender and ethnicity</b>:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-9087546_V1\">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/2467696.2467720\">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a>\r\n - Paper citation: <a href=\"http://hdl.handle.net/2142/88927\">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a>\r\n* <b>MapAffil for identifying article country of affiliation</b>:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4354331_V1\">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a>\r\n - Paper citation: <a href=\"http://doi.org/10.1045/november2015-torvik\">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a>\r\n* <b>IMPLICIT journal similarity</b>:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4742014_V1\">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a>\r\n* <b>Novelty dataset for identify article level novelty</b>:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-5060298_V1\">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1045/september2016-mishra\">\tMishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a>\r\n - Code: <a href=\"https://github.com/napsternxg/Novelty\">https://github.com/napsternxg/Novelty</a>\r\n* Expertise dataset for identifying author expertise on articles: \r\n* Source code provided at: <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>\r\n\r\nNote: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.\r\nCheck <a href=\"https://www.nlm.nih.gov/databases/download/pubmed_medline.html\">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions\r\n\r\nAdditional data related updates can be found at <a href=\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n## Acknowledgments\r\n\r\nThis work was made possible in part with funding to VIT from <a href=\"https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490\">NIH grant P01AG039347</a> and <a href=\"http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742\">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.\r\n\r\n## License\r\n\r\nSelf-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.\r\nPermissions beyond the scope of this license may be available at <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.\r\n"]}	2018-04-27T18:03:04Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"", "uri"=>"", "uri_type"=>"", "citation"=>"Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE", "dataset_id"=>525, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo"}	2018-04-27T18:00:53Z
Dataset	update: {"description"=>["Self-citation analysis data based on PubMed Central subset (2002-2005)\r\n----------------------------------------------------------------------\r\nCreated by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018\r\n\r\n## Introduction\r\n\r\nThis is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.\r\nIt contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. \r\nThe dataset is distributed in the form of the following tab separated text files: \r\n\r\n* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors\r\n* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors\r\n* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors\r\n* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data\r\n* COLUMNS_DESC.txt file - Descriptions of all columns\r\n* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. \r\n* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. \r\n* README.txt file\r\n\r\n## Dataset creation\r\n\r\nOur experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](https://clarivate.com/products/web-of-science/databases/). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. \r\n\r\n* MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html\r\n* Citation data from PubMed Central (original paper includes additional citations from Web of Science)\r\n* Author-ity 2009 dataset: \r\n - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1\r\n - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304\r\n - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105\r\n* Genni 2.0 + Ethnea for identifying author gender and ethnicity:\r\n - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1\r\n - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720\r\n - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927\r\n* MapAffil for identifying article country of affiliation:\r\n - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1\r\n - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik. \r\n* IMPLICIT journal similarity:\r\n - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1\r\n* Novelty dataset for identify article level novelty:\r\n - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1\r\n - Paper citation: Mishra, S., & Torvik, V. I. (2016). Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib Magazine, 22(9/10). https://doi.org/10.1045/september2016-mishra \r\n - Code: https://github.com/napsternxg/Novelty\r\n* Expertise dataset for identifying author expertise on articles: \r\n* Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis\r\n\r\nNote: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.\r\nCheck here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions:\r\nhttps://www.nlm.nih.gov/databases/download/pubmed_medline.html\r\n\r\nAdditional data related updates can be found at: http://abel.ischool.illinois.edu\r\n\r\n\r\n## Acknowledgments\r\n\r\nThis work was made possible in part with funding to VIT from NIH grant P01AG039347 (https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490) and NSF grant 1348742 (http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.\r\n\r\n## License\r\n\r\nSelf-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.\r\nPermissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.\r\n", "Self-citation analysis data based on PubMed Central subset (2002-2005)\r\n----------------------------------------------------------------------\r\nCreated by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018\r\n\r\n## Introduction\r\n\r\nThis is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.\r\nIt contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. \r\nThe dataset is distributed in the form of the following tab separated text files: \r\n\r\n* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors\r\n* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors\r\n* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors\r\n* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data\r\n* COLUMNS_DESC.txt file - Descriptions of all columns\r\n* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. \r\n* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. \r\n* README.txt file\r\n\r\n## Dataset creation\r\n\r\nOur experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href=\"https://clarivate.com/products/web-of-science/databases/\">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. \r\n\r\n* MEDLINE 2015 baseline: <a href=\"https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html\">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>\r\n* Citation data from PubMed Central (original paper includes additional citations from Web of Science)\r\n* Author-ity 2009 dataset: \r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4222651_V1\">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/1552303.1552304\">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1002/asi.20105\">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a>\r\n* Genni 2.0 + Ethnea for identifying author gender and ethnicity:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-9087546_V1\">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1145/2467696.2467720\">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a>\r\n - Paper citation: <a href=\"http://hdl.handle.net/2142/88927\">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a>\r\n* MapAffil for identifying article country of affiliation:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4354331_V1\">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a>\r\n - Paper citation: <a href=\"http://doi.org/10.1045/november2015-torvik\">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a>\r\n* IMPLICIT journal similarity:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-4742014_V1\">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a>\r\n* Novelty dataset for identify article level novelty:\r\n - Dataset citation: <a href=\"https://doi.org/10.13012/B2IDB-5060298_V1\">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a>\r\n - Paper citation: <a href=\"https://doi.org/10.1045/september2016-mishra\">\tMishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a>\r\n - Code: <a href=\"https://github.com/napsternxg/Novelty\">https://github.com/napsternxg/Novelty</a>\r\n* Expertise dataset for identifying author expertise on articles: \r\n* Source code provided at: <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>\r\n\r\nNote: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.\r\nCheck <a href=\"https://www.nlm.nih.gov/databases/download/pubmed_medline.html\">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions\r\n\r\nAdditional data related updates can be found at <a href=\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n## Acknowledgments\r\n\r\nThis work was made possible in part with funding to VIT from <a href=\"https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490\">NIH grant P01AG039347</a> and <a href=\"http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742\">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.\r\n\r\n## License\r\n\r\nSelf-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.\r\nPermissions beyond the scope of this license may be available at <a href=\"https://github.com/napsternxg/PubMed_SelfCitationAnalysis\">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.\r\n"], "version_comment"=>[nil, ""]}	2018-04-27T18:00:53Z

Self-citation analysis data based on PubMed Central subset (2002-2005)

Metadata

Dataset Description

Subject

Keywords

License

Funder

Funder

Corresponding Creator

Downloaded

Related Materials (1)

Versions

Files

Change Log