Displaying Dataset 1 - 25 of 40 in total

Subject Area

Social Sciences (40)
Life Sciences (0)
Physical Sciences (0)
Technology and Engineering (0)
Uncategorized (0)

Funder

U.S. National Institutes of Health (NIH) (19)
U.S. National Science Foundation (NSF) (8)
Other (3)
U.S. Department of Energy (DOE) (0)
U.S. Department of Agriculture (USDA) (0)
Illinois Department of Natural Resources (IDNR) (0)
U.S. National Aeronautics and Space Administration (NASA) (0)
U.S. Geological Survey (USGS) (0)

Publication Year

2018 (21)
2016 (8)
2019 (6)
2017 (5)
2020 (0)

License

CC0 (22)
CC BY (18)
custom (0)
published: 2019-05-08
 
The data file contains a list of articles given low score by the RCT Tagger and an error analysis of them, which were used in a project associated with the manuscript "An in-situ evaluation of the RCT Tagger using 7413 articles included in 570 Cochrane reviews with RCT-only inclusion criteria".
keywords: Cochrane reviews; automation; randomized controlled trial; RCT; systematic reviews
published: 2019-05-08
 
The data file contains a list of included studies with their detailed metadata, taken from Cochrane reviews which were used in a project associated with the manuscript "An in-situ evaluation of the RCT Tagger using 7413 articles included in 570 Cochrane reviews with RCT-only inclusion criteria".
keywords: Cochrane reviews; automation; randomized controlled trial; RCT; systematic review
published: 2019-05-08
 
The data file contains detailed information of the Cochrane reviews that were used in a project associated with the manuscript (working title) "An in-situ evaluation of the RCT Tagger using 7413 articles included in 570 Cochrane reviews with RCT-only inclusion criteria".
keywords: Cochrane reviews; systematic reviews; randomized control trial; RCT; automation
published: 2019-04-05
 
File Name: Inclusion_Criteria_Annotation.csv Data Preparation: Xiaoru Dong Date of Preparation: 2019-04-04 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: The file contains lists of inclusion criteria of Cochrane Systematic Reviews and the manual annotation results. 5420 inclusion criteria were annotated, out of 7158 inclusion criteria available. Annotations are either "Only RCTs" or "Others". There are 2 columns in the file: - "Inclusion Criteria": Content of inclusion criteria of Cochrane Systematic Reviews. - "Only RCTs": Manual Annotation results. In which, "x" means the inclusion criteria is classified as "Only RCTs". Blank means that the inclusion criteria is classified as "Others". Notes: 1. "RCT" stands for Randomized Controlled Trial, which, in definition, is "a work that reports on a clinical trial that involves at least one test treatment and one control treatment, concurrent enrollment and follow-up of the test- and control-treated groups, and in which the treatments to be administered are selected by a random process, such as the use of a random-numbers table." [Randomized Controlled Trial publication type definition from https://www.nlm.nih.gov/mesh/pubtypes.html]. 2. In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided. 3. This datafile (V2) is a updated version of the datafile published at https://doi.org/10.13012/B2IDB-5958960_V1 with some minor spelling mistakes in the data fixed.
keywords: Inclusion criteri; Randomized controlled trials; Machine learning; Systematic reviews
published: 2018-09-06
 
The XSEDE program manages the database of allocation awards for the portfolio of advanced research computing resources funded by the National Science Foundation (NSF). The database holds data for allocation awards dating to the start of the TeraGrid program in 2004 to present, with awards continuing through the end of the second XSEDE award in 2021. The project data include lead researcher and affiliation, title and abstract, field of science, and the start and end dates. Along with the project information, the data set includes resource allocation and usage data for each award associated with the project. The data show the transition of resources over a fifteen year span along with the evolution of researchers, fields of science, and institutional representation.
keywords: allocations; cyberinfrastructure; XSEDE
published: 2019-02-19
 
The organizations that contribute to the longevity of 67 long-lived molecular biology databases published in Nucleic Acids Research (NAR) between 1991-2016 were identified to address two research questions 1) which organizations fund these databases? and 2) which organizations maintain these databases? Funders were determined by examining funding acknowledgements in each database's most recent NAR Database Issue update article published (prior to 2017) and organizations operating the databases were determine through review of database websites.
keywords: databases; research infrastructure; sustainability; data sharing; molecular biology; bioinformatics; bibliometrics
published: 2019-01-07
 
Vendor transcription of the Catalogue of Copyright Entries, Part 1, Group 1, Books: New Series, Volume 29 for the Year 1932. This file contains all of the entries from the indicated volume.
keywords: copyright; Catalogue of Copyright Entries; Copyright Office
published: 2018-12-20
 
File Name: AllWords.csv Data Preparation: Xiaoru Dong, Linh Hoang Date of Preparation: 2018-12-12 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: The file contains lists of all words (all features) from the bag-of-words feature extraction. Notes: In order to reproduce the data in this file, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords: Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews
published: 2018-12-20
 
File Name: Error_Analysis.xslx Data Preparation: Xiaoru Dong Date of Preparation: 2018-12-12 Data Contributions: Xiaoru Dong, Linh Hoang, Jingyi Xie, Jodi Schneider Data Source: The classification prediction results of prediction in testing data set Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews Description: The file contains lists of the wrong and correct prediction of inclusion criteria of Cochrane Systematic Reviews from the testing data set and the length (number of words) of the inclusion criteria. Notes: In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords: Inclusion criteria, Randomized controlled trials, Machine learning, Systematic reviews
published: 2018-12-14
 
Spreadsheet with data about whether or not the indicated institutional repository website provides metadata documentation. See readme file for more information.
keywords: institutional repositories; metadata; best practices; metadata documentation
published: 2018-12-20
 
File Name: WordsSelectedByManualAnalysis.csv Data Preparation: Xiaoru Dong, Linh Hoang Date of Preparation: 2018-12-14 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: this file contains the list of 407 informative words reselected from the 1655 words by manual analysis. In particular, from the 1655 words that we got from information gain feature selection, we then manually read and eliminated the domain specific words. The remaining words then were selected into the "Manual Analysis Words" as the results. Notes: Even though the list of words in this file was selected manually. However, in order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords: Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews
published: 2018-12-20
 
File Name: WordsSelectedByInformationGain.csv Data Preparation: Xiaoru Dong, Linh Hoang Date of Preparation: 2018-12-12 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: the file contains a list of 1655 informative words selected by applying information gain feature selection strategy. Information gain is one of the methods commonly used for feature selection, which tells us how many bits of information the presence of the word are helpful for us to predict the classes, and can be computed in a specific formula [Jurafsky D, Martin JH. Speech and language processing. London: Pearson; 2014 Dec 30].We ran Information Gain feature selection on Weka -- a machine learning tool. Notes: In order to reproduce the data in this file, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords: Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews
published: 2016-08-02
 
These data are the result of a multi-step process aimed at enriching BIBFRAME RDF with linked data. The process takes in an initial MARC XML file, transforms it to BIBFRAME RDF/XML, and then four separate python files corresponding to the BIBFRAME 1.0 model (Work, Instance, Annotation, and Authority) are run over the BIBFRAME RDF/XML output. The input and outputs of each step are included in this data set. Input file types include the CSV; MARC XML; and Master RDF/XML Files. The CSV contain bibliographic identifiers to e-books. From CSVs a set of MARC XML are generated. The MARC XML are utilized to produce the Master RDF file set. The major outputs of the enrichment code produce BIBFRAME linked data as Annotation RDF, Instance RDF, Work RDF, and Authority RDF.
keywords: BIBFRAME; Schema.org; linked data; discovery; MARC; MARCXML; RDF
published: 2016-08-18
 
Copyright Review Management System renewals by year, data from Table 2 of the article "How Large is the ‘Public Domain’? A comparative Analysis of Ringer’s 1961 Copyright Renewal Study and HathiTrust CRMS Data."
keywords: copyright; copyright renewals; HathiTrust
published: 2018-04-19
 
Prepared by Vetle Torvik 2018-04-15 The dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed. &bull; How was the dataset created? First and last names of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including Ethnea+Genni as described in: <i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA. http://hdl.handle.net/2142/88927</i> <i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i> EthnicSeer: http://singularity.ist.psu.edu/ethnicity <i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i> SexMachine 0.1.1: <a href="https://pypi.python.org/pypi/SexMachine/">https://pypi.org/project/SexMachine</a> First names, for some Author-ity records lacking them, were harvested from outside bibliographic databases. &bull; The code and back-end data is periodically updated and made available for query at <a href ="http://abel.ischool.illinois.edu">Torvik Research Group</a> &bull; What is the format of the dataset? The dataset contains 9,300,182 rows and 10 columns 1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition) 2. name: full name used as input to EthnicSeer) 3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX 4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction 5. lastname: used as input for Ethnea+Genni 6. firstname: used as input for Ethnea+Genni 7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short) 8. Genni: predicted gender; 'F', 'M', or '-' 9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male) 10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'
keywords: Androgyny; Bibliometrics; Data mining; Search engine; Gender; Semantic orientation; Temporal prediction; Textual markers
published: 2018-09-04
 
This dataset contains records of five years of interlibrary loan (ILL) transactions for the University of Illinois at Urbana-Champaign Library. It is for the materials lent to other institutions during period 2009-2013. It includes 169,890 transactions showing date; borrowing institution’s type, state and country; material format, imprint city, imprint country, imprint region, call number, language, local circulation count, ILL lending count, and OCLC holdings count. The dataset was generated putting together monthly ILL reports. Circulation and ILL lending fields were added from the ILS records. Borrower region and imprint region fields are created based on Title VI Region List. OCLC holdings field has been added from WorldCat records.
keywords: Interlibrary Loan; ILL; Lending; OCLC Holding; Library; Area Studies; Collection; Circulation; Collaborative; Shared; Resource Sharing
published: 2018-08-06
 
This annotation study compared RobotReviewer's data extraction to that of three novice data extractors, using six included articles synthesized in one Cochrane review: Bailey E, Worthington HV, van Wijk A, Yates JM, Coulthard P, Afzal Z. Ibuprofen and/or paracetamol (acetaminophen) for pain relief after surgical removal of lower wisdom teeth. Cochrane Database Syst Rev. 2013; CD004624; doi:10.1002/14651858.CD004624.pub2 The goal was to assess the relative advantage of RobotReviewer's data extraction with respect to quality.
keywords: RobotReviewer; annotation; information extraction; data extraction; systematic review automation; systematic reviewing;
published: 2018-07-28
 
This dataset presents a citation analysis and citation context analysis used in Linh Hoang, Frank Scannapieco, Linh Cao, Yingjun Guan, Yi-Yun Cheng, and Jodi Schneider. Evaluating an automatic data extraction tool based on the theory of diffusion of innovation. Under submission. We identified the papers that directly describe or evaluate RobotReviewer from the list of publications on the RobotReviewer website <http://www.robotreviewer.net/publications>, resulting in 6 papers grouped into 5 studies (we collapsed a conference and journal paper with the same title and authors into one study). We found 59 citing papers, combining results from Google Scholar on June 05, 2018 and from Scopus on June 23, 2018. We extracted the citation context around each citation to the RobotReviewer papers and categorized these quotes into emergent themes.
keywords: RobotReviewer; citation analysis; citation context analysis
published: 2018-07-25
 
The PDF describes the process and data used for the heuristic user evaluation described in the related article “<i>Evaluating an automatic data extraction tool based on the theory of diffusion of innovation</i>” by Linh Hoang, Frank Scannapieco, Linh Cao, Yingjun Guan, Yi-Yun Cheng, and Jodi Schneider (under submission).<br /> Frank Scannapieco assessed RobotReviewer data extraction performance on ten articles in 2018-02. Articles are included papers from an update review: Sabharwal A., G.-F.I., Stellrecht E., Scannapeico F.A. <i>Periodontal therapy to prevent the initiation and/or progression of common complex systemic diseases and conditions</i>. An update. Periodontol 2000. In Press. <br/> The form was created in consultation with Linh Hoang and Jodi Schneider. To do the assessment, Frank Scannapieco entered PDFs for these ten articles into RobotReviewer and then filled in ten evaluation forms, based on the ten Robot Reviewer automatic data extraction reports. Linh Hoang analyzed these ten evaluation forms and synthesized Frank Scannapieco’s comments to arrive at the evaluation results for the heuristic user evaluation.
keywords: RobotReviewer; systematic review automation; data extraction
published: 2018-07-13
 
Qualitative Data collected from the websites of undergraduate research journals between October, 2014 and May, 2015. Two CSV files. The first file, "Sample", includes the sample of journals with secondary data collected. The second file, "Population", includes the remainder of the population for which secondary data was not collected. Note: That does not add up to 800 as indicated in article, rows were deleted for journals that had broken links or defunct websites during random sampling process.
keywords: undergraduate research; undergraduate journals; scholarly communication; libraries; liaison librarianship
published: 2018-04-23
 
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href="https://clarivate.com/products/web-of-science/databases/">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a> * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4222651_V1">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a> - Paper citation: <a href="https://doi.org/10.1145/1552303.1552304">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a> - Paper citation: <a href="https://doi.org/10.1002/asi.20105">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a> * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-9087546_V1">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a> - Paper citation: <a href="https://doi.org/10.1145/2467696.2467720">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a> - Paper citation: <a href="http://hdl.handle.net/2142/88927">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a> * MapAffil for identifying article country of affiliation: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4354331_V1">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a> - Paper citation: <a href="http://doi.org/10.1045/november2015-torvik">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a> * IMPLICIT journal similarity: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4742014_V1">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a> * Novelty dataset for identify article level novelty: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-5060298_V1">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a> - Paper citation: <a href="https://doi.org/10.1045/september2016-mishra"> Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a> - Code: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a> * Expertise dataset for identifying author expertise on articles: * Source code provided at: <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a> **Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.** Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a> ## Acknowledgments This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347</a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.
keywords: Self citation; PubMed Central; Data Analysis; Citation Data;
published: 2018-04-23
 
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a> * MeSH tree 2015: <a href="ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/">ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/</a> * Source code provided at: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a> Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here </a>for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a> ## Acknowledgments This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347 </a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742 </a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
keywords: Conceptual novelty; bibliometrics; PubMed; MEDLINE; MeSH; Medical Subject Headings; Analysis;
published: 2018-04-23
 
Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc. Using authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless. The tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then nA = sum(Ai); nB = sum(Ai) nAB = sum(Ai*Bi) if A not equal B; nAA = sum(Ai*(Ai-1)/2) otherwise nAnB = nA*nB if A not equal B; nAnA = nA*(nA-1)/2 otherwise score = 1000000*nAB/nAnB if A is not equal B; 1000000*nAA/nAnA otherwise Token pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB. The cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs. Each dataset has the following 7 tab-delimited all-ASCII columns 1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000 2: nAB: total number of co-occurrences 3: nAnB: total number of pairs 4: nA: number of occurrences of token A 5: nB: number of occurrences of token B 6: A: token A 7: B: token B We made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in: <a href="https://doi.org/10.1371/journal.pone.0115681">D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681. https://doi.org/10.1371/journal.pone.0115681</a> <a href="http://dx.doi.org/10.5210/disco.v7i0.6654">Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:http://dx.doi.org/10.5210/disco.v7i0.6654</a>
keywords: PubMed; MeSH; token; name disambiguation
published: 2018-04-23
 
Provides links to Author-ity 2009, including records from principal investigators (on NIH and NSF grants), inventors on USPTO patents, and students/advisors on ProQuest dissertations. Note that NIH and NSF differ in the type of fields they record and standards used (e.g., institution names). Typically an NSF grant spanning multiple years is associated with one record, while an NIH grant occurs in multiple records, for each fiscal year, sub-projects/supplements, possibly with different principal investigators. The prior probability of match (i.e., that the author exists in Author-ity 2009) varies dramatically across NIH grants, NSF grants, and USPTO patents. The great majority of NIH principal investigators have one or more papers in PubMed but a minority of NSF principal investigators (except in biology) have papers in PubMed, and even fewer USPTO inventors do. This prior probability has been built into the calculation of match probabilities. The NIH data were downloaded from NIH exporter and the older NIH CRISP files. The dataset has 2,353,387 records, only includes ones with match probability > 0.5, and has the following 12 fields: 1 app_id, 2 nih_full_proj_nbr, 3 nih_subproj_nbr, 4 fiscal_year 5 pi_position 6 nih_pi_names 7 org_name 8 org_city_name 9 org_bodypolitic_code 10 age: number of years since their first paper 11 prob: the match probability to au_id 12 au_id: Author-ity 2009 author ID The NSF dataset has 262,452 records, only includes ones with match probability > 0.5, and the following 10 fields: 1 AwardId 2 fiscal_year 3 pi_position, 4 PrincipalInvestigators, 5 Institution, 6 InstitutionCity, 7 InstitutionState, 8 age: number of years since their first paper 9 prob: the match probability to au_id 10 au_id: Author-ity 2009 author ID There are two files for USPTO because here we linked disambiguated authors in PubMed (from Author-ity 2009) with disambiguated inventors. The USPTO linking dataset has 309,720 records, only includes ones with match probability > 0.5, and the following 3 fields 1 au_id: Author-ity 2009 author ID 2 inv_id: USPTO inventor ID 3 prob: the match probability of au_id vs inv_id The disambiguated inventors file (uiuc_uspto.tsv) has 2,736,306 records, and has the following 7 fields 1 inv_id: USPTO inventor ID 2 is_lower 3 is_upper 4 fullnames 5 patents: patent IDs separated by '|' 6 first_app_yr 7 last_app_yr
keywords: PubMed; USPTO; Principal investigator; Name disambiguation