Illinois Data Bank
Deposit Dataset
Find Data
Policies
Guides
Contact Us
Log in with NetID
Toggle navigation
Illinois Data Bank
Deposit Dataset
Find Data
Policies
Guides
Contact Us
Log in with NetID
Displaying 401 - 425 of 668 in total
<
1
2
…
13
14
15
16
17
18
19
20
21
…
26
27
>
25 per page
50 per page
Show All
Clear Filters
Generate Report from Search Results
Subject Area
Life Sciences (365)
Social Sciences (136)
Physical Sciences (101)
Technology and Engineering (64)
Arts and Humanities (1)
Uncategorized (1)
Funder
Other (206)
U.S. National Science Foundation (NSF) (193)
U.S. Department of Energy (DOE) (68)
U.S. National Institutes of Health (NIH) (63)
U.S. Department of Agriculture (USDA) (44)
Illinois Department of Natural Resources (IDNR) (17)
U.S. Geological Survey (USGS) (7)
U.S. National Aeronautics and Space Administration (NASA) (6)
Illinois Department of Transportation (IDOT) (4)
U.S. Army (2)
Publication Year
2021 (108)
2022 (108)
2020 (96)
2023 (78)
2019 (72)
2024 (70)
2018 (61)
2017 (36)
2016 (30)
2025 (4)
2009 (1)
2011 (1)
2012 (1)
2014 (1)
2015 (1)
License
CC0 (367)
CC BY (281)
custom (20)
Datasets
published: 2019-12-03
de Moya, Robert (2019): Feather Louse Orthology set. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0440388_V1
This is the data set associated with the manuscript titled "Extensive host-switching of avian feather lice following the Cretaceous-Paleogene mass extinction event." Included are the gene alignments used for phylogenetic analyses and the cophylogenetic input files.
keywords:
phylogenomics, cophylogenetics, feather lice, birds
published: 2020-04-22
Endres, A. Bryan; Endres, Renata; Krstinić Nižić, Marinela (2020): Croatian Restaurant Allergy Disclosures. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9891298_V1
Data on Croatian restaurant allergen disclosures on restaurant websites, on-line menus and social media comments
keywords:
restaurant; allergen; disclosure; tourism
published: 2012-07-01
Mirarab, Siavash; Ngyuen, Nam-Phuong; Warnow, Tandy (2012): Data for SEPP: SATé-Enabled Phylogenetic Placement.. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9316702_V1
This dataset provides the data for Mirarab, Siavash, Nam Nguyen, and Tandy Warnow. "SEPP: SATé-enabled phylogenetic placement." Biocomputing 2012. 2012. 247-258.
published: 2019-06-12
Miller, Andrew; Raudabaugh, Daniel (2019): Supplemental data sets for Raudabaugh et al., Where are they hiding? Testing the body snatchers hypothesis in pyrophilous fungi. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1530363_V1
The data set contains Supplemental data sets for the Manuscript entitled "Where are they hiding? Testing the body snatchers hypothesis in pyrophilous fungi." Environmental sampling: Amplification of nuclear DNA regions (ITS1 and ITS2) were completed using the Fluidigm Access Array and the resulting amplicons were sequenced on an Illumina MiSeq v2 platform runs using rapid 2 × 250 nt paired-end reads. Illumina sequencing run amplicons that were size selected into <500nt and >500nt sub-pools, then remixed together <500nt: >500nt by nM concentration in a 1x:3x proportion. All amplification and sequencing steps were performed at the Roy J. Carver Biotechnology Center at the University of Illinois Urbana-Champaign. ITS1 region primers consisted of ITS1F (5'-CTTGGTCATTTAGAGGAAGTAA-'3) and ITS2 (5'-GCTGCGTTCTTCATCGATGC-'3). ITS2 region primers consisted of fITS7 (5'-GTGARTCATCGAATCTTTG-'3) and ITS4 (5'-TCCTCCGCTTATTGATATGC-'3). Supplemental files 1 through 5 contain the raw data files. Supplemental 1 is the ITS1 Illumina MiSeq forward reads and Supplemental 2 is the corresponding index files. Supplemental 3 is the ITS2 Illumina MiSeq forward reads and Supplemental 4 is the corresponding index files. Supplemental 5 is the map file needed to process the forward reads and index files in QIIME. Supplemental 6 and 7 contain the resulting QIIME 1.9.1. OTU tables along with UNITE, NCBI, and CONSTAX taxonomic assignments in addition to the representative OTU sequence. Numeric samples within the OTU tables correspond to the following: 1 Brachythecium sp. 2 Usnea cornuta 3 Dicranum sp. 4 Leucodon julaceus 5 Lobaria quercizans 6 Rhizomnium sp. 7 Dicranum sp. 8 Thuidium delicatulum 9 Myelochroa aurulenta 10 Atrichum angustatum 11 Dicranum sp. 12 Hypnum sp. 13 Atrichum angustatum 14 Hypnum sp. 15 Thuidium delicatulum 16 Leucobryum sp. 17 Polytrichum commune 18 Atrichum angustatum 19 Atrichum angustatum 20 Atrichum crispulum 21 Bryaceae 22 Leucobryum sp. 23 Conocephalum conicum 24 Climacium americanum 25 Atrichum angustatum 26 Huperzia serrata 27 Polytrichum commune 28 Diphasiastrum sp. 29 Anomodon attenuatus 30 Bryoandersonia sp. 31 Polytrichum commune 32 Thuidium delicatulum 33 Brachythecium sp. 34 Leucobryum glaucum 35 Bryoandersonia sp. 36 Anomodon attenuatus 37 Pohlia sp. 38 Cinclidium sp. 39 Hylocomium splendens 40 Polytrichum commune 41 negative control 42 Soil 43 Soil 44 Soil 45 Soil 46 Soil 47 Soil If a sample number is not present within the OTU table; either no sequences were obtained or no sequences passed the quality filtering step in QIIME. Supplemental 8 contains the Summary of unique species per location.
published: 2019-07-29
Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2019): Data from TRACTION: Fast non-parametric improvement of estimated gene trees. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1747658_V1
Datasets used in the study, "TRACTION: Fast non-parametric improvement of estimated gene trees," accepted at the Workshop on Algorithms in Bioinformatics (WABI) 2019.
keywords:
Gene tree correction; horizontal gene transfer; incomplete lineage sorting
published: 2019-08-30
Allen, Maximilian (2019): Wisconsin Bobcat Harvest Data. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2501832_V1
This dataset includes the data from an analysis of bobcat harvest data with particular focus on the relationship between catch-per-unit-effort and population size. The data relate to bobcat trapper and hunter harvest metrics from Wisconsin and include two RDS files which can be open in the software R using the readRDS() function.
keywords:
bobcat; catch-per-unit-effort; CPUE; harvest; Lynx rufus; wildlife management; trapper; hunter
published: 2017-12-22
Scheidler, Andrew; Kinnett-Hopkins, Dominique; Learmonth, Yvonne; Motl, Robert; Lopez-Ortiz, Citlali (2017): Targeted ballet program mitigates ataxia and improves agility in moderate-to-advanced multiple sclerosis. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-6858418_V2
TBP assessment raw data files of pre- and post- motion capture velocity and center of pressure force plate data. Labels are self-explanatory. The .mat files refer to data exported from the force plate for the time-to-stabilization assessments while the .txt files are the data collected for smoothness of gait assessments. These files do not relate to one another and are from separate assessments. Version2's files are the result from using Python code Data_Bank_Cleaner.py on version1's. Please find more information in READ_ME_databank.txt.
keywords:
Multiple Sclerosis; Rehabilitation; Balance; Ataxia; Ballet; Dance; Targeted Ballet Program
published: 2019-07-08
Kehoe, Adam K.; Torvik, Vetle I. (2019): Datasets from "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8020612_V1
# Overview These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe. The datasets consist of the following: * twin_not_abstract_matched_complete.tsv: a tab-delimited file consisting of pairs of MEDLINE articles with identical titles, authors and years of publication. This file contains the PMIDs of the duplicate publications, as well as their medical subject headings (MeSH) and three measures of their indexing consistency. * twin_abstract_matched_complete.tsv: the same as above, except that the MEDLINE articles also have matching abstracts. * mesh_training_data.csv: a comma-separated file containing the training data for the model discussed in the dissertation. * mesh_scores.tsv: a tab-delimited file containing a pairwise similarity score based on word embeddings, and MeSH hierarchy relationship. ## Duplicate MEDLINE Publications Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns: 1. pmid_one: the PubMed unique identifier of the first paper 2. pmid_two: the PubMed unique identifier of the second paper 3. mesh_one: A list of medical subject headings (MeSH) from the first paper, delimited by the "|" character 4. mesh_two: a list of medical subject headings from the second paper, delimited by the "|" character 5. hoopers_consistency: The calculation of Hooper's consistency between the MeSH of the first and second paper 6. nonhierarchicalfree: a word embedding based consistency score described in the dissertation 7. hierarchicalfree: a word embedding based consistency score additionally limited by the MeSH hierarchy, described in the dissertation. ## MeSH Training Data The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns: 1. pmid: the PubMed unique identifier of the paper 2. term: a candidate MeSH term 3. cit_count: the log of the frequency of the term in the citation candidate set 4. total_cit: the log of the total number the paper's citations 5. citr_count: the log of the frequency of the term in the citations of the paper's citations 6. total_citofcit: the log of the total number of the citations of the paper's citations 7. absim_count: the log of the frequency of the term in the AbSim candidate set 8. total_absim_count: the log of the total number of AbSim records for the paper 9. absimr_count: the log of the frequency of the term in the citations of the AbSim records 10. total_absimr_count: the log of the total number of citations of the AbSim record 11. log_medline_frequency: the log of the frequency of the candidate term in MEDLINE. 12. relevance: a binary indicator (True/False) if the candidate term was assigned to the target paper ## Cosine Similarity The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns: 1. mesh_one: a string of the first MeSH heading. 2. mesh_two: a string of the second MeSH heading. 3. cosine_similarity: the cosine similarity between the terms 4. relationship_type: a string identifying the relationship type, consisting of none, parent/child, sibling, ancestor and direct (terms are identical, i.e. a direct hierarchy match). The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/). For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format."
keywords:
MEDLINE;MeSH;Medical Subject Headings;Indexing
published: 2019-08-29
de Moya, Robert (2019): Bemisia tabaci ortholog set. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5333299_V1
This is the published ortholog set derived from whole genome data used for the analysis of members of the B. tabaci complex of whiteflies. It includes the concatenated alignment and individual gene alignments used for analyses (Link to publication: https://www.mdpi.com/1424-2818/11/9/151).
published: 2020-10-01
Strickland, Lynette (2020): No choice mating trials and two choice mating trials in the polymorphic tortoise beetle, Chelymorpha alternans. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8972634_V1
These datasets were performed to assess whether color pattern phenotypes of the polymorphic tortoise beetle, Chelymorpha alternans, mate randomly with one another, and whether there are any reproductive differences between assortative and disassortative pairings.
keywords:
mate choice, color polymorphisms, random mating
published: 2017-08-11
Schiffer, Peter; Le, Brian L. (2017): Magnetotransport measurements of connected kagome artificial spin ice in armchair and zigzag configurations. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1859347_V1
Enclosed in this dataset are transport data of kagome connected artificial spin ice networks composed of permalloy nanowires. The data herein are reproductions of the data seen in Appendix B of the dissertation titled "Magnetotransport of Connected Artificial Spin Ice". Field sweeps with the magnetic field applied in-plane were performed in 5 degree increments for armchair orientation kagome artificial spin ice and zigzag orientation kagome artificial spin ice.
keywords:
Magnetotransport; artificial spin ice; nanowires
published: 2020-05-15
Mishra, Shubhanshu (2020): Trained models for multi-task multi-dataset learning for sequence prediction in tweets - Old Experiments. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4520270_V1
Trained models for multi-task multi-dataset learning for sequence prediction in tweets Tasks include POS, NER, Chunking, and SuperSenseTagging Models were trained using: https://github.com/napsternxg/SocialMediaIE/blob/master/experiments/multitask_multidataset_experiment.py See https://github.com/napsternxg/SocialMediaIE for details.
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning;
published: 2021-05-10
Zheng, Zhonghua; Zhao, Lei; Oleson, Keith (2021): Global multi-model projections of urban daily temperatures. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-6081052_V1
This dataset contains the emulated global multi-model urban daily temperature projections under RCP 8.5 scenario. The dataset is derived from the study "Large model structural uncertainty in global projections of urban heat waves" (XXXX). Details about this dataset and the local urban climate emulator are described in the article. This dataset documents the global urban daily temperatures of 17 CMIP5 Earth system models for 2006-2015 and 2061-2070. This dataset may be useful for multiple communities regarding urban climate change, heat waves, impacts, vulnerability, risks, and adaptation applications.
keywords:
Urban heat waves; CMIP; urban warming; heat stress; urban climate change
published: 2019-03-19
Molloy, Erin K.; Warnow, Tandy (2019): Data from: TreeMerge: A new method for improving the scalability of species tree estimation methods. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9570561_V1
This repository includes scripts and datasets for the paper, "TreeMerge: A new method for improving the scalability of species tree estimation methods." The latest version of TreeMerge can be downloaded from Github (https://github.com/ekmolloy/treemerge).
keywords:
divide-and-conquer; statistical consistency; species trees; incomplete lineage sorting; phylogenomics
published: 2019-01-27
Le, Thien; Sy, Aaron; Molloy, Erin K.; Zhang, Qiuyi; Rao, Satish; Warnow, Tandy (2019): Using INC within Divide-and-Conquer Phylogeny Estimation - Datasets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8518809_V1
This repository include datasets that are studied with INC/INC-ML/INC-NJ in the paper `Using INC within Divide-and-Conquer Phylogeny Estimation' that was submitted to AICoB 2019. Each dataset has its own readme.txt that further describes the creation process and other parameters/softwares used in making these datasets. The latest implementation of INC/INC-ML/INC-NJ can be found on https://github.com/steven-le-thien/constraint_inc. Note: there may be files with DS_STORE as extension in the datasets; please ignore these files.
keywords:
phylogenetics; gene tree estimation; divide-and-conquer; absolute fast converging
published: 2023-02-10
Emmet, Robert L.; Benson, Thomas J.; Allen, Maximilian L.; Stodola, Kirk W. (2023): Integrating multiple data sources improves prediction and inference for upland game occupancy models. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0477888_V1
Data and documentation for Ornithological Applications manuscript “Integrating multiple data sources improves prediction and inference for upland game bird occupancy models” by Robert L. Emmet, Thomas J. Benson, Maximilian L. Allen, and Kirk W. Stodola We combined data from the North American Breeding Bird Survey and eBird with a targeted survey (IDNR upland game) to estimate habitat use of northern bobwhite and ring-necked pheasant in Illinois and to document the efficiency and overlap among the various data sources. Data include, eBird, USGS Breeding Bird Survey, National Land Cover Database, Upland game bird surveys, stream data)
keywords:
data integration; occupancy; avian population modelling; northern bobwhite;Colinus virginianus; ring-necked pheasant; Phasianus colchicus
published: 2023-02-07
Willson, James; Tabatabaee, Yasamin; Liu, Baqiao; Warnow, Tandy (2023): Data from: DISCO+QR: Rooting Species Trees in the Presence of GDL and ILS. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5748609_V1
Data sets from "DISCO+QR: Rooting Species Trees in the Presence of GDL and ILS." It contains trees and sequences simulated with gene duplication and loss under a variety of different conditions. Note: - trees.tar.gz contains the simulated gene-family trees used in our experiments (both true trees from SimPhy as well as trees estimated from alignments). - alignments.tar.gz contains simulated sequence data used for estimating the gene-family trees
keywords:
evolution; computational biology; bioinformatics; phylogenetics
published: 2023-04-06
Warnow, Tandy; Park, Minhyuk (2023): INDELible simulated datesets with sequence length heterogeneity. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0900513_V1
This is a simulated sequence dataset generated using INDELible and processed via a sequence fragmentation procedure.
keywords:
sequence length heterogeneity;indelible;computational biology;multiple sequence alignment
published: 2021-04-11
Park, Minhyuk; Zaharias, Paul; Warnow, Tandy (2021): Disjoint Tree Mergers for Large-Scale Maximum LikelihoodTree Estimation. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-7008049_V1
This dataset contains RNASim1000, Cox1-Het datasets as well as analyses of RNASim1000, Cox1-Het, and 1000M1(HF).
keywords:
phylogeny estimation; maximum likelihood; RAxML; IQ-TREE; FastTree; cox1; heterotachy; disjoint tree mergers; Tree of Life
published: 2018-06-02
Palmer, Ryan; Albarracin, Dolores (2018): Trust in Science as a Deterrent and a Facilitator of Belief in Conspiracy Theories: Pseudoscience Preys on Audiences that Trust in Science. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4469040_V1
keywords:
conspiracy theory; trust in science
published: 2018-12-13
Yin, Dandong; Wang, Shaowen (2018): CyberGIS-Jupyter HAND Example Notebook. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-6316661_V2
The dataset contains a complete example (inputs, outputs, codes, intermediate results, visualization webpage) of executing Height Above Nearest Drainage HAND workflow with CyberGIS-Jupyter.
keywords:
cybergis; hydrology; Jupyter
published: 2019-06-11
Wang, Wenrui; Wang, Tao; Amin, Vivek P.; Wang, Yang; Radhakrishnan, Anil; Davidson, Angie; Allen, Shane R.; Silva, T. J.; Ohldag, Hendrik; Balzar, Davor; Zink, Barry L.; Haney, Paul M.; Xiao, John Q.; Cahill, David G.; Lorenz, Virginia O.; Fan, Xin (2019): Dataset for "Anomalous Spin-Orbit Torques in Magnetic Single-Layer Films". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-7281207_V1
This dataset provides the raw data, code and related figures for the paper, "Anomalous Spin-Orbit Torques in Magnetic Single-Layer Films."
keywords:
spintronics; spin-orbit torques; magnetic materials
published: 2021-09-03
Clark, Lindsay V.; Mays, Wittney; Lipka, Alexander E.; Sacks, Erik J. (2021): Dataset for evaluating the Hind/He statistic in polyRAD. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4814898_V1
All of the files in this dataset pertain to the evaluation of a novel statistic, Hind/He, for distinguishing Mendelian loci from paralogs. They are derived from a RAD-seq genotyping dataset of diploid and tetraploid Miscanthus sacchariflorus.
published: 2021-03-15
Stodola, Alison P.; Lydeard, Charles; Lamer, James T.; Douglass, Sarah A.; Cummings, Kevin; Campbell, David (2021): Data and Images for "Hiding in plain sight: genetic confirmation of putative Louisiana Fatmucket Lampsilis hydiana in Illinois". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5609050_V1
Dataset associated with "Hiding in plain sight: genetic confirmation of putative Louisiana Fatmucket Lampsilis hydiana in Illinois" as submitted to Freshwater Mollusk Biology and Conservation by Stodola et al. Images are from cataloged specimens from the Illinois Natural History Survey (INHS) Mollusk Collection in Champaign, Illinois that were used for genetic research. File names indicate the species as confirmed in Stodola et al. (i.e., Lampsilis siliquoidea or Lampsilis hydiana) followed by the INHS Mollusk Collection catalog number, followed by the individual specimen number, followed by shell view (interior or exterior). If no specimen number is noted in the file name, there is only one specimen for that catalog number. For example: Lsiliquoidea_46515_1_2_3_exterior. Images were created by photographing specimens on a metric grid in an OrTech Photo-e-Box Plus with a Nikon D610 single lens reflex camera using a 60mm lens. Post-processing of images (cropping, image rotation, and auto contrast) occurred in Adobe Photoshop and saved as TIFF files using no image compression, interleaved pixel order, and IBM PC Byte Order. One additional partial lot, INHS Mollusk Catalog No. 37059 (shown with both interior and exterior view in one image), is included for reference but was not genetically sequenced. A .csv file contains an index of all specimens photographed. SPECIES: species confirmed using genetic analyses GENE: cox1 or nad1 mitochondrial gene ACCESSION: GenBank accession number INHS CATALOG NO: Illinois Natural History Survey Mollusk Collection Catalog number WATERBODY: waterbody where specimen was collected PUTATIVE SPECIES: species determination based on morphological characters prior to genetic analysis Phylogenetic sequence data (.nex files) were aligned using BioEdit (Hall, T.A. 1999. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symposium Series 41:95-98.). Pertinent methodology for the analysis are contained within the manuscript submittal for Stodola et al. to Freshwater Mollusk Biology and Conservation. In these files, "N" is a standard symbol for an unknown base.
keywords:
Lampsilis hydiana; Lampsilis siliquoidea; unionid; Louisiana Fatmucket; Fatmucket; genetic confirmation
published: 2018-04-19
Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1
Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03 The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed. • How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in <i>Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304</i> <i>Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105</i> Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication. • How accurate is the 2009 dataset (compared to 2006 and 2009)? The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors. • What is the format of the dataset? The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields: 1. blocks separated by '||'; each block may consist of multiple lastname-first initial variants separated by '|' 2. prior probabilities of the respective blocks separated by '|' 3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks) 4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased). 5. cluster size (number of author name instances on papers) 6. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix 7. last name variants separated by '|' 8. first name variants separated by '|' 9. middle initial variants separated by '|' ('-' if none) 10. suffix variants separated by '|' ('-' if none) 11. email addresses separated by '|' ('-' if none) 12. range of years (e.g., 1997-2009) 13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none) 14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '|'; ('-' if none) 15. Journals with counts in parenthesis (separated by "|"), 16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none) 17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none) 18. Co-author IDs with counts in parenthesis; separated by '|'; ('-' if none) 19. Author name instances (PMID_auno separated '|') 20. Grant IDs (after normalization; "-" if none given; separated by "|"), 21. Total number of times cited. (Citations are based on references extracted from PMC). 22. h-index 23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "|" 24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "|" 25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "|" 26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)
keywords:
Bibliographic databases; Name disambiguation; MEDLINE; Library information networks