Displaying 401 - 425 of 668 in total

Subject Area

Life Sciences (365)
Social Sciences (136)
Physical Sciences (101)
Technology and Engineering (64)
Arts and Humanities (1)
Uncategorized (1)


Other (206)
U.S. National Science Foundation (NSF) (193)
U.S. Department of Energy (DOE) (68)
U.S. National Institutes of Health (NIH) (63)
U.S. Department of Agriculture (USDA) (44)
Illinois Department of Natural Resources (IDNR) (17)
U.S. Geological Survey (USGS) (7)
U.S. National Aeronautics and Space Administration (NASA) (6)
Illinois Department of Transportation (IDOT) (4)
U.S. Army (2)

Publication Year

2021 (108)
2022 (108)
2020 (96)
2023 (78)
2019 (72)
2024 (70)
2018 (61)
2017 (36)
2016 (30)
2025 (4)
2009 (1)
2011 (1)
2012 (1)
2014 (1)
2015 (1)


CC0 (367)
CC BY (281)
custom (20)


published: 2019-12-03
This is the data set associated with the manuscript titled "Extensive host-switching of avian feather lice following the Cretaceous-Paleogene mass extinction event." Included are the gene alignments used for phylogenetic analyses and the cophylogenetic input files.
keywords: phylogenomics, cophylogenetics, feather lice, birds
published: 2020-04-22
Data on Croatian restaurant allergen disclosures on restaurant websites, on-line menus and social media comments
keywords: restaurant; allergen; disclosure; tourism
published: 2012-07-01
This dataset provides the data for Mirarab, Siavash, Nam Nguyen, and Tandy Warnow. "SEPP: SATé-enabled phylogenetic placement." Biocomputing 2012. 2012. 247-258.
published: 2019-06-12
The data set contains Supplemental data sets for the Manuscript entitled "Where are they hiding? Testing the body snatchers hypothesis in pyrophilous fungi." Environmental sampling: Amplification of nuclear DNA regions (ITS1 and ITS2) were completed using the Fluidigm Access Array and the resulting amplicons were sequenced on an Illumina MiSeq v2 platform runs using rapid 2 × 250 nt paired-end reads. Illumina sequencing run amplicons that were size selected into <500nt and >500nt sub-pools, then remixed together <500nt: >500nt by nM concentration in a 1x:3x proportion. All amplification and sequencing steps were performed at the Roy J. Carver Biotechnology Center at the University of Illinois Urbana-Champaign. ITS1 region primers consisted of ITS1F (5'-CTTGGTCATTTAGAGGAAGTAA-'3) and ITS2 (5'-GCTGCGTTCTTCATCGATGC-'3). ITS2 region primers consisted of fITS7 (5'-GTGARTCATCGAATCTTTG-'3) and ITS4 (5'-TCCTCCGCTTATTGATATGC-'3). Supplemental files 1 through 5 contain the raw data files. Supplemental 1 is the ITS1 Illumina MiSeq forward reads and Supplemental 2 is the corresponding index files. Supplemental 3 is the ITS2 Illumina MiSeq forward reads and Supplemental 4 is the corresponding index files. Supplemental 5 is the map file needed to process the forward reads and index files in QIIME. Supplemental 6 and 7 contain the resulting QIIME 1.9.1. OTU tables along with UNITE, NCBI, and CONSTAX taxonomic assignments in addition to the representative OTU sequence. Numeric samples within the OTU tables correspond to the following: 1 Brachythecium sp. 2 Usnea cornuta 3 Dicranum sp. 4 Leucodon julaceus 5 Lobaria quercizans 6 Rhizomnium sp. 7 Dicranum sp. 8 Thuidium delicatulum 9 Myelochroa aurulenta 10 Atrichum angustatum 11 Dicranum sp. 12 Hypnum sp. 13 Atrichum angustatum 14 Hypnum sp. 15 Thuidium delicatulum 16 Leucobryum sp. 17 Polytrichum commune 18 Atrichum angustatum 19 Atrichum angustatum 20 Atrichum crispulum 21 Bryaceae 22 Leucobryum sp. 23 Conocephalum conicum 24 Climacium americanum 25 Atrichum angustatum 26 Huperzia serrata 27 Polytrichum commune 28 Diphasiastrum sp. 29 Anomodon attenuatus 30 Bryoandersonia sp. 31 Polytrichum commune 32 Thuidium delicatulum 33 Brachythecium sp. 34 Leucobryum glaucum 35 Bryoandersonia sp. 36 Anomodon attenuatus 37 Pohlia sp. 38 Cinclidium sp. 39 Hylocomium splendens 40 Polytrichum commune 41 negative control 42 Soil 43 Soil 44 Soil 45 Soil 46 Soil 47 Soil If a sample number is not present within the OTU table; either no sequences were obtained or no sequences passed the quality filtering step in QIIME. Supplemental 8 contains the Summary of unique species per location.
published: 2019-07-29
Datasets used in the study, "TRACTION: Fast non-parametric improvement of estimated gene trees," accepted at the Workshop on Algorithms in Bioinformatics (WABI) 2019.
keywords: Gene tree correction; horizontal gene transfer; incomplete lineage sorting
published: 2019-08-30
This dataset includes the data from an analysis of bobcat harvest data with particular focus on the relationship between catch-per-unit-effort and population size. The data relate to bobcat trapper and hunter harvest metrics from Wisconsin and include two RDS files which can be open in the software R using the readRDS() function.
keywords: bobcat; catch-per-unit-effort; CPUE; harvest; Lynx rufus; wildlife management; trapper; hunter
published: 2017-12-22
TBP assessment raw data files of pre- and post- motion capture velocity and center of pressure force plate data. Labels are self-explanatory. The .mat files refer to data exported from the force plate for the time-to-stabilization assessments while the .txt files are the data collected for smoothness of gait assessments. These files do not relate to one another and are from separate assessments. Version2's files are the result from using Python code Data_Bank_Cleaner.py on version1's. Please find more information in READ_ME_databank.txt.
keywords: Multiple Sclerosis; Rehabilitation; Balance; Ataxia; Ballet; Dance; Targeted Ballet Program
published: 2019-07-08
# Overview These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe. The datasets consist of the following: * twin_not_abstract_matched_complete.tsv: a tab-delimited file consisting of pairs of MEDLINE articles with identical titles, authors and years of publication. This file contains the PMIDs of the duplicate publications, as well as their medical subject headings (MeSH) and three measures of their indexing consistency. * twin_abstract_matched_complete.tsv: the same as above, except that the MEDLINE articles also have matching abstracts. * mesh_training_data.csv: a comma-separated file containing the training data for the model discussed in the dissertation. * mesh_scores.tsv: a tab-delimited file containing a pairwise similarity score based on word embeddings, and MeSH hierarchy relationship. ## Duplicate MEDLINE Publications Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns: 1. pmid_one: the PubMed unique identifier of the first paper 2. pmid_two: the PubMed unique identifier of the second paper 3. mesh_one: A list of medical subject headings (MeSH) from the first paper, delimited by the "|" character 4. mesh_two: a list of medical subject headings from the second paper, delimited by the "|" character 5. hoopers_consistency: The calculation of Hooper's consistency between the MeSH of the first and second paper 6. nonhierarchicalfree: a word embedding based consistency score described in the dissertation 7. hierarchicalfree: a word embedding based consistency score additionally limited by the MeSH hierarchy, described in the dissertation. ## MeSH Training Data The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns: 1. pmid: the PubMed unique identifier of the paper 2. term: a candidate MeSH term 3. cit_count: the log of the frequency of the term in the citation candidate set 4. total_cit: the log of the total number the paper's citations 5. citr_count: the log of the frequency of the term in the citations of the paper's citations 6. total_citofcit: the log of the total number of the citations of the paper's citations 7. absim_count: the log of the frequency of the term in the AbSim candidate set 8. total_absim_count: the log of the total number of AbSim records for the paper 9. absimr_count: the log of the frequency of the term in the citations of the AbSim records 10. total_absimr_count: the log of the total number of citations of the AbSim record 11. log_medline_frequency: the log of the frequency of the candidate term in MEDLINE. 12. relevance: a binary indicator (True/False) if the candidate term was assigned to the target paper ## Cosine Similarity The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns: 1. mesh_one: a string of the first MeSH heading. 2. mesh_two: a string of the second MeSH heading. 3. cosine_similarity: the cosine similarity between the terms 4. relationship_type: a string identifying the relationship type, consisting of none, parent/child, sibling, ancestor and direct (terms are identical, i.e. a direct hierarchy match). The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/). For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format."
keywords: MEDLINE;MeSH;Medical Subject Headings;Indexing
published: 2019-08-29
This is the published ortholog set derived from whole genome data used for the analysis of members of the B. tabaci complex of whiteflies. It includes the concatenated alignment and individual gene alignments used for analyses (Link to publication: https://www.mdpi.com/1424-2818/11/9/151).
published: 2020-10-01
These datasets were performed to assess whether color pattern phenotypes of the polymorphic tortoise beetle, Chelymorpha alternans, mate randomly with one another, and whether there are any reproductive differences between assortative and disassortative pairings.
keywords: mate choice, color polymorphisms, random mating
published: 2017-08-11
Enclosed in this dataset are transport data of kagome connected artificial spin ice networks composed of permalloy nanowires. The data herein are reproductions of the data seen in Appendix B of the dissertation titled "Magnetotransport of Connected Artificial Spin Ice". Field sweeps with the magnetic field applied in-plane were performed in 5 degree increments for armchair orientation kagome artificial spin ice and zigzag orientation kagome artificial spin ice.
keywords: Magnetotransport; artificial spin ice; nanowires
published: 2020-05-15
Trained models for multi-task multi-dataset learning for sequence prediction in tweets Tasks include POS, NER, Chunking, and SuperSenseTagging Models were trained using: https://github.com/napsternxg/SocialMediaIE/blob/master/experiments/multitask_multidataset_experiment.py See https://github.com/napsternxg/SocialMediaIE for details.
keywords: twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning;
published: 2021-05-10
This dataset contains the emulated global multi-model urban daily temperature projections under RCP 8.5 scenario. The dataset is derived from the study "Large model structural uncertainty in global projections of urban heat waves" (XXXX). Details about this dataset and the local urban climate emulator are described in the article. This dataset documents the global urban daily temperatures of 17 CMIP5 Earth system models for 2006-2015 and 2061-2070. This dataset may be useful for multiple communities regarding urban climate change, heat waves, impacts, vulnerability, risks, and adaptation applications.
keywords: Urban heat waves; CMIP; urban warming; heat stress; urban climate change
published: 2019-03-19
This repository includes scripts and datasets for the paper, "TreeMerge: A new method for improving the scalability of species tree estimation methods." The latest version of TreeMerge can be downloaded from Github (https://github.com/ekmolloy/treemerge).
keywords: divide-and-conquer; statistical consistency; species trees; incomplete lineage sorting; phylogenomics
published: 2019-01-27
This repository include datasets that are studied with INC/INC-ML/INC-NJ in the paper `Using INC within Divide-and-Conquer Phylogeny Estimation' that was submitted to AICoB 2019. Each dataset has its own readme.txt that further describes the creation process and other parameters/softwares used in making these datasets. The latest implementation of INC/INC-ML/INC-NJ can be found on https://github.com/steven-le-thien/constraint_inc. Note: there may be files with DS_STORE as extension in the datasets; please ignore these files.
keywords: phylogenetics; gene tree estimation; divide-and-conquer; absolute fast converging
published: 2023-02-10
Data and documentation for Ornithological Applications manuscript “Integrating multiple data sources improves prediction and inference for upland game bird occupancy models” by Robert L. Emmet, Thomas J. Benson, Maximilian L. Allen, and Kirk W. Stodola We combined data from the North American Breeding Bird Survey and eBird with a targeted survey (IDNR upland game) to estimate habitat use of northern bobwhite and ring-necked pheasant in Illinois and to document the efficiency and overlap among the various data sources. Data include, eBird, USGS Breeding Bird Survey, National Land Cover Database, Upland game bird surveys, stream data)
keywords: data integration; occupancy; avian population modelling; northern bobwhite;Colinus virginianus; ring-necked pheasant; Phasianus colchicus
published: 2023-02-07
Data sets from "DISCO+QR: Rooting Species Trees in the Presence of GDL and ILS." It contains trees and sequences simulated with gene duplication and loss under a variety of different conditions. Note: - trees.tar.gz contains the simulated gene-family trees used in our experiments (both true trees from SimPhy as well as trees estimated from alignments). - alignments.tar.gz contains simulated sequence data used for estimating the gene-family trees
keywords: evolution; computational biology; bioinformatics; phylogenetics
published: 2023-04-06
This is a simulated sequence dataset generated using INDELible and processed via a sequence fragmentation procedure.
keywords: sequence length heterogeneity;indelible;computational biology;multiple sequence alignment
published: 2021-04-11
This dataset contains RNASim1000, Cox1-Het datasets as well as analyses of RNASim1000, Cox1-Het, and 1000M1(HF).
keywords: phylogeny estimation; maximum likelihood; RAxML; IQ-TREE; FastTree; cox1; heterotachy; disjoint tree mergers; Tree of Life
published: 2018-12-13
The dataset contains a complete example (inputs, outputs, codes, intermediate results, visualization webpage) of executing Height Above Nearest Drainage HAND workflow with CyberGIS-Jupyter.
keywords: cybergis; hydrology; Jupyter
published: 2021-09-03
All of the files in this dataset pertain to the evaluation of a novel statistic, Hind/He, for distinguishing Mendelian loci from paralogs. They are derived from a RAD-seq genotyping dataset of diploid and tetraploid Miscanthus sacchariflorus.
published: 2021-03-15
Dataset associated with "Hiding in plain sight: genetic confirmation of putative Louisiana Fatmucket Lampsilis hydiana in Illinois" as submitted to Freshwater Mollusk Biology and Conservation by Stodola et al. Images are from cataloged specimens from the Illinois Natural History Survey (INHS) Mollusk Collection in Champaign, Illinois that were used for genetic research. File names indicate the species as confirmed in Stodola et al. (i.e., Lampsilis siliquoidea or Lampsilis hydiana) followed by the INHS Mollusk Collection catalog number, followed by the individual specimen number, followed by shell view (interior or exterior). If no specimen number is noted in the file name, there is only one specimen for that catalog number. For example: Lsiliquoidea_46515_1_2_3_exterior. Images were created by photographing specimens on a metric grid in an OrTech Photo-e-Box Plus with a Nikon D610 single lens reflex camera using a 60mm lens. Post-processing of images (cropping, image rotation, and auto contrast) occurred in Adobe Photoshop and saved as TIFF files using no image compression, interleaved pixel order, and IBM PC Byte Order. One additional partial lot, INHS Mollusk Catalog No. 37059 (shown with both interior and exterior view in one image), is included for reference but was not genetically sequenced. A .csv file contains an index of all specimens photographed. SPECIES: species confirmed using genetic analyses GENE: cox1 or nad1 mitochondrial gene ACCESSION: GenBank accession number INHS CATALOG NO: Illinois Natural History Survey Mollusk Collection Catalog number WATERBODY: waterbody where specimen was collected PUTATIVE SPECIES: species determination based on morphological characters prior to genetic analysis Phylogenetic sequence data (.nex files) were aligned using BioEdit (Hall, T.A. 1999. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symposium Series 41:95-98.). Pertinent methodology for the analysis are contained within the manuscript submittal for Stodola et al. to Freshwater Mollusk Biology and Conservation. In these files, "N" is a standard symbol for an unknown base.
keywords: Lampsilis hydiana; Lampsilis siliquoidea; unionid; Louisiana Fatmucket; Fatmucket; genetic confirmation
published: 2018-04-19
Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03 The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed. &bull; How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in <i>Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304</i> <i>Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105</i> Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication. &bull; How accurate is the 2009 dataset (compared to 2006 and 2009)? The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors. &bull; What is the format of the dataset? The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields: 1. blocks separated by '||'; each block may consist of multiple lastname-first initial variants separated by '|' 2. prior probabilities of the respective blocks separated by '|' 3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks) 4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased). 5. cluster size (number of author name instances on papers) 6. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix 7. last name variants separated by '|' 8. first name variants separated by '|' 9. middle initial variants separated by '|' ('-' if none) 10. suffix variants separated by '|' ('-' if none) 11. email addresses separated by '|' ('-' if none) 12. range of years (e.g., 1997-2009) 13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none) 14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '|'; ('-' if none) 15. Journals with counts in parenthesis (separated by "|"), 16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none) 17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none) 18. Co-author IDs with counts in parenthesis; separated by '|'; ('-' if none) 19. Author name instances (PMID_auno separated '|') 20. Grant IDs (after normalization; "-" if none given; separated by "|"), 21. Total number of times cited. (Citations are based on references extracted from PMC). 22. h-index 23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "|" 24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "|" 25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "|" 26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)
keywords: Bibliographic databases; Name disambiguation; MEDLINE; Library information networks