Datasets (102)

Subject Area

Life Sciences (43)
Social Sciences (23)
Physical Sciences (16)
Technology and Engineering (15)
Uncategorized (5)

Funder

U.S. National Science Foundation (NSF) (27)
Other (24)
U.S. Department of Energy (DOE) (13)
U.S. National Institutes of Health (NIH) (10)
U.S. Department of Agriculture (USDA) (3)
Illinois Department of Natural Resources (IDNR) (2)
U.S. Geological Survey (USGS) (2)
U.S. National Aeronautics and Space Administration (NASA) (1)

Publication Year

2017 (36)
2018 (33)
2016 (30)
2019 (3)

License

CC0 (58)
CC BY (42)
custom (2)
planned publication date: 2019-01-01
 
Illinois CWD deer management information and similar proportion of land cover type grouping for NLCD for Illinois
keywords: Chronis wasting disease, wildlife management, sharpshooting
published: 2018-07-13
 
Qualitative Data collected from the websites of undergraduate research journals between October, 2014 and May, 2015. Two CSV files. The first file, "Sample", includes the sample of journals with secondary data collected. The second file, "Population", includes the remainder of the population for which secondary data was not collected. Note: That does not add up to 800 as indicated in article, rows were deleted for journals that had broken links or defunct websites during random sampling process.
keywords: undergraduate research; undergraduate journals; scholarly communication; libraries; liaison librarianship
published: 2018-05-16
 
These data are for two companion papers on use of LSPIV obtained from UAS (i.e. drones) to measure flow structure in streams. The LSPIV1 folder contains spreadsheet data used in each case referred to in Table 1 in the manuscript. In the spreadsheets, there is a cell that denotes which figure was constructed with which data. The LSPIV2 folder contains spreadsheets with data used for the constructed figures, and are labeled by figure.
keywords: LSPIV; drone; UAS; flow structure; rivers
published: 2018-06-20
 
The dataset includes the data used in the study of Classical Topological Order in the Kinetics of Artificial Spin Ice. This includes the photoemission electron microscopy intensity measurement of artificial spin ice at different temperatures as a function of time. The data includes the raw data, the metadata, and the data cookbook. Please refer to the data cookbook for more information. Note: vertex_population.xlsx file in the meta_data_code folder can be disregarded.
keywords: artificial spin ice; PEEM; topological order
planned publication date: 2018-09-01
 
Ammonia flux measurement data using flux gradient and relaxed eddy accumulation methods, and ancillary environmental data collected during the 2014 corn-growing season in Central Illinois, USA. This excel file contains two spreadsheets: one README sheet, and one sheet containing all data. These data were used in the development of the manuscript titled "Ammonia Flux Measurements above a Corn Canopy using Relaxed Eddy Accumulation and a Flux Gradient System."
keywords: Ammonia; Bi-directional Flux; Corn; Relaxed Eddy Accumulation; Flux Gradient; Urease Inhibitor
published: 2018-06-18
 
This repository contains datasets and R scripts that were used in a study of the population structure of Miscanthus sacchariflorus in its native range across East Asia. Notably, genotypes of 764 individuals at 34,605 SNPs, called from reduced-representation DNA sequencing using a non-reference bioinformatics pipeline, are provided. Two similar SNP datasets, used for identifying clonal duplicates and for determining the ancestry of ornamental and hybrid Miscanthus plants identified in previous studies respectively, are also provided. There is also a spreadsheet listing the provenance and ploidy of all individuals along with their plastid (chloroplast) haplotypes. Software output for Structure, Treemix, and DIYABC is also included. See README.txt for more information about individual files. Results of this study are described in a manuscript in revision in Annals of Botany by the same authors, "Population structure of Miscanthus sacchariflorus reveals two major polyploidization events, tetraploid-mediated unidirectional introgression from diploid Miscanthus sinensis, and diversity centered around the Yellow Sea."
keywords: Miscanthus; restriction site-associated DNA sequencing (RAD-seq); single nucleotide polymorphism (SNP); population genetics; Miscanthus xgiganteus; Miscanthus sacchariflorus; R scripts; germplasm; plastid haplotype
published: 2018-06-06
 
DNDC scripts and outputs that were generated as a part of the research publication 'Evaluation of DeNitrification DeComposition Model for Estimating Ammonia Fluxes from Chemical Fertilizer Application'.
keywords: DNDC; REA; ammonia emissions; fertilizers; uncertainty analysis
published: 2018-06-05
 
A complete building coverage area dataset (i.e. area occupied by building structures, excluding other built surfaces such as roads, parking lots, and public parks) at the level of census block groups for the contiguous United States (CONUS). The dataset was assembled based on an ensemble prediction of nonlinear hierarchical models to account for spatial heterogeneities in the distribution of built surfaces across different urban communities. Percentage of impervious land and housing density were used as predictors of the estimated area of buildings and cross-validation results showed that the product estimated area represented by buildings with a mean error of 0.049 %.
keywords: Building Coverage Area; Urban Geography; Regional; Sustainability; US Census Block Groups; CONUS Data
published: 2018-04-26
 
GBS data from soybean lines carrying introgressions from Glycine tomentella. This project is led by Dr. Randy Nelson, USDA scientist at the University of Illinois. Fastq files contain raw Illumina data. Txt files are keyfiles containing barcodes for each genetic entity.
published: 2018-04-05
 
GBS data from Phaseolus accessions, for a study led by Dr. Glen Hartman, UIUC. <br />The (zipped) fastq file can be processed with the TASSEL GBS pipeline or other pipelines for SNP calling. The related article has been submitted and the methods section describes the data processing in detail.
published: 2018-05-01
 
GBS data for G. max x G. soja crosses, a project led by Dr. Randy Nelson.
published: 2018-06-01
 
We summarize peer reviewed literature reporting associations between for three ‘Tier 2’ indicators (β-glucosidase (BG), fluorescein diacetate (FDA) hydrolysis, and permanganate oxidizable carbon (POXC)) and crop yield and greenhouse gas emissions. Peer-reviewed articles published between January of 1990 and December 2017 were searched using the Thomas Reuters Web of Science database (Thomas Reuters, Philadelphia, Pennsylvania) and Google Scholar to identify studies reporting results for: “β-glucosidase”, “permanganate oxidizable carbon”, “active carbon”, “readily oxidizable carbon”, or “fluorescein diacetate hydrolysis”, together with one or more of the following: “crop yield”, “productivity”, “greenhouse gas’, “CO2”, “CH4”, or “N2O”. Meta-data for records include associated descriptor variables and covariates useful for scoring function development which include: 1) identifying factors for the study site (location, and year in which data were reported), 2) soil textural class and pH, 3) depth of sampling, 4) analytical methods for quantification (i.e.: loss on ignition, combustion), 5) units used in published works (i.e.: equivalent mass, concentration), 6) SOC class (L,M,H), and 7) summary statistics for correlation between SQIs and functions.
keywords: Soil health promoting practices; Soil quality indicators; β-glucosidase; fluorescein diacetate hydrolysis; Permanganate oxidizable carbon; Greenhouse gas emissions; Scoring curves; Soil Management Assessment Framework
published: 2018-06-01
 
Dataset compiled by Yushu Xia and Michelle Wander for the Soil Health Institute. Data were recovered from peer reviewed literature reporting results for three ‘Tier 2’ indicators (β-glucosidase (BG), fluorescein diacetate (FDA) hydrolysis, and permanganate oxidizable carbon (POXC)) in terms of their relative response to management where soils under cover crops, grassland cover, organic amendments and residue return compared to conventionally managed controls. Peer-reviewed articles published between January of 1990 and December 2017 were searched using the Thomas Reuters Web of Science database (Thomas Reuters, Philadelphia, Pennsylvania) and Google Scholar to identify studies reporting results for: “β-glucosidase”, “permanganate oxidizable carbon”, “active carbon”, “readily oxidizable carbon”, and “fluorescein diacetate hydrolysis”, together with one or more of the following: “management practice”, “tillage”, “cover crop”, “residue”, “organic fertilizer”, or “manure”. Records were tabulated to compare SQI abundance in soil maintained under a control (conventional cropping with that found under soil health promoting practice) and soil aggrading practice with the intent to contribute to SQI databases that will support development of interpretive frameworks and/or algorithms including pedo-transfer functions relating indicator abundance to management practices and site specific factors. Meta-data include key descriptor variables and covariates useful for development of scoring functions which include: 1) identifying factors for the study site (location, year of initiation of study and year in which data was reported), 2) soil textural class and pH, 3) depth of sampling, 4) analytical methods for quantification (i.e.: loss on ignition, combustion), 5) units used in published works (i.e.: equivalent mass, concentration), 6) SOC class (L,M,H), and 7) statistical significance of difference comparisons.
keywords: Soil health promoting practices; Soil quality indicators; β-glucosidase; fluorescein diacetate hydrolysis; Permanganate oxidizable carbon; Greenhouse gas emissions; Scoring curves; Soil Management Assessment Framework
planned publication date: 2019-05-22
 
This is the experimental data of isolated nanomagnet islands with or without the presence of large nanomagnet islands. The small islands are made of Permalloy materials with size of 170 nm by 470 nm by 2.5 nm. The systems are measured at a temperature where the small islands are fluctuating around room temperature. The data is recorded as photoemission electron microscopy intensity. More details about the data can be found in the note.txt and Spe_2016.xlsx file. Note: The raw data folders are stored in five volumes during the compression. All five volumes are needed in order to recover the original folder.
keywords: artificial spin ice; magnetism
planned publication date: 2019-05-20
 
This is the experimental data of tetris artificial spin ice. The islands are made of Permalloy materials with size of 170 nm by 470 nm by 2.5 nm. The systems are measured at a temperature where the islands are fluctuating around room temperature. The data is recorded as photoemission electron microscopy intensity. More details about the dataset can be found in the file Note.txt and Tetris_data_list.xlsx Note: 2 files name bl11_teris600_033 and bl11_tetris600_2_135 are not recorded in the excel sheet because they are corrupted during the measurement. Any data that is not recorded in the excel sheet is either corrupted or of low quality. From files *_028 to *_049, tetris is spelled with “t” while in the raw data folder without “t”. This is a typo. Throughout the dataset, tetris and teris are supposed to have the same meaning.
keywords: artificial spin ice
published: 2018-05-21
 
This dataset contains bonding networks and tolerance ranges for geometric magnetic dimensionality. The data can be searched in the html frontend above, code obtained at the GitHub repository, or the raw data can be downloaded as csv below. The csv data contains the results of 42520 compounds (unique icsd_code) from ICSD FindIt v3.5.0. The csv is semicolon-delimited since some fields contain multiple comma-separated values.
keywords: materials science; physics; magnetism; crystallography
published: 2018-05-06
 
This deposit contains all raw data and analysis from the paper "In-cell titration of small solutes controls protein stability and aggregation". Data is collected into several types: 1) analysis*.tar.gz are the analysis scripts and the resulting data for each cell. The numbers correspond to the numbers shown in Fig.S1. (in publication) 2) scripts.tar.gz contains helper scripts to create the dataset in bash format. 3) input.tar.gz contains headers and other information that is fed into bash scripts to create the dataset. 4) All rawData*.tar.gz are tarballs of the data of cells in different solutes in .mat files readable by matlab, as follows: - Each experiment included in the publication is represented by two matlab files: (1) a calibration jump under amber illumination (_calib.mat suffix) (2) a full jump under blue illumination (FRET data) - Each file contains the following fields: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;coordleft - coordinates of cropped and aligned acceptor channel on the original image &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;coordright - coordinates of cropped and aligned donor channel on the original image] &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dataleft - a 3d 12-bit integer matrix containing acceptor channel flourescence for each pixel and time step. Not available in _calib files &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dataright - a 3d 12-bit integer matrix containing donor channel flourescence for each pixel and time step. This will be mCherry in _calib files and AcGFP in data files. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;frame1 - original image size &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;imgstd - cropped dimensions &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;numFrames - number of frames in dataleft and dataright &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;videos - a structure file containing camera data. Specifically, videos.TimeStamp includes the time from each frame.
keywords: Live cell; FRET microscopy; osmotic challenge; intracellular titrations; protein dynamics
published: 2018-04-23
 
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href="https://clarivate.com/products/web-of-science/databases/">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a> * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4222651_V1">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a> - Paper citation: <a href="https://doi.org/10.1145/1552303.1552304">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a> - Paper citation: <a href="https://doi.org/10.1002/asi.20105">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a> * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-9087546_V1">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a> - Paper citation: <a href="https://doi.org/10.1145/2467696.2467720">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a> - Paper citation: <a href="http://hdl.handle.net/2142/88927">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a> * MapAffil for identifying article country of affiliation: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4354331_V1">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a> - Paper citation: <a href="http://doi.org/10.1045/november2015-torvik">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a> * IMPLICIT journal similarity: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4742014_V1">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a> * Novelty dataset for identify article level novelty: - Dataset citation: <a href="https://doi.org/10.13012/B2IDB-5060298_V1">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a> - Paper citation: <a href="https://doi.org/10.1045/september2016-mishra"> Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a> - Code: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a> * Expertise dataset for identifying author expertise on articles: * Source code provided at: <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a> **Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.** Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a> ## Acknowledgments This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347</a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.
keywords: Self citation; PubMed Central; Data Analysis; Citation Data;
published: 2018-04-23
 
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a> * MeSH tree 2015: <a href="ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/">ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/</a> * Source code provided at: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a> Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here </a>for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a> ## Acknowledgments This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347 </a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742 </a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
keywords: Conceptual novelty; bibliometrics; PubMed; MEDLINE; MeSH; Medical Subject Headings; Analysis;
published: 2018-04-23
 
Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc. Using authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless. The tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then nA = sum(Ai); nB = sum(Ai) nAB = sum(Ai*Bi) if A not equal B; nAA = sum(Ai*(Ai-1)/2) otherwise nAnB = nA*nB if A not equal B; nAnA = nA*(nA-1)/2 otherwise score = 1000000*nAB/nAnB if A is not equal B; 1000000*nAA/nAnA otherwise Token pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB. The cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs. Each dataset has the following 7 tab-delimited all-ASCII columns 1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000 2: nAB: total number of co-occurrences 3: nAnB: total number of pairs 4: nA: number of occurrences of token A 5: nB: number of occurrences of token B 6: A: token A 7: B: token B We made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in: <a href="https://doi.org/10.1371/journal.pone.0115681">D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681. https://doi.org/10.1371/journal.pone.0115681</a> <a href="http://dx.doi.org/10.5210/disco.v7i0.6654">Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:http://dx.doi.org/10.5210/disco.v7i0.6654</a>
keywords: PubMed; MeSH; token; name disambiguation
published: 2018-04-23
 
Provides links to Author-ity 2009, including records from principal investigators (on NIH and NSF grants), inventors on USPTO patents, and students/advisors on ProQuest dissertations. Note that NIH and NSF differ in the type of fields they record and standards used (e.g., institution names). Typically an NSF grant spanning multiple years is associated with one record, while an NIH grant occurs in multiple records, for each fiscal year, sub-projects/supplements, possibly with different principal investigators. The prior probability of match (i.e., that the author exists in Author-ity 2009) varies dramatically across NIH grants, NSF grants, and USPTO patents. The great majority of NIH principal investigators have one or more papers in PubMed but a minority of NSF principal investigators (except in biology) have papers in PubMed, and even fewer USPTO inventors do. This prior probability has been built into the calculation of match probabilities. The NIH data were downloaded from NIH exporter and the older NIH CRISP files. The dataset has 2,353,387 records, only includes ones with match probability > 0.5, and has the following 12 fields: 1 app_id, 2 nih_full_proj_nbr, 3 nih_subproj_nbr, 4 fiscal_year 5 pi_position 6 nih_pi_names 7 org_name 8 org_city_name 9 org_bodypolitic_code 10 age: number of years since their first paper 11 prob: the match probability to au_id 12 au_id: Author-ity 2009 author ID The NSF dataset has 262,452 records, only includes ones with match probability > 0.5, and the following 10 fields: 1 AwardId 2 fiscal_year 3 pi_position, 4 PrincipalInvestigators, 5 Institution, 6 InstitutionCity, 7 InstitutionState, 8 age: number of years since their first paper 9 prob: the match probability to au_id 10 au_id: Author-ity 2009 author ID There are two files for USPTO because here we linked disambiguated authors in PubMed (from Author-ity 2009) with disambiguated inventors. The USPTO linking dataset has 309,720 records, only includes ones with match probability > 0.5, and the following 3 fields 1 au_id: Author-ity 2009 author ID 2 inv_id: USPTO inventor ID 3 prob: the match probability of au_id vs inv_id The disambiguated inventors file (uiuc_uspto.tsv) has 2,736,306 records, and has the following 7 fields 1 inv_id: USPTO inventor ID 2 is_lower 3 is_upper 4 fullnames 5 patents: patent IDs separated by '|' 6 first_app_yr 7 last_app_yr
keywords: PubMed; USPTO; Principal investigator; Name disambiguation
published: 2018-04-19
 
MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. Prepared by Vetle Torvik 2018-04-05 The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters), and should be about 3.5GB uncompressed. &bull; How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data <a href ="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">Terms and Conditions</a> &bull; Affiliations are linked to a particular author on a particular article. Prior to 2014, NLM recorded the affiliation of the first author only. However, MapAffil 2016 covers some PubMed records lacking affiliations that were harvested elsewhere, from PMC (e.g., PMID 22427989), NIH grants (e.g., 1838378), and Microsoft Academic Graph and ADS (e.g. 5833220). &bull; Affiliations are pre-processed (e.g., transliterated into ASCII from UTF-8 and html) so they may differ (sometimes a lot; see PMID 27487542) from PubMed records. &bull; All affiliation strings where processed using the MapAffil procedure, to identify and disambiguate the most specific place-name, as described in: <i>Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p</i> &bull; Look for <a href="https://doi.org/10.1186/s41182-017-0073-6">Fig. 4</a> in the following article for coverage statistics over time: <i>Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Tropical medicine and health. 2017 Dec;45(1):33.</i> Expect to see big upticks in coverage of PMIDs around 1988 and for non-first authors in 2014. &bull; The code and back-end data is periodically updated and made available for query by PMID at <a href="http://abel.ischool.illinois.edu/">Torvik Research Group</a> &bull; What is the format of the dataset? The dataset contains 37,406,692 rows. Each row (line) in the file has a unique PMID and author postition (e.g., 10786286_3 is the third author name on PMID 10786286), and the following thirteen columns, tab-delimited. All columns are ASCII, except city which contains Latin-1. 1. PMID: positive non-zero integer; int(10) unsigned 2. au_order: positive non-zero integer; smallint(4) 3. lastname: varchar(80) 4. firstname: varchar(80); NLM started including these in 2002 but many have been harvested from outside PubMed 5. year of publication: 6. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK 7. city: varchar(200); typically 'city, state, country' but could inlude further subvisions; unresolved ambiguities are concatenated by '|' 8. state: Australia, Canada and USA (which includes territories like PR, GU, AS, and post-codes like AE and AA) 9. country 10. journal 11. lat: at most 3 decimals (only available when city is not a country or state) 12. lon: at most 3 decimals (only available when city is not a country or state) 13. fips: varchar(5); for USA only retrieved by lat-lon query to https://geo.fcc.gov/api/census/block/find
keywords: PubMed, MEDLINE, Digital Libraries, Bibliographic Databases; Author Affiliations; Geographic Indexing; Place Name Ambiguity; Geoparsing; Geocoding; Toponym Extraction; Toponym Resolution
published: 2018-04-19
 
Prepared by Vetle Torvik 2018-04-15 The dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed. &bull; How was the dataset created? First and lastnames of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including Ethnea+Genni as described in: <i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA. http://hdl.handle.net/2142/88927</i> <i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i> EthnicSeer: http://singularity.ist.psu.edu/ethnicity <i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i> SexMachine 0.1.1: <a href="https://pypi.python.org/pypi/SexMachine/">https://pypi.org/project/SexMachine</a> First names, for some Author-ity records lacking them, were harvested from outside bibliographic databases. &bull; The code and back-end data is periodically updated and made available for query at <a href ="http://abel.ischool.illinois.edu">Torvik Research Group</a> &bull; What is the format of the dataset? The dataset contains 9,300,182 rows and 10 columns 1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition) 2. name: full name used as input to EthnicSeer) 3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX 4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction 5. lastname: used as input for Ethnea+Genni 6. firstname: used as input for Ethnea+Genni 7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short) 8. Genni: predicted gender; 'F', 'M', or '-' 9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male) 10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'
keywords: Androgyny; Bibliometrics; Data mining; Earch engine; Gender; Semantic orientation; Temporal prediction; Textual markers
published: 2018-04-19
 
Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03 The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed. &bull; How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in <i>Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304</i> <i>Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105</i> Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication. &bull; How accurate is the 2009 dataset (compared to 2006 and 2009)? The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors. &bull; What is the format of the dataset? The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields: 1. blocks separated by '||'; each block may consist of multiple lastname-first initial variants separated by '|' 2. prior probabilities of the respective blocks separated by '|' 3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks) 4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased). 5. cluster size (number of author name instances on papers) 6. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix 7. last name variants separated by '|' 8. first name variants separated by '|' 9. middle initial variants separated by '|' ('-' if none) 10. suffix variants separated by '|' ('-' if none) 11. email addresses separated by '|' ('-' if none) 12. range of years (e.g., 1997-2009) 13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none) 14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '|'; ('-' if none) 15. Journals with counts in parenthesis (separated by "|"), 16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none) 17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none) 18. Co-author IDs with counts in parenthesis; separated by '|'; ('-' if none) 19. Author name instances (PMID_auno separated '|') 20. Grant IDs (after normalization; "-" if none given; separated by "|"), 21. Total number of times cited. (Citations are based on references extracted from PMC). 22. h-index 23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "|" 24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "|" 25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "|" 26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)
keywords: Bibliographic databases; Name disambiguation; MEDLINE; Library information networks
published: 2018-03-28
 
Bibliotelemetry data are provided in support of the evaluation of Internet of Things (IoT) middleware within library collections. IoT infrastructure within the physical library environment is the basis for an integrative, hybrid approach to digital resource recommenders. The IoT infrastructure provides mobile, dynamic wayfinding support for items in the collection, which includes features for location-based recommendations. A modular evaluation and analysis herein clarified the nature of users’ requests for recommendations based on their location, and describes subject areas of the library for which users request recommendations. The modular mobile design allowed for deep exploration of bibliographic identifiers as they appeared throughout the global module system, serving to provide context to the searching and browsing data that are the focus of this study.
keywords: internet of things; IoT; academic libraries; bibliographic classification
published: 2018-03-14
 
These data include information on a field experiment on Castilleja coccinea (L.) Spreng., scarlet Indian paintbrush (Orobanchaceae). There is intraspecific variation in scarlet Indian paintbrush in the color of the bracts surrounding the flowers. Two bract color morphs were included in this study, the scarlet and yellow morphs. The experiment was conducted at Illinois Beach State Park in 2012. The aim of the work was to compare the color morphs with regard to 1) self-compatibility, 2) response to pollinator exclusion, 3) cross-compatibility between the color morphs, and 4) relative female fertility and male fitness. Three files are attached with this record. The raw data are in "fruitSet.csv" and "seedSet.csv", while "readme.txt" has detailed explanations of the raw data files.
keywords: Castilleja coccinea; Orobanchaceae; floral color polymorphism; bract color polymorphism; breeding system; hand-pollination; self-compatibility; reproductive assurance
published: 2018-03-08
 
This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.
keywords: databases; research infrastructure; sustainability; data sharing; molecular biology; bioinformatics; bibliometrics
published: 2018-03-01
 
Data were used to analyze patterns in predator-specific nest predation on shrubland birds in Illinois as related to landscape composition at multiple landscape scales. Data were used in a Journal of Applied Ecology research paper of the same name. Data were collected between 2011 and 2014 at sites in east-central and northeastern Illinois, USA as part of a Ph.D. research project on the relationship between avian nest predation and landscape characteristics, and how nest predation affects adult and nestling bird behavior.
keywords: nest predation; avian ecology; land cover; landscape composition; landscape scale; nest camera; nest survival; predator-specific mortality; scale-dependence; scrubland; shrub-nesting bird
published: 2018-03-01
 
The data set consists of Illumina sequences derived from 48 sediment samples, collected in 2015 from Lake Michigan and Lake Superior for the purpose of inventorying the fungal diversity in these two lakes. DNA was extracted from ca. 0.5g of sediment using the MoBio PowerSoil DNA isolation kits following the Earth Microbiome protocol. PCR was completed with the fungal primers ITS1F and fITS7 using the Fluidigm Access Array. The resulting amplicons were sequenced using the Illumina Hi-Seq2500 platform with rapid 2 x 250nt paired-end reads. The enclosed data sets contain the forward read files for both primers, both fixed-header index files, and the associated map files needed to be processed in QIIME. In addition, enclosed are two rarefied OTU files used to evaluate fungal diversity. All decimal latitude and decimal longitude coordinates of our collecting sites are also included. File descriptions: Great_lakes_Map_coordinates.xlsx = coordinates of sample sites QIIME Processing ITS1 region: These are the raw files used to process the ITS1 Illumina reads in QIIME. ***only forward reads were processed GL_ITS1_HW_mapFile_meta.txt = This is the map file used in QIIME. ITS1F_Miller_Fludigm_I1_fixedheader.fastq = Index file from Illumina. Headers were fixed to match the forward reads (R1) file in order to process in QIIME ITS1F_Miller_Fludigm_R1.fastq = Forward Illumina reads for the ITS1 region. QIIME Processing ITS2 region: These are the raw files used to process the ITS2 Illumina reads in QIIME. ***only forward reads were processed GL_ITS2_HW_mapFile_meta.txt = This is the map file used in QIIME. ITS7_Miller_Fludigm_I1_Fixedheaders.fastq = Index file from Illumina. Headers were fixed to match the forward reads (R1) file in order to process in QIIME ITS7_Miller_Fludigm_R1.fastq = Forward Illumina reads for the ITS2 region. Resulting OTU Table and OTU table with taxonomy ITS1 Region wahl_ITS1_R1_otu_table.csv = File contains Representative OTUs based on ITS1 region for all the R1 data and the number of each OTU found in each sample. wahl_ITS1_R1_otu_table_w_tax.csv = File contains Representative OTUs based on ITS1 region for all the R1 and the number of each OTU found in each sample along with taxonomic determination based on the following database: sh_taxonomy_qiime_ver7_97_s_31.01.2016_dev ITS2 Region wahl_ITS2_R1_otu_table.csv = File contains Representative OTUs based on ITS2 region for all the R1 data and the number of each OTU found in each sample. wahl_ITS2_R1_otu_table_w_tax.csv = File contains Representative OTUs based on ITS2 region for all the R1 data and the number of each OTU found in each sample along with taxonomic determination based on the following database: sh_taxonomy_qiime_ver7_97_s_31.01.2016_dev Rarified illumina dataset for each ITS Region ITS1_R1_nosing_rare_5000.csv = Environmental parameters and rarefied OTU dataset for ITS1 region. ITS2_R1_nosing_rare_5000.csv = Environmental parameters and rarefied OTU dataset for ITS2 region. Column headings: #SampleID = code including researcher initials and sequential run number BarcodeSequence = LinkerPrimerSequence = two sequences used CTTGGTCATTTAGAGGAAGTAA or GTGARTCATCGAATCTTTG ReversePrimer = two sequences used GCTGCGTTCTTCATCGATGC or TCCTCCGCTTATTGATATGC run_prefix = initials of run operator Sample = location code, see thesis figures 1 and 2 for mapped locations and Great_lakes_Map_coordinates.xlsx for exact coordinates. DepthGroup = S= shallow (50-100 m), MS=mid-shallow (101-150 m), MD=mid-deep (151-200 m), and D=deep (>200 m)" Depth_Meters = Depth in meters Lake = lake name, Michigan or Superior Nitrogen % Carbon % Date = mm/dd/yyyy pH = acidity, potential of Hydrogen (pH) scale SampleDescription = Sample or control X = sequential run number OTU ID = Operational taxonomic unit ID
keywords: Illumina; next-generation sequencing; ITS; fungi
published: 2017-11-29
 
This dataset contains genotypic and phenotypic data, R scripts, and the results of analysis pertaining to a multi-location field trial of Miscanthus sinensis. Genome-wide association and genomic prediction were performed for biomass yield and 14 yield-component traits across six field trial locations in Asia and North America, using 46,177 single-nucleotide polymorphism (SNP) markers mined from restriction site-associated DNA sequencing (RAD-seq) and 568 M. sinensis accessions. Genomic regions and candidate genes were identified that can be used for breeding improved varieties of M. sinensis, which in turn will be used to generate new M. xgiganteus clones for biomass.
keywords: miscanthus; genotyping-by-sequencing (GBS); genome-wide association studies (GWAS); genomic selection
published: 2017-12-01
 
This dataset contains all the numerical results (digital elevation models) that are presented in the paper "Landscape evolution models using the stream power incision model show unrealistic behavior when m/n equals 0.5." The paper can be found at: http://www.earth-surf-dynam-discuss.net/esurf-2017-15/ The paper has been accepted, but the most up to date version may not be available at the link above. If so, please contact Jeffrey Kwang at jeffskwang@gmail.com to obtain the most up to date manuscript.
keywords: landscape evolution models; digital elelvation model
published: 2017-12-04
 
Data used for Zaya et al. (2018), published in Invasive Plant Science and Management DOI 10.1017/inp.2017.37, are made available here. There are three spreadsheet files (CSV) available, as well as a text file that has detailed descriptions for each file ("readme.txt"). One spreadsheet file ("prices.csv") gives pricing information, associated with Figure 3 in Zaya et al. (2018). The other two spreadsheet files are associated with the genetic analysis, where one file contains raw data for biallelic microsatellite loci ("genotypes.csv") and the other ("structureResults.csv") contains the results of Bayesian clustering analysis with the program STRUCTURE. The genetic data may be especially useful for future researchers. The genetic data contain the genotypes of the horticultural samples that were the focus of the published article, and also genotypes of nearly 400 wild plants. More information on the location of the wild plant collections can be found in the Supplemental information for Zaya et al. (2015) Biological Invasions 17:2975–2988 DOI 10.1007/s10530-015-0926-z. See "readme.txt" for more information.
keywords: Horticultural industry; invasive species; microsatellite DNA; mislabeling; molecular testing
published: 2017-12-15
 
These are the results of an 8 month cohort study in two commercial dairy herds in Northwest Illinois. From each herd, 50 cows were selected at random, stratified over lactations 1 to 3. Serum from these animals was collected every two months and tested for antibodies to Bovine Leukosis Virus, Neospora caninum, and Mycobacterium avium subsp. paratuberculosis. Animals that left the herd during the study were replaced by another animal in the same herd and lactation. At the last sampling, serum neutralization assays were performed for Bovine Herpesvirus type 1 and Bovine Viral Diarrhea virus type 1 and 2. Production data before and after sampling was collected for the entire herd from PCdart.
keywords: serostatus;dairy;production;cohort
published: 2017-12-18
 
This dataset matches to a thesis of the same title: Can fair use be adequately taught to Librarians? Assessing Librarians' confidence and comprehension in explaining fair use following an expert workshop.
keywords: fair use; copyright
published: 2017-12-14
 
Objectives: This study follows-up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1) what is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign campus repository? Are datasets more likely to be single file or multiple file items? (2) what is the usage data associated with these datasets? Which items are most popular? Methods: The dataset records collected in this study were identified by filtering item types categorized as "data" or "dataset" using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item's statistics report. The Handle identifier represents the dataset record's persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository. Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS. Results: A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first time frame a large number of PDFs were deposited by the Illinois Department of Agriculture. Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per file across all datasets averaged 3.2. Conclusion: Academic librarians, repository managers, and research data services staff can use the results presented here to anticipate the nature of research data that may be deposited within institutional repositories. With increased awareness, content recruitment, and improvements, IRs can provide a viable cyberinfrastructure for researchers to deposit data, but much can be learned from the data already deposited. Awareness of trends can help librarians facilitate discussions with researchers about research data deposits as well as better tailor their services to address short-term and long-term research needs.
keywords: research data; research statistics; institutional repositories; academic libraries
published: 2017-12-20
 
The dataset contains processed model fields used to generate data, figures and tables in the Journal of Geophysical Research article "Investigating the linear dependence of direct and indirect radiative forcing on emission of carbonaceous aerosols in a global climate model." The processed data are monthly averaged cloud properties (CCN, CDNC and LWP) and forcing variables (DRF and IRF) at original CAM5 spatial resolution (1.9° by 2.5°). Raw model output fields from CAM5 simulations are available through NERSC upon request. Please find more detailed information in the ReadMe file.
keywords: carbonaceous aerosols; radiative forcing; emission; linearity
published: 2018-01-03
 
Concatenated sequence alignment, phylogenetic analysis files, and relevant software parameter files from a cophylogenetic study of Brueelia-complex lice and their avian hosts. The sequence alignment file includes a list of character blocks for each gene alignment and the parameters used for the MrBayes phylogenetic analysis. 1) Files from the MrBayes analyses: a) a file with 100 random post-burnin trees (50% burnin) used in the cophylogenetic analysis - analysisrandom100_trees_brueelia.tre b) a majority rule consensus tree - treeconsensus_tree_brueelia.tre c) a maximum clade credibility tree - mcc_tree_brueelia.tre The tree tips are labeled with louse voucher names, and can be referenced in Supplementary Table 1 of the associated publication. 2) Files related to a BEAST analysis with COI data: a) the XML file used as input for the BEAST run, including model parameters, MCMC chain length, and priors - beast_parameters_coi_brueelia.xml b) a file with 100 random post-burnin trees (10% burnin) from the BEAST posterior distribution of trees; used in OTU analysis - beast_100random_trees_brueelia.tre c) an ultrametric maximum clade credibility tree - mcc_tree_beast_brueelia.tre 3) A maximum clade credibility tree of Brueelia-complex host species generated from a distribution of trees downloaded from https://birdtree.org/subsets/ - mcc_tree_brueelia_hosts.tre 4) Concatenated sequence alignment - concatenated_alignment_brueelia.nex
keywords: bird lice; Brueelia-complex; passerines; multiple sequence alignment; phylogenetic tree; Bayesian phylogenetic analysis; MrBayes; BEAST
published: 2018-01-13
 
This dataset provides the time series (Aug. - Sep. 2016) data of sun-induced chlorophyll fluorescence, photosynthesis, photosynthetically active radiation, and associated vegetation indices that were collected in a soybean field in the farm of University of Illinois at Urbana and Champaign. Data contain 255 records and 6 variables (PPFD-IN: Photosynthetically active radiation; GPP-Gross Primary Production; SIF: Sun-Induced Fluorescence; NDVI: Normalized Difference Vegetation Index; Rededge: Rededge Index; Redege_NDVI: Rededge Normalized Difference Vegetation Index). The timestamp uses the standard time. Data are available from 8 am to 4 pm (corresponding to 9 am to 5 pm local time) every day.
keywords: sun-induced chlorophyll fluorescence; photosynthesis; soybean
published: 2018-02-22
 
Datasets used in the study, "OCTAL: Optimal Completion of Gene Trees in Polynomial Time," under review at Algorithms for Molecular Biology. Note: DS_STORE file in 25gen-10M folder can be disregarded.
keywords: phylogenomics; missing data; coalescent-based species tree estimation; gene trees
published: 2018-01-11
 
Dataset includes structure and values of a causal model for Training Quality in nuclear power plants. Each entry refers to a piece of evidence supporting causality of the Training Quality causal model. Includes bibliographic information, context-specific text from the reference, and three weighted values; (M1) credibility of reference, (2) causality determined by the author, and (3) analysts confidence level. (M1, M2, and M3) Weight metadata are based on probability language from <a href="https://www.ipcc.ch/ipccreports/tar/vol4/english/index.htm" style="text-decoration: none" >Intergovernmental Panel on Climate Change (IPCC), Climate Change 2001: Synthesis Report</a>. The language can be found in the “Summary for Policymakers” section, in the PDF format. Weight Metadata: LowerBound_Probability, UpperBound_Probability, Qualitative Language 0.99, 1, Virtually Certain 0.9, 0.99, Very Likely 0.66, 0.9, Likely 0.33, 0.66, Medium Likelihood 0.1, 0.33, Unlikely 0.01, 0.1, Very Unlikely 0, 0.01, Extremely Unlikely
keywords: Data-Theoretic; Training; Organization; Probabilistic Risk Assessment; Training Quality; Causal Model; DT-BASE; Bayesian Belief Network; Bayesian Network; Theory-Building
published: 2017-06-16
 
Table S1. Pollen types identified in the BCI and PNSL pollen rain data sets. Pollen types were identified to species when possible and assigned a life form based on descriptions provided in Croat, T.B. (1978). Taxa from BCI and PNSL were assigned a 1 if present in forest census data or a 0 if absent. The relative representation of each taxon has been provided for each extended record and by dry and wet season representation respectively. CA loadings are provided for axes 1 and 2 (Fig. 1).
keywords: pollen; identifications; abundance; data; BCI; PNSL; Panama
published: 2016-06-23
 
This dataset contains hourly traffic estimates (speeds) for individual links of the New York City road network for the years 2010-2013, estimated from New York City Taxis.
keywords: traffic estimates; traffic conditions; New York City
published: 2017-10-11
 
The International Registry of Reproductive Pathology Database is part of pioneering work done by Dr. Kenneth McEntee to comprehensively document thousands of disease cases studies. His large and comprehensive collection of case reports and physical samples was complimented by development of the International Registry of Reproductive Pathology Database in the 1980s. The original FoxPro Database files and a migrated access version were completed by the College of Veterinary Medicine in 2016. Access CSV files were completed by the University of Illinois Library in 2017.
keywords: Animal Pathology; Databases; Veterinary Medicine
published: 2017-12-22
 
TBP assessment raw data files of pre- and post- motion capture velocity and center of pressure force plate data. Labels are self-explanatory. The .mat files refer to data exported from the force plate for the time-to-stabilization assessments while the .txt files are the data collected for smoothness of gait assessments. These files do not relate to one another and are from separate assessments. Version2's files are the result from using Python code Data_Bank_Cleaner.py on version1's. Please find more information in READ_ME_databank.txt.
keywords: Multiple Sclerosis; Rehabilitation; Balance; Ataxia; Ballet; Dance; Targeted Ballet Program
published: 2017-11-15
 
Monthly water withdrawal records (total pumpage and per-capita consumption) for the City of Austin, Texas (2000-2014). Data were provided by Austin Water Utility.
keywords: Water use; Water conservation
published: 2017-11-14
 
If you use this dataset, please cite the IJRR data paper (bibtex is below). We present a dataset collected from a canoe along the Sangamon River in Illinois. The canoe was equipped with a stereo camera, an IMU, and a GPS device, which provide visual data suitable for stereo or monocular applications, inertial measurements, and position data for ground truth. We recorded a canoe trip up and down the river for 44 minutes covering 2.7 km round trip. The dataset adds to those previously recorded in unstructured environments and is unique in that it is recorded on a river, which provides its own set of challenges and constraints that are described in this paper. The data is divided into subsets, which can be downloaded individually. Video previews are available on Youtube: https://www.youtube.com/channel/UCOU9e7xxqmL_s4QX6jsGZSw The information below can also be found in the README files provided in the 527 dataset and each of its subsets. The purpose of this document is to assist researchers in using this dataset. Images ====== Raw --- The raw images are stored in the cam0 and cam1 directories in bmp format. They are bayered images that need to be debayered and undistorted before they are used. The camera parameters for these images can be found in camchain-imucam.yaml. Note that the camera intrinsics describe a 1600x1200 resolution image, so the focal length and center pixel coordinates must be scaled by 0.5 before they are used. The distortion coefficients remain the same even for the scaled images. The camera to imu tranformation matrix is also in this file. cam0/ refers to the left camera, and cam1/ refers to the right camera. Rectified --------- Stereo rectified, undistorted, row-aligned, debayered images are stored in the rectified/ directory in the same way as the raw images except that they are in png format. The params.yaml file contains the projection and rotation matrices necessary to use these images. The resolution of these parameters do not need to be scaled as is necessary for the raw images. params.yml ---------- The stereo rectification parameters. R0,R1,P0,P1, and Q correspond to the outputs of the OpenCV stereoRectify function except that 1s and 2s are replaced by 0s and 1s, respectively. R0: The rectifying rotation matrix of the left camera. R1: The rectifying rotation matrix of the right camera. P0: The projection matrix of the left camera. P1: The projection matrix of the right camera. Q: Disparity to depth mapping matrix T_cam_imu: Transformation matrix for a point in the IMU frame to the left camera frame. camchain-imucam.yaml -------------------- The camera intrinsic and extrinsic parameters and the camera to IMU transformation usable with the raw images. T_cam_imu: Transformation matrix for a point in the IMU frame to the camera frame. distortion_coeffs: lens distortion coefficients using the radial tangential model. intrinsics: focal length x, focal length y, principal point x, principal point y resolution: resolution of calibration. Scale the intrinsics for use with the raw 800x600 images. The distortion coefficients do not change when the image is scaled. T_cn_cnm1: Transformation matrix from the right camera to the left camera. Sensors ------- Here, each message in name.csv is described ###rawimus### time # GPS time in seconds message name # rawimus acceleration_z # m/s^2 IMU uses right-forward-up coordinates -acceleration_y # m/s^2 acceleration_x # m/s^2 angular_rate_z # rad/s IMU uses right-forward-up coordinates -angular_rate_y # rad/s angular_rate_x # rad/s ###IMG### time # GPS time in seconds message name # IMG left image filename right image filename ###inspvas### time # GPS time in seconds message name # inspvas latitude longitude altitude # ellipsoidal height WGS84 in meters north velocity # m/s east velocity # m/s up velocity # m/s roll # right hand rotation about y axis in degrees pitch # right hand rotation about x axis in degrees azimuth # left hand rotation about z axis in degrees clockwise from north ###inscovs### time # GPS time in seconds message name # inscovs position covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz m^2 attitude covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz deg^2 velocity covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz (m/s)^2 ###bestutm### time # GPS time in seconds message name # bestutm utm zone # numerical zone utm character # alphabetical zone northing # m easting # m height # m above mean sea level Camera logs ----------- The files name.cam0 and name.cam1 are text files that correspond to cameras 0 and 1, respectively. The columns are defined by: unused: The first column is all 1s and can be ignored. software frame number: This number increments at the end of every iteration of the software loop. camera frame number: This number is generated by the camera and increments each time the shutter is triggered. The software and camera frame numbers do not have to start at the same value, but if the difference between the initial and final values is not the same, it suggests that frames may have been dropped. camera timestamp: This is the cameras internal timestamp of the frame capture in units of 100 milliseconds. PC timestamp: This is the PC time of arrival of the image. name.kml -------- The kml file is a mapping file that can be read by software such as Google Earth. It contains the recorded GPS trajectory. name.unicsv ----------- This is a csv file of the GPS trajectory in UTM coordinates that can be read by gpsbabel, software for manipulating GPS paths. @article{doi:10.1177/0278364917751842, author = {Martin Miller and Soon-Jo Chung and Seth Hutchinson}, title ={The Visual–Inertial Canoe Dataset}, journal = {The International Journal of Robotics Research}, volume = {37}, number = {1}, pages = {13-20}, year = {2018}, doi = {10.1177/0278364917751842}, URL = {https://doi.org/10.1177/0278364917751842}, eprint = {https://doi.org/10.1177/0278364917751842} }
keywords: slam;sangamon;river;illinois;canoe;gps;imu;stereo;monocular;vision;inertial
published: 2017-10-10
 
This dataset contains ground motion data for Newmark Structural Engineering Laboratory (NSEL) Report Series 048, "Modification of ground motions for use in Central North America: Southern Illinois surface ground motions for structural analysis". The data are 20 individual ground motion time history records developed at each of the 10 sites (for a total of 200 ground motions). These accompanying ground motions are developed following the detailed procedure presented in Kozak et al. [2017].
keywords: earthquake engineering; ground motion records; southern Illinois seismic hazard; dynamic structural analysis; conditional mean spectrum
published: 2017-09-28
 
This is the dataset used in the Journal of Ecology publication of the same name. It is a site by species matrix of species relative abundances. The file BH.veg.data.csv contains a site by species matrix of species relative abundance (percent cover across all sampling quadrats within site). Data under the heading Year refers to sampling periods. Year 1 refers to the first set of samples taken between 1997 and 2000, Year 2 refers to the second set taken between 2002 and 2005, Year 3 refers to the third set taken between 2007 and 2010, and Year 4 refers to the fourth set taken between 2012 and 2015. All sites met Critical Trends Assessment Program (CTAP) size criteria of being at least 2 ha in size with a minimum of 500 m2 of suitable sampling area. The data in file BH.site.location.csv contains Public Land Survey System ranges and townships in which specific sites were located. All sites were located within the U.S. state of Illinois. More information about this dataset: Interested parties can request data from the Critical Trends Assessment Program, which was the source for the data on the wetlands in this study. More information on the program and data requests can be obtained by visiting the program webpage. Critical Trends Assessment Program, Illinois Natural History Survey. http://wwx.inhs.illinois.edu/research/ctap/
keywords: biodiversity; biotic homogenization; invasive species; Phalaris arundinacea; plant population and community dynamics; similarity index; wetlands
published: 2017-09-26
 
This file contains the supplemental appendix for the article "Farmer Preferences for Agricultural Soil Carbon Sequestration Schemes" published in Applied Economic Policy and Perspectives (accepted 2017).
keywords: appendix; carbon sequestration; tillage; choice experiment
published: 2017-09-08
 
Transport and MFM data of brickwork artificial spin ice composed of permalloy are included, which are reproductions of the data in an article named "Magnetic response of brickwork artificial spin ice". Transport data represent magnetic response of connected brickwork artificial spin ice, and MFM data represent how both connected and disconnected brickwork artificial spin ice react to external magnetic fields. SEM images of typical samples are included, where individual nanowire leg (island) is approximately 660 nm long and 140 nm wide with a 40 nm thickness. For the transport, each sample was measured in a longitudinal and a transverse geometry. Red curves are the 2500 Oe to -2500 Oe sweeps and the blue curves are -2500 Oe to 2500 Oe sweeps. Transport measurements were taken by using a standard 4-wire technique. Each plot was saved in pdf format.
keywords: Magnetotransport
published: 2017-09-06
 
Spire angle data for sinistral whelks of the family Busyconidae. Data focuses on spire angles, with some data on total shell length. Locality information is present for all modern specimens.
keywords: lightning whelk; sinistral whelk; spire angle; sourcing; Busycon; Cahokia; Spiro
published: 2017-07-29
 
This dataset contains the PartMC-MOSAIC simulations used in the article “Plume-exit modeling to determine cloud condensation nuclei activity of aerosols from residential biofuel combustion”. The data is organized as a set of folders, each folder representing a different scenario modeled. Each folder contains a series of NetCDF files, which are the output of the PartMC-MOSAIC simulation. They contain information on particle and gas properties, both of the biofuel burning plume and background. Input files for PartMC-MOSAIC are also included. This dataset was used during the open review process at Atmospheric Chemistry and Physics (ACP) and supports both the discussion paper and final article.
keywords: CCN; cloud condensation nuclei; activation; supersaturation; biofuel
published: 2017-08-11
 
Enclosed in this dataset are transport data of kagome connected artificial spin ice networks composed of permalloy nanowires. The data herein are reproductions of the data seen in Appendix B of the dissertation titled "Magnetotransport of Connected Artificial Spin Ice". Field sweeps with the magnetic field applied in-plane were performed in 5 degree increments for armchair orientation kagome artificial spin ice and zigzag orientation kagome artificial spin ice.
keywords: Magnetotransport; artificial spin ice; nanowires
published: 2017-06-16
 
Table S2. Raw pollen counts and climatic data for each seasonal sampling period. Climatic data reflects the average daily conditions observed over the duration samples were collected (˚C/day, mm/day, MJ/m2/day). Lycopodium counts and counts for each pollen taxon reflect the aggregated pollen sum from four sampling heights.
keywords: pollen; count; climate; data; BCI; PNSL; Panama
published: 2017-06-16
 
Table S3. Mean slope response for each predictive model used in the ecoinformatic analysis. Mean responses are provided for each seasonal and annual pollen data set analyzed from BCI and PNSL and are summarized by life form. Calculated p-values are provided for each model.
keywords: pollen; response; climate; ecoinformatics; BCI; PNSL; Panama
published: 2017-06-15
 
Datasets used in the study, "Optimal completion of incomplete gene trees in polynomial time using OCTAL," presented at WABI 2017.
keywords: phylogenomics; missing data; coalescent-based species tree estimation; gene trees
published: 2017-05-31
 
Dataset includes maternal antigen treatment and early-life antigen treatment for male zebra finches. Also includes data on beak coloration, measures of song complexity for each male, and female responses to treated males. Male beak color and song metadata: * MATID= Maternal Identity * MATTRT=Maternal antigen treatment prior to egg laying (KLH=keyhole limpet hemocyanin, LPS= lipopolysaccharide, PBS=phosphate buffered saline) * YGTRT= Young antigen treatment post-hatch (KLH=keyhole limpet hemocyanin, LPS= lipopolysaccharide, PBS=phosphate buffered saline)) * NESTBANDNUM= Nestling band number * Haptoglobin=haptoglobin levels at day 28 (mg/ml) * Mean TE= Mean number of total elements in that male's song * TE (z)= Z-transformed total elements * Mean UE=Mean number of unique elements in the song * UE (z)= z-transformed unique elements * mean phrases= Mean number of song phrases * Phrases (z)= z-transformed song phrases * Mean D= Mean song duration in seconds * D (z)=z-transformed song duration * B2 standard=beak brightness standardized so that lower values reflect less bright beaks * B2 (z)=z-transformed brightness * S1R standard= beak saturation at high wavelengths standardized so that lower values reflect less red beaks * S1R (z)=z-transformed S1R * S1U standard= beak saturation at low wavelengths standardized so that lower values reflect less red beaks * S1U (z)=z-transformed S1U * H4B standard= beak hue standardized so that lower values reflect less red beaks * H4B (z)=z-transformed H4B Female choice metadata: * Control Bird=PBS denotes that all control males received phosphate buffered saline * Treatment Bird= Treatment the male received (keyhole limpet hemocyanin (KLH) or lipopolysaccharide (LPS)) * Beak Wipes Control=# of beak wipes the female performed when on the control male side * Beak Wipes Treatment=# of beak wipes the female performed when on the "treatment male" side * Hops Control=# of hops female performed when on the control male side * Hops Treatment=# of hops female performed when on the treatment male side * Time Spent Near Control=amount of time (sec) female spent on the control male side * Time Spent Near Treatment=amount of time (sec) the female spent on the treatment male side
keywords: early-life; stress; immune response; phenotypic correlation; sexual signal; zebra finch;birdsongs; acoustic signals; beak coloration; mate selection
published: 2017-05-01
 
Indianapolis Int'l Airport to Urbana: Sampling Rate: 2 Hz Total Travel Time: 5901534 ms or 98.4 minutes Number of Data Points: 11805 Distance Traveled: 124 miles via I-74 Device used: Samsung Galaxy S6 Date Recorded: 2016-11-27 Parameters Recorded: * ACCELEROMETER X (m/s²) * ACCELEROMETER Y (m/s²) * ACCELEROMETER Z (m/s²) * GRAVITY X (m/s²) * GRAVITY Y (m/s²) * GRAVITY Z (m/s²) * LINEAR ACCELERATION X (m/s²) * LINEAR ACCELERATION Y (m/s²) * LINEAR ACCELERATION Z (m/s²) * GYROSCOPE X (rad/s) * GYROSCOPE Y (rad/s) * GYROSCOPE Z (rad/s) * LIGHT (lux) * MAGNETIC FIELD X (microT) * MAGNETIC FIELD Y (microT) * MAGNETIC FIELD Z (microT) * ORIENTATION Z (azimuth °) * ORIENTATION X (pitch °) * ORIENTATION Y (roll °) * PROXIMITY (i) * ATMOSPHERIC PRESSURE (hPa) * SOUND LEVEL (dB) * LOCATION Latitude * LOCATION Longitude * LOCATION Altitude (m) * LOCATION Altitude-google (m) * LOCATION Altitude-atmospheric pressure (m) * LOCATION Speed (kph) * LOCATION Accuracy (m) * LOCATION ORIENTATION (°) * Satellites in range * GPS NMEA * Time since start in ms * Current time in YYYY-MO-DD HH-MI-SS_SSS format Quality Notes: There are some things to note about the quality of this data set that you may want to consider while doing preprocessing. This dataset was taken continuously as a single trip, no stop was made for gas along the way making this a very long continuous dataset. It starts in the parking lot of the Indianapolis International Airport and continues directly towards a gas station on Lincoln Avenue in Urbana, IL. There are a couple parts of the trip where the phones orientation had to be changed because my navigation cut out. These times are easy to account for based on Orientation X/Y/Z change. I would also advise cutting out the first couple hundred points or the points leading up to highway speed. The phone was mounted in the cupholder in the front seat of the car.
keywords: smartphone; sensor; driving; accelerometer; gyroscope; magnetometer; gps; nmea; barometer; satellite
published: 2017-03-08
 
This dataset includes early embryogenesis and post-embryonic development of Soybean cyst nematode.
keywords: Soybean cyst nematode; Embryogenesis; Post-embryonic development
published: 2017-03-07
 
This is a sample 5 minute video of an E coli bacterium swimming in a microfluidic chamber as well as some supplementary code files to be used with the Matlab code available at https://github.com/dfraebel/CellTracking
published: 2017-03-02
 
This data was collected between 2004 and 2010 at White River National Wildlife Refuge (WRNWR) and Saint Francis National Forest (SF). It was collected as part of two master’s and one PhD project at Arkansas State University USA studying Swainson’s Warbler habitat use, survival, and body condition.
keywords: Swainson’s Warbler; Limnothlypis swainsonii; flooding; natural disturbance; apparent survival; body condition
published: 2017-02-28
 
Leesburg, VA to Indianapolis, Indiana: Sampling Rate: 0.1 Hz Total Travel Time: 31100007 ms or 518 minutes or 8.6 hours Distance Traveled: 570 miles via I-70 Number of Data Points: 3112 Device used: Samsung Galaxy S4 Date Recorded: 2017-01-15 Parameters Recorded: * ACCELEROMETER X (m/s²) * ACCELEROMETER Y (m/s²) * ACCELEROMETER Z (m/s²) * GRAVITY X (m/s²) * GRAVITY Y (m/s²) * GRAVITY Z (m/s²) * LINEAR ACCELERATION X (m/s²) * LINEAR ACCELERATION Y (m/s²) * LINEAR ACCELERATION Z (m/s²) * GYROSCOPE X (rad/s) * GYROSCOPE Y (rad/s) * GYROSCOPE Z (rad/s) * LIGHT (lux) * MAGNETIC FIELD X (microT) * MAGNETIC FIELD Y (microT) * MAGNETIC FIELD Z (microT) * ORIENTATION Z (azimuth °) * ORIENTATION X (pitch °) * ORIENTATION Y (roll °) * PROXIMITY (i) * ATMOSPHERIC PRESSURE (hPa) * Relative Humidity (%) * Temperature (F) * SOUND LEVEL (dB) * LOCATION Latitude * LOCATION Longitude * LOCATION Altitude (m) * LOCATION Altitude-google (m) * LOCATION Altitude-atmospheric pressure (m) * LOCATION Speed (kph) * LOCATION Accuracy (m) * LOCATION ORIENTATION (°) * Satellites in range * GPS NMEA * Time since start in ms * Current time in YYYY-MO-DD HH-MI-SS_SSS format Quality Notes: There are some things to note about the quality of this data set that you may want to consider while doing preprocessing. This dataset was taken continuously but had multiple stops to refuel (without the data recording ceasing). This can be removed by parsing out all data that has a speed of 0. The mount for this dataset was fairly stable (as can be seen by the consistent orientation angle throughout the dataset). It was mounted tightly between two seats in the back of the vehicle. Unfortunately, the frequency for this dataset was set fairly low at one per ten seconds.
keywords: smartphone; sensor; driving; accelerometer; gyroscope; magnetometer; gps; nmea; barometer; satellite; temperature; humidity
published: 2017-02-23
 
GBS data from diverse sorghum lines. Project funded by DOE, ARPA-E, and startup funds to PJ Brown.
published: 2017-02-21
 
GBS data from biparental sorghum populations provided by Dr. Bill Rooney, TAMU. Data produced and analyzed by Pradeep Hirannaiah to study recombination in sorghum. Funding for this study was provided by the Sorghum Checkoff.
published: 2017-02-21
 
GBS data from diverse sorghum lines. Project funded by DOE, ARPA-E, and startup funds to PJ Brown.
published: 2017-06-01
 
List of Chinese Students Receiving a Ph.D. in Chemistry between 1905 and 1964. Based on two books compiling doctoral dissertations by Chinese students in the United States. Includes disciplines; university; advisor; year degree awarded, birth and/or death date, dissertation title. Accompanies Chapter 5 : History of the Modern Chemistry Doctoral Program in Mainland China by Vera V. Mainz published in "Igniting the Chemical Ring of Fire : Historical Evolution of the Chemical Communities in the Countries of the Pacific Rim", Seth Rasmussen, Editor. Published by World Scientific. Expected publication 2017.
keywords: Chinese; graduate student; dissertation; university; advisor; chemistry; engineering; materials science
published: 2016-12-20
 
Scripts and example data for AIDData (aiddata.org) processing in support of forthcoming Nakamura dissertation. This dataset includes two sets of scripts and example data files from an aiddata.org data dump. Fuller documentation about the functionality for these scripts is within the readme file. Additional background information and description of usage will be in the forthcoming Nakamura dissertation (link will be added when available). Data originally supplied by Nakamura. Python code and this readme file created by Wickes. Data included within this deposit are examples to demonstrate execution. Roughly, there are two python scripts in here: keyword_search.py, designed to assist in finding records matching specific keywords, and matching_tool.ipynb, designed to assist in detection of which records are and are not contained within a keyword results file and an aiddata project data file.
keywords: aiddata; natural resources
published: 2016-12-19
 
Files in this dataset represent an investigation into use of the Library mobile app Minrva during the months of May 2015 through December 2015. During this time interval 45,975 API hits were recorded by the Minrva web server. The dataset included herein is an analysis of the following: 1) a delineation of API hits to mobile app modules use in the Minrva app by month, 2) a general analysis of Minrva app downloads to module use, and 3) the annotated data file providing associations from API hits to specific modules used, organized by month (May 2015 – December 2015).
keywords: API analysis; log analysis; Minrva Mobile App
published: 2016-12-18
 
This dataset is the numerical simulation data of the computational study of the cold front-related hydrodynamics in the Wax Lake delta. The numerical model used is ECOM-si.
keywords: Wax Lake delta; Hydrodynamics; Cold front
published: 2016-12-13
 
BAM files for founding strain (MG1655-motile) as well as evolved strains from replicate motility selection experiments in low-viscosity agar plates containing either rich medium (LB) or minimal medium (M63+0.18mM galactose)
published: 2016-12-12
 
This dataset includes data of the the Wax Lake delta from four public agencies: NGDC, USGS, NDBC, and NOAA CO-OPS. Besides the original data, the processed data associated with analyzed figures are also shared.
keywords: Wax Lake delta; NOAA CO-OPS; NGDC; USGS; NDBC
published: 2016-12-12
 
This dataset is the field measurements of water depth at the Wax Lake delta on the date 2012-12-01.
keywords: Wax Lake delta; Bathymetry
published: 2016-12-12
 
This dataset is about a topographic LIDAR survey (saved in “waxlake-lidar.img”) that was conducted over the Wax Lake delta, between longitudes −91.5848 to −91.292 degrees, and latitudes 29.3647 to 29.6466 degrees. Different from other elevation data, the positive value in the LIDAR data indicates land elevation, while the zero value implies riverbed without identifying specific water depth.
keywords: LIDAR; Wax Lake delta
published: 2017-12-12
 
This dataset includes both meteorology and oceanography data collected at stations (CSI03, CSI06, and CSI09) near the Gulf of Mexico from the LSU WAVCIS (Waves-Current-Surge Information System) lab. The associated data analysis visualization is also saved in separate directories.
keywords: WAVCIS; Gulf of Mexico; Meteorology; Oceanography
published: 2016-12-12
 
This dataset is the field measurements of currents at two stations (Big Hogs Bayou and Delta1) in the the Wax Lake delta in November 2012 and February 2013.
keywords: Wax Lake delta; Currents
published: 2016-12-12
 
This dataset is the field measurements of water depth at the Wax Lake delta conducted in late 2012.
keywords: Wax Lake delta; Bathymetry
published: 2016-12-02
 
This dataset enumerates the number of geocoded tweets captured in geographic rectangular bounding boxes around the metropolitan statistical areas (MSAs) defined for 49 American cities, during a four-week period in 2012 (between April and June), through the Twitter Streaming API. More information on MSA definitions: https://www.census.gov/population/metro/
keywords: human dynamics; social media; urban informatics; pace of life; Twitter; ecological correlation; individual behavior
published: 2016-11-30
 
This is the dataset used in the BioScience publication of the same name. More information about this dataset: Interested parties can request data from the Critical Trends Assessment Program, which was the source for the data on natural areas in this study. More information on the program and data requests can be obtained by visiting the program webpage. Critical Trends Assessment Program, Illinois Natural History Survey. http://wwx.inhs.illinois.edu/research/ctap/ These spatial datasets were used for analyses: Illinois Natural History Survey. 2003. Illinois GAP analysis land cover classification 1999-2000, 1:100 000 Scale, Raster Digital Data, Version 2.0. Champaign, IL, USA. Illinois State Geological Survey. 1995. Illinois Landcover Thematic Map Coverage Map 1991-1995. Champaign, IL, USA. Illinois State Geological Survey. 2001. Illinois Landcover Thematic Map Coverage Map 1999-2000. Champaign, IL, USA. USDA National Agricultural Statistics Service Cropland Data Layer. 1999-2015. Published crop-specific data layer [Online]. Available at https://nassgeodata.gmu.edu/CropScape/. USDA-NASS, Washington, DC. Information on agricultural practices and landcover changes were derived from the following U.S. Department of Agriculture (USDA) resources: USDA Economic Research Service. 2016. Adoption of Genetically Engineered Crops in the U.S. Available at http://www.ers.usda.gov/data-products/. USDA-ERS, Washington, DC. USDA Natural Resources Conservation Service. 2015. Summary Report: 2012 National Resources Inventory. https://www.nrcs.usda.gov/Internet/FSE_DOCUMENTS/nrcseprd396218.pdf. USDA-NRCS, Washington, DC, and Center for Survey Statistics and Methodology, Iowa State University, Ames, Iowa.
keywords: Milkweed; Monarch Butterfly; CTAP Critical Trends Assessment Program; BioScience
published: 2016-11-28
 
These show the topography and relief of the Precambrian surface of the Cratonic Platform of the United States.
keywords: precambrian; geology; relief; elevation
published: 2016-08-18
 
Copyright Review Management System renewals by year, data from Table 2 of the article "How Large is the ‘Public Domain’? A comparative Analysis of Ringer’s 1961 Copyright Renewal Study and HathiTrust CRMS Data."
keywords: copyright; copyright renewals; HathiTrust
published: 2016-08-16
 
This archive contains all the alignments and trees used in the HIPPI paper [1]. The pfam.tar archive contains the PFAM families used to build the HMMs and BLAST databases. The file structure is: ./X/Y/initial.fasttree ./X/Y/initial.fasta where X is a Pfam family, Y is the cross-fold set (0, 1, 2, or 3). Inside the folder are two files, initial.fasta which is the Pfam reference alignment with 1/4 of the seed alignment removed and initial.fasttree, the FastTree-2 ML tree estimated on the initial.fasta. The query.tar archive contains the query sequences for each cross-fold set. The associated query sequences for a cross-fold Y is labeled as query.Y.Z.fas, where Z is the fragment length (1, 0.5, or 0.25). The query files are found in the splits directory. [1] Nguyen, Nam-Phuong D, Mike Nute, Siavash Mirarab, and Tandy Warnow. (2016) HIPPI: Highly Accurate Protein Family Classification with Ensembles of HMMs. To appear in BMC Genomics.
keywords: HIPPI dataset; ensembles of profile Hidden Markov models; Pfam
published: 2016-08-02
 
These data are the result of a multi-step process aimed at enriching BIBFRAME RDF with linked data. The process takes in an initial MARC XML file, transforms it to BIBFRAME RDF/XML, and then four separate python files corresponding to the BIBFRAME 1.0 model (Work, Instance, Annotation, and Authority) are run over the BIBFRAME RDF/XML output. The input and outputs of each step are included in this data set. Input file types include the CSV; MARC XML; and Master RDF/XML Files. The CSV contain bibliographic identifiers to e-books. From CSVs a set of MARC XML are generated. The MARC XML are utilized to produce the Master RDF file set. The major outputs of the enrichment code produce BIBFRAME linked data as Annotation RDF, Instance RDF, Work RDF, and Authority RDF.
keywords: BIBFRAME; Schema.org; linked data; discovery; MARC; MARCXML; RDF
published: 2016-07-22
 
Datasets and R scripts relating to the manuscript "Ecological characteristics and in situ genetic associations for yield-component traits of wild Miscanthus from eastern Russia" published in Annals of Botany, 10.1093/aob/mcw137. Field data, including collection locations, physical and ecological information for each location, and plant phenotypes relating to biomass are included. Genetic data in this repository include single nucleotide polymorphisms (SNPs) derived from restriction site-associated DNA sequencing (RAD-seq), as well as plastid microsatellites. A file is also included listing the DNA sequences of all RAD-seq markers generated to-date by the Sacks lab, including those from this publication.
keywords: Miscanthus sacchariflorus; Miscanthus sinensis; Russia; germplasm; RAD-seq; SNP
published: 2016-06-23
 
This dataset was extracted from a set of metadata files harvested from the DataCite metadata store (https://search.datacite.org/ui) during December 2015. Metadata records for items with a resourceType of dataset were collected. 1,647,949 total records were collected. This dataset contains three files: 1) readme.txt: A readme file. 2) version-results.csv: A CSV file containing three columns: DOI, DOI prefix, and version text contents 3) version-counts.csv: A CSV file containing counts for unique version text content values.
keywords: datacite;metadata;version values;repository data
published: 2016-06-23
 
This dataset was extracted from a set of metadata files harvested from the DataCite metadata store (http://search.datacite.org/ui) during December 2015. Metadata records for items with a resourceType of dataset were collected. 1,647,949 total records were collected. This dataset contains four files: 1) readme.txt: a readme file. 2) language-results.csv: A CSV file containing three columns: DOI, DOI prefix, and language text contents 3) language-counts.csv: A CSV file containing counts for unique language text content values. 4) language-grouped-counts.txt: A text file containing the results of manually grouping these language codes.
keywords: datacite;metadata;language codes;repository data
published: 2016-06-06
 
These datasets represent first-time collaborations between first and last authors (with mutually exclusive publication histories) on papers with 2 to 5 authors in years [1988,2009] in PubMed. Each record of each dataset captures aspects of the similarity, nearness, and complementarity between two authors about the paper marking the formation of their collaboration.
published: 2016-05-26
 
This data set includes survey responses collected during 2015 from academic libraries with library publishing services. Each institution responded to questions related to its use of user studies or information about readers in order to shape digital publication design, formats, and interfaces. Survey data was supplemented with institutional categories to facilitate comparison across institutional types.
keywords: academic libraries; publishing; user experience; user studies
published: 2016-05-19
 
This dataset contains records of four years of taxi operations in New York City and includes 697,622,444 trips. Each trip records the pickup and drop-off dates, times, and coordinates, as well as the metered distance reported by the taximeter. The trip data also includes fields such as the taxi medallion number, fare amount, and tip amount. The dataset was obtained through a Freedom of Information Law request from the New York City Taxi and Limousine Commission. The files in this dataset are optimized for use with the ‘decompress.py’ script included in this dataset. This file has additional documentation and contact information that may be of help if you run into trouble accessing the content of the zip files.
keywords: taxi;transportation;New York City;GPS
published: 2016-05-16
 
This dataset contains the protein sequences and trees used to compare NRPS condensation domains in the AMB gene cluster and was used to create figure S1 in Rojas et al. 2015. Instead of having to collect representative sequences independently, this set of condensation domain sequences may serve as a quick reference set for coarse classification of condensation domains.
keywords: condensation domain; NRPS; biosynthetic gene cluster; antimetabolite; Pseudomonas; oxyvinylglycine; secondary metabolite; thiotemplate; toxin