Dataset Search

Displaying 26 - 50 of 153 in total

Filters

Subject Area

Life Sciences (78)

Social Sciences (40)

Physical Sciences (20)

Technology and Engineering (13)

Arts and Humanities (1)

Uncategorized

Funder

U.S. National Science Foundation (NSF) (50)

Other (50)

U.S. Department of Energy (DOE) (31)

U.S. National Institutes of Health (NIH) (20)

U.S. Department of Agriculture (USDA) (8)

Illinois Department of Natural Resources (IDNR) (4)

Illinois Department of Transportation (IDOT) (2)

U.S. Geological Survey (USGS) (2)

U.S. National Aeronautics and Space Administration (NASA) (1)

Publication Year

2025 (34)

2020 (20)

2023 (18)

2021 (17)

2022 (15)

2018 (13)

2024 (13)

2026 (9)

2019 (7)

2017 (6)

2011 (1)

License

CC0 (80)

CC BY (70)

custom (3)

Illinois Data Bank Dataset Search Results

Results

published: 2017-12-22

Targeted ballet program mitigates ataxia and improves agility in moderate-to-advanced multiple sclerosis

Scheidler, Andrew; Kinnett-Hopkins, Dominique; Learmonth, Yvonne; Motl, Robert; Lopez-Ortiz, Citlali (2017)

TBP assessment raw data files of pre- and post- motion capture velocity and center of pressure force plate data. Labels are self-explanatory. The .mat files refer to data exported from the force plate for the time-to-stabilization assessments while the .txt files are the data collected for smoothness of gait assessments. These files do not relate to one another and are from separate assessments. Version2's files are the result from using Python code Data_Bank_Cleaner.py on version1's. Please find more information in READ_ME_databank.txt.

keywords: Multiple Sclerosis; Rehabilitation; Balance; Ataxia; Ballet; Dance; Targeted Ballet Program

published: 2018-04-23

Author-Linked data for Author-ity 2009

Torvik, Vetle I. (2018)

Provides links to Author-ity 2009, including records from principal investigators (on NIH and NSF grants), inventors on USPTO patents, and students/advisors on ProQuest dissertations. Note that NIH and NSF differ in the type of fields they record and standards used (e.g., institution names). Typically an NSF grant spanning multiple years is associated with one record, while an NIH grant occurs in multiple records, for each fiscal year, sub-projects/supplements, possibly with different principal investigators. The prior probability of match (i.e., that the author exists in Author-ity 2009) varies dramatically across NIH grants, NSF grants, and USPTO patents. The great majority of NIH principal investigators have one or more papers in PubMed but a minority of NSF principal investigators (except in biology) have papers in PubMed, and even fewer USPTO inventors do. This prior probability has been built into the calculation of match probabilities. The NIH data were downloaded from NIH exporter and the older NIH CRISP files. The dataset has 2,353,387 records, only includes ones with match probability > 0.5, and has the following 12 fields: 1 app_id, 2 nih_full_proj_nbr, 3 nih_subproj_nbr, 4 fiscal_year 5 pi_position 6 nih_pi_names 7 org_name 8 org_city_name 9 org_bodypolitic_code 10 age: number of years since their first paper 11 prob: the match probability to au_id 12 au_id: Author-ity 2009 author ID The NSF dataset has 262,452 records, only includes ones with match probability > 0.5, and the following 10 fields: 1 AwardId 2 fiscal_year 3 pi_position, 4 PrincipalInvestigators, 5 Institution, 6 InstitutionCity, 7 InstitutionState, 8 age: number of years since their first paper 9 prob: the match probability to au_id 10 au_id: Author-ity 2009 author ID There are two files for USPTO because here we linked disambiguated authors in PubMed (from Author-ity 2009) with disambiguated inventors. The USPTO linking dataset has 309,720 records, only includes ones with match probability > 0.5, and the following 3 fields 1 au_id: Author-ity 2009 author ID 2 inv_id: USPTO inventor ID 3 prob: the match probability of au_id vs inv_id The disambiguated inventors file (uiuc_uspto.tsv) has 2,736,306 records, and has the following 7 fields 1 inv_id: USPTO inventor ID 2 is_lower 3 is_upper 4 fullnames 5 patents: patent IDs separated by '|' 6 first_app_yr 7 last_app_yr

keywords: PubMed; USPTO; Principal investigator; Name disambiguation

published: 2025-11-19

Data for Production of a δ-Lactam from Glucose through Integrating Biological and Chemical Catalysis

Kim, Min Soo; Shi, Longyuan; Zhao, Huimin; Huber, George (2025)

We present a new strategy for the production of a δ-lactam from glucose that integrates biological production of triacetic acid lactone (TAL, 4-hydroxy-6-methyl-2H-2-one) with catalytic transformation of TAL into 6-methylpiperidin-2-one (MPO) through metabolic engineering, isomerization, amination, and catalytic hydrogenation/hydrogenolysis. We developed a sustainable and antibiotic-free fed-batch fermentation using genetically modified Rhodotorula toruloides IFO0880. This process achieved a yield of 2-hydroxy-6-methyl-4H-pyran-4-one (2H4P) at 0.05 g/g of glucose, corresponding to a 9.9 g/L titer. By adjusting the pH of the fermentation broth to 2, 2H4P was quantitatively converted into TAL. The TAL in the fermentation broth was directly converted by aminolysis into 4-hydroxy-6-methylpyridin-2(1H)-one (HMPO), which achieved an 18.5% yield with 94.3% purity. The HMPO yield was lower in the fermentation broth than in a clean feedstock (32.2%), suggesting that the biological impurities are inhibitors in this reaction. Further investigation revealed that lower pH levels and reduced TAL concentrations in the fermentation broth significantly decreased HMPO yields. Subsequently, the precipitated HMPO was filtered and dried and then subjected to the final catalytic conversion in H2O solvent, achieving a MPO yield of 91.8%. This integrated approach demonstrated the direct use of TAL in the filtered aqueous fermentation broth without the need to isolate TAL.

keywords: Conversion;Catalysis;Metabolic Engineering

published: 2018-12-06

NEXUS data file for phylogenetic analysis of Iassinae (Hemiptera: Cicadellidae)

Krishnankutty, Sindhu; Dietrich, Christopher; Dai, Wu; Siddappaji, Madhura (2018)

The text file contains the original DNA sequence data used in the phylogenetic analyses of Krishnankutty et al. (2016: Systematic Entomology 41: 580–595). The text file is marked up according to the standard NEXUS format commonly used by various phylogenetic analysis software packages. The file will be parsed automatically by a variety of programs that recognize NEXUS as a standard bioinformatics file format. The file contains five separate data blocks, one for each character partition (28S, histone H3, 12S, indels, and morphology) for 53 taxa (species). Gaps inserted into the DNA sequence alignment are indicated by a dash, and missing data are indicated by a question mark. The separate "indels1" block includes 40 indels (insertions/deletions) from the 28S sequence alignment re-coded using the modified complex indel coding scheme, as described in the "Materials and methods" of the original publication. The DIMENSIONS statements near the beginning of each block indicate the numbers of taxa (NTax) and characters (NChar). The file contains aligned nucleotide sequence data for 3 gene regions and 40 morphological characters. The file is configured for use with the maximum likelihood-based phylogenetic program GARLI but can also be parsed by any other bioinformatics software that supports the NEXUS format. Descriptions of the morphological characters and more details on the species and specimens included in the dataset are provided in the supplementary document included as a separate pdf. The original raw DNA sequence data are available from NCBI GenBank under the accession numbers indicated in the supporting pdf file. More details on individual analyses are provided in the original publication.

keywords: phylogeny; DNA sequence; morphology; Insecta; Hemiptera; Cicadellidae; leafhopper; evolution; 28S rDNA; histone H3; 12S mtDNA; maximum likelihood

published: 2023-12-13

Distribution of nonindigenous Basket Clams (Corbicula spp.) in Mexico

Tiemann, Jeremy (2023)

Corbicula spp. are one of the most prolific aquatic invasive species in the world and can have negative effects on aquatic ecosystems. We performed qualitative field surveys, examined literature accounts and natural history museum holdings, and accessed citizen science data sources to document the distribution of Corbicula in Mexico and shared drainages. Through 26 publications (N = 127 records), 312 museum holdings, and 446 iNaturalist records, we documented 885 records pertaining to Corbicula in Mexico and shared drainages. The first record of the species in Mexico was in 1969, and it has since been reported from 26 of the 32 Mexican states and most of the major river basins throughout the country. However, we suggest Corbicula is more prevalent in Mexico than we report in this work as it is often under sampled / under reported.

keywords: Corbicula; exotic species; invasive species; Asian Clams; Bivalvia; freshwater systems

published: 2025-09-29

Data for Characterization of the Ghd8 Flowering Time Gene in a Mini-Core Collection of Miscanthus sinensis

Guo, Zhihui; Xu, Meilan; Nagano, Hironori; Clark, Lindsay; Sacks, Erik; Yamada, Toshihiko (2025)

The optimal flowering time for bioenergy crop miscanthus is essential for environmental adaptability and biomass accumulation. However, little is known about how genes controlling flowering in other grasses contribute to flowering regulation in miscanthus. Here, we report on the sequence characterization and gene expression of Miscanthus sinensisGhd8, a transcription factor encoding a HAP3/NF-YB DNA-binding domain, which has been identified as a major quantitative trait locus in rice, with pleiotropic effects on grain yield, heading date and plant height. In M. sinensis, we identified two homoeologous loci, MsiGhd8A located on chromosome 13 and MsiGhd8B on chromosome 7, with one on each of this paleo-allotetraploid species’ subgenomes. A total of 46 alleles and 28 predicted protein sequence types were identified in 12 wild-collected accessions. Several variants of MsiGhd8 showed a geographic and latitudinal distribution. Quantitative real-time PCR revealed that MsiGhd8 expressed under both long days and short days, and MsiGhd8B showed a significantly higher expression than MsiGhd8A. The comparison between flowering time and gene expression indicated that MsiGhd8B affected flowering time in response to day length for some accessions. This study provides insight into the conserved function of Ghd8 in the Poaceae, and is an important initial step in elucidating the flowering regulatory network of Miscanthus.

keywords: Feedstock Production;Genomics

published: 2025-10-27

Data for Evaluation of Strategies to Narrow the Product Chain-Length Distribution of Microbially Synthesized Free Fatty Acids

Jindra, Michael A.; Choe, Kisurb; Chowdhury, Ratul; Kong, Ryan; Ghaffari, Soodabeh; Sweedler, Jonathan; Pfleger, Brian (2025)

The dominant strategy for tailoring the chain-length distribution of free fatty acids (FFA) synthesized by heterologous hosts is expression of a selective acyl-acyl carrier protein (ACP) thioesterase. However, few of these enzymes can generate a precise (greater than 90% of a desired chain-length) product distribution when expressed in a microbial or plant host. The presence of alternative chain-lengths can complicate purification in situations where blends of fatty acids are not desired. We report the assessment of several strategies for improving the dodecanoyl-ACP thioesterase from the California bay laurel to exhibit more selective production of medium-chain free fatty acids to near exclusivity. We demonstrated that matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-ToF MS) was an effective library screening technique for identification of thioesterase variants with favorable shifts in chain-length specificity. This strategy proved to be a more effective screening technique than several rational approaches discussed herein. With this data, we isolated four thioesterase variants which exhibited a more selective FFA distribution over wildtype when expressed in the fatty acid accumulating E. coli strain, RL08. We then combined mutations from the MALDI isolates to generate BTE-MMD19, a thioesterase variant capable of producing free fatty acids consisting of 90% of C12 products. Of the four mutations which conferred a specificity shift, we noted that three affected the shape of the binding pocket, while one occurred on the positively charged acyl carrier protein landing pad. Finally, we fused the maltose binding protein (MBP) from E. coli to the N – terminus of BTE-MMD19 to improve enzyme solubility and achieve a titer of 1.9 g per L of twelve-carbon fatty acids in a shake flask.

keywords: Conversion;Genomics

published: 2019-07-08

Wikipedia category embeddings - Node2Vec, Poincare, Elmo

Mishra, Shubhanshu (2019)

Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (<a href="https://archive.org/download/enwiki-20170920">https://archive.org/download/enwiki-20170920</a>) created using the following algorithms: * Node2vec * Poincare embedding * Elmo model on the category title The following files are present: * wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with "_") <tab> 300 dim space separated embedding. * wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. * elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using * node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt * poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt * wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt * categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. * category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category * category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt * wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt Software used: * <a href="https://github.com/napsternxg/WikiUtils">https://github.com/napsternxg/WikiUtils</a> - Processing sql dumps * <a href="https://github.com/napsternxg/node2vec">https://github.com/napsternxg/node2vec</a> - Generate random walks for node2vec * <a href="https://github.com/RaRe-Technologies/gensim">https://github.com/RaRe-Technologies/gensim</a> (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm * <a href="https://github.com/allenai/allennlp">https://github.com/allenai/allennlp</a> (version 0.8.2) - Generate elmo embeddings for each category title Code used: * wiki_cat_node2vec_commands.sh - Commands used to * wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings * wiki_cat_poincare_embedding.py - generate poincare embeddings

keywords: Wikipedia; Wikipedia Category Tree; Embeddings; Elmo; Node2Vec; Poincare;

published: 2020-08-01

NEXUS morphological data file for phylogenetic analysis of Empoascini

Xu, Ye; Dietrich, Christopher H.; Zhang, Yalin; Dmitriev, Dmitry; Zhang, Li; Wang, Yi-Mei; Lu, Si-Han; Qin, Dao-Zheng (2020)

The Empoascini_morph_data.nex text file contains the original data used in the phylogenetic analyses of Xu et al. (Systematic Entomology, in review). The text file is marked up according to the standard NEXUS format commonly used by various phylogenetic analysis software packages. The file will be parsed automatically by a variety of programs that recognize NEXUS as a standard bioinformatics file format. The first nine lines of the file indicate the file type (Nexus), that 110 taxa were analyzed, that a total of 99 characters were analyzed, the format of the data, and specification for symbols used in the dataset to indicate different character states. For species that have more than one state for a particular character, the states are enclosed in square brackets. Question marks represent missing data.The pdf file, Appendix1.pdf, is available here and describes the morphological characters and character states that were scored in the dataset. The data analyses are described in the cited original paper.

keywords: Hemiptera; Cicadellidae; morphology; biogeography; evolution

published: 2021-04-19

Response of Soil Quality Indictors including β-glucosidase, Fluorescein Diacetate Hydrolysis and Permanganate Oxidizable Carbon

Xia, Yushu; Wander, Michelle (2021)

Dataset compiled by Yushu Xia and Michelle Wander for the Soil Health Institute. Data were recovered from peer reviewed literature reporting results for three soil quality indicators (SQIs) (β-glucosidase (BG), fluorescein diacetate (FDA) hydrolysis, and permanganate oxidizable carbon (POXC)) in terms of their relative response to management where soils under grassland cover, no-tillage, cover crops, residue return and organic amendments were compared to conventionally managed controls. Peer-reviewed articles published between January of 1990 and May 2018 were searched using the Thomas Reuters Web of Science database (Thomas Reuters, Philadelphia, Pennsylvania) and Google Scholar to identify studies reporting results for: “β-glucosidase”, “permanganate oxidizable carbon”, “active carbon”, “readily oxidizable carbon”, and “fluorescein diacetate hydrolysis”, together with one or more of the following: “management practice”, “tillage”, “cover crop”, “residue”, “organic fertilizer”, or “manure”. Records were tabulated to compare SQI abundance in soil maintained under a control and soil aggrading practice with the intent to contribute to SQI databases that will support development of interpretive frameworks and/or algorithms including pedo-transfer functions relating indicator abundance to management practices and site specific factors. Meta-data include the following key descriptor variables and covariates useful for development of scoring functions: 1) identifying factors for the study site (location, year of initiation of study and year in which data was reported), 2) soil textural class, pH, and SOC, 3) depth and timing of soil sampling, 4) analytical methods for SQI quantification, 5) units used in published works (i.e. equivalent mass, concentration), 6) SQI abundances, and 7) statistical significance of difference comparisons. *Note: Blank values in tables are considered unreported data.

keywords: Soil health promoting practices; Soil quality indicators; β-glucosidase; fluorescein diacetate hydrolysis; Permanganate oxidizable carbon; Greenhouse gas emissions; Scoring curves; Soil Management Assessment Framework

published: 2022-08-08

Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment

Shen, Chengze; Liu, Baqiao; Williams, Kelly P.; Warnow, Tandy (2022)

This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment". The zip file has the following structure (presented as an example): salma_paper_datasets/ |_README.md |_10aa/ |_crw/ |_homfam/ |_aat/ | |_... |_... |_het/ |_5000M2-het/ | |_... |_5000M3-het/ ... |_rec_res/ Generally, the structure can be viewed as: [category]/[dataset]/[replicate]/[alignment files] # Categories: 1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate. 2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM). 3. homfam: There are the 10 largest Homfam datasets, each with one replicate. 4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates. 5. rec\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper. # Alignment files There are at most 6 `.fasta` files in each sub-directory: 1. `all.unaln.fasta`: All unaligned sequences. 2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included. 3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences). 4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included. 5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences). 6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included. >If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing. >If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing. >If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing. # Additional file(s) 1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.

keywords: SALMA;MAFFT;alignment;eHMM;sequence length heterogeneity

published: 2022-12-21

Dataset associated with the "Fishes of Champaign County, Illinois: as affected by 120 years of stream changes" manuscript by Sherwood et al.

Sherwood, Joshua; Tiemann, Jeremy; Stein, Jeffrey (2022)

This dataset is associated with a larger manuscript published in 2022 in the Illinois Natural History Survey Bulletin that summarized the Fishes of Champaign County project from 2012-2015. With data spanning over 120 years, the Fishes of Champaign County is a comprehensive, long-term investigation into the changing fish communities of east-central Illinois. Surveys first occurred in Champaign County in the late 1880s (40 sites), with subsequent surveys in 1928–1929 (125 sites), 1959–1960 (143 sites), and 1987–1988 (141 sites). Between 2012 and 2015, we resampled 122 sites across Champaign County. The combined data from these five surveys have produced a unique perspective into not only the fish communities of the region, but also insight into in-stream habitat changes during the past 120 years. The dataset is in Microsoft Access format, with five data tables, one for each time period surveyed. Field names are self-explanatory, with some variation in data types collected during different surveys as follows: Forbes & Richardson (1880s) collected presence/absence only. Thompson & Hunt (1928-1929) collected abundance only, Larimore & Smith (1959-1960) collected length and weight for some samples, but only presence/absence at others. In some cases, fish of the same species were weighed in bulk, with the fields “LOW” and “HIGH” indicating the lower and upper limits of total length in the batch, and weight indicating the gross weight of all fish in the batch. Larimore and Bayley (1987-1988) collected length and weight for all surveys, and Sherwood and Stein (2012-2015) collected length and weight for all surveys except for cases where extremely abundant single species where subsampled. Lengths are reported in millimeters, and weight in grams. Two lookup tables provide information about species codes used in the data tables and sample site location and notes.

keywords: fishes of Champaign County; streams; anthropogenic disturbances; long-term dataset

published: 2025-05-02

Dataset for studying transitive closure in citations

Fu, Yuanxi (2025)

This dataset contains the first-generation (1st-gen) and second-generation (2nd-gen) citation relationships to a set of focal papers. The 1st-gen citation relationships are the instances of one paper citing a focal paper. These citing papers are called "1st-gen citations." The 2nd-gen citation relationships are the instances that a paper cites a 1st-gen citation. The citing paper in the 2nd-gen citation relationship is a second-generation (2nd-gen) citation. When a 2nd-gen citation is also a 1st-gen citation, it creates a transitive closure with the focal paper. Each focal paper has an abbreviation, which can be found below. The 1st-gen and 2nd-gen citation relationships were extracted from the Curated Open Citation Dataset (Korobskiy & Chacko, 2023), which is derived from a copy of COCI, the OpenCitations Index of Crossref Open DOI-to-DOI Citations, downloaded on May 6, 2023. Scripts used to collect this dataset can be found at https://github.com/yuanxiesa/transitive_closure_study. Each focal paper currently has two files: {abbreviation}_1st.csv contains the 1st-gen citation relationships; {abbreviation}_2nd.csv contains the 2nd-gen citation relationships. Focal paper abbreviation == "louvain": Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008 Focal paper abbreviation == "lp": Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3), 036106. https://doi.org/10.1103/PhysRevE.76.036106 Focal paper abbreviation == "gn": Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113. https://doi.org/10.1103/PhysRevE.69.026113

keywords: transitive closure; citations; community detection algorithms; OpenCitations; method papers

published: 2024-11-19

Dataset for Reassessment of the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science

Salami, Malik Oyewale; McCumber, Corinne (2024)

This project investigates retraction indexing agreement among data sources: Crossref, Retraction Watch, Scopus, and Web of Science. As of July 2024, this reassesses the April 2023 union list of Schneider et al. (2023): https://doi.org/10.55835/6441e5cae04dbe5586d06a5f. As of April 2023, over 1 in 5 DOIs had discrepancies in retraction indexing among the 49,924 DOIs indexed as retracted in at least one of Crossref, Retraction Watch, Scopus, and Web of Science (Schneider et al., 2023). Here, we determine what changed in 15 months. Pipeline code to get the results files can be found in the GitHub repository https://github.com/infoqualitylab/retraction-indexing-agreement in the iPython notebook 'MET-STI2024_Reassessment_of_retraction_indexing_agreement.ipynb' Some files have been redacted to remove proprietary data, as noted in README.txt. Among our sources, data is openly available only for Crossref and Retraction Watch. FILE FORMATS: 1) unionlist_completed_2023-09-03-crws-ressess.csv - UTF-8 CSV file 2) unionlist_completed-ria_2024-07-09-crws-ressess.csv - UTF-8 CSV file 3) unionlist-15months-period_sankey.png - Portable Network Graphics (PNG) file 4) unionlist_ria_proportion_comparison.png - Portable Network Graphics (PNG) file 5) README.txt - text file FILE DESCRIPTION: Description of the files can be found in README.txt

keywords: retraction status; data quality; indexing; retraction indexing; metadata; meta-science; RISRS

published: 2018-12-20

Inclusion_Criteria_Annotation

Dong, Xiaoru; Xie, Jingyi; Linh, Hoang (2018)

File Name: Inclusion_Criteria_Annotation.csv Data Preparation: Xiaoru Dong Date of Preparation: 2018-12-14 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: The file contains lists of inclusion criteria of Cochrane Systematic Reviews and the manual annotation results. 5420 inclusion criteria were annotated, out of 7158 inclusion criteria available. Annotations are either "Only RCTs" or "Others". There are 2 columns in the file: - "Inclusion Criteria": Content of inclusion criteria of Cochrane Systematic Reviews. - "Only RCTs": Manual Annotation results. In which, "x" means the inclusion criteria is classified as "Only RCTs". Blank means that the inclusion criteria is classified as "Others". Notes: 1. "RCT" stands for Randomized Controlled Trial, which, in definition, is "a work that reports on a clinical trial that involves at least one test treatment and one control treatment, concurrent enrollment and follow-up of the test- and control-treated groups, and in which the treatments to be administered are selected by a random process, such as the use of a random-numbers table." [Randomized Controlled Trial publication type definition from https://www.nlm.nih.gov/mesh/pubtypes.html]. 2. In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.

keywords: Inclusion criteria, Randomized controlled trials, Machine learning, Systematic reviews

published: 2020-06-02

NEXUS file for phylogenetic analysis of Eurymelinae (Hemiptera: Cicadellidae)

Xue, Qingquan; Dietrich, Christopher; Zhang, Yalin (2020)

The text file contains the original data used in the phylogenetic analyses of Xue et al. (2020: Systematic Entomology, in press). The text file is marked up according to the standard NEXUS format commonly used by various phylogenetic analysis software packages. The file will be parsed automatically by a variety of programs that recognize NEXUS as a standard bioinformatics file format. The first six lines of the file identify the file as NEXUS, indicate that the file contains data for 89 taxa (species) and 2676 characters, indicate that the first 2590 characters are DNA sequence and the last 86 are morphological, that gaps inserted into the DNA sequence alignment and inapplicable morphological characters are indicated by a dash, and that missing data are indicated by a question mark. The file contains aligned nucleotide sequence data for 5 gene regions and 86 morphological characters. The positions of data partitions are indicated in the mrbayes block of commands for the phylogenetic program MrBayes at the end of the file (Subset1 = 16S gene; Subset2 = 28S gene; Subset3 = COI gene; Subset 4 = Histone H3 and H2A genes). The mrbayes block also contains instructions for MrBayes on various non-default settings for that program. These are explained in the original publication. Descriptions of the morphological characters and more details on the species and specimens included in the dataset are provided in the supplementary document included as a separate pdf, also available from the journal website. The original raw DNA sequence data are available from NCBI GenBank under the accession numbers indicated in the supplementary file.

keywords: phylogeny; DNA sequence; morphology; Insecta; Hemiptera; Cicadellidae; leafhopper; evolution; 28S rDNA; 16S rDNA; histone H3; histone H2A; cytochrome oxidase I; Bayesian analysis

published: 2024-10-07

SoyFACE Fumigation Data Files

Kole Aspray, Elise; Ainsworth, Elizabeth; McGrath, Jesse; McGrath, Justin; Montes, Christopher; Whetten, Andrew; Ort, Donald; Long, Stephen; Puthuval, Kannan; Mies, Timothy; Bernacchi, Carl; DeLucia, Evan; Dalsing, Bradley; Leakey, Andrew; Li, Shuai; Herriott, Jelena; Miglietta, Franco (2024)

This data set is related to the SoyFACE experiments, which are open-air agricultural climate change experiments that have been conducted since 2001. The fumigation experiments take place at the SoyFACE farm and facility in Champaign County, Illinois during the growing season of each year, typically between June and October. This V4 contains new experimental data files, hourly fumigation files, and weather/ambient files for 2022 and 2023, since the original dataset only included files for 2001-2021. The MATLAB code has also been updated for efficiency, and explanatory files have been updated accordingly. Below are new changes in V4: - The "SoyFACE Plot Information 2001 to 2021" file is renamed to “SoyFACE ring information 2001 to 2023.xlsx”. Data for 2022 and 2023 were added. File contains information about each year of the SoyFACE experiments, including the fumigation treatment type (CO2, O3, or a combination treatment), the crop species, the plots (also referred to as 'rings' and labeled with numbers between 2 and 31) used in each experiment, important experiment dates, and the target concentration levels or 'setpoints' for CO2 and O3 in each experiment. - The "SoyFACE 1-Minute Fumigation Data Files" were updated to contain sub-folders for each year of the experiments (2001-2023), each of which contains sub-folders for each ring used in that year's experiments. This data set also includes hourly data files for the fumigation experiments ("SoyFACE Hourly Fumigation Data Files" folder) created from the 1-minute files, and hourly ambient/weather data files for each year of the experiments ("Hourly Weather and Ambient Data Files" folder which has also been updated to include 2022 and 2023 data). The ambient CO2 and O3 data are collected at SoyFACE, and the weather data are collected from the SURFRAD and WARM weather stations located near the SoyFACE farm. - “Rings.xlsx” is new in this version. This file lists the rings and treatments used in each year of the SoyFACE experiments between 2001 and 2023 and is used in several of the MATLAB codes. - “CMI Weather Data Explanation.docx” is newly added. This file contains specific information about the processing of raw weather data, which is used in the hourly weather and ambient data files. - Files that were in RAR format in V3 are now updated and saved as ZIP format, including: Hourly Weather and Ambient Data Files.zip , SoyFACE 1-Minute Fumigation Data Files.zip , SoyFACE Hourly Fumigation Data Files.zip, and Matlab Files.zip. - The "Fumigation Target Percentages" file was updated to add data for 2022 and 2023. This file shows how much of the time the CO2 and O3 fumigation levels are within a 10 or 20 percent margin of the target levels when the fumigation system is turned on. - The "Matlab Files" folder contains custom code (Aspray, E.K.) that was used to clean the "SoyFACE 1-Minute Fumigation Data" files and to generate the "SoyFACE Hourly Fumigation Data" and "Fumigation Target Percentages" files. Code information can be found in the various "Explanation" files. The Matlab code changes are as follows: 1. “Data_Issues_Finder.m” code was changed to use the “Ring.xlsx” file to gather ring and treatment information based on the contents of the file rather than being hardcoded in the Matlab code itself. 2. “Data_Issues_Finder_all.m” code is new. This code is the same as the “Data_Issues_Finder.m” code except that it identifies all CO2 and O3 repeats. In contrast, the “Data_Issues_Finder.m” code only identifies CO2 and O3 repeats that occur when the fumigation system is turned on. 3. “Target_Yearly.m” code was changed to use the “Ring.xlsx” file to gather ring and treatment information based on the contents of the file rather than being hardcoded in the Matlab code itself. 4. “HourlyFumCode.m” code is new. This code uses the “Rings.xlsx” file to gather ring and treatment information based on the contents of the file instead of the user needing to define these values explicitly. This code also defines a list of all ring folders for the year selected and runs the hourly code for each ring, instead of the user having to run the hourly code for each ring individually. Finally, the code generates two dialog boxes for the user, one which allows user to specify whether they want the hourly code to be run for 1-minute fumigation files or 1-minute ambient files, and another which allows user to specify whether they would like the hourly fumigation averages to be replaced with hourly ambient averages when the fumigation system is turned off. 5. “HourlyDataFun.m” code was changed to run either “HourlyData.m” code or “HourlyDataAmb.m” code, depending on user input in the first dialog box. 6. “HourlyData.m” code was changed to replace hourly fumigation averages with hourly ambient averages when the fumigation system is turned off, depending on user input in the second dialog box. 7. “HourlyDataAmb.m” code is new. This code is similar to “HourlyData.m” code but is used to calculate hourly averages for 1-minute ambient files instead 1-minute fumigation files. 8. “batch.m” code was changed to account for new function input variables in “HourlyDataFun.m” code, along with adding header columns for “FumOutput.xlsx” and “AmbOutput.xlsx” output files generated by “HourlyData.m” and “HourlyDataAmb.m” code. - Finally, the " * Explanation" files contain information about the column names, units of measurement, steps needed to use Matlab code, and other pertinent information for each data file. Some of them have been updated to reflect the current change of data.

keywords: SoyFACE; agriculture; agricultural; climate; climate change; atmosphere; atmospheric change; CO2; carbon dioxide; O3; ozone; soybean; fumigation; treatment

published: 2018-08-06

Comparison of data extraction on 6 clinical trial papers, extraction by RobotReviewer, by 3 novice data extractors, and from a published Cochrane review.

Hoang, Linh; Cao, Linh ; Guan, Yingjun; Cheng, Yi-Yun; Schneider, Jodi (2018)

This annotation study compared RobotReviewer's data extraction to that of three novice data extractors, using six included articles synthesized in one Cochrane review: Bailey E, Worthington HV, van Wijk A, Yates JM, Coulthard P, Afzal Z. Ibuprofen and/or paracetamol (acetaminophen) for pain relief after surgical removal of lower wisdom teeth. Cochrane Database Syst Rev. 2013; CD004624; doi:10.1002/14651858.CD004624.pub2 The goal was to assess the relative advantage of RobotReviewer's data extraction with respect to quality.

keywords: RobotReviewer; annotation; information extraction; data extraction; systematic review automation; systematic reviewing;

published: 2021-02-24

Southeastern South America Soil Moisture Alteration Experiment Using CESM2

Bieri, Carolina A.; Dominguez, Francina (2021)

This dataset contains model output from the Community Earth System Model, Version 2 (CESM2; Danabasoglu et al. 2020). These data were used for analysis in Impacts of Large-Scale Soil Moisture Anomalies in Southeastern South America, published in the Journal of Hydrometeorology (DOI: 10.1175/JHM-D-20-0116.1). See this publication for details of the model simulations that created these data. Four NetCDF (.nc) files are included in this dataset. Two files correspond to the control simulation (FHIST_SP_control) and two files correspond to a simulation with a dry soil moisture anomaly imposed in southeastern South America (FHIST_SP_dry; see the publication mentioned in the preceding paragraph for details on the spatial extent of the imposed anomaly). For each simulation, one file corresponds to output from the atmospheric model (file names with "cam") of CESM2 and the other to the land model (file names with "clm2"). These files are raw CESM output concatenated into a single file for each simulation. All files include data from 1979-01-02 to 2003-12-31 at a daily resolution. The spatial resolution of all files is about 1 degree longitude x 1 degree latitude. Variables included in these files are listed or linked below. Variables in atmosphere model output: Vertical velocity (omega) Convective precipitation Large-scale precipitation Surface pressure Specific humidity Temperature (atmospheric profile) Reference temperature (temp. at reference height, 2 meters in this case) Zonal wind Meridional wind Geopotential height Variables in land model output: See https://www.cesm.ucar.edu/models/cesm1.2/clm/models/lnd/clm/doc/UsersGuide/history_fields_table_40.xhtml Note that not all of the variables listed at the above link are included in the land model output files in this dataset. This material is based upon work supported by the National Science Foundation under Grant No. 1454089. We acknowledge high-performance computing support from Cheyenne (doi:10.5065/D6RX99HX) provided by NCAR's Computational and Information Systems Laboratory, sponsored by the National Science Foundation. The CESM project is supported primarily by the National Science Foundation. We thank all the scientists, software engineers, and administrators who contributed to the development of CESM2. References Danabasoglu, G., and Coauthors, 2020: The Community Earth System Model Version 2 (CESM2). Journal of Advances in Modeling Earth Systems, 12, e2019MS001916, https://doi.org/10.1029/2019MS001916.

keywords: Climate modeling; atmospheric science; hydrometeorology; hydroclimatology; soil moisture; land-atmosphere interactions

published: 2020-07-15

Data from: Supertree-like methods for genome-scale species tree estimation

Molloy, Erin K. (2020)

This repository includes scripts and datasets for Chapter 6 of my PhD dissertation, " Supertree-like methods for genome-scale species tree estimation," that had not been published previously. This chapter is based on the article: Molloy, E.K. and Warnow, T. "FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models." Bioinformatics, In press. https://doi.org/10.1093/bioinformatics/btaa444. The results presented in my PhD dissertation differ from those in the Bioinformatics article, because I re-estimated species trees using FastMulRF and MulRF on the same datasets in the original repository (https://doi.org/10.13012/B2IDB-5721322_V1). To re-estimate species trees, (1) a seed was specified when running MulRF, and (2) a different script (specifically preprocess_multrees_v3.py from https://github.com/ekmolloy/fastmulrfs/releases/tag/v1.2.0) was used for preprocessing gene trees (which were then given as input to MulRF and FastMulRFS). Note that this preprocessing script is a re-implementation of the original algorithm for improved speed (a bug fix also was implemented). Finally, it was brought to my attention that the simulation in the Bioinformatics article differs from prior studies, because I scaled the species tree by 10 generations per year (instead of 0.9 years per generation, which is ~1.1 generations per year). I re-simulated datasets (true-trees-with-one-gen-per-year-psize-10000000.tar.gz and true-trees-with-one-gen-per-year-psize-50000000.tar.gz) using 0.9 years per generation to quantify the impact of this parameter change (see my PhD dissertation or the supplementary materials of Bioinformatics article for discussion).

keywords: Species tree estimation; gene duplication and loss; statistical consistency; MulRF, FastRFS

published: 2020-10-14

Multiple stem and environmental variables dataset

Dalling, James W.; Heineman, Katherine D. (2020)

Data on permanent plots at Fortuna and the Panama Canal Watershed, Republic of Panama, containing counts and percent of trees with one or more multiple stems >10cm diameter, with and without palms. Accompanying environmental data includes elevation, precipitation, soil type and soil chemical variables (pH, total N, NO3, NO4, resin P, mehlich Ca, K and Mg.

keywords: multiple stems; resprouting; Panama Canal Watershed; Fortuna Forest Reserve

published: 2025-04-14

New York art gallery exhibition reviews and catalogs analyzed by race and gender

Mathews, Emilee (2025)

This dataset builds on an existing dataset which captures artists’ demographics who are represented by top tier galleries in the 2016–2017 New York art season (Case-Leal, 2017, https://web.archive.org/web/20170617002654/http://www.havenforthedispossessed.org/) with a census of reviews and catalogs about those exhibitions to assess proportionality of media coverage across race and gender. The readme file explains variables, collection, relationship between the datasets, and an example of how the Case-Leal dataset was transformed. The ArticleDataset.csv provides all articles with citation information as well as artist, artistic identity characteristic, and gallery. The ExhibitionCatalog.csv provides exhibition catalog citation information for each identified artist. New in this V2: - In V1, ArticleDataset.csv had both data on the articles published and all of the exhibitions, which was misleading. In V2 I separated out so that ArticleDataset only has articles, and AllSoloShows has all shows, including those that had no articles written about them in the publications reviewed. - Upon closer review I noticed approximately 10 out of the 133 articles had incorrect information in variable "Publication content type: art or general" and/or "Publication Carrier type: web or library?" so I updated V2. - Upon closer review I noticed there was 3 instances of artists who had two solo shows apiece: in addition to Meleko Mokgosi and Carrie Mae Weems which I had already noted in V1, there was also Roxy Paine. I had not noticed this because only one of two of Paine's shows had been written about. This brings the total number of shows to 117 (which was 116 in V1). -Upon closer review I removed one row from ExhibitionCatalogs.csv, as the item i had listed did not meet the parameters.

keywords: diversity and inclusion; diversity audit; contemporary art; art exhibitions; art exhibition reviews; exhibition catalogs; magazines; newspapers; demographics

published: 2025-10-09

Data for Soil Fertility Management for Sustainable Miscanthus × giganteus Production: Increased Tiller Weight from Nitrogen Management Explains Yield Gains in Aged Miscanthus

Namoi, Nictor; Jang, Chunhwa; Voigt, Thomas; Lee, DoKyoung (2025)

Aging-related yield decline in Miscanthus × giganteus (miscanthus) remains a major constraint to sustainable biomass production. This study evaluated how nitrogen (N) management and soil fertility influence yield-component traits and productivity in aging miscanthus. Trials were conducted at two sites established in 2008 at the University of Illinois Energy Farm, Urbana, IL. (i) The Sun Grant trial received 0, 60, and 120 kg N ha−1 annually until 2015. Starting 2021, half of each plot received 60 or 120 kg N ha−1, resulting in six legacy-contemporary treatments: 0N–0N, 0N–120N, 60N–0N, 60N–60N, 120N–0N, 120N–120N. (ii) The Energy Farm trial remained unfertilized until 2014, when one half of each plot received 56 kg N ha−1, forming two treatments: 0N–0N, 0N–56N. Sun Grant trial results showed N fertilization increased tiller density (tillers m−2) and tiller weight (g tiller−1) in juvenile to early-mature miscanthus (2011–2015). After N withdrawal, both traits declined (20 % and 40 %), though legacy effects persisted in tiller weight in the aging stands (2020–2023). Contemporary N had little effect on tiller density but increased tiller weight by 34 %–77 %, resulting in 23 %–106 % higher machine-harvested biomass yield in 0–120N, 60-60N, and 120-120N plots. At the Energy Farm trial, 0N–56N plots yielded 59 %–108 % more biomass than 0N–0N. Soil total N increased (Sun Grant: 47 % by 2020; Energy Farm: 58 % by 2023), while Mehlich-3 P (42 %–44 %) and K (21 %–46 %) declined. These findings identify tiller weight as a key determinant of biomass yield in aging miscanthus and highlight the need for P and K management for long-term productivity.

keywords: miscanthus; nitrogen; soil

published: 2023-09-21

The Inclusion Network of 27 Review Articles Published between 2013-2018 Investigating the Relationship Between Physical Activity and Depressive Symptoms

Clarke, Caitlin; Lischwe Mueller, Natalie; Joshi, Manasi Ballal; Fu, Yuanxi; Schneider, Jodi (2023)

The relationship between physical activity and mental health, especially depression, is one of the most studied topics in the field of exercise science and kinesiology. Although there is strong consensus that regular physical activity improves mental health and reduces depressive symptoms, some debate the mechanisms involved in this relationship as well as the limitations and definitions used in such studies. Meta-analyses and systematic reviews continue to examine the strength of the association between physical activity and depressive symptoms for the purpose of improving exercise prescription as treatment or combined treatment for depression. This dataset covers 27 review articles (either systematic review, meta-analysis, or both) and 365 primary study articles addressing the relationship between physical activity and depressive symptoms. Primary study articles are manually extracted from the review articles. We used a custom-made workflow (Fu, Yuanxi. (2022). Scopus author info tool (1.0.1) [Python]. <a href="https://github.com/infoqualitylab/Scopus_author_info_collection">https://github.com/infoqualitylab/Scopus_author_info_collection</a> that uses the Scopus API and manual work to extract and disambiguate authorship information for the 392 reports. The author information file (author_list.csv) is the product of this workflow and can be used to compute the co-author network of the 392 articles. This dataset can be used to construct the inclusion network and the co-author network of the 27 review articles and 365 primary study articles. A primary study article is "included" in a review article if it is considered in the review article's evidence synthesis. Each included primary study article is cited in the review article, but not all references cited in a review article are included in the evidence synthesis or primary study articles. The inclusion network is a bipartite network with two types of nodes: one type represents review articles, and the other represents primary study articles. In an inclusion network, if a review article includes a primary study article, there is a directed edge from the review article node to the primary study article node. The attribute file (article_list.csv) includes attributes of the 392 articles, and the edge list file (inclusion_net_edges.csv) contains the edge list of the inclusion network. Collectively, this dataset reflects the evidence production and use patterns within the exercise science and kinesiology scientific community, investigating the relationship between physical activity and depressive symptoms. FILE FORMATS 1. article_list.csv - Unicode CSV 2. author_list.csv - Unicode CSV 3. Chinese_author_name_reference.csv - Unicode CSV 4. inclusion_net_edges.csv - Unicode CSV 5. review_article_details.csv - Unicode CSV 6. supplementary_reference_list.pdf - PDF 7. README.txt - text file 8. systematic_review_inclusion_criteria.csv - Unicode CSV <b>UPDATES IN THIS VERSION COMPARED TO V3</b> (Clarke, Caitlin; Lischwe Mueller, Natalie; Joshi, Manasi Ballal; Fu, Yuanxi; Schneider, Jodi (2023): The Inclusion Network of 27 Review Articles Published between 2013-2018 Investigating the Relationship Between Physical Activity and Depressive Symptoms. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4614455_V3) - We added a new file systematic_review_inclusion_criteria.csv.

keywords: systematic reviews; meta-analyses; evidence synthesis; network visualization; tertiary studies; physical activity; depressive symptoms; exercise; review articles

published: 2024-08-12

Data for Stable isotopes and diet metabarcoding reveal trophic overlap between native and invasive Banded Killifish (Fundulus diaphanus) subspecies

Hartman, Jordan H; Davis, Mark A; Iacaruso, Nicholas J; Tiemann, Jeremy S; Larson, Eric R (2024)

Data associated with the manuscript "Stable isotopes and diet metabarcoding reveal trophic overlap between native and invasive Banded Killifish (Fundulus diaphanus) subspecies." by Jordan H. Hartman, Mark A. Davis, Nicholas J. Iacaruso, Jeremy S. Tiemann, Eric R. Larson. For this project, we sampled six locations in Michigan and Illinois for Eastern and Western Banded Killifish and primary consumers. Using stable isotope analysis we found that Eastern Banded Killifish had higher variance in littoral dependence and trophic position than Western Banded Killifish, but both stable isotope and gut content metabarcoding analyses revealed an overlap in the diet composition and trophic position between the subspecies. This dataset provides the sampling locations, accession numbers for gut content metabarcoding data from the National Center for Biotechnology Information Sequence Read Archive, the assignment of each family used in the gut content metabarcoding analysis as littoral, pelagic, terrestrial, or parasite. and the raw stable isotope data from University of California Davis.

keywords: non-game fish; invasive species; imperiled species; stable isotope analysis; gut content metabarcoding