Illinois Data Bank Dataset Search Results
Results
published:
2021-04-19
Xia, Yushu; Wander, Michelle
(2021)
Dataset compiled by Yushu Xia and Michelle Wander for the Soil Health Institute.
Data were recovered from peer reviewed literature reporting results for three soil quality indicators (SQIs) (β-glucosidase (BG), fluorescein diacetate (FDA) hydrolysis, and permanganate oxidizable carbon (POXC)) in terms of their relative response to management where soils under grassland cover, no-tillage, cover crops, residue return and organic amendments were compared to conventionally managed controls. Peer-reviewed articles published between January of 1990 and May 2018 were searched using the Thomas Reuters Web of Science database (Thomas Reuters, Philadelphia, Pennsylvania) and Google Scholar to identify studies reporting results for: “β-glucosidase”, “permanganate oxidizable carbon”, “active carbon”, “readily oxidizable carbon”, and “fluorescein diacetate hydrolysis”, together with one or more of the following: “management practice”, “tillage”, “cover crop”, “residue”, “organic fertilizer”, or “manure”. Records were tabulated to compare SQI abundance in soil maintained under a control and soil aggrading practice with the intent to contribute to SQI databases that will support development of interpretive frameworks and/or algorithms including pedo-transfer functions relating indicator abundance to management practices and site specific factors.
Meta-data include the following key descriptor variables and covariates useful for development of scoring functions: 1) identifying factors for the study site (location, year of initiation of study and year in which data was reported), 2) soil textural class, pH, and SOC, 3) depth and timing of soil sampling, 4) analytical methods for SQI quantification, 5) units used in published works (i.e. equivalent mass, concentration), 6) SQI abundances, and 7) statistical significance of difference comparisons.
*Note: Blank values in tables are considered unreported data.
keywords:
Soil health promoting practices; Soil quality indicators; β-glucosidase; fluorescein diacetate hydrolysis; Permanganate oxidizable carbon; Greenhouse gas emissions; Scoring curves; Soil Management Assessment Framework
published:
2021-10-04
Wang, Justin; Curtis, Jeffrey H; Riemer, Nicole; West, Matthew
(2021)
This dataset contains all the necessary information to recreate the study presented in the paper entitled "Learning coagulation processes with combinatorially-invariant neural networks". This consists of (1) the aggregated output files used for machine learning, (2) the machine learning codes used to learn the presented models, (3) the PartMC model source code that was used to generate the simulation data and (4) the Python scripts used construct the scenario library for training and testing simulations. This data was used to investigate a method (combinatorally-invariant neural network) for learning the aerosol process of coagulation. This data may be useful for application of other methods.
keywords:
Machine learning; Atmospheric chemistry; Particle-resolved modeling; Coagulation; Atmospheric Science
published:
2025-05-10
Bakken, George; O'Keefe, Joy
(2025)
This dataset provides instructions for procedures to use heat transfer analyses to estimate thermal conditions in artificial roosts for bats. The dataset contains scripts to employ in the program GNU Octave, example meteorology data, and example text files specifying roost dimensions and material properties.
keywords:
Bat box; design; heat storage; heat transfer analysis; insulation; temperature
published:
2018-05-06
Sukenik, Shahar; Salam, Mohammed; Wang, Yuhan; Gruebele, Martin
(2018)
This deposit contains all raw data and analysis from the paper "In-cell titration of small solutes controls protein stability and aggregation". Data is collected into several types:
1) analysis*.tar.gz are the analysis scripts and the resulting data for each cell. The numbers correspond to the numbers shown in Fig.S1. (in publication)
2) scripts.tar.gz contains helper scripts to create the dataset in bash format.
3) input.tar.gz contains headers and other information that is fed into bash scripts to create the dataset.
4) All rawData*.tar.gz are tarballs of the data of cells in different solutes in .mat files readable by matlab, as follows:
- Each experiment included in the publication is represented by two matlab files: (1) a calibration jump under amber illumination (_calib.mat suffix) (2) a full jump under blue illumination (FRET data)
- Each file contains the following fields:
coordleft - coordinates of cropped and aligned acceptor channel on the original image
coordright - coordinates of cropped and aligned donor channel on the original image]
dataleft - a 3d 12-bit integer matrix containing acceptor channel flourescence for each pixel and time step. Not available in _calib files
dataright - a 3d 12-bit integer matrix containing donor channel flourescence for each pixel and time step. This will be mCherry in _calib files and AcGFP in data files.
frame1 - original image size
imgstd - cropped dimensions
numFrames - number of frames in dataleft and dataright
videos - a structure file containing camera data. Specifically, videos.TimeStamp includes the time from each frame.
keywords:
Live cell; FRET microscopy; osmotic challenge; intracellular titrations; protein dynamics
published:
2025-09-26
Dong, Hongxu; Clark, Lindsay; Jin, Xiaoli; Anzoua, Kossonou; Bagmet, Larisa; Chebukin, Pavel; Dzyubenko, Elena; Dzyubenko, Nicolay; Ghimire, Bimal Kumar; Heo, Kweon; Johnson, Douglas A.; Nagano, Hironori; Sabitov, Andrey; Peng, Junhua; Yamada, Toshihiko; Yoo, Ji Hye; Yu, Chang Yeon; Zhao, Hua; Long, Stephen P.; Sacks, Erik
(2025)
Miscanthus is a close relative of saccharum and a potentially valuable genetic resource for improving sugarcane. Differences in flowering time within and between miscanthus and saccharum hinders intra- and interspecific hybridizations. A series of greenhouse experiments were conducted over three years to determine how to synchronize flowering time of saccharum and miscanthus genotypes. We found that day length was an important factor influencing when miscanthus and saccharum flowered. Sugarcane could be induced to flower in a central Illinois greenhouse using supplemental lighting to reduce the rate at which days shortened during the autumn and winter to 1 min d-1, which allowed us to synchronize the flowering of some sugarcane genotypes with Miscanthus genotypes primarily from low latitudes. In a complementary growth chamber experiment, we evaluated 33 miscanthus genotypes, including 28 M. sinensis, 2 M. floridulus, and 3 M. ×giganteus collected from 20.9° S to 44.9° N for response to three day lengths (10 h, 12.5 h, and 15 h). High latitude-adapted M. sinensis flowered mainly under 15 h days, but unexpectedly, short days resulted in short, stocky plants that did not flower; in some cases, flag leaves developed under short days but heading did not occur. In contrast, for M. sinensis and M. floridulus from low latitudes, shorter day lengths typically resulted in earlier flowering, and for some low latitude genotypes, 15 h days resulted in no flowering. However, the highest ratio of reproductive shoots to total number of culms was typically observed for 12.5 h or 15 h days. Latitude of origin was significantly associated with culm length, and the shorter the days, the stronger the relationship. Nearly all entries achieved maximal culm length under the 15 h treatment, but the nearer to the equator an accession originated, the less of a difference in culm length between the short-day treatments and the 15 h day treatment. Under short days, short culms for high-latitude accessions was achieved by different physiological mechanisms for M. sinensis genetic groups from the mainland in comparison to those from Japan; for mainland accessions, the mechanism was reduced internode length, whereas for Japanese accessions the phyllochron under short days was greater than under long days. Thus, for M. sinensis, short days typically hastened floral induction, consistent with the expectations for a facultative short-day plant. However, for high latitude accessions of M. sinensis, days less than 12.5 h also signaled that plants should prepare for winter by producing many short culms with limited elongation and development; moreover, this response was also epistatic to flowering. Thus, to flower M. sinensis that originates from high latitudes synchronously with sugarcane, the former needs day lengths >12.5 h (perhaps as high as 15 h), whereas that the latter needs day lengths <12.5 h.
keywords:
Feedstock Production;Phenomics
published:
2022-08-08
Shen, Chengze; Liu, Baqiao; Williams, Kelly P.; Warnow, Tandy
(2022)
This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment".
The zip file has the following structure (presented as an example):
salma_paper_datasets/
|_README.md
|_10aa/
|_crw/
|_homfam/
|_aat/
| |_...
|_...
|_het/
|_5000M2-het/
| |_...
|_5000M3-het/
...
|_rec_res/
Generally, the structure can be viewed as:
[category]/[dataset]/[replicate]/[alignment files]
# Categories:
1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate.
2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM).
3. homfam: There are the 10 largest Homfam datasets, each with one replicate.
4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates.
5. rec\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper.
# Alignment files
There are at most 6 `.fasta` files in each sub-directory:
1. `all.unaln.fasta`: All unaligned sequences.
2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included.
3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences).
4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included.
5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences).
6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included.
>If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing.
>If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing.
>If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing.
# Additional file(s)
1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.
keywords:
SALMA;MAFFT;alignment;eHMM;sequence length heterogeneity
published:
2025-08-14
Bao, Wencheng; Kontou, Eleftheria
(2025)
Data and code for the paper titled "Electric Vehicle Charging Stations at Risk from Hazardous Events and Power Outages: Analytics and Resilience Implications" published in Renewable and Sustainable Energy Reviews journal (https://doi.org/10.1016/j.rser.2025.116144).
keywords:
electric vehicles; hazardous events; charging infrastructure; power outages; resilience
published:
2025-11-25
Hyunbin, Kim; Kiseok, Kim; Roman, Makhnenko
(2025)
This dataset encompasses experimental results supporting the upcoming journal paper, "Hydro-mechanical-chemical behavior of sedimentary rock during CO2 injection". The dataset includes the measurements and analyses conducted under controlled laboratory conditions, capturing changes in poroviscoelastic properties and pore structure after CO2 treatment.
keywords:
Poroviscoelasticity; Carbonate mineral dissolution; Porosity evolution; Compaction; Shale; Opalinus Clay
published:
2022-12-21
Sherwood, Joshua; Tiemann, Jeremy; Stein, Jeffrey
(2022)
This dataset is associated with a larger manuscript published in 2022 in the Illinois Natural History Survey Bulletin that summarized the Fishes of Champaign County project from 2012-2015. With data spanning over 120 years, the Fishes of Champaign County is a comprehensive, long-term investigation into the changing fish communities of east-central Illinois. Surveys first occurred in Champaign County in the late 1880s (40 sites), with subsequent surveys in 1928–1929 (125 sites), 1959–1960 (143 sites), and 1987–1988 (141 sites). Between 2012 and 2015, we resampled 122 sites across Champaign County. The combined data from these five surveys have produced a unique perspective into not only the fish communities of the region, but also insight into in-stream habitat changes during the past 120 years.
The dataset is in Microsoft Access format, with five data tables, one for each time period surveyed. Field names are self-explanatory, with some variation in data types collected during different surveys as follows: Forbes & Richardson (1880s) collected presence/absence only. Thompson & Hunt (1928-1929) collected abundance only, Larimore & Smith (1959-1960) collected length and weight for some samples, but only presence/absence at others. In some cases, fish of the same species were weighed in bulk, with the fields “LOW” and “HIGH” indicating the lower and upper limits of total length in the batch, and weight indicating the gross weight of all fish in the batch. Larimore and Bayley (1987-1988) collected length and weight for all surveys, and Sherwood and Stein (2012-2015) collected length and weight for all surveys except for cases where extremely abundant single species where subsampled. Lengths are reported in millimeters, and weight in grams. Two lookup tables provide information about species codes used in the data tables and sample site location and notes.
keywords:
fishes of Champaign County; streams; anthropogenic disturbances; long-term dataset
published:
2024-09-19
Klimasmith, Isaac; Kent, Angela
(2024)
The use of potentially beneficial microorganisms in agriculture (microbial inoculants) has rapidly accelerated in recent years. For microbial inoculants to be effective as agricultural tools, these organisms must be able to survive and persist in novel environments while not destabilizing the resident community or spilling over into adjacent natural ecosystems. Here, we adapt a macroecological propagule pressure model to a microbial scale and present an experimental approach for testing the role of propagule pressure in microbial inoculant introductions. We experimentally determined the risk-release relationship for an IAA-expressing Pseudomonas simiae inoculant in a model monocot system. We then used this relationship to simulate establishment outcomes under a range of application frequencies (propagule number) and inoculant concentrations (propagule size). Our simulations show that repeated inoculant applications may increase establishment, even when increased inoculant concentration does not alter establishment probabilities.
The dataset filed here includes the experimemtal datafile, and a RMarkdown file that includes all the code used in in both the modeling and anaylsis.
keywords:
microbial inoculants; invasion ecology; propagule pressure; agriculture; modeling
published:
2024-06-17
Stuchiner, Emily; Jernigan, Wyatt; Zhang, Ziliang; Eddy, William; DeLucia, Evan; Yang, Wendy
(2024)
Data includes carbon mineralization rates, potential denitrification rates, net nitrous oxide fluxes, and soil chemical properties from a laboratory incubation of soil samples collected from 20 locations across an Illinois maize field.
keywords:
denitrification; nitrous oxide; dissolved organic carbon; maize
published:
2025-05-02
This dataset contains the first-generation (1st-gen) and second-generation (2nd-gen) citation relationships to a set of focal papers. The 1st-gen citation relationships are the instances of one paper citing a focal paper. These citing papers are called "1st-gen citations." The 2nd-gen citation relationships are the instances that a paper cites a 1st-gen citation. The citing paper in the 2nd-gen citation relationship is a second-generation (2nd-gen) citation. When a 2nd-gen citation is also a 1st-gen citation, it creates a transitive closure with the focal paper.
Each focal paper has an abbreviation, which can be found below. The 1st-gen and 2nd-gen citation relationships were extracted from the Curated Open Citation Dataset (Korobskiy & Chacko, 2023), which is derived from a copy of COCI, the OpenCitations Index of Crossref Open DOI-to-DOI Citations, downloaded on May 6, 2023. Scripts used to collect this dataset can be found at https://github.com/yuanxiesa/transitive_closure_study. Each focal paper currently has two files: {abbreviation}_1st.csv contains the 1st-gen citation relationships; {abbreviation}_2nd.csv contains the 2nd-gen citation relationships.
Focal paper abbreviation == "louvain": Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008
Focal paper abbreviation == "lp": Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3), 036106. https://doi.org/10.1103/PhysRevE.76.036106
Focal paper abbreviation == "gn": Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113. https://doi.org/10.1103/PhysRevE.69.026113
keywords:
transitive closure; citations; community detection algorithms; OpenCitations; method papers
published:
2025-10-30
Dwivedi, Nidhi; Yamamoto, Senri; Zhao, Yunjun; Hou, Guichuan; Bowling, Forrest; Tobimatsu, Yuki; Liu, Chang-Jun
(2025)
Grass lignocelluloses feature complex compositions and structures. In addition to the presence of conventional lignin units from monolignols, acylated monolignols and flavonoid tricin also incorporate into lignin polymer; moreover, hydroxycinnamates, particularly ferulate, cross-link arabinoxylan chains with each other and/or with lignin polymers. These structural complexities make grass lignocellulosics difficult to optimize for effective agro-industrial applications. In the present study, we assess the applications of two engineered monolignol 4-O-methyltransferases (MOMTs) in modifying rice lignocellulosic properties. Two MOMTs confer regiospecific para-methylation of monolignols but with different catalytic preferences. The expression of MOMTs in rice resulted in differential but drastic suppression of lignin deposition, showing more than 50% decrease in guaiacyl lignin and up to an 90% reduction in syringyl lignin in transgenic lines. Moreover, the levels of arabinoxylan-bound ferulate were reduced by up to 50%, and the levels of tricin in lignin fraction were also substantially reduced. Concomitantly, up to 11 μmol/g of the methanol-extractable 4-O-methylated ferulic acid and 5–7 μmol/g 4-O-methylated sinapic acid were accumulated in MOMT transgenic lines. Both MOMTs in vitro displayed discernible substrate promiscuity towards a range of phenolics in addition to the dominant substrate monolignols, which partially explains their broad effects on grass phenolic biosynthesis. The cell wall structural and compositional changes resulted in up to 30% increase in saccharification yield of the de-starched rice straw biomass after diluted acid-pretreatment. These results demonstrate an effective strategy to tailor complex grass cell walls to generate improved cellulosic feedstocks for the fermentable sugar-based production of biofuel and bio-chemicals.
keywords:
Feedstock Production;Biomass Analytics;Genome Engineering
published:
2025-09-25
Vu-Le, The-Anh; Park, Minhyuk; Chen, Ian; Warnow, Tandy
(2025)
Dataset for "Using Stochastic Block Models for Community Detection". This contains synthetic networks with ground-truth community structure generated using synthetic network generators (specifically, ABCD+o) based on real-world networks and computed clusterings on these real-world networks.
Note:
* networks.zip contains the synthetic networks
published:
2025-11-17
Bayer , Hugo; Hassell Jr, James; Oleksiak, Cecily; Garcia, Gabriela; Hollis, Vaughan; Juliano, Vitor; Maren, Stephen
(2025)
Raw data from the article "Pharmacological stimulation of infralimbic cortex after fear conditioning facilitates subsequent fear extinction", published in Neuropsychopharmacology in 2024.
published:
2024-11-19
Salami, Malik Oyewale; McCumber, Corinne
(2024)
This project investigates retraction indexing agreement among data sources: Crossref, Retraction Watch, Scopus, and Web of Science. As of July 2024, this reassesses the April 2023 union list of Schneider et al. (2023): https://doi.org/10.55835/6441e5cae04dbe5586d06a5f. As of April 2023, over 1 in 5 DOIs had discrepancies in retraction indexing among the 49,924 DOIs indexed as retracted in at least one of Crossref, Retraction Watch, Scopus, and Web of Science (Schneider et al., 2023). Here, we determine what changed in 15 months.
Pipeline code to get the results files can be found in the GitHub repository
https://github.com/infoqualitylab/retraction-indexing-agreement in the iPython notebook 'MET-STI2024_Reassessment_of_retraction_indexing_agreement.ipynb'
Some files have been redacted to remove proprietary data, as noted in README.txt. Among our sources, data is openly available only for Crossref and Retraction Watch.
FILE FORMATS:
1) unionlist_completed_2023-09-03-crws-ressess.csv - UTF-8 CSV file
2) unionlist_completed-ria_2024-07-09-crws-ressess.csv - UTF-8 CSV file
3) unionlist-15months-period_sankey.png - Portable Network Graphics (PNG) file
4) unionlist_ria_proportion_comparison.png - Portable Network Graphics (PNG) file
5) README.txt - text file
FILE DESCRIPTION:
Description of the files can be found in README.txt
keywords:
retraction status; data quality; indexing; retraction indexing; metadata; meta-science; RISRS
published:
2025-10-10
Singh, Ramkrishna; Liu, Hui; Shanklin, John; Singh, Vijay
(2025)
Lipids accumulated in the vegetative tissues of cellulosic feedstocks can be a potential raw material for biodiesel and bioethanol production. In this work, bagasse of genetically engineered sorghum was subjected to liquid hot-water pretreatment at 170, 180, and 190 °C for different reaction time. Under the optimal pretreatment condition (170 °C, 20 min), the residue was enriched in glucan (57.39 ± 2.63 % w/w) and xylan (13.38 ± 0.49 % w/w). The total lipid content of the pretreated residue was 6.81% w/w, similar to that observed in untreated bagasse (6.30% w/w). Pretreatment improved the enzymatic digestibility of bagasse, allowing a recovery of 79% w/w and 86% w/w of glucose and xylose, respectively. The pretreatment and enzymatic saccharification resulted in a 2-fold increase in total lipid in enzymatic residue compared to the original bagasse. Thus, pretreatment and enzymatic hydrolysis enabled high sugar recovery while concentrating triglycerides and free fatty acids in the residue.
keywords:
Conversion;Feedstock Production;Feedstock Bioprocessing
published:
2018-11-21
Clark, Lindsay V.; Lipka, Alexander E.; Sacks, Erik J.
(2018)
This set of scripts accompanies the manuscript describing the R package polyRAD, which uses DNA sequence read depth to estimate allele dosage in diploids and polyploids. Using several high-confidence SNP datasets from various species, allelic read depth from a typical RAD-seq dataset was simulated, then genotypes were estimated with polyRAD and other software and compared to the true genotypes, yielding error estimates.
keywords:
R programming language; genotyping-by-sequencing (GBS); restriction site-associated DNA sequencing (RAD-seq); polyploidy; single nucleotide polymorphism (SNP); Bayesian genotype calling; simulation
published:
2018-12-20
Dong, Xiaoru; Xie, Jingyi; Linh, Hoang
(2018)
File Name: Inclusion_Criteria_Annotation.csv
Data Preparation: Xiaoru Dong
Date of Preparation: 2018-12-14
Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang
Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks.
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider.
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews.
Description: The file contains lists of inclusion criteria of Cochrane Systematic Reviews and the manual annotation results. 5420 inclusion criteria were annotated, out of 7158 inclusion criteria available. Annotations are either "Only RCTs" or "Others". There are 2 columns in the file:
- "Inclusion Criteria": Content of inclusion criteria of Cochrane Systematic Reviews.
- "Only RCTs": Manual Annotation results. In which, "x" means the inclusion criteria is classified as "Only RCTs". Blank means that the inclusion criteria is classified as "Others".
Notes:
1. "RCT" stands for Randomized Controlled Trial, which, in definition, is "a work that reports on a clinical trial that involves at least one test treatment and one control treatment, concurrent enrollment and follow-up of the test- and control-treated groups, and in which the treatments to be administered are selected by a random process, such as the use of a random-numbers table." [Randomized Controlled Trial publication type definition from https://www.nlm.nih.gov/mesh/pubtypes.html].
2. In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords:
Inclusion criteria, Randomized controlled trials, Machine learning, Systematic reviews
published:
2020-06-02
Xue, Qingquan; Dietrich, Christopher; Zhang, Yalin
(2020)
The text file contains the original data used in the phylogenetic analyses of Xue et al. (2020: Systematic Entomology, in press). The text file is marked up according to the standard NEXUS format commonly used by various phylogenetic analysis software packages. The file will be parsed automatically by a variety of programs that recognize NEXUS as a standard bioinformatics file format. The first six lines of the file identify the file as NEXUS, indicate that the file contains data for 89 taxa (species) and 2676 characters, indicate that the first 2590 characters are DNA sequence and the last 86 are morphological, that gaps inserted into the DNA sequence alignment and inapplicable morphological characters are indicated by a dash, and that missing data are indicated by a question mark. The file contains aligned nucleotide sequence data for 5 gene regions and 86 morphological characters. The positions of data partitions are indicated in the mrbayes block of commands for the phylogenetic program MrBayes at the end of the file (Subset1 = 16S gene; Subset2 = 28S gene; Subset3 = COI gene; Subset 4 = Histone H3 and H2A genes). The mrbayes block also contains instructions for MrBayes on various non-default settings for that program. These are explained in the original publication. Descriptions of the morphological characters and more details on the species and specimens included in the dataset are provided in the supplementary document included as a separate pdf, also available from the journal website. The original raw DNA sequence data are available from NCBI GenBank under the accession numbers indicated in the supplementary file.
keywords:
phylogeny; DNA sequence; morphology; Insecta; Hemiptera; Cicadellidae; leafhopper; evolution; 28S rDNA; 16S rDNA; histone H3; histone H2A; cytochrome oxidase I; Bayesian analysis
published:
2024-03-21
Becker, Maria; Han, Kanyao; Werthmann, Antonina; Rezapour, Rezvaneh; Lee, Haejin; Diesner, Jana; Witt, Andreas
(2024)
Impact assessment is an evolving area of research that aims at measuring and predicting the potential effects of projects or programs. Measuring the impact of scientific research is a vibrant subdomain, closely intertwined with impact assessment. A recurring obstacle pertains to the absence of an efficient framework which can facilitate the analysis of lengthy reports and text labeling. To address this issue, we propose a framework for automatically assessing the impact of scientific research projects by identifying pertinent sections in project reports that indicate the potential impacts. We leverage a mixed-method approach, combining manual annotations with supervised machine learning, to extract these passages from project reports. This is a repository to save datasets and codes related to this project.
Please read and cite the following paper if you would like to use the data:
Becker M., Han K., Werthmann A., Rezapour R., Lee H., Diesner J., and Witt A. (2024). Detecting Impact Relevant Sections in Scientific Research. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING).
This folder contains the following files:
evaluation_20220927.ods: Annotated German passages (Artificial Intelligence, Linguistics, and Music) - training data
annotated_data.big_set.corrected.txt: Annotated German passages (Mobility) - training data
incl_translation_all.csv: Annotated English passages (Artificial Intelligence, Linguistics, and Music) - training data
incl_translation_mobility.csv: Annotated German passages (Mobility) - training data
ttparagraph_addmob.txt: German corpus (unannotated passages)
model_result_extraction.csv: Extracted impact-relevant passages from the German corpus based on the model we trained
rf_model.joblib: The random forest model we trained to extract impact-relevant passages
Data processing codes can be found at: https://github.com/khan1792/texttransfer
keywords:
impact detection; project reports; annotation; mixed-methods; machine learning
published:
2025-06-30
Mori, Jameson; Skowron, Nicholas; Barr, Daniel; Johnson, Ben; Novakofski, Jan; Mateus-Pinilla, Nohra
(2025)
This dataset contains measurements of water loss as white-tailed deer (Odocoileus virginianus) retroypharyngeal lymph nodes air-dried in a refrigerator for 31 days. Daily weights for lymph nodes are recorded every 24 hours, as are the variables "firmness" and "surface wetness". "Firmness" is a categorical variable measuring how much the tissue deforms to the touch (soft, medium, or hard). "Surface wetness" is the amount of visible moisture on the outside of the lymph node (all, some, or none). Lymph node weights were measured until their weights stabilized for 3 consecutive days at two decimal places (ex. 3.02, 3.02, 3.02) or until the weights fluctuated only by 0.01 (ex. 3.02, 3.03, 3.02). Lymph nodes were from northern Illinois white-tailed deer collected as part of the Illinois Department of Natural Resources' ongoing chronic wasting disease (CWD) management efforts.
keywords:
cervid; lymph node; chronic wasting disease; cwd; diagnostic testing; dessication; drying; tissue
published:
2025-10-24
Choe, Kisurb; Jindra, Michael A.; Hubbard, Susan; Pfleger, Brian; Sweedler, Jonathan
(2025)
Creating controlled lipid unsaturation locations in oleochemicals can be a key to many bioengineered products. However, evaluating the effects of modifications to the acyl-ACP desaturase on lipid unsaturation is not currently amenable to high-throughput assays, limiting the scale of redesign efforts to <200 variants. Here, we report a rapid mass spectrometry (MS) assay for profiling the positions of double bonds on membrane lipids produced by Escherichia coli colonies after treatment with ozone gas. By MS measurement of the ozonolysis products of Δ6 and Δ8 isomers of membrane lipids from colonies expressing recombinant Thunbergia alata desaturase, we screened a randomly mutagenized library of the desaturase gene at 5 s per sample. Two variants with altered regiospecificity were isolated, indicated by an increase in 16:1 Δ8 proportion. We also demonstrated the ability of these desaturase variants to influence the membrane composition and fatty acid distribution of E. coli strains deficient in the native acyl-ACP desaturase gene, fabA. Finally, we used the fabA deficient chassis to concomitantly express a non-native acyl-ACP desaturase and a medium-chain thioesterase from Umbellularia californica, demonstrating production of only saturated free fatty acids.
keywords:
Conversion;Lipidomics;Mass Spectrometry
published:
2025-07-28
McCumber, Corinne; Salami, Malik Oyewale
(2025)
This project investigates retraction indexing agreement in PubMed between 2024-07-03 and 2025-05-09 in order to address an API limitation that resulted in 199 items being excluded from analysis in "Analyzing the consistency of retraction indexing". PubMed was queried on 2024-07-03 and on 2025-05-09 using the search “Retracted Publication[PT]”. PubMed is only able to return 10,000 items when queried via the E-Utilities API. When the pipeline was run 2024-07-03, the search between 2020 and 2024 returned 10,199 items, meaning that an expected 199 items indexed as retracted in PubMed were excluded. This dataset uses and compares information from PubMed as of 2025-05-09 to attempt to identify those 199 items.
keywords:
retraction status; data quality; indexing; retraction indexing; metadata; meta-science; RISRS; PubMed
published:
2025-11-06
Deshavath, Narendra Naik; Woodruff, William; Eller, Fred; Susanto, Vionna; Yang, Cindy; Rao, Christopher V.; Singh, Vijay
(2025)
Microbial oils are a sustainable biomass-derived substitute for liquid fuels and vegetable oils. Oilcane, an engineered sugarcane with superior feedstock characteristics for biodiesel production, is a promising candidate for bioconversion. This study describes the processing of oilcane stems into juice and hydrothermally pretreated lignocellulosic hydrolysate and their valorization to ethanol and microbial oil using Saccharomyces cerevisiae and engineered Rhodosporidium toruloides strains, respectively. A bioethanol titer of 106 g/L was obtained from S. cerevisiae grown on oilcane juice in a 3 L fermenter, and a lipid titer of 8.8 g/L was obtained from R. toruloides grown on oilcane hydrolysate in a 75 L fermenter. Oil was extracted from the R. toruloides cells using supercritical CO2, and the observed fatty acid profile was consistent with previous studies on this strain. These results demonstrate the feasibility of pilot-scale lipid production from oilcane hydrolysate as part of an integrated bioconversion strategy.
keywords:
Conversion;Bioproducts;Feedstock Bioprocessing;Hydrolysate