Illinois Data Bank Dataset Search Results
Results
published:
2025-05-29
Ruess, P.J.; Hanley, Jackie; Konar, Megan
(2025)
These data support Ruess et al (2025) "Drought impacts to water footprints and virtual water transfers of counties of the United States", Water Resources Research, 61, e2024WR037715, https://doi.org/10.1029/2024WR037715.
The dataset contains estimates for Virtual Water Content (VWC) and Virtual Water Trade (VWT) for nine unique combinations of three crop categories (cereal grains, produce, and animal feed) and three water sources (surface water withdrawals, groundwater withdrawals, and groundwater depletion) for the years 2012 and 2017 within the Continental United States. The VWC is calculated by dividing irrigation withdrawal estimates (m3) by the production (tons) at the county resolution. The VWT is calculated by multiplying the VWC by the estimated county level food flows (tons) from Karakoc et al. (2022). All VWC estimates are provided at the county resolution according to county GEOID and are given in units of m3/ton. All VWT estimates are given in pairs of origin and destination GEOID’s and provided in units of m3.
When using, please cite as:
Ruess, P.J., Hanley, J., and Konar, M. (2025) "Drought impacts to water footprints and virtual water transfers of counties of the United States", Water Resources Research, 61, e2024WR037715, doi: 10.1029/2024WR037715.
keywords:
irrigation; water footprints; supply chains
published:
2025-08-26
Kraft, Mary L.; Fisher, Gregory L.; Chini, Corryn E.; Gorman, Brittney L.; Brunet, Melanie A.
(2025)
This dataset consists of the time-of-flight secondary ion mass spectrometry (TOF-SIMS) depth profiling data that was collected with a PHI nanoTOF II Parallel Imaging MS/MS instrument from a 70 micron by 70 micron region on a recombinant HEK cell labeled with a stain that accumulates in the endoplasmic reticulum (ER-Tracker Blue White DPX, Invitrogen).
keywords:
TOF-SIMS; secondary ion mass spectrometry; depth profiling; endoplasmic reticulum; fluorine; total ion count; TIC image; ion image, tandem mass spectrometry imaging, ER-tracker
published:
2025-06-23
Kleiman, Diego; Feng, Jiangyan; Xue, Zhengyuan; Shukla, Diwakar
(2025)
This repository contains data and model weights associated with the publication "ESMDynamic: Fast and Accurate Prediction of Protein Dynamic Contact Maps from Single Sequences". It includes the datasets used for training and evaluating a dynamic contact prediction model, ESMDynamic, as well as a script for conversion and usage.
keywords:
Computational biology; Structural biology; Molecular dynamics; Machine learning; Protein modeling; Bioinformatics; Biophysics; Artificial intelligence
published:
2023-09-19
Salami, Malik Oyewale; Lee, Jou; Schneider, Jodi
(2023)
We used the following keywords files to identify categories for journals and conferences not in Scopus, for our STI 2023 paper "Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science".
The first four text files each contains keywords/content words in the form: 'keyword1', 'keyword2', 'keyword3', .... The file title indicates the name of the category:
file1: healthscience_words.txt
file2: lifescience_words.txt
file3: physicalscience_words.txt
file4: socialscience_words.txt
The first four files were generated from a combination of software and manual review in an iterative process in which we:
- Manually reviewed venue titles were not able to automatically categorize using the Scopus categorization or extending it as a resource.
- Iteratively reviewed uncategorized venue titles to manually curate additional keywords as content words indicating a venue title could be classified in the category healthscience, lifescience, physicalscience, or socialscience. We used English content words and added words we could automatically translate to identify content words. NOTE: Terminology with multiple potential meanings or contain non-English words that did not yield useful automatic translations e.g., (e.g., Al-Masāq) were not selected as content words.
The fifth text file is a list of stopwords in the form: 'stopword1', 'stopword2, 'stopword3', ...
file5: stopwords.txt
This file contains manually curated stopwords from venue titles to handle non-content words like 'conference' and 'journal,' etc.
This dataset is a revision of the following dataset:
Version 1: Lee, Jou; Schneider, Jodi: Keywords for manual field assignment for Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science. University of Illinois at Urbana-Champaign Data Bank.
Changes from Version 1 to Version 2:
- Added one author
- Added a stopwords file that was used in our data preprocessing.
- Thoroughly reviewed each of the 4 keywords lists. In particular, we added UTF-8 terminology, removed some non-content words and misclassified content words, and extensively reviewed non-English keywords.
keywords:
health science keywords; scientometrics; stopwords; field; keywords; life science keywords; physical science keywords; science of science; social science keywords; meta-science; RISRS
published:
2025-08-21
Lu, Yi; Sweedler, Jonathan; Zhou, Shuaizhen; Zhou, Yu
(2025)
Engineering efficient biocatalysts is essential for metabolic engineering to produce valuable bioproducts from renewable resources. However, due to the complexity of cellular metabolic networks, it is challenging to translate success in vitro into high performance in cells. To meet such a challenge, an accurate and efficient quantification method is necessary to screen a large set of mutants from complex cell culture and a careful correlation between the catalysis parameters in vitro and performance in cells is required. In this study, we employed a mass-spectrometry based high-throughput quantitative method to screen new mutants of 2-pyrone synthase (2PS) for triacetic acid lactone (TAL) biosynthesis through directed evolution in E. coli. From the process, we discovered two mutants with the highest improvement (46 fold) in titer and the fastest kcat (44 fold) over the wild type 2PS, respectively, among those reported in the literature. A careful examination of the correlation between intracellular substrate concentration, Michaelis-Menten parameters and TAL titer for these two mutants reveals that a fast reaction rate under limiting intracellular substrate concentrations is important for in-cell biocatalysis. Such properties can be tuned by protein engineering and synthetic biology to adopt these engineered proteins for the maximum activities in different intracellular environments.
keywords:
catalysis; mass spectrometry; metabolic engineering
published:
2025-08-21
Viral vectors provide an increasingly versatile platform for transformation-free reagent delivery to plants. RNA viral vectors can be used to induce gene silencing, overexpress proteins, or introduce gene editing reagents; however, they are often constrained by carrying capacity or restricted tropism in germline cells. Site-specific recombinases that catalyze precise genetic rearrangements are powerful tools for genome engineering that vary in size and, potentially, efficacy in plants. In this work, we show that viral vectors based on tobacco rattle virus (TRV) deliver and stably express four recombinases ranging in size from ∼0.6 to ∼1.5 kb and achieve simultaneous marker removal and reporter activation through targeted excision in transgenic Nicotiana benthamiana lines. TRV vectors with Cre, FLP, CinH, and Integrase13 efficiently mediated recombination in infected somatic tissue and led to heritable modifications at high frequency. An excision-activated Ruby reporter enabled simple and high-resolution tracing of infected cell lineages without the need for molecular genotyping. Together, our experiments broaden the scope of viral recombinase delivery and offer insights into infection dynamics that may be useful in developing future viral vectors.
keywords:
gene editing; genome engineering; plant transformation
published:
2025-08-20
Arshad, Muhammad Umer; Archer, David ; Wasonga, Daniel ; Namoi, Nictor; Boe, Arvid ; Rob , Mitchell; Heaton, Emily; Khanna, Madhu; Lee, DoKyoung
(2025)
The compiled datasets include detailed costs for switchgrass production, categorized into establishment, maintenance, and harvesting expenses, along with revenue calculations. Costs were gathered from multiple sources and adjusted for inflation, focusing on farm-gate profitability, excluding fixed costs and transportation. All financial data is provided per hectare. The dataset was used to evaluate the economic performance of forage- and bioenergy-type switchgrass cultivars and their response to nitrogen fertilization across diverse marginal environments in the U.S. Midwest. Data Envelopment Analysis (DEA) and cost-benefit analysis were employed to assess the efficiency and profitability of 23 different cultivar and fertilization rate combinations over five years.
published:
2025-08-05
Zhu, Minjiang; Sanders, Derrick M.; Kim, Yun Seong; Shah, Rohan ; Hossain, Mohammad Tanver; Ewoldt, Randy H.; Tawfick, Sameh H.; Geubelle, Philippe H.
(2025)
published:
2025-08-08
Bhatnagar, Nikita; Chung, Sarah S.; Hodge, John; Kim, Sang Yeol; Sands, Mia; Leakey, Andrew D. B.; Ort, Donald R.; Burgess, Steven J.
(2025)
Rubisco activase is an ATP-dependent chaperone that facilitates dissociation of inhibitory sugar phosphates from the catalytic sites of Rubisco during photosynthesis. In Arabidopsis, Rubisco activase is negatively regulated by dark-dependent phosphorylation of Thr78. The prevalence of Thr78 in Rubisco activase was investigated across sequences from 91 plant species, finding that 29 (∼32%) species shared a threonine in the same position. Analysis of seven C3 species with an antibody raised against a Thr78 phospho-peptide demonstrated that this position is phosphorylated in multiple genera. However, light-dependent dephosphorylation of Thr78 was observed only in Arabidopsis. Further, phosphorylation of Thr78 could not be detected in any of the four C4 grass species examined. The results suggest that despite conservation of Thr78 in Rubisco activase from a wide range of species, a regulatory role for phosphorylation at this site is more limited. This provides a case study for how variation in post-translational regulation can amplify functional divergence across the phylogeny of plants beyond what is explained by sequence variation in a metabolically important protein.
keywords:
photosynthesis; sorghum
published:
2025-08-07
Keiser, Ashley D.; Heaton, Emily; VanLoocke, Andrew; Studt, Jacob; McDaniel, Marshall D.
(2025)
Bioenergy and bioproduct markets are expanding to meet demand for climate friendly goods and services. Perennial biomass crops are particularly well suited for this goal because of their high yields, low input requirements, and potential to increase soil carbon (C). However, it is unclear how much C is allocated into belowground pools by perennial bioenergy crops and whether the belowground benefits vary with nitrogen (N) fertilizer inputs. Using in situ 13C pulse-chase labeling, we tested whether the sterile perennial grass Miscanthus × giganteus (miscanthus) or annual maize transfers more photosynthetic C to belowground pools. The experiment took place at two sites in Central and Northwest (NW) Iowa with different management histories and two nitrogen (N) fertilizer rates (0 and 224 kg N ha-1 yr-1) to determine if the fate of plant-derived soil C depends on soil fertility and crop type (perennial or annual). Maize allocated a greater percentage of total new 13C to roots than miscanthus, but miscanthus had greater new 13C in total and belowground plant biomass. We found strong interactions between site and most soil measurements – including new 13C in mineral and particulate soil organic matter (SOM) pools –which appear to be driven by differences in historical fertilizer management. The NW Iowa site, with a history of manure inputs, had greater plant-available nutrients (phosphorus, potassium, and ammonium) in soils, and resulted in less 13C from miscanthus in SOM pools compared to maize (approximately 64% less in POM and 70% less in MAOM). In more nutrient-limited soils (Central site), miscanthus transferred 4.5 times more 13C than maize to the more stable mineral-associated SOM pool. Our results suggest that past management, including historical manure inputs that affect a site’s soil fertility, can influence the net C benefits of bioenergy crops.
Dataset includes tables/figures from article and supplementary info. Dryad contains raw data.
keywords:
land management; carbon; miscanthus; maize
published:
2025-08-05
Carrica, Lauren; Gulley, Joshua M.
(2025)
This dataset includes all data used in the manuscript by Carrica and Gulley titled, "Ontogeny of catechol-o-methyltransferase expression in the rat prefrontal cortex: effects of methamphetamine exposure"
keywords:
dopamine clearance; adolescence; drug exposure; prefrontal cortex
published:
2025-07-31
Gibson, Jared; Jiang, Zhanzhi; Kou, Angela
(2025)
This repository includes data files and analysis and plotting codes for reproducing the figures in the paper "A scanning resonator for probing quantum coherent devices" arXiv:2506.22620
published:
2025-03-19
Bieri, Carolina A.; Dominguez, Francina; Miguez-Macho, Gonzalo; Fan, Ying
(2025)
This repository includes HRLDAS Noah-MP model output generated as part of Bieri et al. (2025) - Implementing deep soil and dynamic root uptake in Noah-MP (v4.5): Impact on Amazon dry-season transpiration.
These data are distributed in two different formats: Raw model output files and subsetted files that include data for a specific variable. All files are .nc format (NetCDF) and aggregated into .tar files to facilitate download. Given the size of these datasets, Globus transfer is the best way to download them.
Raw model output for four model experiments is available: FD (control), GW, SOIL, and ROOT. See the associated publication for information on the different experiments. These data span an approximately 20 year period from 01 Jun 2000 to 31 Dec 2019. The data have a spatial resolution of 4 km and a temporal frequency of 3 hours. These data are for a domain in the southern Amazon basin (see Figure 1 in the associated publication). Data for each experiment is available as a .tar file which includes 3-hourly NetCDF files. All default Noah-MP output variables are included in each file. As a result, the .tar files are quite large and may take many hours or even days to transfer depending on your network speed and local configurations. These files are named 'noahmp_output_2000_2019_EXP.tar', where EXP is the name of the experiment (FD, GW, SOIL, or ROOT).
Subsetted model output at a daily temporal resolution for all four model experiments is also available. These .tar files include the following variables: water table depth (ZWT), latent heat flux (LH), sensible heat flux (HFX), soil moisture (SOIL_M), canopy evaporation (ECAN), ground evaporation (EDIR), transpiration (ETRAN), rainfall rate at the surface (QRAIN), and two variables that are specific to the ROOT experiment: ROOTACTIVITY (root activity function) and GWRD (active root water uptake depth). There is one file for each variable within the tarred files. These files are named 'noahmp_output_subset_2000_2019_EXP.tar', where EXP is the name of the experiment (FD, GW, SOIL, or ROOT).
Finally, there is a sample dataset with raw 3-hourly output from the ROOT experiment for one day. The purpose of this sample dataset is to allow users to confirm if these data meet their needs before initiating a full transfer via Globus. This file is named 'noahmp_output_sample_ROOT.tar'.
The README.txt file provides information on the Noah-MP output variables in these datasets, among other specifications.
Information on HRLDAS Noah-MP and names/definitions of model output variables that are useful in working with these data are available here: http://dx.doi.org/10.5065/ew8g-yr95. Note that some output variables may be listed in this document under a different variable name, so searching for the long name (e.g. 'baseflow' instead of 'QRF') is recommended.
Information on additional output variables that were added to the model as part of this study is available here: https://github.com/bieri2/bieri-et-al-2025-EGU-GMD/tree/DynaRoot.
Model code, configuration files, and forcing data used to carry out the model simulations are linked in the related resources section.
keywords:
Land surface model; NetCDF
published:
2025-07-28
McCumber, Corinne; Salami, Malik Oyewale
(2025)
This project investigates retraction indexing agreement in PubMed between 2024-07-03 and 2025-05-09 in order to address an API limitation that resulted in 199 items being excluded from analysis in "Analyzing the consistency of retraction indexing". PubMed was queried on 2024-07-03 and on 2025-05-09 using the search “Retracted Publication[PT]”. PubMed is only able to return 10,000 items when queried via the E-Utilities API. When the pipeline was run 2024-07-03, the search between 2020 and 2024 returned 10,199 items, meaning that an expected 199 items indexed as retracted in PubMed were excluded. This dataset uses and compares information from PubMed as of 2025-05-09 to attempt to identify those 199 items.
keywords:
retraction status; data quality; indexing; retraction indexing; metadata; meta-science; RISRS; PubMed
published:
2025-07-25
Mori, Jameson; Rivera, Nelda; Brown, William; Skinner, Daniel; Schlichting, Peter; Novakofski, Jan; Mateus-Pinilla, Nohra
(2025)
This dataset contains the pregnancy status of wild, white-tailed deer (Odocoileus virginianus) from northern Illinois culled as part of the Illinois Department of Natural Resources' chronic wasting disease (CWD) surveillance program. Fiscal years 2005 through 2024 are included. A fiscal year is the time between July 1st of one calendar year and June 30th of the next. Variables in this dataset include the pregnancy status, CWD infection status, age, weight, and day of mortality for each female deer, as well as the deer land cover utility (LCU) score for the TRS, township, or county from which the deer was culled. The deer population density of the county is also included. Data have been anonymized for landowner privacy reasons so that the location and year are not identifiable, but will give the same modeling results by maintaining how the data are grouped. The R code used to conduct the regression modeling is also included.
keywords:
cervid; Cervidae, chronic wasting disease; CWD; reproduction; white-tailed deer; Odocoileus virginianus; pregnancy; regression
published:
2025-06-22
Stickley, Samuel; Crawford, John; Peterman, William; Fraterrigo, Jennifer
(2025)
keywords:
terrestrial salamanders, microhabitat, physiology, mechanistic models, ecological niche models, climate change, Great Smoky Mountains National Park
published:
2019-09-01
Jackson, Nicole; Konar, Megan; Debaere, Peter; Estes, Lyndon
(2019)
Agriculture has substantial socioeconomic and environmental impacts that vary between crops. However, information on how the spatial distribution of specific crops has changed over time across the globe is relatively sparse. We introduce the Probabilistic Cropland Allocation Model (PCAM), a novel algorithm to estimate where specific crops have likely been grown over time. Specifically, PCAM downscales annual and national-scale data on the crop-specific area harvested of 17 major crops to a global 0.5-degree grid from 1961-2014.
The resulting database presented here provides annual global gridded likelihood estimates of crop-specific areas. Both mean and standard deviations of grid cell fractions are available for each of the 17 crops. Each netCDF file contains an individual year of data with an additional variable ("crs") that defines the coordinate reference system used. Our results provide new insights into the likely changes in the spatial distribution of major crops over the past half-century. For additional information, please see the related paper by Jackson et al. (2019) in Environmental Research Letters (https://doi.org/10.1088/1748-9326/ab3b93).
keywords:
global; gridded; probabilistic allocation; crop suitability; agricultural geography; time series
published:
2025-07-21
Feng, Jennifer T.; van den Berg, Thya; Donders, Timme H.; Kong, Shu; Puthanveetil Satheesan, Sandeep; Punyasena, Surangi W.
(2025)
This dataset includes image stacks, annotated counts, and ground-truth masks from two high-resolution sediment cores extracted from Laguna Pallcacocha, in El Cajas National Park, Ecuadorian Andes by Moy et al. (2002) and Hagemans et al. (2021). The first core (PAL 1999, from Moy et al. (2002)) extends through the Holocene (11,600 cal. yr. BP - present). There are a total of 900 annotated image stacks and masks in the PAL 1999 domain. The second core (PAL IV, from Hagemans et al. (2021)) captures the 20th century. There are 2986 annotated image stacks and masks in the PAL IV domain.
Different microscopes and annotations tools were used to image and annotate each core and there are corresponding differences in naming conventions and file formats. Thus, we organized our data separately for the PAL 1999 and the PAL IV domains. The three letter codes used to label our pollen annotations are in the file: “Pollen_Identification_Codes.xlsx”.
Both domain directories contain:
• Image stacks organized by subdirectory
• Annotations within each image stack directory, containing specimen identifications using a three letter code and coordinates defining bounding boxes or circles
• Ground-truth distance-transform masks for each image stack
The zip file "bestValModel_encoder.paramOnly.zip" is the trained pollen detection model produced from the images and annotations in this dataset.
Please cite this dataset as:
Feng, Jennifer T.; van den Berg, Thya; Donders, Timme H.; Kong, Shu; Puthanveetil Satheesan, Sandeep; Punyasena, Surangi W. (2025): Slide scans, annotated pollen counts, and trained pollen detection models for fossil pollen samples from Laguna Pallcacocha, El Cajas National Park, Ecuador . University of Illinois Urbana-Champaign. https://doi.org/10.13012/B2IDB-4207757_V1
Please also include citations of the original publications from which these data are taken:
Feng, Jennifer T., Sandeep Puthanveetil Satheesan, Shu Kong, Timme H. Donders, and Surangi W. Punyasena. “Addressing the ‘Open World’: Detecting and Segmenting Pollen on Palynological Slides with Deep Learning.” bioRxiv, January 1, 2025. https://doi.org/10.1101/2025.01.05.631390.
Feng, Jennifer T., Sandeep Puthanveetil Satheesan, Shu Kong, Timme H. Donders, and Surangi W. Punyasena. “Addressing the ‘Open World’: Detecting and Segmenting Pollen on Palynological Slides with Deep Learning.” Paleobiology, 2025 [in press].
Feng, J. T. (2023). Open-world deep learning applied to pollen detection (MS thesis, University of Illinois at Urbana-Champaign). https://hdl.handle.net/2142/120168
keywords:
continual learning; deep learning; domain gaps; open-world; palynology; pollen grain detection; taxonomic bias
published:
2024-11-15
Blanke, Steven; Ringling, Megan; Tan, Ivilyn; Oh, Seung
(2024)
This page contains the data for the manuscript "Vacuolating cytotoxin A interactions with the host cell surface". This manuscript is currently in prep.
keywords:
Steven R Blanke; Vacuolating cytotoxin A; VacA; Helicobacter pylori; protein binding; sphingomyelin; cell surface
published:
2024-11-13
Tang, Zhichu; Chen, Wenxiang; Yin, Kaijun; Busch, Robert; Hou, Hanyu; Lin, Oliver; Lyu, Zhiheng; Zhang, Cheng; Yang, Hong; Zuo, Jian-Min ; Chen, Qian
(2024)
These datasets are for the four-dimensional scanning transmission electron microscopy (4D-STEM) and electron energy loss spectroscopy (EELS) experiments for cathode nanoparticles at different states. The raw 4D-STEM experiment datasets were collected by TEM image & analysis software (FEI) and were saved as SER files. The raw 4D-STEM datasets of SER files can be opened and viewed in MATLAB using our analysis software package of imToolBox available at https://github.com/flysteven/imToolBox. The raw EELS datasets were collected by DigitalMicrograph software and were saved as DM4 files. The raw EELS datasets can be opened and viewed in DigitalMicrograph software or using our analysis codes available at https://github.com/chenlabUIUC/OrientedPhaseDomain. All the datasets are from the work "Nanoscale Stacking Fault Engineering and Mapping in Spinel Oxides for Reversible Multivalent Ion Insertion" (2024).
The 4D-STEM experiment data include four example datasets for cathode nanoparticles collected at pristine and discharged states. Each dataset contains a stack of diffraction patterns collected at different probe positions scanned across the cathode nanoparticle.
1. Pristine untreated nanoparticle: "Pristine U-NP.ser"
2. Pristine 200ºC heated nanoparticle: "Pristine H200-NP.ser"
3. Untreated nanoparticle after first discharge in Zn-ion batteries: "Discharged U-NP.ser"
4. 200ºC heated nanoparticle after first discharge in Zn-ion batteries: "Discharged H200-NP.ser"
The EELS experiment data includes six example datasets for cathode nanoparticles collected at different states (in "EELS datasets.zip") as described below. Each EELS dataset contains the zero-loss and core-loss EELS spectra collected at different probe positions scanned across the cathode nanoparticle.
1. Pristine untreated nanoparticle: "Pristine U-NP EELS.zip"
2. Pristine 200ºC heated nanoparticle: "Prisitne H200-NP EELS.zip"
3. Untreated nanoparticle after first discharge in Zn-ion batteries: "Discharged U-NP EELS.zip"
4. Untreated nanoparticle after first charge in Zn-ion batteries: "Charged U-NP EELS.zip"
5. 200ºC heated nanoparticle after first discharge in Zn-ion batteries: "Discharged H200-NP EELS.zip"
6. 200ºC heated nanoparticle after first charge in Zn-ion batteries: "Charged H200-NP EELS.zip"
The details of the software package and codes that can be used to analyze the 4D-STEM datasets and EELS datasets are available at: https://github.com/chenlabUIUC/OrientedPhaseDomain. Once our paper is formally published, we will update the relationship of these datasets with our paper.
keywords:
4D-STEM; EELS; defects; strain; cathode; nanoparticle; energy storage
published:
2024-10-10
Zeiri, Offer; Hatzis, Katherine Marie; Gomez, Maurea; Cook, Emily A; Kincanon, Maegen; Murphy, Catherine
(2024)
keywords:
Gold nanorods, Surface enhanced Raman spectroscopy, SERS, Polyoxometalates
published:
2025-06-24
Ge, Jiankai; Weatherspoon, Howard; Peters, Baron
(2025)
This supporting information file contains codes related to pending publication Ge et al. Proc. Nat. Acad. Sci. USA, (revisions in review). The contents include a Mathematica code that solves the Laplace transformed equations and generates figures from the paper. A python code is included for generation of Figure 5 in the main text.
keywords:
Population balance model; Covalent organic framework; Nucleation; Growth;
published:
2024-09-16
Wu, Steven; Smith, Hannah
(2024)
This dataset describes an analysis of research documents about the debate between hydrogen fuel cells and
lithium-ion batteries within the context of electric vehicles.
To create this dataset, we first analyzed news articles on the topic of sustainable development. We searched for related science using keywords in Google Scholar. We then identified subtopics and selected one specific subtopic: electric vehicles. We started to identify positions and players about electric vehicles [1].
Within electric vehicles, we started searching in OpenAlex for a topic of reasonable size (about 300 documents) related to a scientific or technical debate. We narrowed to electric vehicles and batteries, then trained a cluster model [2] on OpenAlex’s keywords to develop some possible search queries, and chose one.
Our final search query (May 7, 2024) returned 301 document in OpenAlex:
Title & abstract includes: Electric Vehicle + Hydrogen + Battery
filter is Lithium-ion Battery Management in Electric Vehicle
We used a Python script and the Scopus API to find missing abstracts and DOIs [3].
To identify relevant documents, we used a combination of Abstractkr [4] and manual screening. As a starting point for Abstractkr [4], one person manually screened 200 documents by checking the abstracts for “hydrogen fuel cells” and “battery comparisons”. Then we used Abstractkr [4] to predict the relevance of the remaining documents based on the title, abstract, and keywords. The settings we used were single screening, ordered by most likely to be relevant, and 0 pilot size. We set a threshold of 0.6 for the predictions. After screening and predictions, 176 documents remained
keywords:
controversy mapping; sustainable development; evidence synthesis; OpenAlex; Abstrackr; Scopus; meta-analysis; electric vehicle; hydrogen fuel cells; battery
published:
2025-02-08
Anne, Lahari; Park, Minhyuk; Warnow, Tandy; Chacko, George
(2025)
The synthetic networks in this dataset were generated using the RECCS protocol developed by Anne et al. (2024). Briefly, the RECCS process is as follows. An input network and clustering (by any algorithm) is used to pass input parameters to a stochastic block model (SBM) generator. The output is then modified to improve fit to the input real world clusters after which outlier nodes are added using one of three different options. See Anne et al. (2024): in press Complex Networks and Applications XIII (preprint : arXiv:2408.13647).
The networks in this dataset were generated using either version 1 or version 2 of the RECCS protocol followed by outlier strategy S1. The input networks to the process were (i) the Curated Exosome Network (CEN), Wedell et al. (2021), (ii) cit_hepph (https://snap.stanford.edu/), (iii) cit_patents (https://snap.stanford.edu/), and (iv) wiki_topcats (https://snap.stanford.edu/).
Input Networks:
The CEN can be downloaded from the Illinois Data Bank:
https://databank.illinois.edu/datasets/IDB-0908742 -> cen_pipeline.tar.gz -> S1_cen_cleaned.tsv
The synthetic file naming system should be interpreted as follows: a_b_c.tsv.gz where
a - name of inspirational network, e.g., cit_hepph
b - the resolution value used when clustering a with the Leiden algorithm optimizing the Constant Potts Model, e.g., 0.01
c- the RECCS option used to approximate edge count and connectivity in the real world network, e.g., v1
Thus, cit_hepph_0.01_v1.tsv indicates that this network was modeled on the cit_hepph network and RECCSv1 was used to match edge count and connectivity to a Leiden-CPM 0.01 clustering of cit_hepph. For SBM generation, we used the graph_tool software (P. Peixoto, Tiago 2014. The graph-tool python library. figshare. Dataset. https://doi.org/10.6084/m9.figshare.1164194.v14)
Additionally, this dataset contains synthetic networks generated for a replication experiment (repl_exp.tar.gz). The experiment aims to evaluate the consistency of RECCS-generated networks by producing multiple replicates under controlled conditions. These networks were generated using different configurations of RECCS, varying across two versions (v1 and v2), and applying the Connectivity Modifier (CM++, Ramavarapu et al. (2024)) pre-processing. Please note that the CM pipeline used for this experiment filters small clusters both before and after the CM treatment.
Input Network : CEN
Within repl_exp.tar.gz, the synthetic file naming system should be interpreted as follows:
cen_<resolution><cm_status><reccs_version>sample<replicate_id>.tsv
where:
cen – Indicates the network was modeled on the Curated Exosome Network (CEN).
resolution – The resolution parameter used in clustering the input network with Leiden-CPM (0.01).
cm_status – Either cm (CM-treated input clustering) or no_cm (input clustering without CM treatment).
reccs_version – The RECCS version used to generate the synthetic network (v1 or v2).
replicate_id – The specific replicate (ranging from 0 to 2 for each configuration).
For example:
cen_0.01_cm_v1_sample_0.tsv – A synthetic network based on CEN with Leiden-CPM clustering at resolution 0.01, CM-treated input, and generated using RECCSv1 (first replicate).
cen_0.01_no_cm_v2_sample_1.tsv – A synthetic network based on CEN with Leiden-CPM clustering at resolution 0.01, without CM treatment, and generated using RECCSv2 (second replicate).
The ground truth clustering input to RECCS is contained in repl_exp_groundtruths.tar.gz.
keywords:
Community Detection; Synthetic Networks; Stochastic Block Model (SBM);
published:
2025-05-21
Punyasena, Surangi W.; Adaime, Marc-Elie; Jaramillo, Carlos
(2025)
This dataset includes a total of 16 images of 2 extant species of Podocarpus (Podocarpaceae) and 23 images of fossil specimens of the morphogenus Podocarpidites.
The images were taken using a Zeiss LSM 880 microscope with Airyscan confocal superresolution at 630x magnification (63x/NA 1.4 oil DIC). The images are in the original CZI file format. They can be opened using Zeiss propriety software (Zen, Zen lite) or open microscopy software, such as ImageJ. More information on how to open CZI files can be found here: [https://www.zeiss.com/microscopy/us/products/software/zeiss-zen/czi-image-file-format.html]
For Podocarpus (modern specimens):
Each folder is labelled by genus and contain all images corresponding to that genus. Detailed information about the folders, files, and specimens can be found in the Excel file "METADATA_Podocarpus_extant.csv". This file includes metadata on: species, slide ID, collection, folder name file name and notes.
Images are of pollen grains from slides in the Florida Museum of Natural History collections.
For Podocarpidites (fossil specimens):
Each image is named after the sample from which it was derived. Detailed information about the specimens can be found in the Excel file "METADATA_ Podocarpidites_fossil.csv". This file includes metadata: the fossil type (Taxon), the slide and sample name (Slide Info), the location of the sample locality (Country, Latitude, Longitude), the age of the sample (Min age, Max age), the location of the specimen on the sample slide (England Finder coordinates), and the image file name.
Images are of fossil pollen from slides in Smithsonian Tropical Research Institute collections.
Please cite this dataset and listed publications when using these images.
keywords:
optical superresolution microscopy; Zeiss Airyscan; CZI images; conifer; saccate pollen; Podocarpus; Podocarpidites