Illinois Data Bank Dataset Search Results
Results
published:
2025-07-21
Feng, Jennifer T.; van den Berg, Thya; Donders, Timme H.; Kong, Shu; Puthanveetil Satheesan, Sandeep; Punyasena, Surangi W.
(2025)
This dataset includes image stacks, annotated counts, and ground-truth masks from two high-resolution sediment cores extracted from Laguna Pallcacocha, in El Cajas National Park, Ecuadorian Andes by Moy et al. (2002) and Hagemans et al. (2021). The first core (PAL 1999, from Moy et al. (2002)) extends through the Holocene (11,600 cal. yr. BP - present). There are a total of 900 annotated image stacks and masks in the PAL 1999 domain. The second core (PAL IV, from Hagemans et al. (2021)) captures the 20th century. There are 2986 annotated image stacks and masks in the PAL IV domain.
Different microscopes and annotations tools were used to image and annotate each core and there are corresponding differences in naming conventions and file formats. Thus, we organized our data separately for the PAL 1999 and the PAL IV domains. The three letter codes used to label our pollen annotations are in the file: “Pollen_Identification_Codes.xlsx”.
Both domain directories contain:
• Image stacks organized by subdirectory
• Annotations within each image stack directory, containing specimen identifications using a three letter code and coordinates defining bounding boxes or circles
• Ground-truth distance-transform masks for each image stack
The zip file "bestValModel_encoder.paramOnly.zip" is the trained pollen detection model produced from the images and annotations in this dataset.
Please cite this dataset as:
Feng, Jennifer T.; van den Berg, Thya; Donders, Timme H.; Kong, Shu; Puthanveetil Satheesan, Sandeep; Punyasena, Surangi W. (2025): Slide scans, annotated pollen counts, and trained pollen detection models for fossil pollen samples from Laguna Pallcacocha, El Cajas National Park, Ecuador . University of Illinois Urbana-Champaign. https://doi.org/10.13012/B2IDB-4207757_V1
Please also include citations of the original publications from which these data are taken:
Feng, Jennifer T., Sandeep Puthanveetil Satheesan, Shu Kong, Timme H. Donders, and Surangi W. Punyasena. “Addressing the ‘Open World’: Detecting and Segmenting Pollen on Palynological Slides with Deep Learning.” bioRxiv, January 1, 2025. https://doi.org/10.1101/2025.01.05.631390.
Feng, Jennifer T., Sandeep Puthanveetil Satheesan, Shu Kong, Timme H. Donders, and Surangi W. Punyasena. “Addressing the ‘Open World’: Detecting and Segmenting Pollen on Palynological Slides with Deep Learning.” Paleobiology, 2025 [in press].
Feng, J. T. (2023). Open-world deep learning applied to pollen detection (MS thesis, University of Illinois at Urbana-Champaign). https://hdl.handle.net/2142/120168
keywords:
continual learning; deep learning; domain gaps; open-world; palynology; pollen grain detection; taxonomic bias
published:
2017-12-14
Objectives: This study follows-up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1) what is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign campus repository? Are datasets more likely to be single file or multiple file items? (2) what is the usage data associated with these datasets? Which items are most popular?
Methods: The dataset records collected in this study were identified by filtering item types categorized as "data" or "dataset" using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item's statistics report. The Handle identifier represents the dataset record's persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository. Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS.
Results: A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first time frame a large number of PDFs were deposited by the Illinois Department of Agriculture. Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per file across all datasets averaged 3.2.
Conclusion: Academic librarians, repository managers, and research data services staff can use the results presented here to anticipate the nature of research data that may be deposited within institutional repositories. With increased awareness, content recruitment, and improvements, IRs can provide a viable cyberinfrastructure for researchers to deposit data, but much can be learned from the data already deposited. Awareness of trends can help librarians facilitate discussions with researchers about research data deposits as well as better tailor their services to address short-term and long-term research needs.
keywords:
research data; research statistics; institutional repositories; academic libraries
published:
2025-09-23
Zhao, Huimin; Chen, Li-Qing; Martin, Teresa; Xue, Xueyi; Singh, Nilmani; Tan, Shi-I; Boob, Aashutosh
(2025)
Mitochondria play a key role in energy production and metabolism, making them a promising target for metabolic engineering and disease treatment. However, despite the known influence of passenger proteins on localization efficiency, only a few protein-localization tags have been characterized for mitochondrial targeting. To address this limitation, we leverage a Variational Autoencoder to design novel mitochondrial targeting sequences. In silico analysis reveals that a high fraction of the generated peptides (90.14%) are functional and possess features important for mitochondrial targeting. We characterize artificial peptides in four eukaryotic organisms and, as a proof-of-concept, demonstrate their utility in increasing 3-hydroxypropionic acid titers through pathway compartmentalization and improving 5-aminolevulinate synthase delivery by 1.62-fold and 4.76-fold, respectively. Moreover, we employ latent space interpolation to shed light on the evolutionary origins of dual-targeting sequences. Overall, our work demonstrates the potential of generative artificial intelligence for both fundamental research and practical applications in mitochondrial biology.
keywords:
AI/ML; metabolic engineering; modeling; software
published:
2018-04-19
Prepared by Vetle Torvik 2018-04-15
The dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.
• How was the dataset created?
First and last names of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including
Ethnea+Genni as described in:
<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.
http://hdl.handle.net/2142/88927</i>
<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>
EthnicSeer: http://singularity.ist.psu.edu/ethnicity
<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>
SexMachine 0.1.1: <a href="https://pypi.python.org/pypi/SexMachine/">https://pypi.org/project/SexMachine</a>
First names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.
• The code and back-end data is periodically updated and made available for query at <a href ="http://abel.ischool.illinois.edu">Torvik Research Group</a>
• What is the format of the dataset?
The dataset contains 9,300,182 rows and 10 columns
1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)
2. name: full name used as input to EthnicSeer)
3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX
4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction
5. lastname: used as input for Ethnea+Genni
6. firstname: used as input for Ethnea+Genni
7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)
8. Genni: predicted gender; 'F', 'M', or '-'
9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)
10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'
keywords:
Androgyny; Bibliometrics; Data mining; Search engine; Gender; Semantic orientation; Temporal prediction; Textual markers
published:
2025-09-15
Zhao, Yang; Kim, Jae Y.; Karan, Ratna; Jung, Je Hyeong; Pathak, Bhuvan; Williamson, Bruce; Kannan, Baskaran; Wang, Duoduo; Fan, Chunyang; Yu, Wenjin; Dong, Shujie; Srivastava, Vibha; Altpeter, Fredy
(2025)
Sugarcane, a tropical C4 grass in the genus Saccharum (Poaceae), accounts for nearly 80% of sugar produced worldwide and is also an important feedstock for biofuel production. Generating transgenic sugarcane with predictable and stable transgene expression is critical for crop improvement. In this study, we generated a highly expressed single copy locus as landing pad for transgene stacking. Transgenic sugarcane lines with stable integration of a single copy nptII expression cassette flanked by insulators supported higher transgene expression along with reduced line to line variation when compared to single copy events without insulators by NPTII ELISA analysis. Subsequently, the nptII selectable marker gene was efficiently excised from the sugarcane genome by the FLPe/FRT site-specific recombination system to create selectable marker free plants. This study provides valuable resources for future gene stacking using site-specific recombination or genome editing tools.
keywords:
Feedstock Production;Biomass Analytics;Genomics
published:
2018-09-04
Teper, Thomas; Lenkart, Joe; Thacker, Mara; Coskun, Esra
(2018)
This dataset contains records of five years of interlibrary loan (ILL) transactions for the University of Illinois at Urbana-Champaign
Library. It is for the materials lent to other institutions during period 2009-2013. It includes 169,890 transactions showing date; borrowing institution’s type, state and country; material format, imprint city, imprint country, imprint region, call number, language, local circulation count, ILL lending count, and OCLC holdings count.
The dataset was generated putting together monthly ILL reports. Circulation and ILL lending fields were added from the ILS records. Borrower region and imprint region fields are created based on Title VI Region List. OCLC holdings field has been added from WorldCat records.
keywords:
Interlibrary Loan; ILL; Lending; OCLC Holding; Library; Area Studies; Collection; Circulation; Collaborative; Shared; Resource Sharing
published:
2017-02-23
GBS data from diverse sorghum lines. Project funded by DOE, ARPA-E, and startup funds to PJ Brown.
published:
2023-08-24
Kim, Hyunchul; Zhao, Helin; van der Zande, Arend
(2023)
This data set includes all of data related to strain-resilient FETs based on 2D heterostructures including optical images of FETs, Raman characteristics data, Transport measurement data, and AFM topography data.
keywords:
2D materials; Stretchable electronics
published:
2019-10-05
Saurabh, Jha; Archit, Patke; Mike, Showerman; Jeremy, Enos; Greg, Bauer; Zbigniew, Kalbarczyk; Ravishankar, Iyer; William , Kramer
(2019)
This dataset contains collected and aggregated network information from NCSA’s Blue Waters system, which is comprised of 27,648 nodes connected via Cray Gemini* 3D torus (dimension 24x24x24) interconnect, from Jan/01/2017 to May/31/2017. Network performance counters for links are exposed via Cray's gpcdr (<a href="https://github.com/ovis-hpc/ovis/wiki/gpcdr-kernel-module">https://github.com/ovis-hpc/ovis/wiki/gpcdr-kernel-module</a>) kernel module. Lightweight Distributed Metric Service ([LDMS](<a href="https://github.com/ovis-hpc/ovis">https://github.com/ovis-hpc/ovis</a>)) is used to sampled the performance counters at 60 second intervals. Please read "README.md" file.
<b>Acknowledgement:</b>
This dataset is collected as a part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.
keywords:
HPC; Interconnect; Network; Congestion; Blue Waters; Dataset
published:
2017-02-21
GBS data from biparental sorghum populations provided by Dr. Bill Rooney, TAMU. Data produced and analyzed by Pradeep Hirannaiah to study recombination in sorghum. Funding for this study was provided by the Sorghum Checkoff.
published:
2017-03-07
Mickalide, Harry; Fraebel, David T.; Kuehn, Seppe
(2017)
This is a sample 5 minute video of an E coli bacterium swimming in a microfluidic chamber as well as some supplementary code files to be used with the Matlab code available at https://github.com/dfraebel/CellTracking
published:
2018-12-13
Xu, Zewei; Wang, Shaowen
(2018)
A 3D CNN method to land cover classification using LiDAR and multitemporal imagery
keywords:
3DCNN; land cover classification; LiDAR; multitemporal imagery
published:
2017-02-21
GBS data from diverse sorghum lines. Project funded by DOE, ARPA-E, and startup funds to PJ Brown.
published:
2019-08-30
This dataset includes the data from an analysis of bobcat harvest data with particular focus on the relationship between catch-per-unit-effort and population size. The data relate to bobcat trapper and hunter harvest metrics from Wisconsin and include two RDS files which can be open in the software R using the readRDS() function.
keywords:
bobcat; catch-per-unit-effort; CPUE; harvest; Lynx rufus; wildlife management; trapper; hunter
published:
2021-07-21
Rozansky, Zachary; Larson, Eric; Taylor, Christopher
(2021)
This dataset contains 1 CSV file: RozanskyLarsonTaylorMsat.csv which contains microsatellite fragment lengths for Virile and Spothanded Crayfish from the Current River watershed of Missouri, U.S., and complimentary data, including assignments to species by phenotype and COI sequence data, GenBank accession numbers for COI sequence data, study sites with dates of collection and geographic coordinates, and Illinois Natural History Survey (INHS) Crustacean Collection lots where specimens are stored.
keywords:
invasive species; hybridization; crayfishes; streams; freshwater; Cambaridae; virile crayfish; spothanded crayfish; Missouri; Current River; Ozark National Scenic Riverways
published:
2016-05-26
This data set includes survey responses collected during 2015 from academic libraries with library publishing services. Each institution responded to questions related to its use of user studies or information about readers in order to shape digital publication design, formats, and interfaces. Survey data was supplemented with institutional categories to facilitate comparison across institutional types.
keywords:
academic libraries; publishing; user experience; user studies
published:
2020-08-31
Chen, Luoye; Khanna, Madhu; Debnath, Deepayan; Zhong, Jia; Ferin, Kelsie; VanLoocke, Andy
(2020)
This dataset contains BEPAM model code and input data to replicate the outcomes for "The Economic and Environmental Costs and Benefits of the Renewable Fuel Standard".
The dataset consists of:
(1) The replication codes and data for the BEPAM model. The code file is named as output.gms. (BEPAM-Social cost model-ERL.zip)
(2) Simulation results from the BEPAM model (BEPAM_Simulation_Results.csv)
* Item (1) is in GAMS format. Item (2) is in text format.
keywords:
Social Cost of Carbon; Social Cost of Nitrogen; Cost-Benefit Analysis; Indirect Land-Use Change
published:
2025-04-21
Shen, Chengze; Wedell, Eleanor; Warnow, Tandy
(2025)
#Overview
These are reference packages for the TIPP3 software for abundance profiling and/or species detection from metagenomic reads (e.g., Illumina, PacBio, Nanopore, etc.). Different refpkg versions are listed.
TIPP3 software: https://github.com/c5shen/TIPP3
#Changelog
V1.2 (`tipp3-refpkg-1-2.zip`)
>>Fixed old typos in the file mapping text.
>>Added new files `taxonomy/species_to_marker.tsv` for new function `run_tipp3.py detection [...parameters]`. Please use the latest release of the TIPP3 software for this new function.
V1 (`tipp3-refpkg.zip`)
>>Initial release of the TIPP3 reference package.
#Usage
1. unzip the file to a local directory (will get a folder named "tipp3-refpkg").
2. use with TIPP3 software: `run_tipp3.py -r [path/to/tipp3-refpkg] [other parameters]`
keywords:
TIPP3; abundance profile; reference database; taxonomic identification
published:
2018-12-20
Dong, Xiaoru; Xie, Jingyi; Hoang, Linh
(2018)
File Name: WordsSelectedByInformationGain.csv
Data Preparation: Xiaoru Dong, Linh Hoang
Date of Preparation: 2018-12-12
Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang
Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks.
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider.
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews.
Description: the file contains a list of 1655 informative words selected by applying information gain feature selection strategy.
Information gain is one of the methods commonly used for feature selection, which tells us how many bits of information the presence of the word are helpful for us to predict the classes, and can be computed in a specific formula [Jurafsky D, Martin JH. Speech and language processing. London: Pearson; 2014 Dec 30].We ran Information Gain feature selection on Weka -- a machine learning tool.
Notes: In order to reproduce the data in this file, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords:
Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews
published:
2021-08-05
Lotspeich-Yadao, Michael
(2021)
This geodatabase serves two purposes: 1) to provide State of Illinois agencies with a fast resource for the preparation of maps and figures that require the use of shape or line files from federal agencies, the State of Illinois, or the City of Chicago, and 2) as a start for social scientists interested in exploring how geographic information systems (whether this is data visualization or geographically weighted regression) can bring new meaning to the interpretation of their data. All layer files included are relevant to the State of Illinois. Sources for this geodatabase include the U.S. Census Bureau, U.S. Geological Survey, City of Chicago, Chicago Public Schools, Chicago Transit Authority, Regional Transportation Authority, and Bureau of Transportation Statistics.
keywords:
State of Illinois; City of Chicago; Chicago Public Schools; GIS; Statistical tabulation areas; hydrography
published:
2022-10-04
One of the newest types of multimedia involves body-connected interfaces, usually termed haptics. Haptics may use stylus-based tactile interfaces, glove-based systems, handheld controllers, balance boards, or other custom-designed body-computer interfaces. How well do these interfaces help students learn Science, Technology, Engineering, and Mathematics (STEM)? We conducted an updated review of learning STEM with haptics, applying meta-analytic techniques to 21 published articles reporting on 53 effects for factual, inferential, procedural, and transfer STEM learning. This deposit includes the data extracted from those articles and comprises the raw data used in the meta-analytic analyses.
keywords:
Computer-based learning; haptic interfaces; meta-analysis
published:
2020-01-28
Miao, Guofang; Guan, Kaiyu
(2020)
This dataset includes two data files that provide the time series (Jul. - Sep. 2017) data of sun-induced chlorophyll fluorescence (SIF_760) collected under sunny conditions at two maize sites (one rainfed and the other irrigated) in Nebraska in 2017.
Data contain 392 SIF_760 records at the rainfed site and 707 records at the irrigated site. The timestamp uses local standard time. Data are available for the sunny conditions from 8 am to 5 pm (corresponding to 9 am to 6 pm local time) throughout the study period.
keywords:
sun-induced chlorophyll fluorescence (SIF); maize; gross primary production(GPP); light use efficiency(LUE); SIF yield
suppressed by curator
published:
2024-09-28
Per the authors' request, the data files for this dataset are now suppressed. Please visit this new dataset for the complete and updated data files: Huang, Yijing; Fahad , Mahmood (2025): Data for Observation of a Magneto-chiral Instability in Photoexcited Tellurium. University of Illinois Urbana-Champaign.<a href="https://doi.org/10.13012/B2IDB-1409842_V1">https://doi.org/10.13012/B2IDB-1409842_V1</a>
====================
The data and code provided in this dataset can be used to generate key plots in the manuscript. It is divided into four subfolders (B parallel/perpendicular to the tellurium c axis and field/ temperature dependence), each containing the raw data (saved in .mat format), the oscillator parameters obtained through linear prediction (saved in .mat format), and the plot-generating code (.m files). The code was written using MATLAB R2024a. To run the code, go to each folder, and run the .m file in that folder, which generates two plots.