Illinois Data Bank Dataset Search Results
Results
published:
2024-10-10
Mishra, Apratim; Lee, Haejin; Jeoung, Sullam; Torvik, Vetle; Diesner, Jana
(2024)
Diversity - PubMed dataset
Contact: Apratim Mishra (Oct, 2024)
This dataset presents article-level (pmid) and author-level (auid) diversity data for PubMed articles. The chosen selection includes articles retrieved from Authority 2018 [1], 907 024 papers, and 1 316 838 authors, and is an expanded dataset of V1. The sample of articles consists of the top 40 journals in the dataset, limited to 2-12 authors published between 1991 – 2014, which are article type "journal type" written in English. Files are 'gzip' compressed and separated by tab space, and V3 includes the correct author count for the included papers (pmids) and updated results with no NaNs.
################################################
File1: auids_plos_3.csv.gz (Important columns defined, 5 in total)
• AUID: a unique ID for each author
• Genni: gender prediction
• Ethnea: ethnicity prediction
#################################################
File2: pmids_plos_3.csv.gz (Important columns defined)
• pmid: unique paper
• auid: all unique auids (author-name unique identification)
• year: Year of paper publication
• no_authors: Author count
• journal: Journal name
• years: first year of publication for every author
• Country-temporal: Country of affiliation for every author
• h_index: Journal h-index
• TimeNovelty: Paper Time novelty [2]
• nih_funded: Binary variable indicating funding for any author
• prior_cit_mean: Mean of all authors’ prior citation rate
• Insti_impact: All unique institutions’ citation rate
• mesh_vals: Top MeSH values for every author of that paper
• relative_citation_ratio: RCR
The ‘Readme’ includes a description for all columns.
[1] Torvik, Vetle; Smalheiser, Neil (2021): Author-ity 2018 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2273402_V1
[2] Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1
keywords:
Diversity; PubMed; Citation
published:
2025-07-21
Feng, Jennifer T.; van den Berg, Thya; Donders, Timme H.; Kong, Shu; Puthanveetil Satheesan, Sandeep; Punyasena, Surangi W.
(2025)
This dataset includes image stacks, annotated counts, and ground-truth masks from two high-resolution sediment cores extracted from Laguna Pallcacocha, in El Cajas National Park, Ecuadorian Andes by Moy et al. (2002) and Hagemans et al. (2021). The first core (PAL 1999, from Moy et al. (2002)) extends through the Holocene (11,600 cal. yr. BP - present). There are a total of 900 annotated image stacks and masks in the PAL 1999 domain. The second core (PAL IV, from Hagemans et al. (2021)) captures the 20th century. There are 2986 annotated image stacks and masks in the PAL IV domain.
Different microscopes and annotations tools were used to image and annotate each core and there are corresponding differences in naming conventions and file formats. Thus, we organized our data separately for the PAL 1999 and the PAL IV domains. The three letter codes used to label our pollen annotations are in the file: “Pollen_Identification_Codes.xlsx”.
Both domain directories contain:
• Image stacks organized by subdirectory
• Annotations within each image stack directory, containing specimen identifications using a three letter code and coordinates defining bounding boxes or circles
• Ground-truth distance-transform masks for each image stack
The zip file "bestValModel_encoder.paramOnly.zip" is the trained pollen detection model produced from the images and annotations in this dataset.
Please cite this dataset as:
Feng, Jennifer T.; van den Berg, Thya; Donders, Timme H.; Kong, Shu; Puthanveetil Satheesan, Sandeep; Punyasena, Surangi W. (2025): Slide scans, annotated pollen counts, and trained pollen detection models for fossil pollen samples from Laguna Pallcacocha, El Cajas National Park, Ecuador . University of Illinois Urbana-Champaign. https://doi.org/10.13012/B2IDB-4207757_V1
Please also include citations of the original publications from which these data are taken:
Feng, Jennifer T., Sandeep Puthanveetil Satheesan, Shu Kong, Timme H. Donders, and Surangi W. Punyasena. “Addressing the ‘Open World’: Detecting and Segmenting Pollen on Palynological Slides with Deep Learning.” bioRxiv, January 1, 2025. https://doi.org/10.1101/2025.01.05.631390.
Feng, Jennifer T., Sandeep Puthanveetil Satheesan, Shu Kong, Timme H. Donders, and Surangi W. Punyasena. “Addressing the ‘Open World’: Detecting and Segmenting Pollen on Palynological Slides with Deep Learning.” Paleobiology, 2025 [in press].
Feng, J. T. (2023). Open-world deep learning applied to pollen detection (MS thesis, University of Illinois at Urbana-Champaign). https://hdl.handle.net/2142/120168
keywords:
continual learning; deep learning; domain gaps; open-world; palynology; pollen grain detection; taxonomic bias
published:
2025-01-30
Zhang, Yufan; Bhattarai, Rabin
(2025)
This is a research data for a manuscript - A Framework of Simulating Structural Sediment Perimeter Barriers using VFSMOD.
keywords:
sediment control
published:
2017-12-14
Objectives: This study follows-up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1) what is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign campus repository? Are datasets more likely to be single file or multiple file items? (2) what is the usage data associated with these datasets? Which items are most popular?
Methods: The dataset records collected in this study were identified by filtering item types categorized as "data" or "dataset" using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item's statistics report. The Handle identifier represents the dataset record's persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository. Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS.
Results: A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first time frame a large number of PDFs were deposited by the Illinois Department of Agriculture. Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per file across all datasets averaged 3.2.
Conclusion: Academic librarians, repository managers, and research data services staff can use the results presented here to anticipate the nature of research data that may be deposited within institutional repositories. With increased awareness, content recruitment, and improvements, IRs can provide a viable cyberinfrastructure for researchers to deposit data, but much can be learned from the data already deposited. Awareness of trends can help librarians facilitate discussions with researchers about research data deposits as well as better tailor their services to address short-term and long-term research needs.
keywords:
research data; research statistics; institutional repositories; academic libraries
published:
2022-06-01
Southey, Bruce; Rodriguez-Zas, Sandra L.
(2022)
This dataset contain information for the paper "Changes in neuropeptide prohormone genes among Cetartio-dactyla livestock and wild species associated with evolution and domestication" Veterinary Sciences, MDPI. Protein sequences were predicted using GeneWise for 98 neuropeptide prohormone genes from publicly available genomes of 118 Cetartiodactyla species. All predictions (CetartiodactylaSequences2022.zip) were manually verified. Sequences were aligned within each prohormone using MAFFT (MDPImultalign2022.zip includes multiple sequence alignment of all species available for each prohormone). Phylogenetic gene trees were constructed using PhyML and the species tree was constructed using ASTRAL (MDPItree2022.zip). The data is released under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).
keywords:
prohormone; neuropeptide; Cetartiodactyla; Cetartiodactyla; phylogenetics; gene tree; species tree
published:
2018-04-19
Prepared by Vetle Torvik 2018-04-15
The dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.
• How was the dataset created?
First and last names of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including
Ethnea+Genni as described in:
<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.
http://hdl.handle.net/2142/88927</i>
<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>
EthnicSeer: http://singularity.ist.psu.edu/ethnicity
<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>
SexMachine 0.1.1: <a href="https://pypi.python.org/pypi/SexMachine/">https://pypi.org/project/SexMachine</a>
First names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.
• The code and back-end data is periodically updated and made available for query at <a href ="http://abel.ischool.illinois.edu">Torvik Research Group</a>
• What is the format of the dataset?
The dataset contains 9,300,182 rows and 10 columns
1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)
2. name: full name used as input to EthnicSeer)
3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX
4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction
5. lastname: used as input for Ethnea+Genni
6. firstname: used as input for Ethnea+Genni
7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)
8. Genni: predicted gender; 'F', 'M', or '-'
9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)
10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'
keywords:
Androgyny; Bibliometrics; Data mining; Search engine; Gender; Semantic orientation; Temporal prediction; Textual markers
published:
2025-09-15
Zhao, Yang; Kim, Jae Y.; Karan, Ratna; Jung, Je Hyeong; Pathak, Bhuvan; Williamson, Bruce; Kannan, Baskaran; Wang, Duoduo; Fan, Chunyang; Yu, Wenjin; Dong, Shujie; Srivastava, Vibha; Altpeter, Fredy
(2025)
Sugarcane, a tropical C4 grass in the genus Saccharum (Poaceae), accounts for nearly 80% of sugar produced worldwide and is also an important feedstock for biofuel production. Generating transgenic sugarcane with predictable and stable transgene expression is critical for crop improvement. In this study, we generated a highly expressed single copy locus as landing pad for transgene stacking. Transgenic sugarcane lines with stable integration of a single copy nptII expression cassette flanked by insulators supported higher transgene expression along with reduced line to line variation when compared to single copy events without insulators by NPTII ELISA analysis. Subsequently, the nptII selectable marker gene was efficiently excised from the sugarcane genome by the FLPe/FRT site-specific recombination system to create selectable marker free plants. This study provides valuable resources for future gene stacking using site-specific recombination or genome editing tools.
keywords:
Feedstock Production;Biomass Analytics;Genomics
published:
2018-09-04
Teper, Thomas; Lenkart, Joe; Thacker, Mara; Coskun, Esra
(2018)
This dataset contains records of five years of interlibrary loan (ILL) transactions for the University of Illinois at Urbana-Champaign
Library. It is for the materials lent to other institutions during period 2009-2013. It includes 169,890 transactions showing date; borrowing institution’s type, state and country; material format, imprint city, imprint country, imprint region, call number, language, local circulation count, ILL lending count, and OCLC holdings count.
The dataset was generated putting together monthly ILL reports. Circulation and ILL lending fields were added from the ILS records. Borrower region and imprint region fields are created based on Title VI Region List. OCLC holdings field has been added from WorldCat records.
keywords:
Interlibrary Loan; ILL; Lending; OCLC Holding; Library; Area Studies; Collection; Circulation; Collaborative; Shared; Resource Sharing
published:
2017-03-08
Thapa, Sita; Schroeder, Nathan; Patel, Jayna; Reuter-Carlson, Ursula
(2017)
This dataset includes early embryogenesis and post-embryonic development of Soybean cyst nematode.
keywords:
Soybean cyst nematode; Embryogenesis; Post-embryonic development
published:
2019-10-05
Saurabh, Jha; Archit, Patke; Mike, Showerman; Jeremy, Enos; Greg, Bauer; Zbigniew, Kalbarczyk; Ravishankar, Iyer; William , Kramer
(2019)
This dataset contains collected and aggregated network information from NCSA’s Blue Waters system, which is comprised of 27,648 nodes connected via Cray Gemini* 3D torus (dimension 24x24x24) interconnect, from Jan/01/2017 to May/31/2017. Network performance counters for links are exposed via Cray's gpcdr (<a href="https://github.com/ovis-hpc/ovis/wiki/gpcdr-kernel-module">https://github.com/ovis-hpc/ovis/wiki/gpcdr-kernel-module</a>) kernel module. Lightweight Distributed Metric Service ([LDMS](<a href="https://github.com/ovis-hpc/ovis">https://github.com/ovis-hpc/ovis</a>)) is used to sampled the performance counters at 60 second intervals. Please read "README.md" file.
<b>Acknowledgement:</b>
This dataset is collected as a part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.
keywords:
HPC; Interconnect; Network; Congestion; Blue Waters; Dataset
published:
2016-12-18
Zhang, Qian; Li, Chunyan
(2016)
This dataset is the numerical simulation data of the computational study of the cold front-related hydrodynamics in the Wax Lake delta. The numerical model used is ECOM-si.
keywords:
Wax Lake delta; Hydrodynamics; Cold front
published:
2017-03-07
Mickalide, Harry; Fraebel, David T.; Kuehn, Seppe
(2017)
This is a sample 5 minute video of an E coli bacterium swimming in a microfluidic chamber as well as some supplementary code files to be used with the Matlab code available at https://github.com/dfraebel/CellTracking
published:
2019-12-03
This is the data set associated with the manuscript titled "Extensive host-switching of avian feather lice following the Cretaceous-Paleogene mass extinction event." Included are the gene alignments used for phylogenetic analyses and the cophylogenetic input files.
keywords:
phylogenomics, cophylogenetics, feather lice, birds
published:
2017-12-12
Zhang, Qian; Li, Chunyan
(2017)
This dataset includes both meteorology and oceanography data collected at stations (CSI03, CSI06, and CSI09) near the Gulf of Mexico from the LSU WAVCIS (Waves-Current-Surge Information System) lab. The associated data analysis visualization is also saved in separate directories.
keywords:
WAVCIS; Gulf of Mexico; Meteorology; Oceanography
published:
2016-12-12
Zhang, Qian; Li, Chunyan
(2016)
This dataset is the field measurements of water depth at the Wax Lake delta conducted in late 2012.
keywords:
Wax Lake delta; Bathymetry
published:
2017-09-06
Kozuch, Laura; Walker, Karen; Marquardt, William
(2017)
Spire angle data for sinistral whelks of the family Busyconidae. Data focuses on spire angles, with some data on total shell length. Locality information is present for all modern specimens.
keywords:
lightning whelk; sinistral whelk; spire angle; sourcing; Busycon; Cahokia; Spiro
published:
2016-12-12
Zhang, Qian; Li, Chunyan
(2016)
This dataset is the field measurements of currents at two stations (Big Hogs Bayou and Delta1) in the the Wax Lake delta in November 2012 and February 2013.
keywords:
Wax Lake delta; Currents
published:
2020-08-31
Chen, Luoye; Khanna, Madhu; Debnath, Deepayan; Zhong, Jia; Ferin, Kelsie; VanLoocke, Andy
(2020)
This dataset contains BEPAM model code and input data to replicate the outcomes for "The Economic and Environmental Costs and Benefits of the Renewable Fuel Standard".
The dataset consists of:
(1) The replication codes and data for the BEPAM model. The code file is named as output.gms. (BEPAM-Social cost model-ERL.zip)
(2) Simulation results from the BEPAM model (BEPAM_Simulation_Results.csv)
* Item (1) is in GAMS format. Item (2) is in text format.
keywords:
Social Cost of Carbon; Social Cost of Nitrogen; Cost-Benefit Analysis; Indirect Land-Use Change
published:
2018-04-26
GBS data from soybean lines carrying introgressions from Glycine tomentella. This project is led by Dr. Randy Nelson, USDA scientist at the University of Illinois. Fastq files contain raw Illumina data. Txt files are keyfiles containing barcodes for each genetic entity.
published:
2018-12-20
Dong, Xiaoru; Xie, Jingyi; Hoang, Linh
(2018)
File Name: WordsSelectedByInformationGain.csv
Data Preparation: Xiaoru Dong, Linh Hoang
Date of Preparation: 2018-12-12
Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang
Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks.
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider.
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews.
Description: the file contains a list of 1655 informative words selected by applying information gain feature selection strategy.
Information gain is one of the methods commonly used for feature selection, which tells us how many bits of information the presence of the word are helpful for us to predict the classes, and can be computed in a specific formula [Jurafsky D, Martin JH. Speech and language processing. London: Pearson; 2014 Dec 30].We ran Information Gain feature selection on Weka -- a machine learning tool.
Notes: In order to reproduce the data in this file, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords:
Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews
published:
2021-08-05
Lotspeich-Yadao, Michael
(2021)
This geodatabase serves two purposes: 1) to provide State of Illinois agencies with a fast resource for the preparation of maps and figures that require the use of shape or line files from federal agencies, the State of Illinois, or the City of Chicago, and 2) as a start for social scientists interested in exploring how geographic information systems (whether this is data visualization or geographically weighted regression) can bring new meaning to the interpretation of their data. All layer files included are relevant to the State of Illinois. Sources for this geodatabase include the U.S. Census Bureau, U.S. Geological Survey, City of Chicago, Chicago Public Schools, Chicago Transit Authority, Regional Transportation Authority, and Bureau of Transportation Statistics.
keywords:
State of Illinois; City of Chicago; Chicago Public Schools; GIS; Statistical tabulation areas; hydrography
published:
2020-01-28
Miao, Guofang; Guan, Kaiyu
(2020)
This dataset includes two data files that provide the time series (Jul. - Sep. 2017) data of sun-induced chlorophyll fluorescence (SIF_760) collected under sunny conditions at two maize sites (one rainfed and the other irrigated) in Nebraska in 2017.
Data contain 392 SIF_760 records at the rainfed site and 707 records at the irrigated site. The timestamp uses local standard time. Data are available for the sunny conditions from 8 am to 5 pm (corresponding to 9 am to 6 pm local time) throughout the study period.
keywords:
sun-induced chlorophyll fluorescence (SIF); maize; gross primary production(GPP); light use efficiency(LUE); SIF yield
suppressed by curator
published:
2024-09-28
Per the authors' request, the data files for this dataset are now suppressed. Please visit this new dataset for the complete and updated data files: Huang, Yijing; Fahad , Mahmood (2025): Data for Observation of a Magneto-chiral Instability in Photoexcited Tellurium. University of Illinois Urbana-Champaign.<a href="https://doi.org/10.13012/B2IDB-1409842_V1">https://doi.org/10.13012/B2IDB-1409842_V1</a>
====================
The data and code provided in this dataset can be used to generate key plots in the manuscript. It is divided into four subfolders (B parallel/perpendicular to the tellurium c axis and field/ temperature dependence), each containing the raw data (saved in .mat format), the oscillator parameters obtained through linear prediction (saved in .mat format), and the plot-generating code (.m files). The code was written using MATLAB R2024a. To run the code, go to each folder, and run the .m file in that folder, which generates two plots.