Dataset Search

Displaying 1 - 25 of 84 in total

Filters

Subject Area

Technology and Engineering (84)

Life Sciences (0)

Social Sciences (0)

Physical Sciences (0)

Uncategorized

Arts and Humanities (0)

Funder

U.S. National Science Foundation (NSF) (34)

Other (27)

U.S. Department of Energy (DOE) (9)

U.S. National Institutes of Health (NIH) (7)

U.S. National Aeronautics and Space Administration (NASA) (1)

U.S. Department of Agriculture (USDA) (0)

Illinois Department of Natural Resources (IDNR) (0)

U.S. Geological Survey (USGS) (0)

Illinois Department of Transportation (IDOT) (0)

U.S. Army (0)

Publication Year

2025 (14)

2024 (10)

2017 (9)

2022 (9)

2021 (8)

2023 (8)

2018 (7)

2019 (6)

2016 (3)

2020 (3)

2026 (2)

2009 (1)

2011 (1)

2012 (1)

2014 (1)

2015 (1)

License

CC0 (43)

CC BY (37)

custom (4)

Illinois Data Bank Dataset Search Results

Results

published: 2026-01-28

Data for Field-Effect Transistors from Artificial Charged Domain Walls in Stacked van der Waals Ferroelectric α-In2Se3

Nahid, Shahriar Muhammad; Dong, Haiyue; Nolan, Gillian; Nam, Sungwoo; Mason, Nadya; Huang, Pinshane; van der Zande, Arend (2026)

Room-temperature transfer curves; Benchmarking conductance; STEM images of charged domain walls; Temperature-dependent transfer curves; Scaling of conductance, hopping length, threshold voltage, trap density, and field-effect mobility with temperature; Magnetotransport data; Optical, AFM, and PFM image of different field-effect transistors; STEM images of contacts; Output and transfer curves of FETs; Additional STEM images of charged domain walls; Temperature scaling of subthreshold swing and threshold voltage difference; Comparison of maximum field-effect mobility for different structures

published: 2016-05-19

New York City Taxi Trip Data (2010-2013)

Donovan, Brian; Work, Dan (2016)

This dataset contains records of four years of taxi operations in New York City and includes 697,622,444 trips. Each trip records the pickup and drop-off dates, times, and coordinates, as well as the metered distance reported by the taximeter. The trip data also includes fields such as the taxi medallion number, fare amount, and tip amount. The dataset was obtained through a Freedom of Information Law request from the New York City Taxi and Limousine Commission. The files in this dataset are optimized for use with the ‘decompress.py’ script included in this dataset. This file has additional documentation and contact information that may be of help if you run into trouble accessing the content of the zip files.

keywords: taxi;transportation;New York City;GPS

published: 2026-01-20

Dataset for "CAMUS: Scalable Phylogenetic Network Estimation"

Willson, James; Warnow, Tandy (2026)

Dataset from "CAMUS: Scalable Phylogenetic Network Estimation." This dataset contains simulated phylogenetic networks, gene trees, and sequence data. - camus-dataset.tar.xz is the main archive containing all the simulated data. More details about the files and directories it contains can be found in README.md - scripts.zip contains various scripts used in the simulation study.

keywords: evolution; computational biology; bioinformatics; phylogenetics

published: 2023-03-16

Data For Well-Connected Communities In Real Networks.

Park, Minhyuk; Tabatabaee, Yasamin; Warnow, Tandy; Chacko, George (2023)

Curated networks and clustering output from the manuscript: Well-Connected Communities in Real-World Networks https://arxiv.org/abs/2303.02813

keywords: Community detection; clustering; open citations; scientometrics; bibliometrics

published: 2024-06-04

Data for Well-Connectedness and Community Detection

Park, Minhyuk; Tabatabaee, Yasamin; Warnow, Tandy; Chacko, George (2024)

This dataset contains files and relevant metadata for real-world and synthetic LFR networks used in the manuscript "Well-Connectedness and Community Detection (2024) Park et al. presently under review at PLOS Complex Systems. The manuscript is an extended version of Park, M. et al. (2024). Identifying Well-Connected Communities in Real-World and Synthetic Networks. In Complex Networks & Their Applications XII. COMPLEX NETWORKS 2023. Studies in Computational Intelligence, vol 1142. Springer, Cham. https://doi.org/10.1007/978-3-031-53499-7_1. “The Overview of Real-World Networks image provides high-level information about the seven real-world networks. TSVs of the seven real-world networks are provided as [network-name]_cleaned to indicate that duplicated edges and self-loops were removed, where column 1 is source and column 2 is target. LFR datasets are contained within the zipped file. Real-world networks are labeled _cleaned_ to indicate that duplicate edges and self loops were removed. #LFR datasets for the Connectivity Modifier (CM) paper ### File organization Each directory `[network-name]_[resolution-value]_lfr` includes the following files: * `network.dat`: LFR network edge-list * `community.dat`: LFR ground-truth communities * `time_seed.dat`: time seed used in the LFR software * `statistics.dat`: statistics generated by the LFR software * `cmd.stat`: command used to run the LFR software as well as time and memory usage information

published: 2021-03-06

RaDICaL: A Synchronized FMCW Radar, Depth, IMU and RGB Camera Data Dataset with Low-Level FMCW Radar Signals (ROS bag format)

Lim, Teck Yian; Markowitz, Spencer Abraham; Do, Minh (2021)

This dataset consists of raw ADC readings from a 3 transmitter 4 receiver 77GHz FMCW radar, together with synchronized RGB camera and depth (active stereo) measurements. The data is grouped into 4 distinct radar configurations: - "indoor" configuration with range <14m - "30m" with range <38m - "50m" with range <63m - "high_res" with doppler resolution of 0.043m/s # Related code https://github.com/moodoki/radical_sdk # Hardware Project Page https://publish.illinois.edu/radicaldata

keywords: radar; FMCW; sensor-fusion; autonomous driving; dataset; RGB-D; object detection; odometry

published: 2025-08-16

Data from development and evaluation of SASCA-s: Scalable Agent-based Simulator for Citation Analysis with simulation

Park, Minhyuk; Lamy, João AC; Rodrigues, Esther CC; Ferreira, Felipe Mariano; Vu-Le, The-Anh; Warnow, Tandy; Chacko, George (2025)

The data within consist of compressed output files in the form of edgelists (*.edgelist.gz) and nodelists (*.aux.parquet) from large citation network simulations using an agent-based model. The code and instructions are available at: <a href="https://github.com/illinois-or-research-analytics/SASCA">https://github.com/illinois-or-research-analytics/SASCA</a>. In addition, we provide a distribution of citation frequencies drawn from a random sample of PubMed journal articles (pooled_50k_pubmed_unique.csv) and a table of recencies- the frequency with which citations are made to the previous year, the year before that and so on (recency_probs_percent_stahl_filled.csv). A manuscript describing the SASCA-s simulator has been submitted for review and will be referenced in a future version of this data repository if it is accepted. The prefixes sj and er refer to the real world and Erdos-Renyi random graph respectively that were used to initiate simulations. These 'seed' networks are available from the Github site referenced above.

keywords: benchmark networks; agent-based models; simulation; citation

published: 2025-08-07

EC-SBM Benchmark Networks

Vu-Le, The-Anh; Chacko, George; Warnow, Tandy (2025)

Dataset generated using the technique described in "EC-SBM synthetic network generator". This contains multiple synthetic networks with ground-truth community structure, which can be used to evaluate community detection methods. Note: * networks.zip contains the synthetic networks

keywords: network science; synthetic networks; community detection; tsv

published: 2024-02-16

Parsed Open Citations and PubMed Data

Mohasel Arjomandi, Hossein; Korobskiy, Dmitriy; Chacko, George (2024)

This dataset contains five files. (i) open_citations_jan2024_pub_ids.csv.gz, open_citations_jan2024_iid_el.csv.gz, open_citations_jan2024_el.csv.gz, and open_citation_jan2024_pubs.csv.gz represent a conversion of Open Citations to an edge list using integer ids assigned by us. The integer ids can be mapped to omids, pmids, and dois using the open_citation_jan2024_pubs.csv and open_citations_jan2024_pub_ids.scv files. The network consists of 121,052,490 nodes and 1,962,840,983 edges. Code for generating these data can be found https://github.com/chackoge/ERNIE_Plus/tree/master/OpenCitations. (ii) The fifth file, baseline2024.csv.gz, provides information about the metadata of PubMed papers. A 2024 version of PubMed was downloaded using Entrez and parsed into a table restricted to records that contain a pmid, a doi, and has a title and an abstract. A value of 1 in columns indicates that the information exists in metadata and a zero indicates otherwise. Code for generating this data: https://github.com/illinois-or-research-analytics/pubmed_etl. If you use these data or code in your work, please cite https://doi.org/10.13012/B2IDB-5216575_V1.

keywords: PubMed

published: 2017-02-28

Smartphone recorded driving sensor data: Leesburg, VA to Indianapolis, IN

Freedman, Ryan (2017)

Leesburg, VA to Indianapolis, Indiana: Sampling Rate: 0.1 Hz Total Travel Time: 31100007 ms or 518 minutes or 8.6 hours Distance Traveled: 570 miles via I-70 Number of Data Points: 3112 Device used: Samsung Galaxy S4 Date Recorded: 2017-01-15 Parameters Recorded: * ACCELEROMETER X (m/s²) * ACCELEROMETER Y (m/s²) * ACCELEROMETER Z (m/s²) * GRAVITY X (m/s²) * GRAVITY Y (m/s²) * GRAVITY Z (m/s²) * LINEAR ACCELERATION X (m/s²) * LINEAR ACCELERATION Y (m/s²) * LINEAR ACCELERATION Z (m/s²) * GYROSCOPE X (rad/s) * GYROSCOPE Y (rad/s) * GYROSCOPE Z (rad/s) * LIGHT (lux) * MAGNETIC FIELD X (microT) * MAGNETIC FIELD Y (microT) * MAGNETIC FIELD Z (microT) * ORIENTATION Z (azimuth °) * ORIENTATION X (pitch °) * ORIENTATION Y (roll °) * PROXIMITY (i) * ATMOSPHERIC PRESSURE (hPa) * Relative Humidity (%) * Temperature (F) * SOUND LEVEL (dB) * LOCATION Latitude * LOCATION Longitude * LOCATION Altitude (m) * LOCATION Altitude-google (m) * LOCATION Altitude-atmospheric pressure (m) * LOCATION Speed (kph) * LOCATION Accuracy (m) * LOCATION ORIENTATION (°) * Satellites in range * GPS NMEA * Time since start in ms * Current time in YYYY-MO-DD HH-MI-SS_SSS format Quality Notes: There are some things to note about the quality of this data set that you may want to consider while doing preprocessing. This dataset was taken continuously but had multiple stops to refuel (without the data recording ceasing). This can be removed by parsing out all data that has a speed of 0. The mount for this dataset was fairly stable (as can be seen by the consistent orientation angle throughout the dataset). It was mounted tightly between two seats in the back of the vehicle. Unfortunately, the frequency for this dataset was set fairly low at one per ten seconds.

keywords: smartphone; sensor; driving; accelerometer; gyroscope; magnetometer; gps; nmea; barometer; satellite; temperature; humidity

published: 2017-05-01

Smartphone recorded driving sensor data: Indianapolis International Airport to Urbana, IL

Freedman, Ryan (2017)

Indianapolis Int'l Airport to Urbana: Sampling Rate: 2 Hz Total Travel Time: 5901534 ms or 98.4 minutes Number of Data Points: 11805 Distance Traveled: 124 miles via I-74 Device used: Samsung Galaxy S6 Date Recorded: 2016-11-27 Parameters Recorded: * ACCELEROMETER X (m/s²) * ACCELEROMETER Y (m/s²) * ACCELEROMETER Z (m/s²) * GRAVITY X (m/s²) * GRAVITY Y (m/s²) * GRAVITY Z (m/s²) * LINEAR ACCELERATION X (m/s²) * LINEAR ACCELERATION Y (m/s²) * LINEAR ACCELERATION Z (m/s²) * GYROSCOPE X (rad/s) * GYROSCOPE Y (rad/s) * GYROSCOPE Z (rad/s) * LIGHT (lux) * MAGNETIC FIELD X (microT) * MAGNETIC FIELD Y (microT) * MAGNETIC FIELD Z (microT) * ORIENTATION Z (azimuth °) * ORIENTATION X (pitch °) * ORIENTATION Y (roll °) * PROXIMITY (i) * ATMOSPHERIC PRESSURE (hPa) * SOUND LEVEL (dB) * LOCATION Latitude * LOCATION Longitude * LOCATION Altitude (m) * LOCATION Altitude-google (m) * LOCATION Altitude-atmospheric pressure (m) * LOCATION Speed (kph) * LOCATION Accuracy (m) * LOCATION ORIENTATION (°) * Satellites in range * GPS NMEA * Time since start in ms * Current time in YYYY-MO-DD HH-MI-SS_SSS format Quality Notes: There are some things to note about the quality of this data set that you may want to consider while doing preprocessing. This dataset was taken continuously as a single trip, no stop was made for gas along the way making this a very long continuous dataset. It starts in the parking lot of the Indianapolis International Airport and continues directly towards a gas station on Lincoln Avenue in Urbana, IL. There are a couple parts of the trip where the phones orientation had to be changed because my navigation cut out. These times are easy to account for based on Orientation X/Y/Z change. I would also advise cutting out the first couple hundred points or the points leading up to highway speed. The phone was mounted in the cupholder in the front seat of the car.

keywords: smartphone; sensor; driving; accelerometer; gyroscope; magnetometer; gps; nmea; barometer; satellite

published: 2025-12-29

Data for Mitigation of ion-temperature/composition ambiguity in the inversion of F-region ion-line spectra measured at Arecibo using coded long pulses

Wu, Yulun; Kudeki, Erhan (2025)

Arecibo ISR CLP ion-line spectra obtained from RI receiver with 500 kHz bandwidth and 120-640 km altitude range, experiment dates September 23-26, 2016. Used for Mitigation of ion-temperature/composition ambiguity in the inversion of F-region ion-line spectra measured at Arecibo using coded long pulses.

keywords: Remote sensing; Incoherent scatter radar; Arecibo Observatory

published: 2024-11-13

Nanoscale Stacking Fault Engineering and Mapping in Spinel Oxides for Reversible Multivalent Ion Insertion

Tang, Zhichu; Chen, Wenxiang; Yin, Kaijun; Busch, Robert; Hou, Hanyu; Lin, Oliver; Lyu, Zhiheng; Zhang, Cheng; Yang, Hong; Zuo, Jian-Min ; Chen, Qian (2024)

These datasets are for the four-dimensional scanning transmission electron microscopy (4D-STEM) and electron energy loss spectroscopy (EELS) experiments for cathode nanoparticles at different states. The raw 4D-STEM experiment datasets were collected by TEM image & analysis software (FEI) and were saved as SER files. The raw 4D-STEM datasets of SER files can be opened and viewed in MATLAB using our analysis software package of imToolBox available at https://github.com/flysteven/imToolBox. The raw EELS datasets were collected by DigitalMicrograph software and were saved as DM4 files. The raw EELS datasets can be opened and viewed in DigitalMicrograph software or using our analysis codes available at https://github.com/chenlabUIUC/OrientedPhaseDomain. All the datasets are from the work "Nanoscale Stacking Fault Engineering and Mapping in Spinel Oxides for Reversible Multivalent Ion Insertion" (2024). The 4D-STEM experiment data include four example datasets for cathode nanoparticles collected at pristine and discharged states. Each dataset contains a stack of diffraction patterns collected at different probe positions scanned across the cathode nanoparticle. 1. Pristine untreated nanoparticle: "Pristine U-NP.ser" 2. Pristine 200ºC heated nanoparticle: "Pristine H200-NP.ser" 3. Untreated nanoparticle after first discharge in Zn-ion batteries: "Discharged U-NP.ser" 4. 200ºC heated nanoparticle after first discharge in Zn-ion batteries: "Discharged H200-NP.ser" The EELS experiment data includes six example datasets for cathode nanoparticles collected at different states (in "EELS datasets.zip") as described below. Each EELS dataset contains the zero-loss and core-loss EELS spectra collected at different probe positions scanned across the cathode nanoparticle. 1. Pristine untreated nanoparticle: "Pristine U-NP EELS.zip" 2. Pristine 200ºC heated nanoparticle: "Prisitne H200-NP EELS.zip" 3. Untreated nanoparticle after first discharge in Zn-ion batteries: "Discharged U-NP EELS.zip" 4. Untreated nanoparticle after first charge in Zn-ion batteries: "Charged U-NP EELS.zip" 5. 200ºC heated nanoparticle after first discharge in Zn-ion batteries: "Discharged H200-NP EELS.zip" 6. 200ºC heated nanoparticle after first charge in Zn-ion batteries: "Charged H200-NP EELS.zip" The details of the software package and codes that can be used to analyze the 4D-STEM datasets and EELS datasets are available at: https://github.com/chenlabUIUC/OrientedPhaseDomain. Once our paper is formally published, we will update the relationship of these datasets with our paper.

keywords: 4D-STEM; EELS; defects; strain; cathode; nanoparticle; energy storage

published: 2025-12-23

study of liquid suction cup detachment mechanism

Aly, Abdallah; A. Saif, M. Taher (2025)

The uploaded data is part of the paper titled: Self-Modifying Percolation Governs Detachment in Soft Suction Wet Adhesion, which shows the detachment mechanism of liquid suction-based adhesion.

published: 2022-04-29

Biological and Simulated datasets for testing the SCAMPP framework for phylogenetic placement methods

Wedell, Eleanor; Warnow, Tandy (2022)

Thank you for using these datasets! These files contain trees and reference alignments, as well as the selected query sequences for testing phylogenetic placement methods against and within the SCAMPP framework. There are four datasets from three different sources, each containing their source alignment and "true" tree, any estimated trees that may have been generated, and any re-estimated branch lengths that were created to be used with their requisite phylogenetic placement method. Three biological datasets (16S.B.ALL, PEWO/LTP_s128_SSU, and PEWO/green85) and one simulated dataset (nt78) is contained. See README.txt in each file for more information.

keywords: Phylogenetic Placement; Phylogenetics; Maximum Likelihood; pplacer; EPA-ng

published: 2022-11-11

Data for Chemical Short-Range Ordering in a CrCoNi Medium-Entropy Alloy

Hsiao, Haw-Wen; Zuo, Jian-Min (2022)

This dataset is for characterizing chemical short-range-ordering in CrCoNi medium entropy alloys. It has three sub-folders: 1. code, 2. sample WQ, 3. sample HT. The software needed to run the files is Gatan Microscopy Suite® (GMS). Please follow the instruction on this page to install the DM3 GMS: <a href="https://www.gatan.com/installation-instructions#Step1">https://www.gatan.com/installation-instructions#Step1</a> 1. Code folder contains three DM scripts to be installed in Gatan DigitalMicrograph software to analyze scanning electron nanobeam diffraction (SEND) dataset: Cepstrum.s: need [EF-SEND_sampleWQ_cropped_aligned.dm3] in Sample WQ and the average image from [EF-SEND_sampleWQ_cropped_aligned.dm3]. Same for Sample HT folder. log_BraggRemoval.s: same as above. Patterson.s: Need refined diffuse patterns in Sample HT folder. 2. Sample WQ and 3. Sample HT folders both contain the SEND data (.ser) and the binned SEND data (.dm3) as well as our calculated strain maps as the strain measurement reference. The Sample WQ folder additionally has atomic resolution STEM images; the Sample HT folder additionally has three refined diffuse patterns as references for diffraction data processing. * Only .ser file is needed to perform the strain measurement using imToolBox as listed in the manuscript. .emi file contains the meta data of the microscope, which can be opened together with .ser file using FEI TIA software.

keywords: Medium entropy alloy; CrCoNi; chemical short-range-ordering; CSRO; TEM

published: 2025-12-01

Data for "Modeling the Global Citation Network using the Scalable Agent-based Simulator for Citation Analysis with Recency-emphasized Sampling (SASCA-ReS)"

Park, Minhyuk; Yi, Haotian; Warnow, Tandy; Chacko, George (2025)

This dataset principally consists of four synthetic citation networks that were generated during the preparation of the manuscript Park M, Yi H, Warnow T, and Chacko G (2025). Modeling the Global Citation Network using the Scalable Agent-based Simulator for Citation Analysis with Recency-emphasized Sampling (SASCA-ReS). A preprint is available on Zenodo (below) and the manuscript has been submitted to the MetaRoR platform for review and feedback. @misc{park_2025_17789558, author = {Park, Minhyuk and Yi, Haotian and Warnow, Tandy and Chacko, George}, title = {Modeling the Global Citation Network using the Scalable Agent-based Simulator for Citation Analysis with Recency-emphasized Sampling (SASCA- ReS) }, month = dec, year = 2025, publisher = {Zenodo}, doi = {10.5281/zenodo.17789558}, url = {https://doi.org/10.5281/zenodo.17789558}, } The networks are roughly 14, 76, 161, and 218 million nodes each. Both nodelists with attributes and edge lists are provided as gzipped parquet files along with the configuration file that was passed to the SASCA-ReS software, which can be accessed at: <a href="https://github.com/illinois-or-research-analytics/SASCA-ReS">https://github.com/illinois-or-research-analytics/SASCA-ReS</a>. A copy of the configuration file that was used to generate the network with SASCA-ReS is also provided. For example: abm14_config.ini; abm14_edgelist.parquet.gz; and abm14_nodelist.parquet.gz. The column headers in the edgelists and nodelists and the fields in the configuration file are explained in the Github repository for SASCA-ReS. In addition, we provide sj_reccount, a table of real world citation frequencies that is an input to the SASCA-Res software. The first column (diff) of sj_reccount lists the difference between the publication year of a citing document and the publication year of a cited document. The second column (count) reports the frequency of such citations across the dataset of 77879427 observations, which is derived from the biomedical literature. Finally, we share data, composite_maverick_disruption.csv , from the mavericks (unconventional citing strategies) experiment reported in the Park et al. (2025) manuscript available at <a href="https://zenodo.org/records/17772113">https://zenodo.org/records/17772113</a>. The columns in the composite_maverick_disruption.csv file are: node_id -> of agents in the various simulations n_i, n_j, n_k -> terms used to compute disruption per "Wu, L., Wang, D. & Evans, J.A. Large teams develop and small teams disrupt science and technology. Nature 566, 378–382 (2019). <a href="https://doi.org/10.1038/s41586-019-0941-9">https://doi.org/10.1038/s41586-019-0941-9"</a> disruption -> the disruption metric of Wu, Wang, and Evans (2019) type -> maverick type (maximizer, randomnik, or minimizer) year -> virtual year in the simulation when the maverick was created alpha -> the alpha parameter of the control agent pa_weight -> the preferential attachment weight of the control agent phenotype fit_peak_value -> the fitness value assigned to the control agent in_degree -> the count of citations accumulated by the maverick or control agent at the end of the simulation out_degree -> the count of references made by the maverick tag -> a label for the experiment, e.g. od249_f1 indicates that the mavericks in this experiment made 249 citations and were assigned a fitness value of 1.

keywords: synthetic networks; agent based models; SASCA-ReS; citation networks

published: 2023-02-07

Data from: DISCO+QR: Rooting Species Trees in the Presence of GDL and ILS

Willson, James; Tabatabaee, Yasamin; Liu, Baqiao; Warnow, Tandy (2023)

Data sets from "DISCO+QR: Rooting Species Trees in the Presence of GDL and ILS." It contains trees and sequences simulated with gene duplication and loss under a variety of different conditions. Note: - trees.tar.gz contains the simulated gene-family trees used in our experiments (both true trees from SimPhy as well as trees estimated from alignments). - alignments.tar.gz contains simulated sequence data used for estimating the gene-family trees

keywords: evolution; computational biology; bioinformatics; phylogenetics

published: 2017-12-20

Data from: Investigating the linear dependence of direct and indirect radiative forcing on emission of carbonaceous aerosols in a global climate model

Chen, Yanju; Bond, Tami (2017)

The dataset contains processed model fields used to generate data, figures and tables in the Journal of Geophysical Research article "Investigating the linear dependence of direct and indirect radiative forcing on emission of carbonaceous aerosols in a global climate model." The processed data are monthly averaged cloud properties (CCN, CDNC and LWP) and forcing variables (DRF and IRF) at original CAM5 spatial resolution (1.9° by 2.5°). Raw model output fields from CAM5 simulations are available through NERSC upon request. Please find more detailed information in the ReadMe file.

keywords: carbonaceous aerosols; radiative forcing; emission; linearity

published: 2017-12-01

Dataset: Landscape evolution models using the stream power incision model show unrealistic behavior when m/n equals 0.5

Kwang, Jeffrey (2017)

This dataset contains all the numerical results (digital elevation models) that are presented in the paper "Landscape evolution models using the stream power incision model show unrealistic behavior when m/n equals 0.5." The paper can be found at: http://www.earth-surf-dynam-discuss.net/esurf-2017-15/ The paper has been accepted, but the most up to date version may not be available at the link above. If so, please contact Jeffrey Kwang at jeffskwang@gmail.com to obtain the most up to date manuscript.

keywords: landscape evolution models; digital elelvation model

published: 2022-08-08

Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment

Shen, Chengze; Liu, Baqiao; Williams, Kelly P.; Warnow, Tandy (2022)

This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment". The zip file has the following structure (presented as an example): salma_paper_datasets/ |_README.md |_10aa/ |_crw/ |_homfam/ |_aat/ | |_... |_... |_het/ |_5000M2-het/ | |_... |_5000M3-het/ ... |_rec_res/ Generally, the structure can be viewed as: [category]/[dataset]/[replicate]/[alignment files] # Categories: 1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate. 2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM). 3. homfam: There are the 10 largest Homfam datasets, each with one replicate. 4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates. 5. rec\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper. # Alignment files There are at most 6 `.fasta` files in each sub-directory: 1. `all.unaln.fasta`: All unaligned sequences. 2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included. 3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences). 4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included. 5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences). 6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included. >If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing. >If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing. >If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing. # Additional file(s) 1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.

keywords: SALMA;MAFFT;alignment;eHMM;sequence length heterogeneity

published: 2025-09-25

Data for "Using Stochastic Block Models for Community Detection"

Vu-Le, The-Anh; Park, Minhyuk; Chen, Ian; Warnow, Tandy (2025)

Dataset for "Using Stochastic Block Models for Community Detection". This contains synthetic networks with ground-truth community structure generated using synthetic network generators (specifically, ABCD+o) based on real-world networks and computed clusterings on these real-world networks. Note: * networks.zip contains the synthetic networks

published: 2022-09-29

3DIFICE: A Synthetic Dataset for Training Computer Vision Algorithms to Recognize Earthquake Damage to Reinforced Concrete Structures

Levine, Nathaniel (2022)

3DIFICE: 3-dimensional Damage Imposed on Frame structures for Investigating Computer vision-based Evaluation methods This dataset contains 1,396 synthetic images and label maps with various types of earthquake damage imposed on reinforced concrete frame structures. Damage includes: cracking, spalling, exposed transverse rebar, and exposed longitudinal rebar. Each image has an associated label map that can be used for training machine learning algorithms to recognize the various types of damage.

keywords: computer vision; earthquake engineering; structural health monitoring; civil engineering; structural engineering;

published: 2020-07-01

SaltProc output for TAP MSR and MSBR online reprocessing depletion simulations

Rykhlevskii, Andrei; Huff, Kathryn D. (2020)

keywords: molten salt; fuel cycle; reprocessing; refueling

published: 2019-09-01

Data for: Probabilistic global maps of crop-specific areas from 1961 to 2014

Jackson, Nicole; Konar, Megan; Debaere, Peter; Estes, Lyndon (2019)

Agriculture has substantial socioeconomic and environmental impacts that vary between crops. However, information on how the spatial distribution of specific crops has changed over time across the globe is relatively sparse. We introduce the Probabilistic Cropland Allocation Model (PCAM), a novel algorithm to estimate where specific crops have likely been grown over time. Specifically, PCAM downscales annual and national-scale data on the crop-specific area harvested of 17 major crops to a global 0.5-degree grid from 1961-2014. The resulting database presented here provides annual global gridded likelihood estimates of crop-specific areas. Both mean and standard deviations of grid cell fractions are available for each of the 17 crops. Each netCDF file contains an individual year of data with an additional variable ("crs") that defines the coordinate reference system used. Our results provide new insights into the likely changes in the spatial distribution of major crops over the past half-century. For additional information, please see the related paper by Jackson et al. (2019) in Environmental Research Letters (https://doi.org/10.1088/1748-9326/ab3b93).

keywords: global; gridded; probabilistic allocation; crop suitability; agricultural geography; time series