Illinois Data Bank Dataset Search Results
Results
published:
2024-02-16
Mohasel Arjomandi, Hossein; Korobskiy, Dmitriy; Chacko, George
(2024)
This dataset contains five files. (i) open_citations_jan2024_pub_ids.csv.gz, open_citations_jan2024_iid_el.csv.gz, open_citations_jan2024_el.csv.gz, and open_citation_jan2024_pubs.csv.gz represent a conversion of Open Citations to an edge list using integer ids assigned by us. The integer ids can be mapped to omids, pmids, and dois using the open_citation_jan2024_pubs.csv and open_citations_jan2024_pub_ids.scv files. The network consists of 121,052,490 nodes and 1,962,840,983 edges. Code for generating these data can be found https://github.com/chackoge/ERNIE_Plus/tree/master/OpenCitations.
(ii) The fifth file, baseline2024.csv.gz, provides information about the metadata of PubMed papers. A 2024 version of PubMed was downloaded using Entrez and parsed into a table restricted to records that contain a pmid, a doi, and has a title and an abstract. A value of 1 in columns indicates that the information exists in metadata and a zero indicates otherwise. Code for generating this data: https://github.com/illinois-or-research-analytics/pubmed_etl. If you use these data or code in your work, please cite https://doi.org/10.13012/B2IDB-5216575_V1.
keywords:
PubMed
published:
2023-03-16
Park, Minhyuk; Tabatabaee, Yasamin; Warnow, Tandy; Chacko, George
(2023)
Curated networks and clustering output from the manuscript: Well-Connected Communities in Real-World Networks https://arxiv.org/abs/2303.02813
keywords:
Community detection; clustering; open citations; scientometrics; bibliometrics
published:
2024-06-04
Park, Minhyuk; Tabatabaee, Yasamin; Warnow, Tandy; Chacko, George
(2024)
This dataset contains files and relevant metadata for real-world and synthetic LFR networks used in the manuscript "Well-Connectedness and Community Detection (2024) Park et al. presently under review at PLOS Complex Systems. The manuscript is an extended version of Park, M. et al. (2024). Identifying Well-Connected Communities in Real-World and Synthetic Networks. In Complex Networks & Their Applications XII. COMPLEX NETWORKS 2023. Studies in Computational Intelligence, vol 1142. Springer, Cham. https://doi.org/10.1007/978-3-031-53499-7_1. “The Overview of Real-World Networks image provides high-level information about the seven real-world networks.
TSVs of the seven real-world networks are provided as [network-name]_cleaned to indicate that duplicated edges and self-loops were removed, where column 1 is source and column 2 is target.
LFR datasets are contained within the zipped file. Real-world networks are labeled _cleaned_ to indicate that duplicate edges and self loops were removed.
#LFR datasets for the Connectivity Modifier (CM) paper
### File organization
Each directory `[network-name]_[resolution-value]_lfr` includes the following files:
* `network.dat`: LFR network edge-list
* `community.dat`: LFR ground-truth communities
* `time_seed.dat`: time seed used in the LFR software
* `statistics.dat`: statistics generated by the LFR software
* `cmd.stat`: command used to run the LFR software as well as time and memory usage information
published:
2026-03-20
Wu, Yulun; Kudeki, Erhan
(2026)
Arecibo ISR CLP/ULP/LULP ion-line spectra obtained from USRP receiver with 500 kHz bandwidth and 120-1400 km altitude range, experiment dates September 23-26, 2016. Used for Joint inversions of coded and uncoded long pulse1 F-region ISR returns measured at Arecibo.
keywords:
Remote sensing; Incoherent scatter radar; Arecibo Observatory
published:
2026-02-09
Park, Minhyuk; Chacko, George
(2026)
This dataset consists of a directed network in edge list format where nodes correspond to articles in the scientific literature and edges represent citations. The network was constructed by seed set expansion (two rounds of citing and cited papers ) of the article (seed node) reporting the discovery of PI 3-Kinase activity. " Malcolm Whitman, C Peter Downes, Marilyn Keeler, Tracy Keller, and Lewis Cantley. (1988) Type I phosphatidylinositol kinase makes a novel inositol phospholipid, phosphatidylinositol-3-phosphate. Nature, 332(6165):644–646." The edge list comprises 17,970,340 nodes and 127,255,020 edges.
The dataset was obtained from the Dimensions database via a two-level expansion of the seed node (article). The first expansion included four groups of nodes: the seed node; all publications cited by the seed node; all publications citing the seed node; and all publications cited by publications citing the seed node. The second expansion included all nodes that either cited or were cited by a node in the first expansion set.
Node ids used were converted from the proprietary identifiers in Dimensions using a zero-based sequence of integer_ids [0: (n-1)]. Access to the original identifiers requires a license from Digital Science.
published:
2024-07-29
Caetano Machado Lopes, Lorran; Chacko, George
(2024)
This dataset consists of a citation graph. It was constructed by downloading and parsing the Works section of the Open Alex catalog of the global research system. Open Alex (see citation below) contains detailed information about scholarly research, including articles, authors, journals, institutions, and their relationships. The data were downloaded on 2024-07-15.
The dataset comprises two compressed (.xz) files.
1) filename: openalexID_integer_id_hasDOI.parquet.xz. The tabular data within contains three columns: openalex_id, integer_id, and hasDOI. Each row represents a record with the following data types:
• openalex_id: A unique identifier from the Open Alex catalog.
• integer_id: An integer representing the new identifier (assigned by the authors)
• hasDOI: An integer (0 or 1) indicating whether the record has a DOI (0 for no, 1 for yes).
2) filename: citation_table.tsv.xz
This edgelist of citations has two columns (no header) of integer values that represent citing and cited integer_id, respectively.
Summary Features
• Total Nodes (Documents): 256,997,006
• Total Edges (citations): 2,148,871,058
• Documents with DOIs: 163,495,446
• Edges between documents with DOIs: 1,936,722,541 [corrected to 2,148,788,148 edges Nov 13, 2025]
• Count of unique nodes in edgelist 111,453,719 [updated Nov 13, 2025]
Note: Nov 13, 2025. An improved curation process will be applied to a future version of this dataset
Note: Nov 13, 2025.
The code used to generate these files can be found here: https://github.com/illinois-or-research-analytics/lorran_openalex/
keywords:
citation networks; Open Alex
published:
2024-12-11
MMAudio pretrained models. These models can be used in the open-sourced codebase https://github.com/hkchengrex/MMAudio
<b>Note:</b> mmaudio_large_44k_v2.pth and Readme.txt are added to this V2. Other 4 files stay the same.
published:
2026-02-19
Gurumoorthi, Akshay; Peters, Baron
(2026)
The dataset contains a jupyter notebook intended for anyone who wants to apply the Empirical Bayes method described in the paper titled 'Data for Improving individual committor estimates and data efficiency in reaction coordinate tests with the Empirical Bayes method' to committor data with a simple and lucid python script.
published:
2026-02-11
Hanley, David; Lee, Jongwon; Choi, Su Yeon; Bretl, Timothy
(2026)
If you use this dataset, please cite both the dataset and the associated data paper (bibtex is below).
@ARTICLE{11386847,
author={Hanley, David and Lee, Jongwon and Choi, Su Yeon and Bretl, Timothy},
journal={IEEE Transactions on Instrumentation and Measurement},
title={The MagPIE2 Dataset for Mapping, Localization, and Simultaneous Localization and Mapping Using Magnetic Fields},
year={2026},
volume={},
number={},
pages={1-1},
keywords={Magnetometers;Magnetic field measurement;Magnetic fields;Pedestrians;Location awareness;Buildings;Simultaneous localization and mapping;Measurement errors;Hardware;Calibration;Localization;mapping;SLAM;dataset;benchmark;magnetometer;magnetic field},
doi={10.1109/TIM.2026.3662919}}
We present a dataset for the evaluation of magnetic field-based robotic and pedestrian localization, mapping, and SLAM methods. This dataset contains magnetometer and inertial measurement unit data collected from inside three buildings both a pedestrian and a ground robot. Data were collected at different heights simultaneously, both with and without changes in the placement of objects that may affect magnetometer measurements. In total, approximately 689 square meters of floor space was covered by this dataset.
This dataset is archivally stored. We provide a GitHub site which is meant to serve as a forum to post issues with the dataset, share code using the dataset, and to resolve problems: <a href="https://github.com/hanley6/MagPIE2Forum">https://github.com/hanley6/MagPIE2Forum</a>
Note that while the dataset is meant to be permanently stored, this forum is not meant to guarantee perennial support and its existence will be dependent on the policies of GitHub.
<b>How is the dataset organized?</b> The data is divided into the following parts at a high level and more detailed information can be found in the Readme:
1. The walking portion of the dataset: CSL_WLK.zip, DCL_WLK.zip, Talbot_WLK.zip, and WLK_Misc.zip.
2. The robot portion of the dataset: Robot_Dataset.zip.
3. Motor interference tests: Motor_Interference_Test.zip.
4. Ground truth evaluation: Ground_Truth_Evaluation.zip.
5. Quick start results: Quick_Start_Results.zip.
<b>How is data recorded and stored?</b> Data is generally collected in the form of ROS bag files. Each ROS bag has Intel Realsense camera images, magnetometer readings, IMU readings, timestamps, and more as applicable for each file in the dataset. Each bag file has an associated metadata file written as a YAML file. This contains general information about each bag file including the start and stop time, who collected the bag file (during the pedestrian portion of the dataset), and the approximate location where data was collected. In several cases, additional comma separated (csv) files of the dataset where included either as a convenient supplement to ROS bag files (e.g., csv files of magnetometer calibration data) or because they serve as human readable quick start results.
<b>How does one set up and run files on the dataset?</b> The files are stored in ROS bags and are, therefore, meant to be run using the Robot Operating System. Information regarding how to use the Robot Operating System as well as installation instructions are available at: <a href="https://ros.org/">https://ros.org/</a>
keywords:
Localization; mapping; SLAM; dataset; benchmark; magnetometer; magnetic field
published:
2025-12-23
Aly, Abdallah; A. Saif, M. Taher
(2025)
The uploaded data is part of the paper titled: Self-Modifying Percolation Governs Detachment in Soft Suction Wet Adhesion, which shows the detachment mechanism of liquid suction-based adhesion.
published:
2026-01-28
Nahid, Shahriar Muhammad; Dong, Haiyue; Nolan, Gillian; Nam, Sungwoo; Mason, Nadya; Huang, Pinshane; van der Zande, Arend
(2026)
Room-temperature transfer curves; Benchmarking conductance; STEM images of charged domain walls; Temperature-dependent transfer curves; Scaling of conductance, hopping length, threshold voltage, trap density, and field-effect mobility with temperature; Magnetotransport data; Optical, AFM, and PFM image of different field-effect transistors; STEM images of contacts; Output and transfer curves of FETs; Additional STEM images of charged domain walls; Temperature scaling of subthreshold swing and threshold voltage difference; Comparison of maximum field-effect mobility for different structures
published:
2016-05-19
Donovan, Brian; Work, Dan
(2016)
This dataset contains records of four years of taxi operations in New York City and includes 697,622,444 trips. Each trip records the pickup and drop-off dates, times, and coordinates, as well as the metered distance reported by the taximeter. The trip data also includes fields such as the taxi medallion number, fare amount, and tip amount. The dataset was obtained through a Freedom of Information Law request from the New York City Taxi and Limousine Commission.
The files in this dataset are optimized for use with the ‘decompress.py’ script included in this dataset. This file has additional documentation and contact information that may be of help if you run into trouble accessing the content of the zip files.
keywords:
taxi;transportation;New York City;GPS
published:
2026-01-20
Willson, James; Warnow, Tandy
(2026)
Dataset from "CAMUS: Scalable Phylogenetic Network Estimation." This dataset contains simulated phylogenetic networks, gene trees, and sequence data.
- camus-dataset.tar.xz is the main archive containing all the simulated data. More details about the files and directories it contains can be found in README.md
- scripts.zip contains various scripts used in the simulation study.
keywords:
evolution; computational biology; bioinformatics; phylogenetics
published:
2021-03-06
Lim, Teck Yian; Markowitz, Spencer Abraham; Do, Minh
(2021)
This dataset consists of raw ADC readings from a 3 transmitter 4 receiver 77GHz FMCW radar, together with synchronized RGB camera and depth (active stereo) measurements.
The data is grouped into 4 distinct radar configurations:
- "indoor" configuration with range <14m
- "30m" with range <38m
- "50m" with range <63m
- "high_res" with doppler resolution of 0.043m/s
# Related code
https://github.com/moodoki/radical_sdk
# Hardware Project Page
https://publish.illinois.edu/radicaldata
keywords:
radar; FMCW; sensor-fusion; autonomous driving; dataset; RGB-D; object detection; odometry
published:
2025-08-16
Park, Minhyuk; Lamy, João AC; Rodrigues, Esther CC; Ferreira, Felipe Mariano; Vu-Le, The-Anh; Warnow, Tandy; Chacko, George
(2025)
The data within consist of compressed output files in the form of edgelists (*.edgelist.gz) and nodelists (*.aux.parquet) from large citation network simulations using an agent-based model. The code and instructions are available at: <a href="https://github.com/illinois-or-research-analytics/SASCA">https://github.com/illinois-or-research-analytics/SASCA</a>. In addition, we provide a distribution of citation frequencies drawn from a random sample of PubMed journal articles (pooled_50k_pubmed_unique.csv) and a table of recencies- the frequency with which citations are made to the previous year, the year before that and so on (recency_probs_percent_stahl_filled.csv). A manuscript describing the SASCA-s simulator has been submitted for review and will be referenced in a future version of this data repository if it is accepted. The prefixes sj and er refer to the real world and Erdos-Renyi random graph respectively that were used to initiate simulations. These 'seed' networks are available from the Github site referenced above.
keywords:
benchmark networks; agent-based models; simulation; citation
published:
2025-08-07
Vu-Le, The-Anh; Chacko, George; Warnow, Tandy
(2025)
Dataset generated using the technique described in "EC-SBM synthetic network generator". This contains multiple synthetic networks with ground-truth community structure, which can be used to evaluate community detection methods.
Note:
* networks.zip contains the synthetic networks
keywords:
network science; synthetic networks; community detection; tsv
published:
2017-02-28
Leesburg, VA to Indianapolis, Indiana:
Sampling Rate: 0.1 Hz
Total Travel Time: 31100007 ms or 518 minutes or 8.6 hours
Distance Traveled: 570 miles via I-70
Number of Data Points: 3112
Device used: Samsung Galaxy S4
Date Recorded: 2017-01-15
Parameters Recorded:
* ACCELEROMETER X (m/s²)
* ACCELEROMETER Y (m/s²)
* ACCELEROMETER Z (m/s²)
* GRAVITY X (m/s²)
* GRAVITY Y (m/s²)
* GRAVITY Z (m/s²)
* LINEAR ACCELERATION X (m/s²)
* LINEAR ACCELERATION Y (m/s²)
* LINEAR ACCELERATION Z (m/s²)
* GYROSCOPE X (rad/s)
* GYROSCOPE Y (rad/s)
* GYROSCOPE Z (rad/s)
* LIGHT (lux)
* MAGNETIC FIELD X (microT)
* MAGNETIC FIELD Y (microT)
* MAGNETIC FIELD Z (microT)
* ORIENTATION Z (azimuth °)
* ORIENTATION X (pitch °)
* ORIENTATION Y (roll °)
* PROXIMITY (i)
* ATMOSPHERIC PRESSURE (hPa)
* Relative Humidity (%)
* Temperature (F)
* SOUND LEVEL (dB)
* LOCATION Latitude
* LOCATION Longitude
* LOCATION Altitude (m)
* LOCATION Altitude-google (m)
* LOCATION Altitude-atmospheric pressure (m)
* LOCATION Speed (kph)
* LOCATION Accuracy (m)
* LOCATION ORIENTATION (°)
* Satellites in range
* GPS NMEA
* Time since start in ms
* Current time in YYYY-MO-DD HH-MI-SS_SSS format
Quality Notes:
There are some things to note about the quality of this data set that you may want to consider while doing preprocessing. This dataset was taken continuously but had multiple stops to refuel (without the data recording ceasing). This can be removed by parsing out all data that has a speed of 0. The mount for this dataset was fairly stable (as can be seen by the consistent orientation angle throughout the dataset). It was mounted tightly between two seats in the back of the vehicle. Unfortunately, the frequency for this dataset was set fairly low at one per ten seconds.
keywords:
smartphone; sensor; driving; accelerometer; gyroscope; magnetometer; gps; nmea; barometer; satellite; temperature; humidity
published:
2017-05-01
Indianapolis Int'l Airport to Urbana:
Sampling Rate: 2 Hz
Total Travel Time: 5901534 ms or 98.4 minutes
Number of Data Points: 11805
Distance Traveled: 124 miles via I-74
Device used: Samsung Galaxy S6
Date Recorded: 2016-11-27
Parameters Recorded:
* ACCELEROMETER X (m/s²)
* ACCELEROMETER Y (m/s²)
* ACCELEROMETER Z (m/s²)
* GRAVITY X (m/s²)
* GRAVITY Y (m/s²)
* GRAVITY Z (m/s²)
* LINEAR ACCELERATION X (m/s²)
* LINEAR ACCELERATION Y (m/s²)
* LINEAR ACCELERATION Z (m/s²)
* GYROSCOPE X (rad/s)
* GYROSCOPE Y (rad/s)
* GYROSCOPE Z (rad/s)
* LIGHT (lux)
* MAGNETIC FIELD X (microT)
* MAGNETIC FIELD Y (microT)
* MAGNETIC FIELD Z (microT)
* ORIENTATION Z (azimuth °)
* ORIENTATION X (pitch °)
* ORIENTATION Y (roll °)
* PROXIMITY (i)
* ATMOSPHERIC PRESSURE (hPa)
* SOUND LEVEL (dB)
* LOCATION Latitude
* LOCATION Longitude
* LOCATION Altitude (m)
* LOCATION Altitude-google (m)
* LOCATION Altitude-atmospheric pressure (m)
* LOCATION Speed (kph)
* LOCATION Accuracy (m)
* LOCATION ORIENTATION (°)
* Satellites in range
* GPS NMEA
* Time since start in ms
* Current time in YYYY-MO-DD HH-MI-SS_SSS format
Quality Notes:
There are some things to note about the quality of this data set that you may want to consider while doing preprocessing. This dataset was taken continuously as a single trip, no stop was made for gas along the way making this a very long continuous dataset. It starts in the parking lot of the Indianapolis International Airport and continues directly towards a gas station on Lincoln Avenue in Urbana, IL. There are a couple parts of the trip where the phones orientation had to be changed because my navigation cut out. These times are easy to account for based on Orientation X/Y/Z change. I would also advise cutting out the first couple hundred points or the points leading up to highway speed. The phone was mounted in the cupholder in the front seat of the car.
keywords:
smartphone; sensor; driving; accelerometer; gyroscope; magnetometer; gps; nmea; barometer; satellite
published:
2025-12-29
Wu, Yulun; Kudeki, Erhan
(2025)
Arecibo ISR CLP ion-line spectra obtained from RI receiver with 500 kHz bandwidth and 120-640 km altitude range, experiment dates September 23-26, 2016. Used for Mitigation of ion-temperature/composition ambiguity in the inversion of F-region ion-line spectra measured at Arecibo using coded long pulses.
keywords:
Remote sensing; Incoherent scatter radar; Arecibo Observatory
published:
2024-11-13
Tang, Zhichu; Chen, Wenxiang; Yin, Kaijun; Busch, Robert; Hou, Hanyu; Lin, Oliver; Lyu, Zhiheng; Zhang, Cheng; Yang, Hong; Zuo, Jian-Min ; Chen, Qian
(2024)
These datasets are for the four-dimensional scanning transmission electron microscopy (4D-STEM) and electron energy loss spectroscopy (EELS) experiments for cathode nanoparticles at different states. The raw 4D-STEM experiment datasets were collected by TEM image & analysis software (FEI) and were saved as SER files. The raw 4D-STEM datasets of SER files can be opened and viewed in MATLAB using our analysis software package of imToolBox available at https://github.com/flysteven/imToolBox. The raw EELS datasets were collected by DigitalMicrograph software and were saved as DM4 files. The raw EELS datasets can be opened and viewed in DigitalMicrograph software or using our analysis codes available at https://github.com/chenlabUIUC/OrientedPhaseDomain. All the datasets are from the work "Nanoscale Stacking Fault Engineering and Mapping in Spinel Oxides for Reversible Multivalent Ion Insertion" (2024).
The 4D-STEM experiment data include four example datasets for cathode nanoparticles collected at pristine and discharged states. Each dataset contains a stack of diffraction patterns collected at different probe positions scanned across the cathode nanoparticle.
1. Pristine untreated nanoparticle: "Pristine U-NP.ser"
2. Pristine 200ºC heated nanoparticle: "Pristine H200-NP.ser"
3. Untreated nanoparticle after first discharge in Zn-ion batteries: "Discharged U-NP.ser"
4. 200ºC heated nanoparticle after first discharge in Zn-ion batteries: "Discharged H200-NP.ser"
The EELS experiment data includes six example datasets for cathode nanoparticles collected at different states (in "EELS datasets.zip") as described below. Each EELS dataset contains the zero-loss and core-loss EELS spectra collected at different probe positions scanned across the cathode nanoparticle.
1. Pristine untreated nanoparticle: "Pristine U-NP EELS.zip"
2. Pristine 200ºC heated nanoparticle: "Prisitne H200-NP EELS.zip"
3. Untreated nanoparticle after first discharge in Zn-ion batteries: "Discharged U-NP EELS.zip"
4. Untreated nanoparticle after first charge in Zn-ion batteries: "Charged U-NP EELS.zip"
5. 200ºC heated nanoparticle after first discharge in Zn-ion batteries: "Discharged H200-NP EELS.zip"
6. 200ºC heated nanoparticle after first charge in Zn-ion batteries: "Charged H200-NP EELS.zip"
The details of the software package and codes that can be used to analyze the 4D-STEM datasets and EELS datasets are available at: https://github.com/chenlabUIUC/OrientedPhaseDomain. Once our paper is formally published, we will update the relationship of these datasets with our paper.
keywords:
4D-STEM; EELS; defects; strain; cathode; nanoparticle; energy storage
published:
2022-04-29
Wedell, Eleanor; Warnow, Tandy
(2022)
Thank you for using these datasets!
These files contain trees and reference alignments, as well as the selected query sequences for testing phylogenetic placement methods against and within the SCAMPP framework.
There are four datasets from three different sources, each containing their source alignment and "true" tree, any estimated trees that may have been generated, and any re-estimated branch lengths that were created to be used with their requisite phylogenetic placement method.
Three biological datasets (16S.B.ALL, PEWO/LTP_s128_SSU, and PEWO/green85) and one simulated dataset (nt78) is contained. See README.txt in each file for more information.
keywords:
Phylogenetic Placement; Phylogenetics; Maximum Likelihood; pplacer; EPA-ng
published:
2022-11-11
Hsiao, Haw-Wen; Zuo, Jian-Min
(2022)
This dataset is for characterizing chemical short-range-ordering in CrCoNi medium entropy alloys. It has three sub-folders: 1. code, 2. sample WQ, 3. sample HT. The software needed to run the files is Gatan Microscopy Suite® (GMS). Please follow the instruction on this page to install the DM3 GMS: <a href="https://www.gatan.com/installation-instructions#Step1">https://www.gatan.com/installation-instructions#Step1</a>
1. Code folder contains three DM scripts to be installed in Gatan DigitalMicrograph software to analyze scanning electron nanobeam diffraction (SEND) dataset:
Cepstrum.s: need [EF-SEND_sampleWQ_cropped_aligned.dm3] in Sample WQ and the average image from [EF-SEND_sampleWQ_cropped_aligned.dm3]. Same for Sample HT folder.
log_BraggRemoval.s: same as above.
Patterson.s: Need refined diffuse patterns in Sample HT folder.
2. Sample WQ and 3. Sample HT folders both contain the SEND data (.ser) and the binned SEND data (.dm3) as well as our calculated strain maps as the strain measurement reference. The Sample WQ folder additionally has atomic resolution STEM images; the Sample HT folder additionally has three refined diffuse patterns as references for diffraction data processing.
* Only .ser file is needed to perform the strain measurement using imToolBox as listed in the manuscript. .emi file contains the meta data of the microscope, which can be opened together with .ser file using FEI TIA software.
keywords:
Medium entropy alloy; CrCoNi; chemical short-range-ordering; CSRO; TEM
published:
2025-12-01
Park, Minhyuk; Yi, Haotian; Warnow, Tandy; Chacko, George
(2025)
This dataset principally consists of four synthetic citation networks that were generated during the preparation of the manuscript Park M, Yi H, Warnow T, and Chacko G (2025). Modeling the Global Citation Network using the Scalable Agent-based Simulator for Citation Analysis with Recency-emphasized Sampling (SASCA-ReS). A preprint is available on Zenodo (below) and the manuscript has been submitted to the MetaRoR platform for review and feedback.
@misc{park_2025_17789558,
author = {Park, Minhyuk and
Yi, Haotian and
Warnow, Tandy and
Chacko, George},
title = {Modeling the Global Citation Network using the
Scalable Agent-based Simulator for Citation
Analysis with Recency-emphasized Sampling (SASCA-
ReS)
},
month = dec,
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.17789558},
url = {https://doi.org/10.5281/zenodo.17789558},
}
The networks are roughly 14, 76, 161, and 218 million nodes each. Both nodelists with attributes and edge lists are provided as gzipped parquet files along with the configuration file that was passed to the SASCA-ReS software, which can be accessed at: <a href="https://github.com/illinois-or-research-analytics/SASCA-ReS">https://github.com/illinois-or-research-analytics/SASCA-ReS</a>. A copy of the configuration file that was used to generate the network with SASCA-ReS is also provided. For example: abm14_config.ini; abm14_edgelist.parquet.gz; and abm14_nodelist.parquet.gz. The column headers in the edgelists and nodelists and the fields in the configuration file are explained in the Github repository for SASCA-ReS.
In addition, we provide sj_reccount, a table of real world citation frequencies that is an input to the SASCA-Res software. The first column (diff) of sj_reccount lists the difference between the publication year of a citing document and the publication year of a cited document. The second column (count) reports the frequency of such citations across the dataset of 77879427 observations, which is derived from the biomedical literature. Finally, we share data, composite_maverick_disruption.csv , from the mavericks (unconventional citing strategies) experiment reported in the Park et al. (2025) manuscript available at <a href="https://zenodo.org/records/17772113">https://zenodo.org/records/17772113</a>. The columns in the composite_maverick_disruption.csv file are:
node_id -> of agents in the various simulations
n_i, n_j, n_k -> terms used to compute disruption per "Wu, L., Wang, D. & Evans, J.A. Large teams develop and small teams disrupt science and technology. Nature 566, 378–382 (2019). <a href="https://doi.org/10.1038/s41586-019-0941-9">https://doi.org/10.1038/s41586-019-0941-9"</a>
disruption -> the disruption metric of Wu, Wang, and Evans (2019)
type -> maverick type (maximizer, randomnik, or minimizer)
year -> virtual year in the simulation when the maverick was created
alpha -> the alpha parameter of the control agent
pa_weight -> the preferential attachment weight of the control agent phenotype
fit_peak_value -> the fitness value assigned to the control agent
in_degree -> the count of citations accumulated by the maverick or control agent at the end of the simulation
out_degree -> the count of references made by the maverick
tag -> a label for the experiment, e.g. od249_f1 indicates that the mavericks in this experiment made 249 citations and were assigned a fitness value of 1.
keywords:
synthetic networks; agent based models; SASCA-ReS; citation networks
published:
2023-02-07
Willson, James; Tabatabaee, Yasamin; Liu, Baqiao; Warnow, Tandy
(2023)
Data sets from "DISCO+QR: Rooting Species Trees in the Presence of GDL and ILS." It contains trees and sequences simulated with gene duplication and loss under a variety of different conditions.
Note:
- trees.tar.gz contains the simulated gene-family trees used in our experiments (both true trees from SimPhy as well as trees estimated from alignments).
- alignments.tar.gz contains simulated sequence data used for estimating the gene-family trees
keywords:
evolution; computational biology; bioinformatics; phylogenetics
published:
2017-12-20
Chen, Yanju; Bond, Tami
(2017)
The dataset contains processed model fields used to generate data, figures and tables in the Journal of Geophysical Research article "Investigating the linear dependence of direct and indirect radiative forcing on emission of carbonaceous aerosols in a global climate model." The processed data are monthly averaged cloud properties (CCN, CDNC and LWP) and forcing variables (DRF and IRF) at original CAM5 spatial resolution (1.9° by 2.5°). Raw model output fields from CAM5 simulations are available through NERSC upon request. Please find more detailed information in the ReadMe file.
keywords:
carbonaceous aerosols; radiative forcing; emission; linearity