Illinois Data Bank

Data for "Modeling the Global Citation Network using the Scalable Agent-based Simulator for Citation Analysis with Recency-emphasized Sampling (SASCA-ReS)"

This dataset principally consists of four synthetic citation networks that were generated during the preparation of the manuscript Park M, Yi H, Warnow T, and Chacko G (2025). Modeling the Global Citation Network using the Scalable Agent-based Simulator for Citation Analysis with Recency-emphasized Sampling (SASCA-ReS). A preprint is available on Zenodo (https://zenodo.org/records/17772113) and the manuscript has been submitted to the MetaRoR platform for review and feedback.

The networks are roughly 14, 76, 161, and 218 million nodes each. Both nodelists with attributes and edge lists are provided as gzipped parquet files along with the configuration file that was passed to the SASCA-ReS software, which can be accessed at: https://github.com/illinois-or-research-analytics/SASCA-ReS. A copy of the configuration file that was used to generate the network with SASCA-ReS is also provided. For example: abm14_config.ini; abm14_edgelist.parquet.gz; and abm14_nodelist.parquet.gz. The column headers in the edgelists and nodelists and the fields in the configuration file are explained in the Github repository for SASCA-ReS.

In addition, we provide sj_reccount, a table of real world citation frequencies that is an input to the SASCA-Res software. The first column (diff) of sj_reccount lists the difference between the publication year of a citing document and the publication year of a cited document. The second column (count) reports the frequency of such citations across the dataset of 77879427 observations, which is derived from the biomedical literature. Finally, we share data, composite_maverick_disruption.csv , from the mavericks (unconventional citing strategies) experiment reported in the Park et al. (2025) manuscript available at https://zenodo.org/records/17772113. The columns in the composite_maverick_disruption.csv file are:

node_id -> of agents in the various simulations
n_i, n_j, n_k -> terms used to compute disruption per "Wu, Wang, and Evans (2019),
Large teams develop and small teams disrupt science and technology. Nature 566, 378–382 10.1038/s41586-019-0941-9"
disruption -> the disruption metric of Wu, Wang, and Evans (2019)
type -> maverick type (maximizer, randomnik, or minimizer)
year -> virtual year in the simulation when the maverick was created
alpha -> the alpha parameter of the control agent
pa_weight -> the preferential attachment weight of the control agent phenotype
fit_peak_value -> the fitness value assigned to the control agent
in_degree -> the count of citations accumulated by the maverick or control agent at the end of the simulation
out_degree -> the count of references made by the maverick
tag -> a label for the experiment, e.g. od249_f1 indicates that the mavericks in this experiment made 249 citations and were assigned a fitness value of 1.

synthetic networks; agent based models; SASCA-ReS; citation networks
CC BY
George Chacko
Version DOI Comment Publication Date
1 10.13012/B2IDB-9265079_V1 2025-12-01

1.59 KB File
2.05 GB File
159 MB File
1.74 KB File
28.5 GB File
1.84 GB File
1.82 KB File
38.8 GB File
2.49 GB File
1.59 KB File
12.8 GB File
881 MB File
7.89 KB File
2.51 KB File

Contact the Research Data Service for help interpreting this log.

Research Data Service Illinois Data Bank
Access and Use Policies Web Privacy Notice Contact Us