Illinois Data Bank Dataset Search Results
Results
published:
2026-02-17
Peyton, Buddy; Bajjalieh, Joseph; Martin, Michael; Gerald, Andrea
(2026)
Coups d'Ètat are important events in the life of a country. They constitute an important subset of irregular transfers of political power that can have significant and enduring consequences for national well-being. There are only a limited number of datasets available to study these events (Powell and Thyne 2011, Marshall and Marshall 2019, Chin, Carter and Wright 2021). Seeking to facilitate research on post-WWII coups by compiling a more comprehensive list and categorization of these events, the Cline Center for Advanced Social Research (previously the Cline Center for Democracy) initiated the Coup d’État Project as part of its Societal Infrastructures and Development (SID) project. More specifically, this dataset identifies the outcomes of coup events (i.e., realized, unrealized, or conspiracy) the type of actor(s) who initiated the coup (i.e., military, rebels, etc.), as well as the fate of the deposed leader.
Version 2.2.2 corrects an error in version 2.2.1 in which the “conspiracy” designation was mistakenly assigned to coup_id: 40411262025. Version 2.2.2 resolves this issue by removing the incorrect designation.
Version 2.2.1 adds 67 additional coup events. 47 of these came from examining the Colpus dataset (Chin, Carter, and Wright 2021), and 20 of these events were added to the data set in the normal annual review of potential new coup events. This version also updates the coding to events in Mali in 2012, Serbia in 2000 and Chad in 1979.
Version 2.2.0 adds 94 additional coup events. 66 of these came from examining Powell and Thyne’s “discarded” events and 28 of these events were added to the data set in the normal annual review of potential new coup events. This version also updates the coding to events in Brazil in 1945 and the Congo in 1968.
Version 2.1.3 adds 19 additional coup events to the data set, corrects the date of a coup in Tunisia, and reclassifies an attempted coup in Brazil in December 2022 as a conspiracy.
Version 2.1.2 added 6 additional coup events that occurred in 2022 and updated the coding of an attempted coup event in Kazakhstan in January 2022.
Version 2.1.1 corrected a mistake in version 2.1.0, where the designation of “dissident coup” had been dropped in error for coup_id: 00201062021. Version 2.1.1 fixed this omission by marking the case as both a dissident coup and an auto-coup.
Version 2.1.0 added 36 cases to the data set and removed two cases from the v2.0.0 data set. This update also added actor coding for 46 coup events and added executive outcomes to 18 events from version 2.0.0. A few other changes were made to correct inconsistencies in the coup ID variable and the date of the event.
Version 2.0.0 improved several aspects of the previous version (v1.0.0) and incorporated additional source material to include:
• Reconciling missing event data
• Removing events with irreconcilable event dates
• Removing events with insufficient sourcing (each event needs at least two sources)
• Removing events that were inaccurately coded as coup events
• Removing variables that fell below the threshold of inter-coder reliability required by the project
• Removing the spreadsheet ‘CoupInventory.xls’ because of inadequate attribution and citations in the event summaries
• Extending the period covered from 1945-2005 to 1945-2019
• Adding events from Powell and Thyne’s Coup Data (Powell and Thyne, 2011)
Version 1.0.0 was released in 2013. This version consolidated coup data taken from the following sources:
• The Center for Systemic Peace (Marshall and Marshall, 2007)
• The World Handbook of Political and Social Indicators (Taylor and Jodice, 1983)
• Coup d’Ètat: A Practical Handbook (Luttwak, 1979)
• The Cline Center’s Social, Political and Economic Event Database (SPEED) Project (Nardulli, Althaus and Hayes, 2015)
• Government Change in Authoritarian Regimes – 2010 Update (Svolik and Akcinaroglu, 2006)
<br>
<b>Items in this Dataset</b>
1. <i>Cline Center Coup d'État Codebook v.2.2.2 Codebook.pdf</i> - This 18-page document describes the Cline Center Coup d’État Project dataset. The first section of this codebook provides a summary of the different versions of the data. The second section provides a succinct definition of a coup d’état used by the Coup d'État Project and an overview of the categories used to differentiate the wide array of events that meet the project's definition. It also defines coup outcomes. The third section describes the methodology used to produce the data. <i>Revised February 2026</i>
2. <i>Coup Data 2.2.2.csv</i> - This CSV (Comma Separated Values) file contains all of the coup event data from the Cline Center Coup d’État Project. It contains 29 variables and 1,161 observations. <i>Revised February 2026</i>
3. <i>Source Document v2.2.2.pdf</i> - This 365-page document provides the sources used for each of the coup events identified in this dataset. Please use the value in the coup_id variable to identify the sources used to identify that particular event. <i>Revised February 2026</i>
4. <i>README.md</i> - This file contains useful information for the user about the dataset. It is a text file written in Markdown language. <i>Revised February 2026</i>
<br>
<b> Citation Guidelines</b>
1. To cite the codebook (or any other documentation associated with the Cline Center Coup d’État Project Dataset) please use the following citation:
Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Scott Althaus. 2026. “Cline Center Coup d’État Project Dataset Codebook”. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.2.2. February 17. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V10
2. To cite data from the Cline Center Coup d’État Project Dataset please use the following citation (filling in the correct date of access):
Peyton, Buddy, Joseph Bajjalieh, Michael Martin, and Andrea Gerald. 2026. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.2.2. February 17. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V10
published:
2025-09-08
Si, Luyang; Salami, Malik Oyewale; Schneider, Jodi
(2025)
This work evaluates the consistency and reliability of the title flag, i.e., retraction labeling that appears in the title of retracted publications, using 925 sampled retracted publications indexed in the Crossref only (Lee & Schneider, 2023), that are indexed in three other sources, Retraction Watch, Scopus, and Web of Science as of April 2023. We presume the retraction status of an item based on its title flag. For example, the flag "removal notice" is a retraction notice, and "retracted article" is a retracted paper. We compared the item's likely retraction status from the flag with the item's actual retraction status from the publisher's website.
keywords:
Crossref; Data Quality; Title flag; Retraction flag; Retraction flag assessment; Retraction labeling; Retraction indexing; Retracted papers; Retraction notices; Retraction status; RISRS
published:
2021-03-14
Kang, Jeon-Young; Michels, Alexander; Lyu, Fangzheng; Wang, Shaohua; Agbodo, Nelson; Freeman, Vincent L; Wang, Shaowen; Anand, Padmanabhan
(2021)
This dataset contains all the code, notebooks, datasets used in the study conducted to measure the spatial accessibility of COVID-19 healthcare resources with a particular focus on Illinois, USA. Specifically, the dataset measures spatial access for people to hospitals and ICU beds in Illinois. The spatial accessibility is measured by the use of an enhanced two-step floating catchment area (E2FCA) method (Luo & Qi, 2009), which is an outcome of interactions between demands (i.e, # of potential patients; people) and supply (i.e., # of beds or physicians). The result is a map of spatial accessibility to hospital beds. It identifies which regions need more healthcare resources, such as the number of ICU beds and ventilators. This notebook serves as a guideline of which areas need more beds in the fight against COVID-19.
## What's Inside
A quick explanation of the components of the zip file
* `COVID-19Acc.ipynb` is a notebook for calculating spatial accessibility and `COVID-19Acc.html` is an export of the notebook as HTML.
* `Data` contains all of the data necessary for calculations:
* `Chicago_Network.graphml`/`Illinois_Network.graphml` are GraphML files of the OSMNX street networks for Chicago and Illinois respectively.
* `GridFile/` has hexagonal gridfiles for Chicago and Illinois
* `HospitalData/` has shapefiles for the hospitals in Chicago and Illinois
* `IL_zip_covid19/COVIDZip.json` has JSON file which contains COVID cases by zip code from IDPH
* `PopData/` contains population data for Chicago and Illinois by census tract and zip code.
* `Result/` is where we write out the results of the spatial accessibility measures
* `SVI/`contains data about the Social Vulnerability Index (SVI)
* `img/` contains some images and HTML maps of the hospitals (the notebook generates the maps)
* `README.md` is the document you're currently reading!
* `requirements.txt` is a list of Python packages necessary to use the notebook (besides Jupyter/IPython). You can install the packages with `python3 -m pip install -r requirements.txt`
keywords:
COVID-19; spatial accessibility; CyberGISX
published:
2023-01-12
Mischo, William; Schlembach, Mary C.
(2023)
These processing and Pearson correlational scripts were developed to support the study that examined the correlational relationships between local journal authorship, local and external citation counts, full-text downloads, link-resolver clicks, and four global journal impact factor indices within an all-disciplines journal collection of 12,200 titles and six subject subsets at the University of Illinois at Urbana-Champaign (UIUC) Library. This study shows strong correlations in the all-disciplines set and most subject subsets. Special processing scripts and web site dashboards were created, including Pearson correlational analysis scripts for reading values from relational databases and displaying tabular results.
The raw data used in this analysis, in the form of relational database tables with multiple columns, is available at <a href="https://doi.org/10.13012/B2IDB-6810203_V1">https://doi.org/10.13012/B2IDB-6810203_V1</a>.
keywords:
Pearson Correlation Analysis Scripts; Journal Publication; Citation and Usage Data; University of Illinois at Urbana-Champaign Scholarly Communication
published:
2020-08-18
Althaus, Scott; Berenbaum, May; Jordan, Jenna; Shalmon, Dan
(2020)
These data and code enable replication of the findings and robustness checks in "No buzz for bees: Media coverage of pollinator decline," published in Proceedings of the National Academy of Sciences of the United States of America (2020)". In this paper, we find that although widespread declines in insect biomass and diversity are increasing concern within the scientific community, it remains unclear whether attention to pollinator declines has also increased within information sources serving the general public. Examining patterns of journalistic attention to the pollinator population crisis can also inform efforts to raise awareness about the importance of declines of insect species providing ecosystem services beyond pollination.
We used the Global News Index developed by the Cline Center for Advanced Social Research at the University of Illinois at Urbana-Champaign to track news attention to pollinator topics in nearly 25 million news items published by two American national newspapers and four international wire services over the past four decades. We provide a link to documentation of the Global News Index in the "relationships with articles, code, o. We found vanishingly low levels of attention to pollinator population topics relative to coverage of climate change, which we use as a comparison topic. In the most recent subset of ~10 million stories published from 2007 to 2019, 1.39% (137,086 stories) refer to climate change/global warming, while only 0.02% (1,780) refer to pollinator populations in all contexts and just 0.007% (679) refer to pollinator declines. Substantial increases in news attention were detectable only in U.S. national newspapers. We also find that while climate change stories appear primarily in newspaper “front sections”, pollinator population stories remain largely marginalized in “science” and “back section” reports. At the same time, news reports about pollinator populations increasingly link the issue to climate change, which might ultimately help raise public awareness to effect needed policy changes.
keywords:
News Coverage; Text Analytics; Insects; Pollinator; Cline Center; Cline Center for Advanced Social Research; political; social; political science; Global News Index; Archer; news; mass communication; journalism
published:
2022-06-20
Jiang, Ming; Dubnicek, Ryan; Worthey, Glen; Underwood, Ted; Downie, J. Stephen
(2022)
This is a sentence-level parallel corpus in support of research on OCR quality. The source data comes from: (1) Project Gutenberg for human-proofread "clean" sentences; and, (2) HathiTrust Digital Library for the paired sentences with OCR errors. In total, this corpus contains 167,079 sentence pairs from 189 sampled books in four domains (i.e., agriculture, fiction, social science, world war history) published from 1793 to 1984. There are 36,337 sentences that have two OCR views paired with each clean version. In addition to sentence texts, this corpus also provides the location (i.e., sentence and chapter index) of each sentence in its belonging Gutenberg volume.
keywords:
sentence-level parallel corpus; optical character recognition; OCR errors; Project Gutenberg; HathiTrust Digital Library; digital libraries; digital humanities;
published:
2019-05-31
The data are provided to illustrate methods in evaluating systematic transactional data reuse in machine learning. A library account-based recommender system was developed using machine learning processing over transactional data of 383,828 transactions (or check-outs) sourced from a large multi-unit research library. The machine learning process utilized the FP-growth algorithm over the subject metadata associated with physical items that were checked-out together in the library. The purpose of this research is to evaluate the results of systematic transactional data reuse in machine learning. The analysis herein contains a large-scale network visualization of 180,441 subject association rules and corresponding node metrics.
keywords:
evaluating machine learning; network science; FP-growth; WEKA; Gephi; personalization; recommender systems
published:
2018-04-23
Provides links to Author-ity 2009, including records from principal investigators (on NIH and NSF grants), inventors on USPTO patents, and students/advisors on ProQuest dissertations.
Note that NIH and NSF differ in the type of fields they record and standards used (e.g., institution names). Typically an NSF grant spanning multiple years is associated with one record, while an NIH grant occurs in multiple records, for each fiscal year, sub-projects/supplements, possibly with different principal investigators.
The prior probability of match (i.e., that the author exists in Author-ity 2009) varies dramatically across NIH grants, NSF grants, and USPTO patents. The great majority of NIH principal investigators have one or more papers in PubMed but a minority of NSF principal investigators (except in biology) have papers in PubMed, and even fewer USPTO inventors do. This prior probability has been built into the calculation of match probabilities.
The NIH data were downloaded from NIH exporter and the older NIH CRISP files. The dataset has 2,353,387 records, only includes ones with match probability > 0.5, and has the following 12 fields:
1 app_id,
2 nih_full_proj_nbr,
3 nih_subproj_nbr,
4 fiscal_year
5 pi_position
6 nih_pi_names
7 org_name
8 org_city_name
9 org_bodypolitic_code
10 age: number of years since their first paper
11 prob: the match probability to au_id
12 au_id: Author-ity 2009 author ID
The NSF dataset has 262,452 records, only includes ones with match probability > 0.5, and the following 10 fields:
1 AwardId
2 fiscal_year
3 pi_position,
4 PrincipalInvestigators,
5 Institution,
6 InstitutionCity,
7 InstitutionState,
8 age: number of years since their first paper
9 prob: the match probability to au_id
10 au_id: Author-ity 2009 author ID
There are two files for USPTO because here we linked disambiguated authors in PubMed (from Author-ity 2009) with disambiguated inventors.
The USPTO linking dataset has 309,720 records, only includes ones with match probability > 0.5, and the following 3 fields
1 au_id: Author-ity 2009 author ID
2 inv_id: USPTO inventor ID
3 prob: the match probability of au_id vs inv_id
The disambiguated inventors file (uiuc_uspto.tsv) has 2,736,306 records, and has the following 7 fields
1 inv_id: USPTO inventor ID
2 is_lower
3 is_upper
4 fullnames
5 patents: patent IDs separated by '|'
6 first_app_yr
7 last_app_yr
keywords:
PubMed; USPTO; Principal investigator; Name disambiguation
published:
2018-12-20
Dong, Xiaoru; Xie, Jingyi; Hoang, Linh
(2018)
File Name: WordsSelectedByManualAnalysis.csv
Data Preparation: Xiaoru Dong, Linh Hoang
Date of Preparation: 2018-12-14
Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang
Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks.
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider.
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews.
Description: this file contains the list of 407 informative words reselected from the 1655 words by manual analysis. In particular, from the 1655 words that we got from information gain feature selection, we then manually read and eliminated the domain specific words. The remaining words then were selected into the "Manual Analysis Words" as the results.
Notes: Even though the list of words in this file was selected manually. However, in order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords:
Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews
published:
2019-07-08
Mishra, Shubhanshu
(2019)
Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (<a href="https://archive.org/download/enwiki-20170920">https://archive.org/download/enwiki-20170920</a>) created using the following algorithms:
* Node2vec
* Poincare embedding
* Elmo model on the category title
The following files are present:
* wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with "_") <tab> 300 dim space separated embedding.
* wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format.
* elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using
* node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt
* poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt
* wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt
* categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files.
* category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category
* category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt
* wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt
Software used:
* <a href="https://github.com/napsternxg/WikiUtils">https://github.com/napsternxg/WikiUtils</a> - Processing sql dumps
* <a href="https://github.com/napsternxg/node2vec">https://github.com/napsternxg/node2vec</a> - Generate random walks for node2vec
* <a href="https://github.com/RaRe-Technologies/gensim">https://github.com/RaRe-Technologies/gensim</a> (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm
* <a href="https://github.com/allenai/allennlp">https://github.com/allenai/allennlp</a> (version 0.8.2) - Generate elmo embeddings for each category title
Code used:
* wiki_cat_node2vec_commands.sh - Commands used to
* wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings
* wiki_cat_poincare_embedding.py - generate poincare embeddings
keywords:
Wikipedia; Wikipedia Category Tree; Embeddings; Elmo; Node2Vec; Poincare;
published:
2018-03-28
Bibliotelemetry data are provided in support of the evaluation of Internet of Things (IoT) middleware within library collections. IoT infrastructure within the physical library environment is the basis for an integrative, hybrid approach to digital resource recommenders. The IoT infrastructure provides mobile, dynamic wayfinding support for items in the collection, which includes features for location-based recommendations. A modular evaluation and analysis herein clarified the nature of users’ requests for recommendations based on their location, and describes subject areas of the library for which users request recommendations. The modular mobile design allowed for deep exploration of bibliographic identifiers as they appeared throughout the global module system, serving to provide context to the searching and browsing data that are the focus of this study.
keywords:
internet of things; IoT; academic libraries; bibliographic classification
published:
2025-05-02
This dataset contains the first-generation (1st-gen) and second-generation (2nd-gen) citation relationships to a set of focal papers. The 1st-gen citation relationships are the instances of one paper citing a focal paper. These citing papers are called "1st-gen citations." The 2nd-gen citation relationships are the instances that a paper cites a 1st-gen citation. The citing paper in the 2nd-gen citation relationship is a second-generation (2nd-gen) citation. When a 2nd-gen citation is also a 1st-gen citation, it creates a transitive closure with the focal paper.
Each focal paper has an abbreviation, which can be found below. The 1st-gen and 2nd-gen citation relationships were extracted from the Curated Open Citation Dataset (Korobskiy & Chacko, 2023), which is derived from a copy of COCI, the OpenCitations Index of Crossref Open DOI-to-DOI Citations, downloaded on May 6, 2023. Scripts used to collect this dataset can be found at https://github.com/yuanxiesa/transitive_closure_study. Each focal paper currently has two files: {abbreviation}_1st.csv contains the 1st-gen citation relationships; {abbreviation}_2nd.csv contains the 2nd-gen citation relationships.
Focal paper abbreviation == "louvain": Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008
Focal paper abbreviation == "lp": Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3), 036106. https://doi.org/10.1103/PhysRevE.76.036106
Focal paper abbreviation == "gn": Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113. https://doi.org/10.1103/PhysRevE.69.026113
keywords:
transitive closure; citations; community detection algorithms; OpenCitations; method papers
published:
2024-11-19
Salami, Malik Oyewale; McCumber, Corinne
(2024)
This project investigates retraction indexing agreement among data sources: Crossref, Retraction Watch, Scopus, and Web of Science. As of July 2024, this reassesses the April 2023 union list of Schneider et al. (2023): https://doi.org/10.55835/6441e5cae04dbe5586d06a5f. As of April 2023, over 1 in 5 DOIs had discrepancies in retraction indexing among the 49,924 DOIs indexed as retracted in at least one of Crossref, Retraction Watch, Scopus, and Web of Science (Schneider et al., 2023). Here, we determine what changed in 15 months.
Pipeline code to get the results files can be found in the GitHub repository
https://github.com/infoqualitylab/retraction-indexing-agreement in the iPython notebook 'MET-STI2024_Reassessment_of_retraction_indexing_agreement.ipynb'
Some files have been redacted to remove proprietary data, as noted in README.txt. Among our sources, data is openly available only for Crossref and Retraction Watch.
FILE FORMATS:
1) unionlist_completed_2023-09-03-crws-ressess.csv - UTF-8 CSV file
2) unionlist_completed-ria_2024-07-09-crws-ressess.csv - UTF-8 CSV file
3) unionlist-15months-period_sankey.png - Portable Network Graphics (PNG) file
4) unionlist_ria_proportion_comparison.png - Portable Network Graphics (PNG) file
5) README.txt - text file
FILE DESCRIPTION:
Description of the files can be found in README.txt
keywords:
retraction status; data quality; indexing; retraction indexing; metadata; meta-science; RISRS
published:
2018-12-20
Dong, Xiaoru; Xie, Jingyi; Linh, Hoang
(2018)
File Name: Inclusion_Criteria_Annotation.csv
Data Preparation: Xiaoru Dong
Date of Preparation: 2018-12-14
Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang
Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks.
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider.
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews.
Description: The file contains lists of inclusion criteria of Cochrane Systematic Reviews and the manual annotation results. 5420 inclusion criteria were annotated, out of 7158 inclusion criteria available. Annotations are either "Only RCTs" or "Others". There are 2 columns in the file:
- "Inclusion Criteria": Content of inclusion criteria of Cochrane Systematic Reviews.
- "Only RCTs": Manual Annotation results. In which, "x" means the inclusion criteria is classified as "Only RCTs". Blank means that the inclusion criteria is classified as "Others".
Notes:
1. "RCT" stands for Randomized Controlled Trial, which, in definition, is "a work that reports on a clinical trial that involves at least one test treatment and one control treatment, concurrent enrollment and follow-up of the test- and control-treated groups, and in which the treatments to be administered are selected by a random process, such as the use of a random-numbers table." [Randomized Controlled Trial publication type definition from https://www.nlm.nih.gov/mesh/pubtypes.html].
2. In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords:
Inclusion criteria, Randomized controlled trials, Machine learning, Systematic reviews
published:
2024-03-21
Becker, Maria; Han, Kanyao; Werthmann, Antonina; Rezapour, Rezvaneh; Lee, Haejin; Diesner, Jana; Witt, Andreas
(2024)
Impact assessment is an evolving area of research that aims at measuring and predicting the potential effects of projects or programs. Measuring the impact of scientific research is a vibrant subdomain, closely intertwined with impact assessment. A recurring obstacle pertains to the absence of an efficient framework which can facilitate the analysis of lengthy reports and text labeling. To address this issue, we propose a framework for automatically assessing the impact of scientific research projects by identifying pertinent sections in project reports that indicate the potential impacts. We leverage a mixed-method approach, combining manual annotations with supervised machine learning, to extract these passages from project reports. This is a repository to save datasets and codes related to this project.
Please read and cite the following paper if you would like to use the data:
Becker M., Han K., Werthmann A., Rezapour R., Lee H., Diesner J., and Witt A. (2024). Detecting Impact Relevant Sections in Scientific Research. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING).
This folder contains the following files:
evaluation_20220927.ods: Annotated German passages (Artificial Intelligence, Linguistics, and Music) - training data
annotated_data.big_set.corrected.txt: Annotated German passages (Mobility) - training data
incl_translation_all.csv: Annotated English passages (Artificial Intelligence, Linguistics, and Music) - training data
incl_translation_mobility.csv: Annotated German passages (Mobility) - training data
ttparagraph_addmob.txt: German corpus (unannotated passages)
model_result_extraction.csv: Extracted impact-relevant passages from the German corpus based on the model we trained
rf_model.joblib: The random forest model we trained to extract impact-relevant passages
Data processing codes can be found at: https://github.com/khan1792/texttransfer
keywords:
impact detection; project reports; annotation; mixed-methods; machine learning
published:
2025-07-28
McCumber, Corinne; Salami, Malik Oyewale
(2025)
This project investigates retraction indexing agreement in PubMed between 2024-07-03 and 2025-05-09 in order to address an API limitation that resulted in 199 items being excluded from analysis in "Analyzing the consistency of retraction indexing". PubMed was queried on 2024-07-03 and on 2025-05-09 using the search “Retracted Publication[PT]”. PubMed is only able to return 10,000 items when queried via the E-Utilities API. When the pipeline was run 2024-07-03, the search between 2020 and 2024 returned 10,199 items, meaning that an expected 199 items indexed as retracted in PubMed were excluded. This dataset uses and compares information from PubMed as of 2025-05-09 to attempt to identify those 199 items.
keywords:
retraction status; data quality; indexing; retraction indexing; metadata; meta-science; RISRS; PubMed
published:
2018-08-06
Hoang, Linh; Cao, Linh ; Guan, Yingjun; Cheng, Yi-Yun; Schneider, Jodi
(2018)
This annotation study compared RobotReviewer's data extraction to that of three novice data extractors, using six included articles synthesized in one Cochrane review: Bailey E, Worthington HV, van Wijk A, Yates JM, Coulthard P, Afzal Z. Ibuprofen and/or paracetamol (acetaminophen) for pain relief after surgical removal of lower wisdom teeth. Cochrane Database Syst Rev. 2013; CD004624; doi:10.1002/14651858.CD004624.pub2 The goal was to assess the relative advantage of RobotReviewer's data extraction with respect to quality.
keywords:
RobotReviewer; annotation; information extraction; data extraction; systematic review automation; systematic reviewing;
published:
2020-05-17
Mishra, Sudhanshu; Prasad, Shivangi; Mishra, Shubhanshu
(2020)
Models and predictions for submission to TRAC - 2020 Second Workshop on Trolling, Aggression and Cyberbullying
Our approach is described in our paper titled:
Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. “Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020.” In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020).
The source code for training this model and more details can be found on our code repository: https://github.com/socialmediaie/TRAC2020
NOTE: These models are retrained for uploading here after our submission so the evaluation measures may be slightly different from the ones reported in the paper.
keywords:
Social Media; Trolling; Aggression; Cyberbullying; text classification; natural language processing; deep learning; open source;
published:
2020-02-12
XSEDE-Extreme Science and Engineering Discovery Environment
(2020)
The XSEDE program manages the database of allocation awards for the portfolio of advanced research computing resources funded by the National Science Foundation (NSF). The database holds data for allocation awards dating to the start of the TeraGrid program in 2004 to present, with awards continuing through the end of the second XSEDE award in 2021. The project data include lead researcher and affiliation, title and abstract, field of science, and the start and end dates. Along with the project information, the data set includes resource allocation and usage data for each award associated with the project. The data show the transition of resources over a fifteen year span along with the evolution of researchers, fields of science, and institutional representation.
keywords:
allocations; cyberinfrastructure; XSEDE
published:
2025-04-14
This dataset builds on an existing dataset which captures artists’ demographics who are represented by top tier galleries in the 2016–2017 New York art season (Case-Leal, 2017, https://web.archive.org/web/20170617002654/http://www.havenforthedispossessed.org/) with a census of reviews and catalogs about those exhibitions to assess proportionality of media coverage across race and gender. The readme file explains variables, collection, relationship between the datasets, and an example of how the Case-Leal dataset was transformed. The ArticleDataset.csv provides all articles with citation information as well as artist, artistic identity characteristic, and gallery. The ExhibitionCatalog.csv provides exhibition catalog citation information for each identified artist.
New in this V2:
- In V1, ArticleDataset.csv had both data on the articles published and all of the exhibitions, which was misleading. In V2 I separated out so that ArticleDataset only has articles, and AllSoloShows has all shows, including those that had no articles written about them in the publications reviewed.
- Upon closer review I noticed approximately 10 out of the 133 articles had incorrect information in variable "Publication content type: art or general" and/or "Publication Carrier type: web or library?" so I updated V2.
- Upon closer review I noticed there was 3 instances of artists who had two solo shows apiece: in addition to Meleko Mokgosi and Carrie Mae Weems which I had already noted in V1, there was also Roxy Paine. I had not noticed this because only one of two of Paine's shows had been written about. This brings the total number of shows to 117 (which was 116 in V1).
-Upon closer review I removed one row from ExhibitionCatalogs.csv, as the item i had listed did not meet the parameters.
keywords:
diversity and inclusion; diversity audit; contemporary art; art exhibitions; art exhibition reviews; exhibition catalogs; magazines; newspapers; demographics
published:
2019-12-22
Zachwieja, Alexandra
(2019)
Dataset providing calculation of a Competition Index (CI) for Late Pleistocene carnivore guilds in Laos and Vietnam and their relationship to humans. Prey mass spectra, Prey focus masses, and prey class raw data can be used to calculate the CI following Hemmer (2004). Mass estimates were calculated for each species following Van Valkenburgh (1990). Full citations to methodological papers are included as relationships with other resources
keywords:
competition; Southeast Asia; carnivores; humans
published:
2020-05-15
Mishra, Shubhanshu; Agarwal, Sneha; Guo, Jinlong ; Phelps , Kirstin ; Picco, Johna ; Diesner , Jana
(2020)
This data has tweets collected in paper Shubhanshu Mishra, Sneha Agarwal, Jinlong Guo, Kirstin Phelps, Johna Picco, and Jana Diesner. 2014. Enthusiasm and support: alternative sentiment classification for social movements on social media. In Proceedings of the 2014 ACM conference on Web science (WebSci '14). ACM, New York, NY, USA, 261-262. DOI: https://doi.org/10.1145/2615569.2615667
The data only contains tweet IDs and the corresponding enthusiasm and support labels by two different annotators.
keywords:
Twitter; text classification; enthusiasm; support; social causes; LGBT; Cyberbullying; NFL
published:
2023-09-21
Clarke, Caitlin; Lischwe Mueller, Natalie; Joshi, Manasi Ballal; Fu, Yuanxi; Schneider, Jodi
(2023)
The relationship between physical activity and mental health, especially depression, is one of the most studied topics in the field of exercise science and kinesiology. Although there is strong consensus that regular physical activity improves mental health and reduces depressive symptoms, some debate the mechanisms involved in this relationship as well as the limitations and definitions used in such studies. Meta-analyses and systematic reviews continue to examine the strength of the association between physical activity and depressive symptoms for the purpose of improving exercise prescription as treatment or combined treatment for depression. This dataset covers 27 review articles (either systematic review, meta-analysis, or both) and 365 primary study articles addressing the relationship between physical activity and depressive symptoms. Primary study articles are manually extracted from the review articles. We used a custom-made workflow (Fu, Yuanxi. (2022). Scopus author info tool (1.0.1) [Python]. <a href="https://github.com/infoqualitylab/Scopus_author_info_collection">https://github.com/infoqualitylab/Scopus_author_info_collection</a> that uses the Scopus API and manual work to extract and disambiguate authorship information for the 392 reports. The author information file (author_list.csv) is the product of this workflow and can be used to compute the co-author network of the 392 articles.
This dataset can be used to construct the inclusion network and the co-author network of the 27 review articles and 365 primary study articles. A primary study article is "included" in a review article if it is considered in the review article's evidence synthesis. Each included primary study article is cited in the review article, but not all references cited in a review article are included in the evidence synthesis or primary study articles. The inclusion network is a bipartite network with two types of nodes: one type represents review articles, and the other represents primary study articles. In an inclusion network, if a review article includes a primary study article, there is a directed edge from the review article node to the primary study article node. The attribute file (article_list.csv) includes attributes of the 392 articles, and the edge list file (inclusion_net_edges.csv) contains the edge list of the inclusion network.
Collectively, this dataset reflects the evidence production and use patterns within the exercise science and kinesiology scientific community, investigating the relationship between physical activity and depressive symptoms.
FILE FORMATS
1. article_list.csv - Unicode CSV
2. author_list.csv - Unicode CSV
3. Chinese_author_name_reference.csv - Unicode CSV
4. inclusion_net_edges.csv - Unicode CSV
5. review_article_details.csv - Unicode CSV
6. supplementary_reference_list.pdf - PDF
7. README.txt - text file
8. systematic_review_inclusion_criteria.csv - Unicode CSV
<b>UPDATES IN THIS VERSION COMPARED TO V3</b> (Clarke, Caitlin; Lischwe Mueller, Natalie; Joshi, Manasi Ballal; Fu, Yuanxi; Schneider, Jodi (2023): The Inclusion Network of 27 Review Articles Published between 2013-2018 Investigating the Relationship Between Physical Activity and Depressive Symptoms. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4614455_V3)
- We added a new file systematic_review_inclusion_criteria.csv.
keywords:
systematic reviews; meta-analyses; evidence synthesis; network visualization; tertiary studies; physical activity; depressive symptoms; exercise; review articles
published:
2022-07-25
This dataset is derived from the raw dataset (https://doi.org/10.13012/B2IDB-4950847_V1) and collects entity mentions that were manually determined to be noisy, non-species entities.
keywords:
synthetic biology; NERC data; species mentions, noisy entities
published:
2022-07-25
This dataset is derived from the raw entity mention dataset (https://doi.org/10.13012/B2IDB-4950847_V1) for species entities and represents those that were determined to be species (i.e., were not noisy entities) but for which no corresponding concept could be found in the NCBI taxonomy database.
keywords:
synthetic biology; NERC data; species mentions, not found entities