Illinois Data Bank Dataset Search Results
Results
published:
2019-08-29
Nardulli, Peter; Peyton, Buddy; Bajjalieh, Joseph; Singh, Ajay; Martin, Michael; Shalmon, Dan; Althaus, Scott
(2019)
This is part of the Cline Center’s ongoing Social, Political and Economic Event Database Project (SPEED) project. Each observation represents an event involving civil unrest, repression, or political violence in Sierra Leone, Liberia, and the Philippines (1979-2009). These data were produced in an effort to describe the relationship between exploitation of natural resources and civil conflict, and to identify policy interventions that might address resource-related grievances and mitigate civil strife.
This work is the result of a collaboration between the US Army Corps of Engineers’ Construction Engineer Research Laboratory (ERDC-CERL), the Swedish Defence Research Agency (FOI) and the Cline Center for Advanced Social Research (CCASR). The project team selected case studies focused on nations with a long history of civil conflict, as well as lucrative natural resources.
The Cline Center extracted these events from country-specific articles published in English by the British Broadcasting Corporation (BBC) Summary of World Broadcasts (SWB) from 1979-2008 and the CIA’s Foreign Broadcast Information Service (FBIS) 1999-2004. Articles were selected if they mentioned a country of interest, and were tagged as relevant by a Cline Center-built machine learning-based classification algorithm. Trained analysts extracted nearly 10,000 events from nearly 5,000 documents. The codebook—available in PDF form below—describes the data and production process in greater detail.
keywords:
Cline Center for Advanced Social Research; civil unrest; Social Political Economic Event Dataset (SPEED); political; event data; war; conflict; protest; violence; social; SPEED; Cline Center; Political Science
published:
2022-01-20
This dataset provides a 50-state (and DC) survey of state-level tax credits modeled after the federal New Markets Tax Credit program, including summaries of the tax credit amount and credit periods, key definitions, eligibility criteria, application process, and degree of conformity to federal law.
keywords:
New Markets Tax Credits; NMTC; tax incentives; state law
published:
2022-01-20
This dataset provides a 50-state (and DC) survey of state-level enterprise zone laws, including summaries and analyses of zone eligibility criteria, eligible investments, incentives to invest in human capital and affordable housing, and taxpayer eligibility.
keywords:
Enterprise Zones; tax incentives; state law
published:
2022-02-20
Proescholdt, Randi; Hsiao, Tzu-Kun; Schneider, Jodi; Cohen, Aaron; McDonagh, Marian; Smalheiser, Neil
(2022)
This dataset contains the files used to perform the work savings and recall evaluation in the study titled "Data from Testing a filtering strategy for systematic reviews: Evaluating work savings and recall."
keywords:
systematic reviews; machine learning; work savings; recall; search results filtering
published:
2022-01-14
This dataset provides a 50-state (and DC) survey of state-level Opportunity Zones laws, including summaries of states' Opportunity Zone tax preferences, supplemental tax preferences, and approach to Opportunity Zones conformity. Data was last updated on January 14, 2022.
keywords:
Opportunity Zones; tax incentives; state law
published:
2022-02-04
Addepalli, Amulya; Ann Subin, Karen; Schneider, Jodi
(2022)
keywords:
retracted papers; knowledge maintenance; keystone citations, Wakefield; misinformation in science; Information Quality Lab
published:
2022-02-09
Kansara, Yogeshwar; Hoang, Khanh Linh
(2022)
The data file contains a list of articles and their RCT Tagger prediction scores, which were used in a project associated with the manuscript "Evaluation of publication type tagging as a strategy to screen randomized controlled trial articles in preparing systematic reviews".
keywords:
Cochrane reviews; automation; randomized controlled trial; RCT; systematic reviews
published:
2023-09-19
Salami, Malik Oyewale; Lee, Jou; Schneider, Jodi
(2023)
We used the following keywords files to identify categories for journals and conferences not in Scopus, for our STI 2023 paper "Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science".
The first four text files each contains keywords/content words in the form: 'keyword1', 'keyword2', 'keyword3', .... The file title indicates the name of the category:
file1: healthscience_words.txt
file2: lifescience_words.txt
file3: physicalscience_words.txt
file4: socialscience_words.txt
The first four files were generated from a combination of software and manual review in an iterative process in which we:
- Manually reviewed venue titles were not able to automatically categorize using the Scopus categorization or extending it as a resource.
- Iteratively reviewed uncategorized venue titles to manually curate additional keywords as content words indicating a venue title could be classified in the category healthscience, lifescience, physicalscience, or socialscience. We used English content words and added words we could automatically translate to identify content words. NOTE: Terminology with multiple potential meanings or contain non-English words that did not yield useful automatic translations e.g., (e.g., Al-Masāq) were not selected as content words.
The fifth text file is a list of stopwords in the form: 'stopword1', 'stopword2, 'stopword3', ...
file5: stopwords.txt
This file contains manually curated stopwords from venue titles to handle non-content words like 'conference' and 'journal,' etc.
This dataset is a revision of the following dataset:
Version 1: Lee, Jou; Schneider, Jodi: Keywords for manual field assignment for Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science. University of Illinois at Urbana-Champaign Data Bank.
Changes from Version 1 to Version 2:
- Added one author
- Added a stopwords file that was used in our data preprocessing.
- Thoroughly reviewed each of the 4 keywords lists. In particular, we added UTF-8 terminology, removed some non-content words and misclassified content words, and extensively reviewed non-English keywords.
keywords:
health science keywords; scientometrics; stopwords; field; keywords; life science keywords; physical science keywords; science of science; social science keywords; meta-science; RISRS
published:
2022-07-25
A set of chemical entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords:
synthetic biology; NERC data; chemical mentions
published:
2021-04-15
To generate the bibliographic and survey data to support a data reuse study conducted by several Library faculty and accepted for publication in the Journal of Academic Librarianship, the project team utilized a series of web-based online scripts that employed several different endpoints from the Scopus API. The related dataset: "Data for: An Examination of Data Reuse Practices within Highly Cited Articles of Faculty at a Research University" contains survey design and results. <br />
1) <b>getScopus_API_process_dmp_IDB.asp</b>: used the search API query the Scopus database API for papers by UIUC authors published in 2015 -- limited to one of 9 pre-defined Scopus subject areas -- and retrieve metadata results sorted highest to lowest by the number of times the retrieved articles were cited. The URL for the basic searches took the following form: https://api.elsevier.com/content/search/scopus?query=(AFFIL%28(urbana%20OR%20champaign) AND univ*%29) OR (AF-ID(60000745) OR AF-ID(60005290))&apikey=xxxxxx&start=" & nstart & "&count=25&date=2015&view=COMPLETE&sort=citedby-count&subj=PHYS<br />
Here, the variable nstart was incremented by 25 each iteration and 25 records were retrieved in each pass. The subject area was renamed (e.g. from PHYS to COMP for computer science) in each of the 9 runs. This script does not use the Scopus API cursor but downloads 25 records at a time for up to 28 times -- or 675 maximum bibliographic records. The project team felt that looking at the most 675 cited articles from UIUC faculty in each of the 9 subject areas was sufficient to gather a robust, representative sample of articles from 2015. These downloaded records were stored in a temporary table that was renamed for each of the 9 subject areas. <br />
2) <b>get_citing_from_surveys_IDB.asp</b>: takes a Scopus article ID (eid) from the 49 UIUC author returned surveys and retrieves short citing article references, 200 at a time, into a temporary composite table. These citing records contain only one author, no author affiliations, and no author email addresses. This script uses the Scopus API cursor=* feature and is able to download all the citing references of an article 200 records at a time. <br />
3) <b>put_in_all_authors_affil_IDB.asp</b>: adds important data to the short citing records. The script adds all co-authors and their affiliations, the corresponding author, and author email addresses. <br />
4) <b>process_for_final_IDB.asp</b>: creates a relational database table with author, title, and source journal information for each of the citing articles that can be copied as an Excel file for processing by the Qualtrics survey software. This was initially 4,626 citing articles over the 49 UIUC authored articles, but was reduced to 2,041 entries after checking for available email addresses and eliminating duplicates.
keywords:
Scopus API; Citing Records; Most Cited Articles
published:
2020-09-27
Vandewalle, Rebecca; Barley, William; Padmanabhan, Anand; Katz, Daniel S.; Wang, Shaowen
(2020)
This dataset contains R codes used to produce the figures submitted in the manuscript titled "Understanding the multifaceted geospatial software ecosystem: a survey approach". The raw survey data used to populate these charts cannot be shared due to the survey consent agreement.
keywords:
R; figures; geospatial software
published:
2022-07-25
Related to the raw entity mentions (https://doi.org/10.13012/B2IDB-4163883_V1), this dataset represents the effects of the data cleaning process and collates all of the entity mentions which were too ambiguous to successfully link to the ChEBI ontology.
keywords:
synthetic biology; NERC data; chemical mentions; ambiguous entities
published:
2019-04-05
Dong, Xiaoru; Xie, Jingyi; Hoang, Linh
(2019)
File Name: Inclusion_Criteria_Annotation.csv
Data Preparation: Xiaoru Dong
Date of Preparation: 2019-04-04
Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang
Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks.
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider.
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews.
Description: The file contains lists of inclusion criteria of Cochrane Systematic Reviews and the manual annotation results. 5420 inclusion criteria were annotated, out of 7158 inclusion criteria available. Annotations are either "Only RCTs" or "Others". There are 2 columns in the file:
- "Inclusion Criteria": Content of inclusion criteria of Cochrane Systematic Reviews.
- "Only RCTs": Manual Annotation results. In which, "x" means the inclusion criteria is classified as "Only RCTs". Blank means that the inclusion criteria is classified as "Others".
Notes:
1. "RCT" stands for Randomized Controlled Trial, which, in definition, is "a work that reports on a clinical trial that involves at least one test treatment and one control treatment, concurrent enrollment and follow-up of the test- and control-treated groups, and in which the treatments to be administered are selected by a random process, such as the use of a random-numbers table." [Randomized Controlled Trial publication type definition from https://www.nlm.nih.gov/mesh/pubtypes.html].
2. In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
3. This datafile (V2) is a updated version of the datafile published at https://doi.org/10.13012/B2IDB-5958960_V1 with some minor spelling mistakes in the data fixed.
keywords:
Inclusion criteri; Randomized controlled trials; Machine learning; Systematic reviews
published:
2021-07-30
Proescholdt, Randi
(2021)
This data comes from a scoping review associated with the project called Reducing the Inadvertent Spread of Retracted Science. The data summarizes the fields that have been explored by existing research on retraction, a list of studies comparing retraction in different fields, and a list of studies focused on retraction of COVID-19 articles.
keywords:
retraction; fields; disciplines; research integrity
published:
2020-09-02
Schneider, Jodi; Ye, Di; Hill, Alison
(2020)
Citation context annotation. This dataset is a second version (V2) and part of the supplemental data for Jodi Schneider, Di Ye, Alison Hill, and Ashley Whitehorn. (2020) "Continued post-retraction citation of a fraudulent clinical trial report, eleven years after it was retracted for falsifying data". Scientometrics. In press, DOI: 10.1007/s11192-020-03631-1
Publications were selected by examining all citations to the retracted paper Matsuyama 2005, and selecting the 35 citing papers, published 2010 to 2019, which do not mention the retraction, but which mention the methods or results of the retracted paper (called "specific" in Ye, Di; Hill, Alison; Whitehorn (Fulton), Ashley; Schneider, Jodi (2020): Citation context annotation for new and newly found citations (2006-2019) to retracted paper Matsuyama 2005. University of Illinois at Urbana-Champaign. <a href="https://doi.org/10.13012/B2IDB-8150563_V1">https://doi.org/10.13012/B2IDB-8150563_V1</a> ). The annotated citations are second-generation citations to the retracted paper Matsuyama 2005 (RETRACTED: Matsuyama W, Mitsuyama H, Watanabe M, Oonakahara KI, Higashimoto I, Osame M, Arimura K. Effects of omega-3 polyunsaturated fatty acids on inflammatory markers in COPD. Chest. 2005 Dec 1;128(6):3817-27.), retracted in 2008 (Retraction in: Chest (2008) 134:4 (893) https://doi.org/10.1016/S0012-3692(08)60339-6).
<b>OVERALL DATA for VERSION 2 (V2)</b>
FILES/FILE FORMATS
Same data in two formats:
2010-2019 SG to specific not mentioned FG.csv - Unicode CSV (preservation format only) - same as in V1
2010-2019 SG to specific not mentioned FG.xlsx - Excel workbook (preferred format) - same as in V1
Additional files in V2:
2G-possible-misinformation-analyzed.csv - Unicode CSV (preservation format only)
2G-possible-misinformation-analyzed.xlsx - Excel workbook (preferred format)
<b>ABBREVIATIONS: </b>
2G - Refers to the second-generation of Matsuyama
FG - Refers to the direct citation of Matsuyama (the one the second-generation item cites)
<b>COLUMN HEADER EXPLANATIONS </b>
File name: 2G-possible-misinformation-analyzed. Other column headers in this file have same meaning as explained in V1. The following are additional header explanations:
Quote Number - The order of the quote (citation context citing the first generation article given in "FG in bibliography") in the second generation article (given in "2G article")
Quote - The text of the quote (citation context citing the first generation article given in "FG in bibliography") in the second generation article (given in "2G article")
Translated Quote - English translation of "Quote", automatically translation from Google Scholar
Seriousness/Risk - Our assessment of the risk of misinformation and its seriousness
2G topic - Our assessment of the topic of the cited article (the second generation article given in "2G article")
2G section - The section of the citing article (the second generation article given in "2G article") in which the cited article(the first generation article given in "FG in bibliography") was found
FG in bib type - The type of article (e.g., review article), referring to the cited article (the first generation article given in "FG in bibliography")
FG in bib topic - Our assessment of the topic of the cited article (the first generation article given in "FG in bibliography")
FG in bib section - The section of the cited article (the first generation article given in "FG in bibliography") in which the Matsuyama retracted paper was cited
keywords:
citation context annotation; retraction; diffusion of retraction; second-generation citation context analysis
published:
2020-12-16
Althaus, Scott; Bajjalieh, Joseph; Jungblut, Marc; Shalmon, Dan; Ghosh, Subhankar; Joshi, Pradnyesh
(2020)
Terrorism is among the most pressing challenges to democratic governance around the world. The Responsible Terrorism Coverage (or ResTeCo) project aims to address a fundamental dilemma facing 21st century societies: how to give citizens the information they need without giving terrorists the kind of attention they want. The ResTeCo hopes to inform best practices by using extreme-scale text analytic methods to extract information from more than 70 years of terrorism-related media coverage from around the world and across 5 languages. Our goal is to expand the available data on media responses to terrorism and enable the development of empirically-validated models for socially responsible, effective news organizations.
This particular dataset contains information extracted from terrorism-related stories in the Foreign Broadcast Information Service (FBIS) published between 1995 and 2013. It includes variables that measure the relative share of terrorism-related topics, the valence and intensity of emotional language, as well as the people, places, and organizations mentioned.
This dataset contains 3 files:
1. "ResTeCo Project FBIS Dataset Variable Descriptions.pdf"
A detailed codebook containing a summary of the Responsible Terrorism Coverage (ResTeCo) Project Foreign Broadcast Information Service (FBIS) Dataset and descriptions of all variables.
2. "resteco-fbis.csv"
This file contains the data extracted from terrorism-related media coverage in the Foreign Broadcast Information Service (FBIS) between 1995 and 2013. It includes variables that measure the relative share of topics, sentiment, and emotion present in this coverage. There are also variables that contain metadata and list the people, places, and organizations mentioned in these articles. There are 53 variables and 750,971 observations. The variable "id" uniquely identifies each observation. Each observation represents a single news article.
Please note that care should be taken when using "resteco-fbis.csv". The file may not be suitable to use in a spreadsheet program like Excel as some of the values get to be quite large. Excel cannot handle some of these large values, which may cause the data to appear corrupted within the software. It is encouraged that a user of this data use a statistical package such as Stata, R, or Python to ensure the structure and quality of the data remains preserved.
3. "README.md"
This file contains useful information for the user about the dataset. It is a text file written in mark down language
Citation Guidelines
1) To cite this codebook please use the following citation:
Althaus, Scott, Joseph Bajjalieh, Marc Jungblut, Dan Shalmon, Subhankar Ghosh, and Pradnyesh Joshi. 2020. Responsible Terrorism Coverage (ResTeCo) Project Foreign Broadcast Information Service (FBIS) Dataset Variable Descriptions. Responsible Terrorism Coverage (ResTeCo) Project Foreign Broadcast Information Service (FBIS) Dataset. Cline Center for Advanced Social Research. December 16. University of Illinois Urbana-Champaign. doi: https://doi.org/10.13012/B2IDB-6360821_V1
2) To cite the data please use the following citation:
Althaus, Scott, Joseph Bajjalieh, Marc Jungblut, Dan Shalmon, Subhankar Ghosh, and Pradnyesh Joshi. 2020. Responsible Terrorism Coverage (ResTeCo) Project Foreign Broadcast Information Service (FBIS) Dataset. Cline Center for Advanced Social Research. December 16. University of Illinois Urbana-Champaign. doi: https://doi.org/10.13012/B2IDB-6360821_V1
keywords:
Terrorism, Text Analytics, News Coverage, Topic Modeling, Sentiment Analysis
published:
2020-12-16
Althaus, Scott; Bajjalieh, Joseph; Jungblut, Marc; Shalmon, Dan; Ghosh, Subhankar; Joshi, Pradnyesh
(2020)
Terrorism is among the most pressing challenges to democratic governance around the world. The Responsible Terrorism Coverage (or ResTeCo) project aims to address a fundamental dilemma facing 21st century societies: how to give citizens the information they need without giving terrorists the kind of attention they want. The ResTeCo hopes to inform best practices by using extreme-scale text analytic methods to extract information from more than 70 years of terrorism-related media coverage from around the world and across 5 languages. Our goal is to expand the available data on media responses to terrorism and enable the development of empirically-validated models for socially responsible, effective news organizations.
This particular dataset contains information extracted from terrorism-related stories in the Summary of World Broadcasts published between 1979 and 2019. It includes variables that measure the relative share of terrorism-related topics, the valence and intensity of emotional language, as well as the people, places, and organizations mentioned.
This dataset contains 3 files:
1. "ResTeCo Project SWB Dataset Variable Descriptions.pdf"
A detailed codebook containing a summary of the Responsible Terrorism Coverage (ResTeCo) Project BBC Summary of World Broadcasts (SWB) Dataset and descriptions of all variables.
2. "resteco-swb.csv"
This file contains the data extracted from terrorism-related media coverage in the BBC Summary of World Broadcasts (SWB) between 1979 and 2019. It includes variables that measure the relative share of topics, sentiment, and emotion present in this coverage. There are also variables that contain metadata and list the people, places, and organizations mentioned in these articles. There are 53 variables and 438,373 observations. The variable "id" uniquely identifies each observation. Each observation represents a single news article.
Please note that care should be taken when using "resteco-swb.csv". The file may not be suitable to use in a spreadsheet program like Excel as some of the values get to be quite large. Excel cannot handle some of these large values, which may cause the data to appear corrupted within the software. It is encouraged that a user of this data use a statistical package such as Stata, R, or Python to ensure the structure and quality of the data remains preserved.
3. "README.md"
This file contains useful information for the user about the dataset. It is a text file written in markdown language
Citation Guidelines
1) To cite this codebook please use the following citation:
Althaus, Scott, Joseph Bajjalieh, Marc Jungblut, Dan Shalmon, Subhankar Ghosh, and Pradnyesh Joshi. 2020. Responsible Terrorism Coverage (ResTeCo) Project BBC Summary of World Broadcasts (SWB) Dataset Variable Descriptions. Responsible Terrorism Coverage (ResTeCo) Project BBC Summary of World Broadcasts (SWB) Dataset. Cline Center for Advanced Social Research. December 16. University of Illinois Urbana-Champaign. doi: https://doi.org/10.13012/B2IDB-2128492_V1
2) To cite the data please use the following citation:
Althaus, Scott, Joseph Bajjalieh, Marc Jungblut, Dan Shalmon, Subhankar Ghosh, and Pradnyesh Joshi. 2020. Responsible Terrorism Coverage (ResTeCo) Project Summary of World Broadcasts (SWB) Dataset. Cline Center for Advanced Social Research. December 16. University of Illinois Urbana-Champaign. doi: https://doi.org/10.13012/B2IDB-2128492_V1
keywords:
Terrorism, Text Analytics, News Coverage, Topic Modeling, Sentiment Analysis
published:
2020-10-11
Narang, Kanika; Sundaram, Hari; Chung, Austin; Chaturvedi, Snigdha
(2020)
This dataset contains the publication record of 6429 computer science researchers collected from the Microsoft Academic dataset provided through their Knowledge Service API (http://bit.ly/microsoft-data).
published:
2020-05-04
Althaus, Scott; Bajjalieh, Joseph; Carter, John; Peyton, Buddy; Shalmon, Dan
(2020)
The Cline Center Historical Phoenix Event Data covers the period 1945-2019 and includes 8.2 million events extracted from 21.2 million news stories. This data was produced using the state-of-the-art PETRARCH-2 software to analyze content from the New York Times (1945-2018), the BBC Monitoring's Summary of World Broadcasts (1979-2019), the Wall Street Journal (1945-2005), and the Central Intelligence Agency’s Foreign Broadcast Information Service (1995-2004). It documents the agents, locations, and issues at stake in a wide variety of conflict, cooperation and communicative events in the Conflict and Mediation Event Observations (CAMEO) ontology.
The Cline Center produced these data with the generous support of Linowes Fellow and Faculty Affiliate Prof. Dov Cohen and help from our academic and private sector collaborators in the Open Event Data Alliance (OEDA).
For details on the CAMEO framework, see:
Schrodt, Philip A., Omür Yilmaz, Deborah J. Gerner, and Dennis Hermreck. "The CAMEO (conflict and mediation event observations) actor coding framework." In 2008 Annual Meeting of the International Studies Association. 2008. http://eventdata.parusanalytics.com/papers.dir/APSA.2005.pdf
Gerner, D.J., Schrodt, P.A. and Yilmaz, O., 2012. Conflict and mediation event observations (CAMEO) Codebook. http://eventdata.parusanalytics.com/cameo.dir/CAMEO.Ethnic.Groups.zip
For more information about PETRARCH and OEDA, see: http://openeventdata.org/
keywords:
OEDA; Open Event Data Alliance (OEDA); Cline Center; Cline Center for Advanced Social Research; civil unrest; petrarch; phoenix event data; violence; protest; political; conflict; political science
published:
2017-12-18
This dataset matches to a thesis of the same title: Can fair use be adequately taught to Librarians? Assessing Librarians' confidence and comprehension in explaining fair use following an expert workshop.
keywords:
fair use; copyright
published:
2016-06-23
This dataset was extracted from a set of metadata files harvested from the DataCite metadata store (http://search.datacite.org/ui) during December 2015. Metadata records for items with a resourceType of dataset were collected. 1,647,949 total records were collected.
This dataset contains four files:
1) readme.txt: a readme file.
2) language-results.csv: A CSV file containing three columns: DOI, DOI prefix, and language text contents
3) language-counts.csv: A CSV file containing counts for unique language text content values.
4) language-grouped-counts.txt: A text file containing the results of manually grouping these language codes.
keywords:
datacite;metadata;language codes;repository data
published:
2020-05-13
Althaus, Scott; Bajjalieh, Joseph; Jungblut, Marc; Shalmon, Dan; Ghosh, Subhankar; Joshi, Pradnyesh
(2020)
Terrorism is among the most pressing challenges to democratic governance around the world. The Responsible Terrorism Coverage (or ResTeCo) project aims to address a fundamental dilemma facing 21st century societies: how to give citizens the information they need without giving terrorists the kind of attention they want. The ResTeCo hopes to inform best practices by using extreme-scale text analytic methods to extract information from more than 70 years of terrorism-related media coverage from around the world and across 5 languages. Our goal is to expand the available data on media responses to terrorism and enable the development of empirically-validated models for socially responsible, effective news organizations.
This particular dataset contains information extracted from terrorism-related stories in the New York Times published between 1945 and 2018. It includes variables that measure the relative share of terrorism-related topics, the valence and intensity of emotional language, as well as the people, places, and organizations mentioned.
This dataset contains 3 files:
1. <i>"ResTeCo Project NYT Dataset Variable Descriptions.pdf"</i>
<ul> <li>A detailed codebook containing a summary of the Responsible Terrorism Coverage (ResTeCo) Project New York Times (NYT) Dataset and descriptions of all variables. </li>
</ul>
2. <i>"resteco-nyt.csv"</i>
<ul><li>This file contains the data extracted from terrorism-related media coverage in the New York Times between 1945 and 2018. It includes variables that measure the relative share of topics, sentiment, and emotion present in this coverage. There are also variables that contain metadata and list the people, places, and organizations mentioned in these articles. There are 53 variables and 438,373 observations. The variable "id" uniquely identifies each observation. Each observation represents a single news article. </li>
<li> <b>Please note</b> that care should be taken when using "resteco-nyt.csv". The file may not be suitable to use in a spreadsheet program like Excel as some of the values get to be quite large. Excel cannot handle some of these large values, which may cause the data to appear corrupted within the software. It is encouraged that a user of this data use a statistical package such as Stata, R, or Python to ensure the structure and quality of the data remains preserved.</li>
</ul>
3. <i>"README.md"</i>
<ul><li>This file contains useful information for the user about the dataset. It is a text file written in mark down language</li>
</ul>
<b>Citation Guidelines</b>
1) To cite this codebook please use the following citation:
Althaus, Scott, Joseph Bajjalieh, Marc Jungblut, Dan Shalmon, Subhankar Ghosh, and Pradnyesh Joshi. 2020. Responsible Terrorism Coverage (ResTeCo) Project New York Times (NYT) Dataset Variable Descriptions. Responsible Terrorism Coverage (ResTeCo) Project New York Times Dataset. Cline Center for Advanced Social Research. May 13. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-4638196_V1
2) To cite the data please use the following citation:
Althaus, Scott, Joseph Bajjalieh, Marc Jungblut, Dan Shalmon, Subhankar Ghosh, and Pradnyesh Joshi. 2020. Responsible Terrorism Coverage (ResTeCo) Project New York Times Dataset. Cline Center for Advanced Social Research. May 13. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-4638196_V1
keywords:
Terrorism, Text Analytics, News Coverage, Topic Modeling, Sentiment Analysis
published:
2025-05-28
This dataset captures ‘Hype’ and 'Diversity', including article-level (pmid) and author-level (auid) data within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1991 and 2014, totaling 421,580 (merged_df).
The classification of hype relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences. Diversity is classified for ethnicity, gender, academic age, and topical expertise for authors based on the Rao-Sterling Diversity index.
File1: merged_auids.csv (Important columns defined)
• AUID: a unique ID for each author
• Genni: gender prediction
• Ethnea: ethnicity prediction
#################################################
File2: merged_df.csv (Important columns defined)
- pmid: unique paper
- auid: all unique auids (author-name unique identification)
- year: Year of paper publication
- no_authors: Author count
- journal: Journal name
- years: first year of publication for every author
- Country-temporal: Country of affiliation for every author
- h_index: Journal h-index
- TimeNovelty: Paper Time novelty
- nih_funded: Binary variable indicating funding for any author
- prior_cites_mean: Mean of all authors’ prior citation rate
- insti_impact: All unique institutions’ citation rate
- mesh_vals: Top MeSH values for every author of that paper
- hype_word: Candidate hype word, such as ‘novel'
- hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location
- hype_percentile: Abstract relative position of hype word
- relative_citation_ratio: RCR
keywords:
Hype; Diversity: PubMed; Abstracts; Scientometrics; Biomedicine
published:
2021-05-01
Cheng, Ti-Chung; Li, Tiffany Wenting; Karahalios, Karrie; Sundaram, Hari
(2021)
This is the first version of the dataset.
This dataset contains anonymize data collected during the experiments mentioned in the publication: “I can show what I really like.”: Eliciting Preferences via Quadratic Voting that would appear in April 2021.
Once the publication link is public, we would provide an update here.
These data were collected through our open-source online systems that are available at (experiment1)[https://github.com/a2975667/QV-app] and (experiment 2)[https://github.com/a2975667/QV-buyback]
There are two folders in this dataset. The first folder (exp1_data) contains data collected during experiment 1; the second folder (exp2_data) contains data collected during experiment 2.
keywords:
Quadratic Voting; Likert scale; Empirical studies; Collective decision-making
published:
2019-09-17
Mishra, Shubhanshu
(2019)
Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets.
Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality.
Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging.
Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py</a>
See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details.
If you are using this data, please also cite the related article:
Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning; classification; sequence tagging