Displaying Dataset 1 - 25 of 57 in total

Subject Area

Social Sciences (57)
Life Sciences (0)
Physical Sciences (0)
Technology and Engineering (0)
Uncategorized (0)

Funder

U.S. National Institutes of Health (NIH) (21)
U.S. National Science Foundation (NSF) (9)
Other (6)
U.S. Department of Energy (DOE) (0)
U.S. Department of Agriculture (USDA) (0)
Illinois Department of Natural Resources (IDNR) (0)
U.S. National Aeronautics and Space Administration (NASA) (0)
U.S. Geological Survey (USGS) (0)
U.S. Army (0)

Publication Year

2018 (22)
2019 (22)
2016 (8)
2017 (5)
2020 (0)

License

CC BY (31)
CC0 (26)
custom (0)
published: 2019-12-22
 
Dataset providing calculation of a Competition Index (CI) for Late Pleistocene carnivore guilds in Laos and Vietnam and their relationship to humans. Prey mass spectra, Prey focus masses, and prey class raw data can be used to calculate the CI following Hemmer (2004). Mass estimates were calculated for each species following Van Valkenburgh (1990). Full citations to methodological papers are included as relationships with other resources
keywords: competition; Southeast Asia; carnivores; humans
published: 2019-12-10
 
The Cline Center Historical Phoenix Event Data covers the period 1945-2019 and includes several million events extracted from 18.9 million news stories. This data was produced using the state-of-the-art PETRARCH-2 software to analyze content from the New York Times (1945-2018), the BBC Monitoring's Summary of World Broadcasts (1979-2019) and the Central Intelligence Agency’s Foreign Broadcast Information Service (1995-2004). It documents the agents, locations, and issues at stake in a wide variety of conflict, cooperation and communicative events in the CAMEO ontology. The Cline Center produced this data with the generous support of Linowes Fellow and Faculty Affiliate Prof. Dov Cohen and help from our academic and private sector collaborators in the Open Event Data Alliance (OEDA).
keywords: OEDA; Open Event Data Alliance (OEDA); Cline Center; Cline Center for Advanced Social Research; civil unrest; petrarch; phoenix event data; violence; protest; political; social; political science
published: 2019-10-16
 
Human annotations of randomly selected judged documents from the AP 88-89, Robust 2004, WT10g, and GOV2 TREC collections. Seven annotators were asked to read documents in their entirety and then select up to ten terms they felt best represented the main topic(s) of the document. Terms were chosen from among a set sampled from the document in question and from related documents.
keywords: TREC; information retrieval; document topicality; document description
published: 2019-10-30
 
Data used for network analyses. deduplicated nodes.csv - List of all nodes in the network. Each node represents a paper. Column headings: ID, Authors, Title, Year Descriptions of column headings: ID - ID for the paper/node. IDs follow the following conventions: - A00 represents the retracted Matsuyama paper. - F### represents a first-generation citation that directly cites the retracted Matsuyama paper. - F###S### represents a second-generation citation that does not cite the retracted Matsuyama paper but that cites some first-generation citation (where F### is one of the first-generation articles it cites). Authors - Authors of the paper Title - Title of the paper (some in Unicode) Year - Year of publication of the paper; Either a 4-digit year or NA (which indicates that the first data source we got it from either did not provide a year, or provided a year that we deemed unreliable) NOTE: Authors/Title/Year were taken primarily from Google Scholar (since it had the larger number of items) with unique items from Web of Science added. ------- deduplicated edges.csv - List of all edges in the network. Each edge represents a citation between two papers. Column headings: from, to Descriptions of column headings: from - ID for the cited paper. This is what the citation goes FROM. to - ID for the citing paper. This is what the citation goes TO. NOTE: All IDs are from deduplicated nodes.csv and follow the conventions above. ------- nodesFG.txt - List of the IDs for the 135 first-generation citations, from 2005 (when Matsuyama was published) through 2018. ------- nodesSGnotFG.txt - List of the IDs for the 2559 second-generation citations that are not first-generation citations from 2005 (when Matsuyama was published), through 2018 -------
keywords: citations; retraction; network analysis; Web of Science; Google Scholar; indirect citation
published: 2019-11-12
 
We are sharing the tweet IDs of four social movements: #BlackLivesMatter, #WhiteLivesMatter, #AllLivesMatter, and #BlueLivesMatter movements. The tweets are collected between May 1st, 2015 and May 30, 2017. We eliminated the location to the United States and focused on extracting the original tweets, excluding the retweets. Recommended citations for the data: Rezapour, R. (2019). Data for: How do Moral Values Differ in Tweets on Social Movements?. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9614170_V1 and Rezapour, R., Ferronato, P., and Diesner, J. (2019). How do moral values differ in tweets on social movements?. In 2019 Computer Supported Cooperative Work and Social Computing Companion Publication (CSCW’19 Companion), Austin, TX.
keywords: Twitter; social movements; black lives matter; blue lives matter; all lives matter; white lives matter
published: 2019-09-17
 
Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging. Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py</a> See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords: twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning; classification; sequence tagging
published: 2019-09-17
 
Trained models for multi-task multi-dataset learning for text classification in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py</a> See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords: twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning; sentiment; sarcasm; abusive content;
published: 2019-09-17
 
Trained models for multi-task multi-dataset learning for sequence tagging in tweets. Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging. Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_experiment.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_experiment.py</a> See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords: twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning;
published: 2019-08-29
 
This is part of the Cline Center’s ongoing Social, Political and Economic Event Database Project (SPEED) project. Each observation represents an event involving civil unrest, repression, or political violence in Sierra Leone, Liberia, and the Philippines (1979-2009). These data were produced in an effort to describe the relationship between exploitation of natural resources and civil conflict, and to identify policy interventions that might address resource-related grievances and mitigate civil strife. This work is the result of a collaboration between the US Army Corps of Engineers’ Construction Engineer Research Laboratory (ERDC-CERL), the Swedish Defence Research Agency (FOI) and the Cline Center for Advanced Social Research (CCASR). The project team selected case studies focused on nations with a long history of civil conflict, as well as lucrative natural resources. The Cline Center extracted these events from country-specific articles published in English by the British Broadcasting Corporation (BBC) Summary of World Broadcasts (SWB) from 1979-2008 and the CIA’s Foreign Broadcast Information Service (FBIS) 1999-2004. Articles were selected if they mentioned a country of interest, and were tagged as relevant by a Cline Center-built machine learning-based classification algorithm. Trained analysts extracted nearly 10,000 events from nearly 5,000 documents. The codebook—available in PDF form below—describes the data and production process in greater detail.
keywords: Cline Center for Advanced Social Research; civil unrest; Social Political Economic Event Dataset (SPEED); political; event data; war; conflict; protest; violence; social; SPEED; Cline Center; Political Science
published: 2019-09-06
 
This is a dataset of 1101 comments from The New York Times (May 1, 2015-August 31, 2015) that contains a mention of the stemmed words vaccine or vaxx.
keywords: vaccine;online comments
published: 2019-07-08
 
# Overview These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe. The datasets consist of the following: * twin_not_abstract_matched_complete.tsv: a tab-delimited file consisting of pairs of MEDLINE articles with identical titles, authors and years of publication. This file contains the PMIDs of the duplicate publications, as well as their medical subject headings (MeSH) and three measures of their indexing consistency. * twin_abstract_matched_complete.tsv: the same as above, except that the MEDLINE articles also have matching abstracts. * mesh_training_data.csv: a comma-separated file containing the training data for the model discussed in the dissertation. * mesh_scores.tsv: a tab-delimited file containing a pairwise similarity score based on word embeddings, and MeSH hierarchy relationship. ## Duplicate MEDLINE Publications Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns: 1. pmid_one: the PubMed unique identifier of the first paper 2. pmid_two: the PubMed unique identifier of the second paper 3. mesh_one: A list of medical subject headings (MeSH) from the first paper, delimited by the "|" character 4. mesh_two: a list of medical subject headings from the second paper, delimited by the "|" character 5. hoopers_consistency: The calculation of Hooper's consistency between the MeSH of the first and second paper 6. nonhierarchicalfree: a word embedding based consistency score described in the dissertation 7. hierarchicalfree: a word embedding based consistency score additionally limited by the MeSH hierarchy, described in the dissertation. ## MeSH Training Data The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns: 1. pmid: the PubMed unique identifier of the paper 2. term: a candidate MeSH term 3. cit_count: the log of the frequency of the term in the citation candidate set 4. total_cit: the log of the total number the paper's citations 5. citr_count: the log of the frequency of the term in the citations of the paper's citations 6. total_citofcit: the log of the total number of the citations of the paper's citations 7. absim_count: the log of the frequency of the term in the AbSim candidate set 8. total_absim_count: the log of the total number of AbSim records for the paper 9. absimr_count: the log of the frequency of the term in the citations of the AbSim records 10. total_absimr_count: the log of the total number of citations of the AbSim record 11. log_medline_frequency: the log of the frequency of the candidate term in MEDLINE. 12. relevance: a binary indicator (True/False) if the candidate term was assigned to the target paper ## Cosine Similarity The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns: 1. mesh_one: a string of the first MeSH heading. 2. mesh_two: a string of the second MeSH heading. 3. cosine_similarity: the cosine similarity between the terms 4. relationship_type: a string identifying the relationship type, consisting of none, parent/child, sibling, ancestor and direct (terms are identical, i.e. a direct hierarchy match). The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/). For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format."
keywords: MEDLINE;MeSH;Medical Subject Headings;Indexing
published: 2019-07-08
 
Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (<a href="https://archive.org/download/enwiki-20170920">https://archive.org/download/enwiki-20170920</a>) created using the following algorithms: * Node2vec * Poincare embedding * Elmo model on the category title The following files are present: * wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with "_") <tab> 300 dim space separated embedding. * wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. * elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using * node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt * poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt * wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt * categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. * category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category * category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt * wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt Software used: * <a href="https://github.com/napsternxg/WikiUtils">https://github.com/napsternxg/WikiUtils</a> - Processing sql dumps * <a href="https://github.com/napsternxg/node2vec">https://github.com/napsternxg/node2vec</a> - Generate random walks for node2vec * <a href="https://github.com/RaRe-Technologies/gensim">https://github.com/RaRe-Technologies/gensim</a> (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm * <a href="https://github.com/allenai/allennlp">https://github.com/allenai/allennlp</a> (version 0.8.2) - Generate elmo embeddings for each category title Code used: * wiki_cat_node2vec_commands.sh - Commands used to * wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings * wiki_cat_poincare_embedding.py - generate poincare embeddings
keywords: Wikipedia; Wikipedia Category Tree; Embeddings; Elmo; Node2Vec; Poincare;
published: 2019-06-13
 
This lexicon is the expanded/enhanced version of the Moral Foundation Dictionary created by Graham and colleagues (Graham et al., 2013). Our Enhanced Morality Lexicon (EML) contains a list of 4,636 morality related words. This lexicon was used in the following paper - please cite this paper if you use this resource in your work. Rezapour, R., Shah, S., & Diesner, J. (2019). Enhancing the measurement of social effects by capturing morality. Proceedings of the 10th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA). Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN. In addition, please consider citing the original MFD paper: <a href="https://doi.org/10.1016/B978-0-12-407236-7.00002-4">Graham, J., Haidt, J., Koleva, S., Motyl, M., Iyer, R., Wojcik, S. P., & Ditto, P. H. (2013). Moral foundations theory: The pragmatic validity of moral pluralism. In Advances in experimental social psychology (Vol. 47, pp. 55-130)</a>.
keywords: lexicon; morality
published: 2018-12-31
 
Sixty undergraduate STEM lecture classes were observed across 14 departments at the University of Illinois Urbana-Champaign in 2015 and 2016. We selected the classes to observe using purposive sampling techniques with the objectives of (1) collecting classroom observations that were representative of the STEM courses offered; (2) conducting observations on non-test, typical class days; and (3) comparing these classroom observations using the Class Observation Protocol for Undergraduate STEM (COPUS) to record the presence and frequency of active learning practices utilized by Community of Practice (CoP) and non-CoP instructors. Decimal values are the result of combined observations. All COPUS codes listed are from Smith (2013) "The Classroom Observation Protocol for Undergraduate STEM (COPUS): A New Instrument to Characterize STEM Classroom Practices" paper. For more information on the data collection process, see "Evidence that communities of practice are associated with active learning in large STEM lectures" by Tomkin et. al. (2019) in the International Journal of STEM Education.
keywords: COPUS, Community of Practice
published: 2019-05-31
 
The data are provided to illustrate methods in evaluating systematic transactional data reuse in machine learning. A library account-based recommender system was developed using machine learning processing over transactional data of 383,828 transactions (or check-outs) sourced from a large multi-unit research library. The machine learning process utilized the FP-growth algorithm over the subject metadata associated with physical items that were checked-out together in the library. The purpose of this research is to evaluate the results of systematic transactional data reuse in machine learning. The analysis herein contains a large-scale network visualization of 180,441 subject association rules and corresponding node metrics.
keywords: evaluating machine learning; network science; FP-growth; WEKA; Gephi; personalization; recommender systems
published: 2019-05-08
 
The data file contains a list of articles that have PubMed identifiers, which were used in a project associated with the manuscript "An in-situ evaluation of the RCT Tagger using 7413 articles included in 570 Cochrane reviews with RCT-only inclusion criteria".
keywords: Cochrane reviews; Randomized controlled trials; RCT; Automation; Systematic reviews
published: 2019-05-08
 
The data file contains a list of articles and their RCT Tagger prediction scores, which were used in a project associated with the manuscript "An in-situ evaluation of the RCT Tagger using 7413 articles included in 570 Cochrane reviews with RCT-only inclusion criteria".
keywords: Cochrane reviews; automation; randomized controlled trial; RCT; systematic reviews
published: 2019-05-08
 
The data file contains a list of articles given low score by the RCT Tagger and an error analysis of them, which were used in a project associated with the manuscript "An in-situ evaluation of the RCT Tagger using 7413 articles included in 570 Cochrane reviews with RCT-only inclusion criteria".
keywords: Cochrane reviews; automation; randomized controlled trial; RCT; systematic reviews
published: 2019-05-08
 
The data file contains a list of included studies with their detailed metadata, taken from Cochrane reviews which were used in a project associated with the manuscript "An in-situ evaluation of the RCT Tagger using 7413 articles included in 570 Cochrane reviews with RCT-only inclusion criteria".
keywords: Cochrane reviews; automation; randomized controlled trial; RCT; systematic review
published: 2019-05-08
 
The data file contains detailed information of the Cochrane reviews that were used in a project associated with the manuscript (working title) "An in-situ evaluation of the RCT Tagger using 7413 articles included in 570 Cochrane reviews with RCT-only inclusion criteria".
keywords: Cochrane reviews; systematic reviews; randomized control trial; RCT; automation
published: 2019-04-05
 
File Name: Inclusion_Criteria_Annotation.csv Data Preparation: Xiaoru Dong Date of Preparation: 2019-04-04 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: The file contains lists of inclusion criteria of Cochrane Systematic Reviews and the manual annotation results. 5420 inclusion criteria were annotated, out of 7158 inclusion criteria available. Annotations are either "Only RCTs" or "Others". There are 2 columns in the file: - "Inclusion Criteria": Content of inclusion criteria of Cochrane Systematic Reviews. - "Only RCTs": Manual Annotation results. In which, "x" means the inclusion criteria is classified as "Only RCTs". Blank means that the inclusion criteria is classified as "Others". Notes: 1. "RCT" stands for Randomized Controlled Trial, which, in definition, is "a work that reports on a clinical trial that involves at least one test treatment and one control treatment, concurrent enrollment and follow-up of the test- and control-treated groups, and in which the treatments to be administered are selected by a random process, such as the use of a random-numbers table." [Randomized Controlled Trial publication type definition from https://www.nlm.nih.gov/mesh/pubtypes.html]. 2. In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided. 3. This datafile (V2) is a updated version of the datafile published at https://doi.org/10.13012/B2IDB-5958960_V1 with some minor spelling mistakes in the data fixed.
keywords: Inclusion criteri; Randomized controlled trials; Machine learning; Systematic reviews
published: 2018-09-06
 
The XSEDE program manages the database of allocation awards for the portfolio of advanced research computing resources funded by the National Science Foundation (NSF). The database holds data for allocation awards dating to the start of the TeraGrid program in 2004 to present, with awards continuing through the end of the second XSEDE award in 2021. The project data include lead researcher and affiliation, title and abstract, field of science, and the start and end dates. Along with the project information, the data set includes resource allocation and usage data for each award associated with the project. The data show the transition of resources over a fifteen year span along with the evolution of researchers, fields of science, and institutional representation.
keywords: allocations; cyberinfrastructure; XSEDE
published: 2019-02-19
 
The organizations that contribute to the longevity of 67 long-lived molecular biology databases published in Nucleic Acids Research (NAR) between 1991-2016 were identified to address two research questions 1) which organizations fund these databases? and 2) which organizations maintain these databases? Funders were determined by examining funding acknowledgements in each database's most recent NAR Database Issue update article published (prior to 2017) and organizations operating the databases were determine through review of database websites.
keywords: databases; research infrastructure; sustainability; data sharing; molecular biology; bioinformatics; bibliometrics
published: 2019-01-07
 
Vendor transcription of the Catalogue of Copyright Entries, Part 1, Group 1, Books: New Series, Volume 29 for the Year 1932. This file contains all of the entries from the indicated volume.
keywords: copyright; Catalogue of Copyright Entries; Copyright Office
published: 2018-12-20
 
File Name: AllWords.csv Data Preparation: Xiaoru Dong, Linh Hoang Date of Preparation: 2018-12-12 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: The file contains lists of all words (all features) from the bag-of-words feature extraction. Notes: In order to reproduce the data in this file, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords: Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews