Displaying 26 - 50 of 144 in total
Subject Area
Funder
Publication Year
License
Illinois Data Bank Dataset Search Results

Dataset Search Results

published: 2018-09-06
 
The XSEDE program manages the database of allocation awards for the portfolio of advanced research computing resources funded by the National Science Foundation (NSF). The database holds data for allocation awards dating to the start of the TeraGrid program in 2004 to present, with awards continuing through the end of the second XSEDE award in 2021. The project data include lead researcher and affiliation, title and abstract, field of science, and the start and end dates. Along with the project information, the data set includes resource allocation and usage data for each award associated with the project. The data show the transition of resources over a fifteen year span along with the evolution of researchers, fields of science, and institutional representation.
keywords: allocations; cyberinfrastructure; XSEDE
published: 2023-01-12
 
This dataset was developed as part of a study that examined the correlational relationships between local journal authorship, local and external citation counts, full-text downloads, link-resolver clicks, and four global journal impact factor indices within an all-disciplines journal collection of 12,200 titles and six subject subsets at the University of Illinois at Urbana-Champaign (UIUC) Library. While earlier investigations of the relationships between usage (downloads) and citation metrics have been inconclusive, this study shows strong correlations in the all-disciplines set and most subject subsets. The normalized Eigenfactor was the only global impact factor index that correlated highly with local journal metrics. Some of the identified disciplinary variances among the six subject subsets may be explained by the journal publication aspirations of UIUC researchers. The correlations between authorship and local citations in the six specific subject subsets closely match national department or program rankings. All the raw data used in this analysis, in the form of relational database tables with multiple columns. Can be opned using MS Access. Description for variables can be viewed through "Design View" (by right clik on the selected table, choose "Design View"). The 2 PDF files provide an overview of tables are included in each MDB file. In addition, the processing scripts and Pearson correlation code is available at <a href="https://doi.org/10.13012/B2IDB-0931140_V1">https://doi.org/10.13012/B2IDB-0931140_V1</a>.
keywords: Usage and local citation relationships; publication; citation and usage metrics; publication; citation and usage correlation analysis; Pearson correlation analysis
published: 2022-07-25
 
This dataset is derived from the raw dataset (https://doi.org/10.13012/B2IDB-4950847_V1) and collects entity mentions that were manually determined to be noisy, non-species entities.
keywords: synthetic biology; NERC data; species mentions, noisy entities
published: 2022-07-25
 
This dataset is derived from the raw entity mention dataset (https://doi.org/10.13012/B2IDB-4950847_V1) for species entities and represents those that were determined to be species (i.e., were not noisy entities) but for which no corresponding concept could be found in the NCBI taxonomy database.
keywords: synthetic biology; NERC data; species mentions, not found entities
published: 2022-07-25
 
This dataset represents the results of manual cleaning and annotation of the entity mentions contained in the raw dataset (https://doi.org/10.13012/B2IDB-4163883_V1). Each mention has been consolidated and linked to an identifier for a matching concept from the NCBI's taxonomy database.
keywords: synthetic biology; NERC data; chemical mentions; cleaned data; ChEBI ontology
published: 2022-07-25
 
This dataset is derived from the raw dataset (https://doi.org/10.13012/B2IDB-4163883_V1) and collects entity mentions that were manually determined to be noisy, non-chemical entities.
keywords: synthetic biology; NERC data; chemical mentions, noisy entities
published: 2022-07-25
 
This dataset is derived from the raw entity mention dataset (https://doi.org/10.13012/B2IDB-4163883_V1) for checmical entities and represents those that were determined to be chemicals (i.e., were not noisy entities) but for which no corresponding concept could be found in the ChEBI ontology.
keywords: synthetic biology; NERC data; chemical mentions, not found entities
published: 2022-07-25
 
A set of gene and gene-related entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords: synthetic biology; NERC data; gene mentions
published: 2023-01-12
 
These processing and Pearson correlational scripts were developed to support the study that examined the correlational relationships between local journal authorship, local and external citation counts, full-text downloads, link-resolver clicks, and four global journal impact factor indices within an all-disciplines journal collection of 12,200 titles and six subject subsets at the University of Illinois at Urbana-Champaign (UIUC) Library. This study shows strong correlations in the all-disciplines set and most subject subsets. Special processing scripts and web site dashboards were created, including Pearson correlational analysis scripts for reading values from relational databases and displaying tabular results. The raw data used in this analysis, in the form of relational database tables with multiple columns, is available at <a href="https://doi.org/10.13012/B2IDB-6810203_V1">https://doi.org/10.13012/B2IDB-6810203_V1</a>.
keywords: Pearson Correlation Analysis Scripts; Journal Publication; Citation and Usage Data; University of Illinois at Urbana-Champaign Scholarly Communication
published: 2022-07-25
 
A set of cell-line entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords: synthetic biology; NERC data; cell-line mentions
published: 2022-06-20
 
This is a sentence-level parallel corpus in support of research on OCR quality. The source data comes from: (1) Project Gutenberg for human-proofread "clean" sentences; and, (2) HathiTrust Digital Library for the paired sentences with OCR errors. In total, this corpus contains 167,079 sentence pairs from 189 sampled books in four domains (i.e., agriculture, fiction, social science, world war history) published from 1793 to 1984. There are 36,337 sentences that have two OCR views paired with each clean version. In addition to sentence texts, this corpus also provides the location (i.e., sentence and chapter index) of each sentence in its belonging Gutenberg volume.
keywords: sentence-level parallel corpus; optical character recognition; OCR errors; Project Gutenberg; HathiTrust Digital Library; digital libraries; digital humanities;
published: 2021-07-20
 
This dataset contains data from extreme-disagreement analysis described in paper “Aaron M. Cohen, Jodi Schneider, Yuanxi Fu, Marian S. McDonagh, Prerna Das, Arthur W. Holt, Neil R. Smalheiser, 2021, Fifty Ways to Tag your Pubtypes: Multi-Tagger, a Set of Probabilistic Publication Type and Study Design Taggers to Support Biomedical Indexing and Evidence-Based Medicine.” In this analysis, our team experts carried out an independent formal review and consensus process for extreme disagreements between MEDLINE indexing and model predictive scores. “Extreme disagreements” included two situations: (1) an abstract was MEDLINE indexed as a publication type but received low scores for this publication type, and (2) an abstract received high scores for a publication type but lacked the corresponding MEDLINE index term. “High predictive score” is defined as the top 100 high-scoring, and “low predictive score” is defined as the bottom 100 low-scoring. Three publication types were analyzed, which are CASE_CONTROL_STUDY, COHORT_STUDY, and CROSS_SECTIONAL_STUDY. Results were recorded in three Excel workbooks, named after the publication types: case_control_study.xlsx, cohort_study.xlsx, and cross_sectional_study.xlsx. The analysis shows that, when the tagger gave a high predictive score (>0.9) on articles that lacked a corresponding MEDLINE indexing term, independent review suggested that the model assignment was correct in almost all cases (CROSS_SECTIONAL_STUDY (99%), CASE_CONTROL_STUDY (94.9%), and COHORT STUDY (92.2%)). Conversely, when articles received MEDLINE indexing but model predictive scores were very low (<0.1), independent review suggested that the model assignment was correct in the majority of cases: CASE_CONTROL_STUDY (85.4%), COHORT STUDY (76.3%), and CROSS_SECTIONAL_STUDY (53.6%). Based on the extreme disagreement analysis, we identified a number of false-positives (FPs) and false-negatives (FNs). For case control study, there were 5 FPs and 14 FNs. For cohort study, there were 7 FPs and 22 FNs. For cross-sectional study, there were 1 FP and 45 FNs. We reviewed and grouped them based on patterns noticed, providing clues for further improving the models. This dataset reports the instances of FPs and FNs along with their categorizations.
keywords: biomedical informatics; machine learning; evidence based medicine; text mining
published: 2021-03-14
 
This dataset contains all the code, notebooks, datasets used in the study conducted to measure the spatial accessibility of COVID-19 healthcare resources with a particular focus on Illinois, USA. Specifically, the dataset measures spatial access for people to hospitals and ICU beds in Illinois. The spatial accessibility is measured by the use of an enhanced two-step floating catchment area (E2FCA) method (Luo & Qi, 2009), which is an outcome of interactions between demands (i.e, # of potential patients; people) and supply (i.e., # of beds or physicians). The result is a map of spatial accessibility to hospital beds. It identifies which regions need more healthcare resources, such as the number of ICU beds and ventilators. This notebook serves as a guideline of which areas need more beds in the fight against COVID-19. ## What's Inside A quick explanation of the components of the zip file * `COVID-19Acc.ipynb` is a notebook for calculating spatial accessibility and `COVID-19Acc.html` is an export of the notebook as HTML. * `Data` contains all of the data necessary for calculations: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * `Chicago_Network.graphml`/`Illinois_Network.graphml` are GraphML files of the OSMNX street networks for Chicago and Illinois respectively. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * `GridFile/` has hexagonal gridfiles for Chicago and Illinois &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * `HospitalData/` has shapefiles for the hospitals in Chicago and Illinois &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * `IL_zip_covid19/COVIDZip.json` has JSON file which contains COVID cases by zip code from IDPH &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * `PopData/` contains population data for Chicago and Illinois by census tract and zip code. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * `Result/` is where we write out the results of the spatial accessibility measures &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; * `SVI/`contains data about the Social Vulnerability Index (SVI) * `img/` contains some images and HTML maps of the hospitals (the notebook generates the maps) * `README.md` is the document you're currently reading! * `requirements.txt` is a list of Python packages necessary to use the notebook (besides Jupyter/IPython). You can install the packages with `python3 -m pip install -r requirements.txt`
keywords: COVID-19; spatial accessibility; CyberGISX
published: 2020-07-16
 
Dataset to be for SocialMediaIE tutorial
keywords: social media; deep learning; natural language processing
published: 2020-05-17
 
Models and predictions for submission to TRAC - 2020 Second Workshop on Trolling, Aggression and Cyberbullying Our approach is described in our paper titled: Mishra, Sudhanshu, Shivangi Prasad, and Shubhanshu Mishra. 2020. “Multilingual Joint Fine-Tuning of Transformer Models for Identifying Trolling, Aggression and Cyberbullying at TRAC 2020.” In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (TRAC-2020). The source code for training this model and more details can be found on our code repository: https://github.com/socialmediaie/TRAC2020 NOTE: These models are retrained for uploading here after our submission so the evaluation measures may be slightly different from the ones reported in the paper.
keywords: Social Media; Trolling; Aggression; Cyberbullying; text classification; natural language processing; deep learning; open source;
published: 2020-02-12
 
The XSEDE program manages the database of allocation awards for the portfolio of advanced research computing resources funded by the National Science Foundation (NSF). The database holds data for allocation awards dating to the start of the TeraGrid program in 2004 to present, with awards continuing through the end of the second XSEDE award in 2021. The project data include lead researcher and affiliation, title and abstract, field of science, and the start and end dates. Along with the project information, the data set includes resource allocation and usage data for each award associated with the project. The data show the transition of resources over a fifteen year span along with the evolution of researchers, fields of science, and institutional representation.
keywords: allocations; cyberinfrastructure; XSEDE
published: 2020-03-03
 
This second version (V2) provides additional data cleaning compared to V1, additional data collection (mainly to include data from 2019), and more metadata for nodes. Please see NETWORKv2README.txt for more detail.
keywords: citations; retraction; network analysis; Web of Science; Google Scholar; indirect citation
published: 2020-05-15
 
This data has tweets collected in paper Shubhanshu Mishra, Sneha Agarwal, Jinlong Guo, Kirstin Phelps, Johna Picco, and Jana Diesner. 2014. Enthusiasm and support: alternative sentiment classification for social movements on social media. In Proceedings of the 2014 ACM conference on Web science (WebSci '14). ACM, New York, NY, USA, 261-262. DOI: https://doi.org/10.1145/2615569.2615667 The data only contains tweet IDs and the corresponding enthusiasm and support labels by two different annotators.
keywords: Twitter; text classification; enthusiasm; support; social causes; LGBT; Cyberbullying; NFL
published: 2020-02-23
 
Citation context annotation for papers citing retracted paper Matsuyama 2005 (RETRACTED: Matsuyama W, Mitsuyama H, Watanabe M, Oonakahara KI, Higashimoto I, Osame M, Arimura K. Effects of omega-3 polyunsaturated fatty acids on inflammatory markers in COPD. Chest. 2005 Dec 1;128(6):3817-27.), retracted in 2008 (Retraction in: Chest (2008) 134:4 (893) <a href="https://doi.org/10.1016/S0012-3692(08)60339-6">https://doi.org/10.1016/S0012-3692(08)60339-6<a/> ). This is part of the supplemental data for Jodi Schneider, Di Ye, Alison Hill, and Ashley Whitehorn. "Continued Citation of a Fraudulent Clinical Trial Report, Eleven Years after it was retracted for Falsifying Data" [R&R under review with Scientometrics]. Overall we found 148 citations to the retracted paper from 2006 to 2019, However, this dataset does not include the annotations described in the 2015. in Ashley Fulton, Alison Coates, Marie Williams, Peter Howe, and Alison Hill. "Persistent citation of the only published randomized controlled trial of omega-3 supplementation in chronic obstructive pulmonary disease six years after its retraction." Publications 3, no. 1 (2015): 17-26. In this dataset 70 new and newly found citations are listed: 66 annotated citations and 4 pending citations (non-annotated since we don't have full-text). "New citations" refer to articles published from March 25, 2014 to 2019, found in Google Scholar and Web of Science. "Newly found citations" refer articles published 2006-2013, found in Google Scholar and Web of Science, but not previously covered in Ashley Fulton, Alison Coates, Marie Williams, Peter Howe, and Alison Hill. "Persistent citation of the only published randomised controlled trial of omega-3 supplementation in chronic obstructive pulmonary disease six years after its retraction." Publications 3, no. 1 (2015): 17-26. NOTES: This is Unicode data. Some publication titles & quotes are in non-Latin characters and they may contain commas, quotation marks, etc. FILES/FILE FORMATS Same data in two formats: 2006-2019-new-citation-contexts-to-Matsuyama.csv - Unicode CSV (preservation format only) 2006-2019-new-citation-contexts-to-Matsuyama.xlsx - Excel workbook (preferred format) ROW EXPLANATIONS 70 rows of data - one citing publication per row COLUMN HEADER EXPLANATIONS Note - processing notes Annotation pending - Y or blank Year Published - publication year ID - ID corresponding to the network analysis. See Ye, Di; Schneider, Jodi (2019): Network of First and Second-generation citations to Matsuyama 2005 from Google Scholar and Web of Science. University of Illinois at Urbana-Champaign. <a href="https://doi.org/10.13012/B2IDB-1403534_V2">https://doi.org/10.13012/B2IDB-1403534_V2</a> Title - item title (some have non-Latin characters, commas, etc.) Official Translated Title - item title in English, as listed in the publication Machine Translated Title - item title in English, translated by Google Scholar Language - publication language Type - publication type (e.g., bachelor's thesis, blog post, book chapter, clinical guidelines, Cochrane Review, consumer-oriented evidence summary, continuing education journal article, journal article, letter to the editor, magazine article, Master's thesis, patent, Ph.D. thesis, textbook chapter, training module) Book title for book chapters - Only for a book chapter - the book title University for theses - for bachelor's thesis, Master's thesis, Ph.D. thesis - the associated university Pre/Post Retraction - "Pre" for 2006-2008 (means published before the October 2008 retraction notice or in the 2 months afterwards); "Post" for 2009-2019 (considered post-retraction for our analysis) Identifier where relevant - ISBN, Patent ID, PMID (only for items we considered hard to find/identify, e.g. those without a DOI-based URL) URL where available - URL, ideally a DOI-based URL Reference number/style - reference Only in bibliography - Y or blank Acknowledged - If annotated, Y, Not relevant as retraction not published yet, or N (blank otherwise) Positive / "Poor Research" (Negative) - P for positive, N for negative if annotated; blank otherwise Human translated quotations - Y or blank; blank means Google scholar was used to translate quotations for Translated Quotation X Specific/in passing (overall) - Specific if any of the 5 quotations are specific [aggregates Specific / In Passing (Quotation X)] Quotation 1 - First quotation (or blank) (includes non-Latin characters in some cases) Translated Quotation 1 - English translation of "Quotation 1" (or blank) Specific / In Passing (Quotation 1) - Specific if "Quotation 1" refers to methods or results of the Matsuyama paper (or blank) What is referenced from Matsuyama (Quotation 1) - Methods; Results; or Methods and Results - blank if "Quotation 1" not specific, no associated quotation, or not yet annotated Quotation 2 - Second quotation (includes non-Latin characters in some cases) Translated Quotation 2 - English translation of "Quotation 2" Specific / In Passing (Quotation 2) - Specific if "Quotation 2" refers to methods or results of the Matsuyama paper (or blank) What is referenced from Matsuyama (Quotation 2) - Methods; Results; or Methods and Results - blank if "Quotation 2" not specific, no associated quotation, or not yet annotated Quotation 3 - Third quotation (includes non-Latin characters in some cases) Translated Quotation 3 - English translation of "Quotation 3" Specific / In Passing (Quotation 3) - Specific if "Quotation 3" refers to methods or results of the Matsuyama paper (or blank) What is referenced from Matsuyama (Quotation 3) - Methods; Results; or Methods and Results - blank if "Quotation 3" not specific, no associated quotation, or not yet annotated Quotation 4 - Fourth quotation (includes non-Latin characters in some cases) Translated Quotation 4 - English translation of "Quotation 4" Specific / In Passing (Quotation 4) - Specific if "Quotation 4" refers to methods or results of the Matsuyama paper (or blank) What is referenced from Matsuyama (Quotation 4) - Methods; Results; or Methods and Results - blank if "Quotation 4" not specific, no associated quotation, or not yet annotated Quotation 5 - Fifth quotation (includes non-Latin characters in some cases) Translated Quotation 5 - English translation of "Quotation 5" Specific / In Passing (Quotation 5) - Specific if "Quotation 5" refers to methods or results of the Matsuyama paper (or blank) What is referenced from Matsuyama (Quotation 5) - Methods; Results; or Methods and Results - blank if "Quotation 5" not specific, no associated quotation, or not yet annotated Further Notes - additional notes
keywords: citation context annotation, retraction, diffusion of retraction
published: 2019-10-16
 
Human annotations of randomly selected judged documents from the AP 88-89, Robust 2004, WT10g, and GOV2 TREC collections. Seven annotators were asked to read documents in their entirety and then select up to ten terms they felt best represented the main topic(s) of the document. Terms were chosen from among a set sampled from the document in question and from related documents.
keywords: TREC; information retrieval; document topicality; document description
published: 2018-09-04
 
This dataset contains records of five years of interlibrary loan (ILL) transactions for the University of Illinois at Urbana-Champaign Library. It is for the materials lent to other institutions during period 2009-2013. It includes 169,890 transactions showing date; borrowing institution’s type, state and country; material format, imprint city, imprint country, imprint region, call number, language, local circulation count, ILL lending count, and OCLC holdings count. The dataset was generated putting together monthly ILL reports. Circulation and ILL lending fields were added from the ILS records. Borrower region and imprint region fields are created based on Title VI Region List. OCLC holdings field has been added from WorldCat records.
keywords: Interlibrary Loan; ILL; Lending; OCLC Holding; Library; Area Studies; Collection; Circulation; Collaborative; Shared; Resource Sharing
published: 2019-12-22
 
Dataset providing calculation of a Competition Index (CI) for Late Pleistocene carnivore guilds in Laos and Vietnam and their relationship to humans. Prey mass spectra, Prey focus masses, and prey class raw data can be used to calculate the CI following Hemmer (2004). Mass estimates were calculated for each species following Van Valkenburgh (1990). Full citations to methodological papers are included as relationships with other resources
keywords: competition; Southeast Asia; carnivores; humans
published: 2018-03-08
 
This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.
keywords: databases; research infrastructure; sustainability; data sharing; molecular biology; bioinformatics; bibliometrics
published: 2018-07-28
 
This dataset presents a citation analysis and citation context analysis used in Linh Hoang, Frank Scannapieco, Linh Cao, Yingjun Guan, Yi-Yun Cheng, and Jodi Schneider. Evaluating an automatic data extraction tool based on the theory of diffusion of innovation. Under submission. We identified the papers that directly describe or evaluate RobotReviewer from the list of publications on the RobotReviewer website <http://www.robotreviewer.net/publications>, resulting in 6 papers grouped into 5 studies (we collapsed a conference and journal paper with the same title and authors into one study). We found 59 citing papers, combining results from Google Scholar on June 05, 2018 and from Scopus on June 23, 2018. We extracted the citation context around each citation to the RobotReviewer papers and categorized these quotes into emergent themes.
keywords: RobotReviewer; citation analysis; citation context analysis