Illinois Data Bank Dataset Search Results
Results
published:
2024-11-07
Zheng, Heng; Fu, Yuanxi; Vandel, Ellie; Schneider, Jodi
(2024)
This dataset consists of the 286 publications retrieved from Web of Science and Scopus on July 6, 2023 as citations for Willoughby et al., 2014:
Patrick H. Willoughby, Matthew J. Jansma, and Thomas R. Hoye (2014). A guide to small-molecule structure assignment through computation of (¹H and ¹³C) NMR chemical shifts. Nature Protocols, 9(3), Article 3. https://doi.org/10.1038/nprot.2014.042
We added the DOIs of the citing publications into a Zotero collection. Then we exported all 286 DOIs in two formats: a .csv file (data export) and an .rtf file (bibliography).
<b>Willoughby2014_286citing_publications.csv</b> is a Zotero data export of the citing publications.
<b>Willoughby2014_286citing_publications.rtf</b> is a bibliography of the citing publications, using a variation of the American Psychological Association style (7th edition) with full names instead of initials.
To create <b>Willoughby2014_citation_contexts.csv</b>, HZ manually extracted the paragraphs that contain a citation marker of Willoughby et al., 2014. We refer to these paragraphs as the citation contexts of Willoughby et al., 2014. Manual extraction started with 286 citing publications but excluded 2 publications that are not in English, those with DOIs 10.13220/j.cnki.jipr.2015.06.004 and 10.19540/j.cnki.cjcmm.20200604.201
The silver standard aimed to triage the citing publications of Willoughby et al., 2014 that are at risk of propagating unreliability due to a code glitch in a computational chemistry protocol introduced in Willoughby et al., 2014. The silver standard was created stepwise:
First one chemistry expert (YF) manually annotated the corpus of 284 citing publications in English, using their full text and citation contexts. She manually categorized publications as either at risk of propagating unreliability or not at risk of propagating unreliability, with a rationale justifying each category.
Then we selected a representative sample of citation contexts to be double annotated. To do this, MJS turned the full dataset of citation contexts (Willoughby2014_citation_contexts.csv) into word embeddings, clustered them using similarity measures using BERTopic's HDBS, and selected representative citation contexts based on the centroids of the clusters.
Next the second chemistry expert (EV) annotated the 77 publications associated with the citation contexts, considering the full text as well as the citation contexts.
<b>double_annotated_subset_77_before_reconciliation.csv</b> provides EV and YF's annotation before reconciliation.
To create the silver standard YF, EV, and JS discussed differences and reconciled most differences. YF and EV had principled reasons for disagreeing on 9 publications; to handle these, YF updated the annotations, to create the silver standard we use for evaluation in the remainder of our JCDL 2024 paper (<b>silver_standard.csv</b>)
<b>Inter_Annotator_Agreement.xlsx</b> indicates publications where the two annotators made opposite decisions and calculates the inter-annotator agreement before and after reconciliation together.
<b>double_annotated_subset_77_before_reconciliation.csv</b> provides EV and YF's annotation after reconciliation, including applying the reconciliation policy.
keywords:
unreliable cited sources; knowledge maintenance; citations; scientific digital libraries; scholarly publications; reproducibility; unreliability propagation; citation contexts
published:
2023-05-02
Lee, Jou; Schneider, Jodi
(2023)
Tab-separated value (TSV) file.
14745 data rows. Each data row represents publication metadata as retrieved from Crossref (http://crossref.org) 2023-04-05 when searching for retracted publications.
Each row has the following columns:
Index - Our index, starting with 0.
DOI - Digital Object Identifier (DOI) for the publication
Year - Publication year associated with the DOI.
URL - Web location associated with the DOI.
Title - Title associated with the DOI. May be blank.
Author - Author(s) associated with the DOI.
Journal - Publication venue (journal, conference, ...) associated with the DOI
RetractionYear - Retraction Year associated with the DOI. May be blank.
Category - One or more categories associated with the DOI. May be blank.
Our search was via the Crossref REST API and searched for:
Update_type=(
'retraction',
'Retraction',
'retracion',
'retration',
'partial_retraction',
'withdrawal','removal')
keywords:
retraction; metadata; Crossref; RISRS
published:
2022-07-25
This dataset represents the results of manual cleaning and annotation of the entity mentions contained in the raw dataset (https://doi.org/10.13012/B2IDB-4163883_V1). Each mention has been consolidated and linked to an identifier for a matching concept from the NCBI's taxonomy database.
keywords:
synthetic biology; NERC data; chemical mentions; cleaned data; ChEBI ontology
published:
2022-07-25
This dataset is derived from the raw entity mention dataset (https://doi.org/10.13012/B2IDB-4163883_V1) for checmical entities and represents those that were determined to be chemicals (i.e., were not noisy entities) but for which no corresponding concept could be found in the ChEBI ontology.
keywords:
synthetic biology; NERC data; chemical mentions, not found entities
published:
2022-07-25
A set of gene and gene-related entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords:
synthetic biology; NERC data; gene mentions
published:
2023-06-06
Korobskiy, Dmitriy; Chacko, George
(2023)
This dataset is derived from the COCI, the OpenCitations Index of Crossref open DOI-to-DOI references (opencitations.net). Silvio Peroni, David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1): 428-444. https://doi.org/10.1162/qss_a_00023 We have curated it to remove duplicates, self-loops, and parallel edges. These data were copied from the Open Citations website on May 6, 2023 and subsequently processed to produce a node list and an edge-list. Integer_ids have been assigned to the DOIs to reduce memory and storage needs when working with these data. As noted on the Open Citation website, each record is a citing-cited pair that uses DOIs as persistent identifiers.
keywords:
open citations; bibliometrics; citation network; scientometrics
published:
2022-07-25
A set of cell-line entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords:
synthetic biology; NERC data; cell-line mentions
published:
2019-02-19
The organizations that contribute to the longevity of 67 long-lived molecular biology databases published in Nucleic Acids Research (NAR) between 1991-2016 were identified to address two research questions 1) which organizations fund these databases? and 2) which organizations maintain these databases? Funders were determined by examining funding acknowledgements in each database's most recent NAR Database Issue update article published (prior to 2017) and organizations operating the databases were determine through review of database websites.
keywords:
databases; research infrastructure; sustainability; data sharing; molecular biology; bioinformatics; bibliometrics
published:
2018-04-19
MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. Prepared by Vetle Torvik 2018-04-05
The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters), and should be about 3.5GB uncompressed.
• How was the dataset created?
The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.
Check here for information to get PubMed/MEDLINE, and NLMs data <a href ="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">Terms and Conditions</a>
• Affiliations are linked to a particular author on a particular article. Prior to 2014, NLM recorded the affiliation of the first author only.
However, MapAffil 2016 covers some PubMed records lacking affiliations that were harvested elsewhere, from PMC (e.g., PMID 22427989), NIH grants (e.g., 1838378), and Microsoft Academic Graph and ADS (e.g. 5833220).
• Affiliations are pre-processed (e.g., transliterated into ASCII from UTF-8 and html) so they may differ (sometimes a lot; see PMID 27487542) from PubMed records.
• All affiliation strings where processed using the MapAffil procedure, to identify and disambiguate the most specific place-name, as described in:
<i>Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p</i>
• Look for <a href="https://doi.org/10.1186/s41182-017-0073-6">Fig. 4</a> in the following article for coverage statistics over time:
<i>Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Tropical medicine and health. 2017 Dec;45(1):33.</i>
Expect to see big upticks in coverage of PMIDs around 1988 and for non-first authors in 2014.
• The code and back-end data is periodically updated and made available for query by PMID at <a href="http://abel.ischool.illinois.edu/">Torvik Research Group</a>
• What is the format of the dataset?
The dataset contains 37,406,692 rows. Each row (line) in the file has a unique PMID and author postition (e.g., 10786286_3 is the third author name on PMID 10786286), and the following thirteen columns, tab-delimited. All columns are ASCII, except city which contains Latin-1.
1. PMID: positive non-zero integer; int(10) unsigned
2. au_order: positive non-zero integer; smallint(4)
3. lastname: varchar(80)
4. firstname: varchar(80); NLM started including these in 2002 but many have been harvested from outside PubMed
5. year of publication:
6. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK
7. city: varchar(200); typically 'city, state, country' but could inlude further subvisions; unresolved ambiguities are concatenated by '|'
8. state: Australia, Canada and USA (which includes territories like PR, GU, AS, and post-codes like AE and AA)
9. country
10. journal
11. lat: at most 3 decimals (only available when city is not a country or state)
12. lon: at most 3 decimals (only available when city is not a country or state)
13. fips: varchar(5); for USA only retrieved by lat-lon query to https://geo.fcc.gov/api/census/block/find
keywords:
PubMed, MEDLINE, Digital Libraries, Bibliographic Databases; Author Affiliations; Geographic Indexing; Place Name Ambiguity; Geoparsing; Geocoding; Toponym Extraction; Toponym Resolution
published:
2025-03-18
Cline Center for Advanced Social Research
(2025)
The Cline Center Global News Index is a searchable database of textual features extracted from millions of news stories, specifically designed to provide comprehensive coverage of events around the world. In addition to searching documents for keywords, users can query metadata and features such as named entities extracted using Natural Language Processing (NLP) methods and variables that measure sentiment and emotional valence.
Archer is a web application purpose-built by the Cline Center to enable researchers to access data from the Global News Index. Archer provides a user-friendly interface for querying the Global News Index (with the back-end indexing still handled by Solr). By default, queries are built using icons and drop-down menus. More technically-savvy users can use Lucene/Solr query syntax via a ‘raw query’ option. Archer allows users to save and iterate on their queries, and to visualize faceted query results, which can be helpful for users as they refine their queries.
Additional Resources:
- Access to Archer and the Global News Index is limited to account-holders. If you are interested in signing up for an account, please fill out the <a href="https://docs.google.com/forms/d/e/1FAIpQLSf-J937V6I4sMSxQt7gR3SIbUASR26KXxqSurrkBvlF-CIQnQ/viewform?usp=pp_url"><b>Archer Access Request Form</b></a> so we can determine if you are eligible for access or not.
- Current users who would like to provide feedback, such as reporting a bug or requesting a feature, can fill out the <a href="https://forms.gle/6eA2yJUGFMtj5swY7"><b>Archer User Feedback Form</b></a>.
- The Cline Center sends out periodic email newsletters to the Archer Users Group. Please fill out this <a href="https://groups.webservices.illinois.edu/subscribe/154221"><b>form</b></a> to subscribe to it.
<b>Citation Guidelines:</b>
1) To cite the GNI codebook (or any other documentation associated with the Global News Index and Archer) please use the following citation:
Cline Center for Advanced Social Research. 2025. Global News Index and Extracted Features Repository [codebook], v1.3.0. Champaign, IL: University of Illinois. June. XX. doi:10.13012/B2IDB-5649852_V6
2) To cite data from the Global News Index (accessed via Archer or otherwise) please use the following citation (filling in the correct date of access):
Cline Center for Advanced Social Research. 2025. Global News Index and Extracted Features Repository [database], v1.3.0. Champaign, IL: University of Illinois. Jun. XX. Accessed Month, DD, YYYY. doi:10.13012/B2IDB-5649852_V6
*NOTE: V6 is replacing V5 with updated ‘Archer’ documents to reflect changes made to the Archer system.
published:
2018-09-06
XSEDE-Extreme Science and Engineering Discovery Environment
(2018)
The XSEDE program manages the database of allocation awards for the portfolio of advanced research computing resources funded by the National Science Foundation (NSF). The database holds data for allocation awards dating to the start of the TeraGrid program in 2004 to present, with awards continuing through the end of the second XSEDE award in 2021. The project data include lead researcher and affiliation, title and abstract, field of science, and the start and end dates. Along with the project information, the data set includes resource allocation and usage data for each award associated with the project. The data show the transition of resources over a fifteen year span along with the evolution of researchers, fields of science, and institutional representation.
keywords:
allocations; cyberinfrastructure; XSEDE
published:
2018-03-08
This dataset was developed to create a census of sufficiently documented molecular biology databases to answer several preliminary research questions. Articles published in the annual Nucleic Acids Research (NAR) “Database Issues” were used to identify a population of databases for study. Namely, the questions addressed herein include: 1) what is the historical rate of database proliferation versus rate of database attrition?, 2) to what extent do citations indicate persistence?, and 3) are databases under active maintenance and does evidence of maintenance likewise correlate to citation? An overarching goal of this study is to provide the ability to identify subsets of databases for further analysis, both as presented within this study and through subsequent use of this openly released dataset.
keywords:
databases; research infrastructure; sustainability; data sharing; molecular biology; bioinformatics; bibliometrics
published:
2025-05-05
Benson, Sara; Cheng, Siyao; Ton, Mary; Graves, Celenia; Owens, Dawn
(2025)
The dataset includes responses from approximately 550 participants to survey questions about trust in images labeled with AI-related tags, compared to other images found online. The questions also explore how the type of label influences their trust.
keywords:
Artificial intelligence (AI); Trust in AI; Al labeling; AI ethics
published:
2016-06-06
These datasets represent first-time collaborations between first and last authors (with mutually exclusive publication histories) on papers with 2 to 5 authors in years [1988,2009] in PubMed. Each record of each dataset captures aspects of the similarity, nearness, and complementarity between two authors about the paper marking the formation of their collaboration.
published:
2024-03-27
Zheng, Heng; Schneider, Jodi
(2024)
To gather news articles from the web that discuss the Cochrane Review, we used Altmetric Explorer from Altmetric.com and retrieved articles on August 1, 2023. We selected all articles that were written in English, published in the United States, and had a publication date <b>prior to March 10, 2023</b> (according to the “Mention Date” on Altmetric.com). This date is significant as it is when Cochrane issued a statement about the "misleading interpretation" of the Cochrane Review.
The collection of news articles is presented in the Altmetric_data.csv file. The dataset contains the following data that we exported from Altmetric Explorer:
- Publication date of the news article
- Title of the news article
- Source/publication venue of the news article
- URL
- Country
We manually checked and added the following information:
- Whether the article still exists
- Whether the article is accessible
- Whether the article is from the original source
We assigned MAXQDA IDs to the news articles. News articles were assigned the same ID when they were (a) identical or (b) in the case of Article 207, closely paraphrased, paragraph by paragraph. Inaccessible items were assigned a MAXQDA ID based on their "Mention Title".
For each article from Altmetric.com, we first tried to use the Web Collector for MAXQDA to download the article from the website and imported it into MAXQDA (version 22.7.0). If an article could not be retrieved using the Web Collector, we either downloaded the .html file or in the case of Article 128, retrieved it from the NewsBank database through the University of Illinois Library.
We then manually extracted direct quotations from the articles using MAXQDA.
We included surrounding words and sentences, and in one case, a news agency’s commentary, around direct quotations for context where needed. The quotations (with context) are the positions in our analysis.
We also identified who was quoted. We excluded quotations when we could not identify who or what was being quoted. We annotated quotations with codes representing groups (government agencies, other organizations, and research publications) and individuals (authors of the Cochrane Review, government agency representatives, journalists, and other experts such as epidemiologists).
The MAXQDA_data.csv file contains excerpts from the news articles that contain the direct quotations we identified. For each excerpt, we included the following information:
- MAXQDA ID of the document from which the excerpt originates;
- The collection date and source of the document;
- The code with which the excerpt is annotated;
- The code category;
- The excerpt itself.
keywords:
altmetrics; MAXQDA; polylogue analysis; masks for COVID-19; scientific controversies; news articles
published:
2020-02-12
Asplund, Joshua; Karahalios, Karrie
(2020)
This dataset contains the results of a three month audit of housing advertisements. It accompanies the 2020 ICWSM paper "Auditing Race and Gender Discrimination in Online Housing Markets". It covers data collected between Dec 7, 2018 and March 19, 2019.
There are two json files in the dataset: The first contains a list of json objects representing advertisements separated by newlines. Each object includes the date and time it was collected, the image and title (if collected) of the ad, the page on which it was displayed, and the training treatment it received. The second file is a list of json objects representing a visit to a housing lister separated by newlines. Each object contains the url, training treatment applied, the location searched, and the metadata of the top sites scraped. This metadata includes location, price, and number of rooms.
The dataset also includes the raw images of ads collected in order to code them by interest and targeting. These were captured by selenium and named using a perceptive hash to de-duplicate images.
keywords:
algorithmic audit; advertisement audit;
published:
2020-02-23
Ye, Di; Hill, Alison; Whitehorn (Fulton), Ashley; Schneider, Jodi
(2020)
Citation context annotation for papers citing retracted paper Matsuyama 2005 (RETRACTED: Matsuyama W, Mitsuyama H, Watanabe M, Oonakahara KI, Higashimoto I, Osame M, Arimura K. Effects of omega-3 polyunsaturated fatty acids on inflammatory markers in COPD. Chest. 2005 Dec 1;128(6):3817-27.), retracted in 2008 (Retraction in: Chest (2008) 134:4 (893) <a href="https://doi.org/10.1016/S0012-3692(08)60339-6">https://doi.org/10.1016/S0012-3692(08)60339-6<a/> ). This is part of the supplemental data for Jodi Schneider, Di Ye, Alison Hill, and Ashley Whitehorn. "Continued Citation of a Fraudulent Clinical Trial Report, Eleven Years after it was retracted for Falsifying Data" [R&R under review with Scientometrics].
Overall we found 148 citations to the retracted paper from 2006 to 2019, However, this dataset does not include the annotations described in the 2015. in Ashley Fulton, Alison Coates, Marie Williams, Peter Howe, and Alison Hill. "Persistent citation of the only published randomized controlled trial of omega-3 supplementation in chronic obstructive pulmonary disease six years after its retraction." Publications 3, no. 1 (2015): 17-26.
In this dataset 70 new and newly found citations are listed: 66 annotated citations and 4 pending citations (non-annotated since we don't have full-text).
"New citations" refer to articles published from March 25, 2014 to 2019, found in Google Scholar and Web of Science.
"Newly found citations" refer articles published 2006-2013, found in Google Scholar and Web of Science, but not previously covered in Ashley Fulton, Alison Coates, Marie Williams, Peter Howe, and Alison Hill. "Persistent citation of the only published randomised controlled trial of omega-3 supplementation in chronic obstructive pulmonary disease six years after its retraction." Publications 3, no. 1 (2015): 17-26.
NOTES:
This is Unicode data. Some publication titles & quotes are in non-Latin characters and they may contain commas, quotation marks, etc.
FILES/FILE FORMATS
Same data in two formats:
2006-2019-new-citation-contexts-to-Matsuyama.csv - Unicode CSV (preservation format only)
2006-2019-new-citation-contexts-to-Matsuyama.xlsx - Excel workbook (preferred format)
ROW EXPLANATIONS
70 rows of data - one citing publication per row
COLUMN HEADER EXPLANATIONS
Note - processing notes
Annotation pending - Y or blank
Year Published - publication year
ID - ID corresponding to the network analysis. See Ye, Di; Schneider, Jodi (2019): Network of First and Second-generation citations to Matsuyama 2005 from Google
Scholar and Web of Science. University of Illinois at Urbana-Champaign. <a href="https://doi.org/10.13012/B2IDB-1403534_V2">https://doi.org/10.13012/B2IDB-1403534_V2</a>
Title - item title (some have non-Latin characters, commas, etc.)
Official Translated Title - item title in English, as listed in the publication
Machine Translated Title - item title in English, translated by Google Scholar
Language - publication language
Type - publication type (e.g., bachelor's thesis, blog post, book chapter, clinical guidelines, Cochrane Review, consumer-oriented evidence summary, continuing education journal article, journal article, letter to the editor, magazine article, Master's thesis, patent, Ph.D. thesis, textbook chapter, training module)
Book title for book chapters - Only for a book chapter - the book title
University for theses - for bachelor's thesis, Master's thesis, Ph.D. thesis - the associated university
Pre/Post Retraction - "Pre" for 2006-2008 (means published before the October 2008 retraction notice or in the 2 months afterwards); "Post" for 2009-2019 (considered post-retraction for our analysis)
Identifier where relevant - ISBN, Patent ID, PMID (only for items we considered hard to find/identify, e.g. those without a DOI-based URL)
URL where available - URL, ideally a DOI-based URL
Reference number/style - reference
Only in bibliography - Y or blank
Acknowledged - If annotated, Y, Not relevant as retraction not published yet, or N (blank otherwise)
Positive / "Poor Research" (Negative) - P for positive, N for negative if annotated; blank otherwise
Human translated quotations - Y or blank; blank means Google scholar was used to translate quotations for Translated Quotation X
Specific/in passing (overall) - Specific if any of the 5 quotations are specific [aggregates Specific / In Passing (Quotation X)]
Quotation 1 - First quotation (or blank) (includes non-Latin characters in some cases)
Translated Quotation 1 - English translation of "Quotation 1" (or blank)
Specific / In Passing (Quotation 1) - Specific if "Quotation 1" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 1) - Methods; Results; or Methods and Results - blank if "Quotation 1" not specific, no associated quotation, or not yet annotated
Quotation 2 - Second quotation (includes non-Latin characters in some cases)
Translated Quotation 2 - English translation of "Quotation 2"
Specific / In Passing (Quotation 2) - Specific if "Quotation 2" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 2) - Methods; Results; or Methods and Results - blank if "Quotation 2" not specific, no associated quotation, or not yet annotated
Quotation 3 - Third quotation (includes non-Latin characters in some cases)
Translated Quotation 3 - English translation of "Quotation 3"
Specific / In Passing (Quotation 3) - Specific if "Quotation 3" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 3) - Methods; Results; or Methods and Results - blank if "Quotation 3" not specific, no associated quotation, or not yet annotated
Quotation 4 - Fourth quotation (includes non-Latin characters in some cases)
Translated Quotation 4 - English translation of "Quotation 4"
Specific / In Passing (Quotation 4) - Specific if "Quotation 4" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 4) - Methods; Results; or Methods and Results - blank if "Quotation 4" not specific, no associated quotation, or not yet annotated
Quotation 5 - Fifth quotation (includes non-Latin characters in some cases)
Translated Quotation 5 - English translation of "Quotation 5"
Specific / In Passing (Quotation 5) - Specific if "Quotation 5" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 5) - Methods; Results; or Methods and Results - blank if "Quotation 5" not specific, no associated quotation, or not yet annotated
Further Notes - additional notes
keywords:
citation context annotation, retraction, diffusion of retraction
published:
2020-07-16
Mishra, Shubhanshu
(2020)
Dataset to be for SocialMediaIE tutorial
keywords:
social media; deep learning; natural language processing
published:
2021-07-22
Hsiao, Tzu-Kun; Schneider, Jodi
(2021)
This dataset includes five files. Descriptions of the files are given as follows:
<b>FILENAME: PubMed_retracted_publication_full_v3.tsv</b>
- Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ).
- Except for the information in the "cited_by" column, all the data is from PubMed.
- PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses:
[1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file).
[2] Citing paper and the cited retracted paper have the same PMID.
ROW EXPLANATIONS
- Each row is a retracted paper. There are 7,813 retracted papers.
COLUMN HEADER EXPLANATIONS
1) PMID - PubMed ID
2) Title - Paper title
3) Authors - Author names
4) Citation - Bibliographic information of the paper
5) First Author - First author's name
6) Journal/Book - Publication name
7) Publication Year
8) Create Date - The date the record was added to the PubMed database
9) PMCID - PubMed Central ID (if applicable, otherwise blank)
10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank)
11) DOI - Digital object identifier (if applicable, otherwise blank)
12) retracted_in - Information of retraction notice (given by PubMed)
13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank)
14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite.
15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank)
<b>FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv</b>
- This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles.
- This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis.
- Citation contexts that meet either of the two conditions below have been excluded from analyses:
[1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file).
[2] Citing paper and the cited retracted paper have the same PMID.
ROW EXPLANATIONS
- Each row is a citation context associated with one retracted paper that's cited.
- In the manuscript, we count each citation context once, even if it cites multiple retracted papers.
COLUMN HEADER EXPLANATIONS
1) pmcid - PubMed Central ID of the citing paper
2) pmid - PubMed ID of the citing paper
3) year - Publication year of the citing paper
4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions)
5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified)
6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively.
7) total_sentences - Total number of sentences in a given location
8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper.
9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper.
10) citation - The citation context
11) progression - Position of a citation context by centile within the citing paper.
12) retracted_yr - Retraction year of the retracted paper
13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction.
<b>FILENAME: 724_knowingly_post_retraction_cit.csv</b> (updated)
- The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv".
- Two citation contexts from retraction notices have been excluded from analyses.
ROW EXPLANATIONS
- Each row is a citation context.
COLUMN HEADER EXPLANATIONS
1) pmcid - PubMed Central ID of the citing paper
2) pmid - PubMed ID of the citing paper
3) pub_type - Publication type collected from the metadata in the PMCOA XML files.
4) pub_type2 - Specific article types. Please see the manuscript for explanations.
5) year - Publication year of the citing paper
6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions)
7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper.
8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper.
9) citation - The citation context
10) retracted_yr - Retraction year of the retracted paper
11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation.
12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation.
<b>FILENAME: Annotation manual.pdf</b>
- The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv.
<b>FILENAME: retraction_notice_PMID.csv</b> (new file added for this version)
- A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).
keywords:
citation context; in-text citation; citation to retracted papers; retraction
published:
2021-11-05
Keralis, Spencer D. C.; Yakin, Syamil
(2021)
This data set contains survey results from a 2021 survey of University of Illinois University Library employees conducted as part of the Becoming A Trans Inclusive Library Project to evaluate the awareness of University of Illinois faculty, staff, and student employees regarding transgender identities, and to assess the professional development needs of library employees to better serve trans and gender non-conforming patrons. The survey instrument is available in the IDEALS repository: http://hdl.handle.net/2142/110080.
keywords:
transgender awareness, academic library, gender identity awareness, professional development opportunities
published:
2023-03-28
Hsiao, Tzu-Kun; Torvik, Vetle
(2023)
Sentences and citation contexts identified from the PubMed Central open access articles
----------------------------------------------------------------------
The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019.
The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles.
<b>Files</b>:
• A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A.
• B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B.
• C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C.
• D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D.
• E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E.
• F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F.
• G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G.
• H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H.
• I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I.
• J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J.
• K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K.
• L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L.
• M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M.
• N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N.
• O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O.
• P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1).
• P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2).
• Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q.
• R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R.
• S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S.
• T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T.
• UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V.
• W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W.
• XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z.
Each row in the file is a sentence/citation context and contains the following columns:
• pmcid: PMCID of the article
• pmid: PMID of the article. If an article does not have a PMID, the value is NONE.
• location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs.
• IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable.
• sentence_id: The ID of the citation context/sentence in the article component
• total_sentences: The number of sentences in the article component.
• intxt_id: The ID of the citation.
• intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-".
• intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-".
• intxt_mark: The citation marker associated with the inline citation.
• best_id: The best source link ID (e.g., PMID) of the citation.
• best_source: The sources that confirm the best ID.
• best_id_diff: The comparison result between the best_id column and the intxt_pmid column.
• citation: A citation context. If no citation is found in a sentence, the value is the sentence.
• progression: Text progression of the citation context/sentence.
<b>Supplementary Files</b>
• PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column.
Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns:
• pmcid: PMCID of the citing article.
• pos: The citation's position in the reference list.
• fromPMID: PMID of the citing article.
• toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci.
• SRC: The sources that confirm the toPMID.
• MatchDB: The origin bibliographic database of the toPMID.
• Probability: The match probability of the toPMID.
• toPMID2: PMID of the citation (as tagged in the XML file).
• SRC2: The sources that confirm the toPMID2.
• intxt_id: The ID of the citation.
• journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files.
• same_ref_string: Whether the citation string appears in the reference list more than once.
• DIFF: The comparison result between the toPMID column and the toPMID2 column.
• bestID: The best source link ID (e.g., PMID) of the citation.
• bestSRC: The sources that confirm the best ID.
• Match: Matching result produced by Patci.
[1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885
• intxt_cit_license_fromPMC.tsv – This file contains the CC licensing information for each article. The licensing information is from PMC's file lists [2], retrieved on June 19, 2020, and March 9, 2023. It should be noted that the license information for 189,855 PMCIDs is <b>NO-CC CODE</b> in the file lists, and 521 PMCIDs are absent in the file lists. The absence of CC licensing information does not indicate that the article lacks a CC license. For example, PMCID: 6156294 (<b>NO-CC CODE</b>) and PMCID: 6118074 (absent in the PMC's file lists) are under CC-BY licenses according to their PDF versions of articles.
The intxt_cit_license_fromPMC.tsv file has two columns:
• pmcid: PMCID of the article.
• license: The article’s CC license information provided in PMC’s file lists. The value is nan when an article is not present in the PMC’s file lists.
[2] https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/
• Supplementary_File_1.zip – This file contains the code for generating the dataset.
keywords:
citation context; in-text citation; inline citation; bibliometrics; science of science
published:
2016-12-19
Files in this dataset represent an investigation into use of the Library mobile app Minrva during the months of May 2015 through December 2015. During this time interval 45,975 API hits were recorded by the Minrva web server. The dataset included herein is an analysis of the following: 1) a delineation of API hits to mobile app modules use in the Minrva app by month, 2) a general analysis of Minrva app downloads to module use, and 3) the annotated data file providing associations from API hits to specific modules used, organized by month (May 2015 – December 2015).
keywords:
API analysis; log analysis; Minrva Mobile App
published:
2023-08-02
Jeng, Amos; Bosch, Nigel; Perry, Michelle
(2023)
This dataset was developed as part of an online survey study that investigates how phatic expressions—comments that are social rather than informative in nature—influence the perceived helpfulness of online peer help-giving replies in an asynchronous college course discussion forum. During the study, undergraduate students (N = 320) rated and described the helpfulness of examples of replies to online requests for help, both with and without four types of phatic expressions: greeting/parting tokens, other-oriented comments, self-oriented comments, and neutral comments.
keywords:
help-giving; phatic expression; discussion forum; online learning; engagement
published:
2023-07-14
Schneider, Jodi; Das, Susmita; Léveillé, Jacqueline ; Proescholdt, Randi
(2023)
Data for Post-retraction citation: A review of scholarly research on the spread of retracted science
Schneider, Jodi; Das, Susmita; Léveillé, Jacqueline; Proescholdt, Randi
Contact: Jodi Schneider jodi@illinois.edu & jschneider@pobox.com
**********
OVERVIEW
**********
This dataset provides further analysis for an ongoing literature review about post-retraction citation.
This ongoing work extends a poster presented as:
Jodi Schneider, Jacqueline Léveillé, Randi Proescholdt, Susmita Das, and The RISRS Team. Characterization of Publications on Post-Retraction Citation of Retracted Articles. Presented at the Ninth International Congress on Peer Review and Scientific Publication, September 8-10, 2022 hybrid in Chicago. https://hdl.handle.net/2142/114477 (now also in https://peerreviewcongress.org/abstract/characterization-of-publications-on-post-retraction-citation-of-retracted-articles/ )
Items as of the poster version are listed in the bibliography 92-PRC-items.pdf.
Note that following the poster, we made several changes to the dataset (see changes-since-PRC-poster.txt). For both the poster dataset and the current dataset, 5 items have 2 categories (see 5-items-have-2-categories.txt).
Articles were selected from the Empirical Retraction Lit bibliography (https://infoqualitylab.org/projects/risrs2020/bibliography/ and https://doi.org/10.5281/zenodo.5498474 ). The current dataset includes 92 items; 91 items were selected from the 386 total items in Empirical Retraction Lit bibliography version v.2.15.0 (July 2021); 1 item was added because it is the final form publication of a grouping of 2 items from the bibliography: Yang (2022) Do retraction practices work effectively? Evidence from citations of psychological retracted articles http://doi.org/10.1177/01655515221097623
Items were classified into 7 topics; 2 of the 7 topics have been analyzed to date.
**********************
OVERVIEW OF ANALYSIS
**********************
DATA ANALYZED:
2 of the 7 topics have been analyzed to date:
field-based case studies (n = 20)
author-focused case studies of 1 or several authors with many retracted publications (n = 15)
FUTURE DATA TO BE ANALYZED, NOT YET COVERED:
5 of the 7 topics have not yet been analyzed as of this release:
database-focused analyses (n = 33)
paper-focused case studies of 1 to 125 selected papers (n = 15)
studies of retracted publications cited in review literature (n = 8)
geographic case studies (n = 4)
studies selecting retracted publications by method (n = 2)
**************
FILE LISTING
**************
------------------
BIBLIOGRAPHY
------------------
92-PRC-items.pdf
------------------
TEXT FILES
------------------
README.txt
5-items-have-2-categories.txt
changes-since-PRC-poster.txt
------------------
CODEBOOKS
------------------
Codebook for authors.docx
Codebook for authors.pdf
Codebook for field.docx
Codebook for field.pdf
Codebook for KEY.docx
Codebook for KEY.pdf
------------------
SPREADSHEETS
------------------
field.csv
field.xlsx
multipleauthors.csv
multipleauthors.xlsx
multipleauthors-not-named.csv
multipleauthors-not-named.xlsx
singleauthors.csv
singleauthors.xlsx
***************************
DESCRIPTION OF FILE TYPES
***************************
BIBLIOGRAPHY (92-PRC-items.pdf) presents the items, as of the poster version. This has minor differences from the current data set. Consult changes-since-PRC-poster.txt for details on the differences.
TEXT FILES provide notes for additional context. These files end in .txt.
CODEBOOKS describe the data we collected. The same data is provided in both Word (.docx) and PDF format.
There is one general codebook that is referred to in the other codebooks: Codebook for KEY lists fields assigned (e.g., for a journal or conference). Note that this is distinct from the overall analysis in the Empirical Retraction Lit bibliography of fields analyzed; for that analysis see Proescholdt, Randi (2021): RISRS Retraction Review - Field Variation Data. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2070560_V1
Other codebooks document specific information we entered on each column of a spreadsheet.
SPREADSHEETS present the data collected. The same data is provided in both Excel (.xlsx) and CSV format.
Each data row describes a publication or item (e.g., thesis, poster, preprint).
For column header explainations, see the associated codebook.
*****************************
DETAILS ON THE SPREADSHEETS
*****************************
field-based case studies
CODEBOOK: Codebook for field
--REFERS TO: Codebook for KEY
DATA SHEET: field
REFERS TO: Codebook for KEY
--NUMBER OF DATA ROWS: 20 NOTE: Each data row describes a publication/item.
--NUMBER OF PUBLICATION GROUPINGS: 17
--GROUPED PUBLICATIONS: Rubbo (2019) - 2 items, Yang (2022) - 3 items
author-focused case studies of 1 or several authors with many retracted publications
CODEBOOK: Codebook for authors
--REFERS TO: Codebook for KEY
DATA SHEET 1: singleauthors (n = 9)
--NUMBER OF DATA ROWS: 9
--NUMBER OF PUBLICATION GROUPINGS: 9
DATA SHEET 2: multipleauthors (n = 5
--NUMBER OF DATA ROWS: 5
--NUMBER OF PUBLICATION GROUPINGS: 5
DATA SHEET 3: multipleauthors-not-named (n = 1)
--NUMBER OF DATA ROWS: 1
--NUMBER OF PUBLICATION GROUPINGS: 1
*********************************
CRediT <http://credit.niso.org>
*********************************
Susmita Das: Conceptualization, Data curation, Investigation, Methodology
Jaqueline Léveillé: Data curation, Investigation
Randi Proescholdt: Conceptualization, Data curation, Investigation, Methodology
Jodi Schneider: Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Supervision
keywords:
retraction; citation of retracted publications; post-retraction citation; data extraction for scoping reviews; data extraction for literature reviews;
published:
2023-07-11
Parulian, Nikolaus
(2023)
The dissertation_demo.zip contains the base code and demonstration purpose for the dissertation: A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning.
Each chapter has a demo folder for demonstrating provenance queries or tools.
The Airbnb dataset for demonstration and simulation is not included in this demo but is available to access directly from the reference website.
Any updates on demonstration and examples can be found online at: https://github.com/nikolausn/dissertation_demo