Illinois Data Bank Dataset Search Results
Results
published:
2024-11-07
Zheng, Heng; Fu, Yuanxi; Vandel, Ellie; Schneider, Jodi
(2024)
This dataset consists of the 286 publications retrieved from Web of Science and Scopus on July 6, 2023 as citations for Willoughby et al., 2014:
Patrick H. Willoughby, Matthew J. Jansma, and Thomas R. Hoye (2014). A guide to small-molecule structure assignment through computation of (¹H and ¹³C) NMR chemical shifts. Nature Protocols, 9(3), Article 3. https://doi.org/10.1038/nprot.2014.042
We added the DOIs of the citing publications into a Zotero collection. Then we exported all 286 DOIs in two formats: a .csv file (data export) and an .rtf file (bibliography).
<b>Willoughby2014_286citing_publications.csv</b> is a Zotero data export of the citing publications.
<b>Willoughby2014_286citing_publications.rtf</b> is a bibliography of the citing publications, using a variation of the American Psychological Association style (7th edition) with full names instead of initials.
To create <b>Willoughby2014_citation_contexts.csv</b>, HZ manually extracted the paragraphs that contain a citation marker of Willoughby et al., 2014. We refer to these paragraphs as the citation contexts of Willoughby et al., 2014. Manual extraction started with 286 citing publications but excluded 2 publications that are not in English, those with DOIs 10.13220/j.cnki.jipr.2015.06.004 and 10.19540/j.cnki.cjcmm.20200604.201
The silver standard aimed to triage the citing publications of Willoughby et al., 2014 that are at risk of propagating unreliability due to a code glitch in a computational chemistry protocol introduced in Willoughby et al., 2014. The silver standard was created stepwise:
First one chemistry expert (YF) manually annotated the corpus of 284 citing publications in English, using their full text and citation contexts. She manually categorized publications as either at risk of propagating unreliability or not at risk of propagating unreliability, with a rationale justifying each category.
Then we selected a representative sample of citation contexts to be double annotated. To do this, MJS turned the full dataset of citation contexts (Willoughby2014_citation_contexts.csv) into word embeddings, clustered them using similarity measures using BERTopic's HDBS, and selected representative citation contexts based on the centroids of the clusters.
Next the second chemistry expert (EV) annotated the 77 publications associated with the citation contexts, considering the full text as well as the citation contexts.
<b>double_annotated_subset_77_before_reconciliation.csv</b> provides EV and YF's annotation before reconciliation.
To create the silver standard YF, EV, and JS discussed differences and reconciled most differences. YF and EV had principled reasons for disagreeing on 9 publications; to handle these, YF updated the annotations, to create the silver standard we use for evaluation in the remainder of our JCDL 2024 paper (<b>silver_standard.csv</b>)
<b>Inter_Annotator_Agreement.xlsx</b> indicates publications where the two annotators made opposite decisions and calculates the inter-annotator agreement before and after reconciliation together.
<b>double_annotated_subset_77_before_reconciliation.csv</b> provides EV and YF's annotation after reconciliation, including applying the reconciliation policy.
keywords:
unreliable cited sources; knowledge maintenance; citations; scientific digital libraries; scholarly publications; reproducibility; unreliability propagation; citation contexts
published:
2023-05-02
Lee, Jou; Schneider, Jodi
(2023)
Tab-separated value (TSV) file.
14745 data rows. Each data row represents publication metadata as retrieved from Crossref (http://crossref.org) 2023-04-05 when searching for retracted publications.
Each row has the following columns:
Index - Our index, starting with 0.
DOI - Digital Object Identifier (DOI) for the publication
Year - Publication year associated with the DOI.
URL - Web location associated with the DOI.
Title - Title associated with the DOI. May be blank.
Author - Author(s) associated with the DOI.
Journal - Publication venue (journal, conference, ...) associated with the DOI
RetractionYear - Retraction Year associated with the DOI. May be blank.
Category - One or more categories associated with the DOI. May be blank.
Our search was via the Crossref REST API and searched for:
Update_type=(
'retraction',
'Retraction',
'retracion',
'retration',
'partial_retraction',
'withdrawal','removal')
keywords:
retraction; metadata; Crossref; RISRS
published:
2024-09-16
Wu, Steven; Smith, Hannah
(2024)
This dataset describes an analysis of research documents about the debate between hydrogen fuel cells and
lithium-ion batteries within the context of electric vehicles.
To create this dataset, we first analyzed news articles on the topic of sustainable development. We searched for related science using keywords in Google Scholar. We then identified subtopics and selected one specific subtopic: electric vehicles. We started to identify positions and players about electric vehicles [1].
Within electric vehicles, we started searching in OpenAlex for a topic of reasonable size (about 300 documents) related to a scientific or technical debate. We narrowed to electric vehicles and batteries, then trained a cluster model [2] on OpenAlex’s keywords to develop some possible search queries, and chose one.
Our final search query (May 7, 2024) returned 301 document in OpenAlex:
Title & abstract includes: Electric Vehicle + Hydrogen + Battery
filter is Lithium-ion Battery Management in Electric Vehicle
We used a Python script and the Scopus API to find missing abstracts and DOIs [3].
To identify relevant documents, we used a combination of Abstractkr [4] and manual screening. As a starting point for Abstractkr [4], one person manually screened 200 documents by checking the abstracts for “hydrogen fuel cells” and “battery comparisons”. Then we used Abstractkr [4] to predict the relevance of the remaining documents based on the title, abstract, and keywords. The settings we used were single screening, ordered by most likely to be relevant, and 0 pilot size. We set a threshold of 0.6 for the predictions. After screening and predictions, 176 documents remained
keywords:
controversy mapping; sustainable development; evidence synthesis; OpenAlex; Abstrackr; Scopus; meta-analysis; electric vehicle; hydrogen fuel cells; battery
published:
2023-06-06
Korobskiy, Dmitriy; Chacko, George
(2023)
This dataset is derived from the COCI, the OpenCitations Index of Crossref open DOI-to-DOI references (opencitations.net). Silvio Peroni, David Shotton (2020). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1): 428-444. https://doi.org/10.1162/qss_a_00023 We have curated it to remove duplicates, self-loops, and parallel edges. These data were copied from the Open Citations website on May 6, 2023 and subsequently processed to produce a node list and an edge-list. Integer_ids have been assigned to the DOIs to reduce memory and storage needs when working with these data. As noted on the Open Citation website, each record is a citing-cited pair that uses DOIs as persistent identifiers.
keywords:
open citations; bibliometrics; citation network; scientometrics
published:
2025-02-08
Anne, Lahari; Park, Minhyuk; Warnow, Tandy; Chacko, George
(2025)
The synthetic networks in this dataset were generated using the RECCS protocol developed by Anne et al. (2024). Briefly, the RECCS process is as follows. An input network and clustering (by any algorithm) is used to pass input parameters to a stochastic block model (SBM) generator. The output is then modified to improve fit to the input real world clusters after which outlier nodes are added using one of three different options. See Anne et al. (2024): in press Complex Networks and Applications XIII (preprint : arXiv:2408.13647).
The networks in this dataset were generated using either version 1 or version 2 of the RECCS protocol followed by outlier strategy S1. The input networks to the process were (i) the Curated Exosome Network (CEN), Wedell et al. (2021), (ii) cit_hepph (https://snap.stanford.edu/), (iii) cit_patents (https://snap.stanford.edu/), and (iv) wiki_topcats (https://snap.stanford.edu/).
Input Networks:
The CEN can be downloaded from the Illinois Data Bank:
https://databank.illinois.edu/datasets/IDB-0908742 -> cen_pipeline.tar.gz -> S1_cen_cleaned.tsv
The synthetic file naming system should be interpreted as follows: a_b_c.tsv.gz where
a - name of inspirational network, e.g., cit_hepph
b - the resolution value used when clustering a with the Leiden algorithm optimizing the Constant Potts Model, e.g., 0.01
c- the RECCS option used to approximate edge count and connectivity in the real world network, e.g., v1
Thus, cit_hepph_0.01_v1.tsv indicates that this network was modeled on the cit_hepph network and RECCSv1 was used to match edge count and connectivity to a Leiden-CPM 0.01 clustering of cit_hepph. For SBM generation, we used the graph_tool software (P. Peixoto, Tiago 2014. The graph-tool python library. figshare. Dataset. https://doi.org/10.6084/m9.figshare.1164194.v14)
Additionally, this dataset contains synthetic networks generated for a replication experiment (repl_exp.tar.gz). The experiment aims to evaluate the consistency of RECCS-generated networks by producing multiple replicates under controlled conditions. These networks were generated using different configurations of RECCS, varying across two versions (v1 and v2), and applying the Connectivity Modifier (CM++, Ramavarapu et al. (2024)) pre-processing. Please note that the CM pipeline used for this experiment filters small clusters both before and after the CM treatment.
Input Network : CEN
Within repl_exp.tar.gz, the synthetic file naming system should be interpreted as follows:
cen_<resolution><cm_status><reccs_version>sample<replicate_id>.tsv
where:
cen – Indicates the network was modeled on the Curated Exosome Network (CEN).
resolution – The resolution parameter used in clustering the input network with Leiden-CPM (0.01).
cm_status – Either cm (CM-treated input clustering) or no_cm (input clustering without CM treatment).
reccs_version – The RECCS version used to generate the synthetic network (v1 or v2).
replicate_id – The specific replicate (ranging from 0 to 2 for each configuration).
For example:
cen_0.01_cm_v1_sample_0.tsv – A synthetic network based on CEN with Leiden-CPM clustering at resolution 0.01, CM-treated input, and generated using RECCSv1 (first replicate).
cen_0.01_no_cm_v2_sample_1.tsv – A synthetic network based on CEN with Leiden-CPM clustering at resolution 0.01, without CM treatment, and generated using RECCSv2 (second replicate).
The ground truth clustering input to RECCS is contained in repl_exp_groundtruths.tar.gz.
keywords:
Community Detection; Synthetic Networks; Stochastic Block Model (SBM);
published:
2023-07-28
Njuguna, Joyce; Clark, Lindsay; Lipka , Alexander; Anzoua, Kossonou; Bagmet, Larisa; Chebukin, Pavel; Dwiyanti, Maria; Dzyubenko, Elena; Dzyubenko, Nicolay; Ghimire, Bimal; Jin, Xiaoli; Johnson, Douglas; Nagano, Hironori; Peng, Junhua; Petersen, Karen; Sabitov, Andrey; Seong, Eun; Yamada, Toshihiko; Yoo, Ji; Yu, Chang; Zhao, Hu; Long, Stephen; Sacks, Erik
(2023)
The dataset is for a study conducted to understand genome-wide association (GWA) and genomic prediction of biomass yield and 14 yield-components traits in Miscanthus sacchariflorus. We evaluated a diversity panel with 590 accessions of M. sacchariflorus grown across four years in one subtropical and three temperate locations and genotyped with 268,109 single nucleotide polymorphisms (SNPs).
keywords:
Miscanthus sacchariflorus; genome-wide association analysis; genomic prediction; bioenergy; biomass
published:
2025-03-05
Li, Fu; Villa, Umberto; Park, Seonyeong; Jeong, Gangwon; Anastasio, Mark A.
(2025)
References
- Li, Fu, Umberto Villa, Seonyeong Park, and Mark A. Anastasio. "3-D stochastic numerical breast phantoms for enabling virtual imaging trials of ultrasound computed tomography." IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 69, no. 1 (2021): 135-146. DOI: 10.1109/TUFFC.2021.3112544
- Li, Fu; Villa, Umberto; Park, Seonyeong; Anastasio, Mark, 2021, "2D Acoustic Numerical Breast Phantoms and USCT Measurement Data", https://doi.org/10.7910/DVN/CUFVKE, Harvard Dataverse, V1
Overview
- This dataset includes 1,089 two-dimensional slices extracted from 3D numerical breast phantoms (NBPs) for ultrasound computed tomography (USCT) studies. The anatomical structures of these NBPs were obtained using tools from the Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) project. The methods used to modify and extend the VICTRE NBPs for use in USCT studies are described in the publication cited above.
- The NBPs in this dataset represent the following four ACR BI-RADS breast composition categories:
> Type A - The breast is almost entirely fatty
> Type B - There are scattered areas of fibroglandular density in the breast
> Type C - The breast is heterogeneously dense
> Type D - The breast is extremely dense
- Each 2D slice is taken from a different 3D NBP, ensuring that no more than one slice comes from any single phantom.
File Name Format
- Each data file is stored as an HDF5 .mat file. The filenames follow this format: {type}{subject_id}.mat where{type} indicates the breast type (A, B, C, or D), and {subject_id} is a unique identifier assigned to each sample. For example, in the filename D510022534.mat, "D" represents the breast type, and "510022534" is the sample ID.
File Contents
- Each file contains the following variables:
> "type": Breast type
> "sos": Speed-of-sound map [mm/μs]
> "den": Ambient density map [kg/mm³]
> "att": Acoustic attenuation (power-law prefactor) map [dB/ MHzʸ mm]
> "y": power-law exponent
> "label": Tissue label map. Tissue types are denoted using the following labels: water (0), fat (1), skin (2), glandular tissue (29), ligament (88), lesion (200).
- All spatial maps ("sos", "den", "att", and "label") have the same spatial dimensions of 2560 x 2560 pixels, with a pixel size of 0.1 mm x 0.1 mm.
- "sos", "den", and "att" are float32 arrays, and "label" is an 8-bit unsigned integer array.
keywords:
Medical imaging; Ultrasound computed tomography; Numerical phantom
published:
2025-05-21
Mostame, Parham; Wirsich, Jonathan; Alderson, Thomas H.; Ridley, Ben; Giraud, Anne-Lise; Carmichael, David W.; Vulliemoz, Serge; Guye, Maxime; Lemieux, Louis; Sadaghiani, Sepideh
(2025)
___________________________________SUMMARY
This dataset contains derivative data from concurrent fMRI and scalp EEG recordings used in:
Mostame Parham, Wirsich Jonathan, Alderson Thomas H, Ridley Ben, Giraud Anne-Lise, Carmichael David W, Vulliemoz Serge, Guye Maxime, Lemieux Louis, Sadaghiani Sepideh (2024) A multiplex of connectome trajectories enables several connectivity patterns in parallel eLife 13:RP98777. doi: https://doi.org/10.7554/eLife.98777.3
___________________________________RAW DATA
The data has been originally published and described as part of other studies (Morillon et al., 2010; Sadaghiani et al., 2012). Briefly, 10 minutes of eyes-closed resting state were analyzed from 26 healthy subjects (average age = 24.39 years; range: 18-31 years; 8 females) with no history of psychiatric or neurological disorders. Informed consent was given by each participant and the study was approved by the local Research Ethics Committee (CPP Ile de France III). FMRI was acquired using a 3T Siemens Tim Trio scanner with a GE-EPI pulse sequence (TR = 2 s; TE = 50 ms; 40 slices; 300 volumes; field of view: 192×192; voxel size: 3×3×3 mm3). Structural T1-weighted scan were acquired using the MPRAGE pulse sequence (176 slices; field of view: 256×256; voxel size: 1×1×1 mm3). 62-channel scalp EEG (Easycap, with an additional EOG and an ECG channel) was recorded using an MR-compatible amplifier (BrainAmp MR, Brain Products) at 5Hz sampling rate.
___________________________________PREPROCESSING
fMRI and EEG data were preprocessed with standard preprocessing steps as explained in detail elsewhere (Wirsich et al., 2020). In brief, fMRI underwent standard slice-time correction, spatial realignment (SPM12, http://www.fil.ion.ucl.ac.uk/spm/software/spm12). Structural T1-weighted images were processed using Freesurfer (recon-all, v6.0.0, https://surfer.nmr.mgh.harvard.edu/) in order to perform non-uniformity and intensity correction, skull stripping and gray/white matter segmentation. The cortex was parcellated into 68 regions of the Desikan-Kiliany atlas (Desikan et al., 2006). This atlas was chosen because —as an anatomical parcellation— avoids biases towards one or the other functional data modality. The T1 images of each subject and the Desikan-Killiany were co-registered to the fMRI images (FSL-FLIRT 6.0.2, https://fsl.fmrib.ox.ac.uk/fsl/fslwiki). We extracted signals of no interest such as the average signals of cerebrospinal fluid (CSF) and white matter from manually defined regions of interest (ROI, 5 mm sphere, Marsbar Toolbox 0.44, http://marsbar.sourceforge.net) and regressed out of the BOLD timeseries along with 6 rotation, translation motion parameters and global gray matter signal (Wirsich et al., 2017a). Then we bandpass-filtered the timeseries at 0.009–0.08 Hz. Average timeseries of each region was then used to calculate connectivity.
EEG underwent gradient and cardio-ballistic artifact removal using Brain Vision Analyzer software (Allen et al., 1998, 2000) and was down-sampled to 250 Hz. EEG was projected into source space using the Tikhonov-regularized minimum norm in Brainstorm software (Baillet et al., 2001; Tadel et al., 2011). Source activity was then averaged to the 68 regions of the Desikan-Killiany atlas. Band-limited EEG signals in each canonical frequency band and every atlas region were then used to calculate frequency-specific connectome dynamics. Note that the MEG-ROI-nets toolbox in the OHBA Software Library (OSL; https://ohba-analysis.github.io/osl-docs/) was used to minimize source leakage in the band-limited source-localized EEG data (Colclough et al., 2015).
___________________________________FOLDER STRUCTURE
The dataset includes five separate folders as described below:
1) EEGfMRI_dFC folder: connectome dynamics of scalp data
This folder contains 26 single MATLAB (.mat) files for each subject. Inside each `.mat` is a structure with fields `A`, `B`, and `C`, corresponding to fMRI, amplitude-coupling, and phase-coupling connectome dynamics, respectively. The fMRI data are 3-dimensional (ROI × ROI × timepoints). The EEG data are stored in a 1×5 cell array (Delta, Theta, Alpha, Beta, Gamma), each cell containing a 3-D ROI × ROI × timepoints matrix.
2) EEGfMRI_dFC_SourceOrtho foldeR: connectome dynamics of source-orthogonalized scalp data
Same format as above, except that EEG connectome dynamics are derived from source-orthogonalized signals. The MEG-ROI-nets toolbox in the OHBA Software Library (OSL; https://ohba-analysis.github.io/osl-docs/) was used to minimize source leakage in the band-limited, source-localized EEG data (Colclough et al., 2015).
3-5) Cross-modal Recurrence Plot (CRP) data
Each subject has an Excel file with five sheets (Delta through Gamma), corresponding to the five frequency bands. Each sheet contains a 2-D CRP matrix (rows = fMRI timepoints, columns = band-limited EEG timepoints).
- Scalp EEG–fMRI CRPs (CRP_EEGfMRI and CRP_EEGfMRI_SourceOrtho folder): two versions (with and without source-orthogonalization), each has 52 Excel files, including amplitude- and phase-coupling CRPs.
- Intracranial EEG–fMRI CRPs (CRP_iEEGfMRI folder): one version, 27 Excel files, containing three cases: amplitude coupling, HRF-convolved amplitude coupling, and phase coupling.
keywords:
Connectome; fMRI-EEG; Intracranial; Multiplex
published:
2020-12-29
Viana, Jéssica; Turner, Benjamin; Dalling, James
(2020)
Three datasets: species_abundance_data, species_traits, and environmental_data. The three datasets were collected in the Fortuna Forest Reserve (8°45′ N, 82°15′ W) and Palo Seco Protected Forest (8°45′ N, 82°13′ W) located in western Panama. The two reserves support humid to super-humid rainforests, according to Holdridge (1947). The species_abundance_data and species_traits datasets were collected across 15 subplots of 25 m2 in 12 one-hectare permanent plots distributed across the two reserves. The subplots were spaced 20 m apart along three 5 m wide transects, each 30 m apart. Please read Prada et al. (2017) for details on the environmental characteristics of the study area.
Prada CM, Morris A, Andersen KM, et al (2017) Soils and rainfall drive landscape-scale changes in the diversity and functional composition of tree communities in a premontane tropical forest. J Veg Sci 28:859–870. https://doi.org/10.1111/jvs.12540
keywords:
functional traits; plants; ferns; environmental data; Fortuna; species data; community ecology
published:
2021-08-20
von Haden, Adam C.; DeLucia, Evan H.; Yang, Wendy; Burnham, Mark
(2021)
In 2020, early-season extreme precipitation events occurred following the planting of Sorghum bicolor (L.) Moench and Zea mays L. in central Illinois that caused ponding. Following the first rainfall event 50m transects were established to assess the waterlogging effects on seedling emergence and crop yields. Soil moisture, emergence, stem and tiller count, LAI, and yield were measured at various points in the season along these transects.
keywords:
Sorghum; Maize; Emergence; Yield; LAI
published:
2024-03-27
Zheng, Heng; Schneider, Jodi
(2024)
To gather news articles from the web that discuss the Cochrane Review, we used Altmetric Explorer from Altmetric.com and retrieved articles on August 1, 2023. We selected all articles that were written in English, published in the United States, and had a publication date <b>prior to March 10, 2023</b> (according to the “Mention Date” on Altmetric.com). This date is significant as it is when Cochrane issued a statement about the "misleading interpretation" of the Cochrane Review.
The collection of news articles is presented in the Altmetric_data.csv file. The dataset contains the following data that we exported from Altmetric Explorer:
- Publication date of the news article
- Title of the news article
- Source/publication venue of the news article
- URL
- Country
We manually checked and added the following information:
- Whether the article still exists
- Whether the article is accessible
- Whether the article is from the original source
We assigned MAXQDA IDs to the news articles. News articles were assigned the same ID when they were (a) identical or (b) in the case of Article 207, closely paraphrased, paragraph by paragraph. Inaccessible items were assigned a MAXQDA ID based on their "Mention Title".
For each article from Altmetric.com, we first tried to use the Web Collector for MAXQDA to download the article from the website and imported it into MAXQDA (version 22.7.0). If an article could not be retrieved using the Web Collector, we either downloaded the .html file or in the case of Article 128, retrieved it from the NewsBank database through the University of Illinois Library.
We then manually extracted direct quotations from the articles using MAXQDA.
We included surrounding words and sentences, and in one case, a news agency’s commentary, around direct quotations for context where needed. The quotations (with context) are the positions in our analysis.
We also identified who was quoted. We excluded quotations when we could not identify who or what was being quoted. We annotated quotations with codes representing groups (government agencies, other organizations, and research publications) and individuals (authors of the Cochrane Review, government agency representatives, journalists, and other experts such as epidemiologists).
The MAXQDA_data.csv file contains excerpts from the news articles that contain the direct quotations we identified. For each excerpt, we included the following information:
- MAXQDA ID of the document from which the excerpt originates;
- The collection date and source of the document;
- The code with which the excerpt is annotated;
- The code category;
- The excerpt itself.
keywords:
altmetrics; MAXQDA; polylogue analysis; masks for COVID-19; scientific controversies; news articles
published:
2020-06-06
Zaya, David N.; Leicht-Young, Stacey A.; Pavlovic, Noel B.; Ashley, Mary V.
(2020)
These data are from an observational study and small experiment investigating reproductive biology and hybridization between two plants, Celastrus scandens L. and Celastrus orbiculatus Thunb. (Celastraceae). These data were collected during the 2008 growing season from the Indiana Dunes National Park (formerly Indiana Dunes National Lakeshore), just east of the municipality of Ogden Dunes, Indiana, USA. The five data files provide information on floral output of the two species, fertilization rate, fruit set rate, hybridization rate at two scales (individual flowers in both species, individual maternal plants in C. scandens), and the results of a hand-pollination experiment that exchanged pollen between the two species.
There are six data files associated with this submission, five data files in comma-separated values format and one text file (‘readme.txt’) that includes detailed explanations of the data files.
keywords:
Celastrus; invasive species; hybridization; heterospecific pollen; hand pollination
published:
2020-02-23
Ye, Di; Hill, Alison; Whitehorn (Fulton), Ashley; Schneider, Jodi
(2020)
Citation context annotation for papers citing retracted paper Matsuyama 2005 (RETRACTED: Matsuyama W, Mitsuyama H, Watanabe M, Oonakahara KI, Higashimoto I, Osame M, Arimura K. Effects of omega-3 polyunsaturated fatty acids on inflammatory markers in COPD. Chest. 2005 Dec 1;128(6):3817-27.), retracted in 2008 (Retraction in: Chest (2008) 134:4 (893) <a href="https://doi.org/10.1016/S0012-3692(08)60339-6">https://doi.org/10.1016/S0012-3692(08)60339-6<a/> ). This is part of the supplemental data for Jodi Schneider, Di Ye, Alison Hill, and Ashley Whitehorn. "Continued Citation of a Fraudulent Clinical Trial Report, Eleven Years after it was retracted for Falsifying Data" [R&R under review with Scientometrics].
Overall we found 148 citations to the retracted paper from 2006 to 2019, However, this dataset does not include the annotations described in the 2015. in Ashley Fulton, Alison Coates, Marie Williams, Peter Howe, and Alison Hill. "Persistent citation of the only published randomized controlled trial of omega-3 supplementation in chronic obstructive pulmonary disease six years after its retraction." Publications 3, no. 1 (2015): 17-26.
In this dataset 70 new and newly found citations are listed: 66 annotated citations and 4 pending citations (non-annotated since we don't have full-text).
"New citations" refer to articles published from March 25, 2014 to 2019, found in Google Scholar and Web of Science.
"Newly found citations" refer articles published 2006-2013, found in Google Scholar and Web of Science, but not previously covered in Ashley Fulton, Alison Coates, Marie Williams, Peter Howe, and Alison Hill. "Persistent citation of the only published randomised controlled trial of omega-3 supplementation in chronic obstructive pulmonary disease six years after its retraction." Publications 3, no. 1 (2015): 17-26.
NOTES:
This is Unicode data. Some publication titles & quotes are in non-Latin characters and they may contain commas, quotation marks, etc.
FILES/FILE FORMATS
Same data in two formats:
2006-2019-new-citation-contexts-to-Matsuyama.csv - Unicode CSV (preservation format only)
2006-2019-new-citation-contexts-to-Matsuyama.xlsx - Excel workbook (preferred format)
ROW EXPLANATIONS
70 rows of data - one citing publication per row
COLUMN HEADER EXPLANATIONS
Note - processing notes
Annotation pending - Y or blank
Year Published - publication year
ID - ID corresponding to the network analysis. See Ye, Di; Schneider, Jodi (2019): Network of First and Second-generation citations to Matsuyama 2005 from Google
Scholar and Web of Science. University of Illinois at Urbana-Champaign. <a href="https://doi.org/10.13012/B2IDB-1403534_V2">https://doi.org/10.13012/B2IDB-1403534_V2</a>
Title - item title (some have non-Latin characters, commas, etc.)
Official Translated Title - item title in English, as listed in the publication
Machine Translated Title - item title in English, translated by Google Scholar
Language - publication language
Type - publication type (e.g., bachelor's thesis, blog post, book chapter, clinical guidelines, Cochrane Review, consumer-oriented evidence summary, continuing education journal article, journal article, letter to the editor, magazine article, Master's thesis, patent, Ph.D. thesis, textbook chapter, training module)
Book title for book chapters - Only for a book chapter - the book title
University for theses - for bachelor's thesis, Master's thesis, Ph.D. thesis - the associated university
Pre/Post Retraction - "Pre" for 2006-2008 (means published before the October 2008 retraction notice or in the 2 months afterwards); "Post" for 2009-2019 (considered post-retraction for our analysis)
Identifier where relevant - ISBN, Patent ID, PMID (only for items we considered hard to find/identify, e.g. those without a DOI-based URL)
URL where available - URL, ideally a DOI-based URL
Reference number/style - reference
Only in bibliography - Y or blank
Acknowledged - If annotated, Y, Not relevant as retraction not published yet, or N (blank otherwise)
Positive / "Poor Research" (Negative) - P for positive, N for negative if annotated; blank otherwise
Human translated quotations - Y or blank; blank means Google scholar was used to translate quotations for Translated Quotation X
Specific/in passing (overall) - Specific if any of the 5 quotations are specific [aggregates Specific / In Passing (Quotation X)]
Quotation 1 - First quotation (or blank) (includes non-Latin characters in some cases)
Translated Quotation 1 - English translation of "Quotation 1" (or blank)
Specific / In Passing (Quotation 1) - Specific if "Quotation 1" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 1) - Methods; Results; or Methods and Results - blank if "Quotation 1" not specific, no associated quotation, or not yet annotated
Quotation 2 - Second quotation (includes non-Latin characters in some cases)
Translated Quotation 2 - English translation of "Quotation 2"
Specific / In Passing (Quotation 2) - Specific if "Quotation 2" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 2) - Methods; Results; or Methods and Results - blank if "Quotation 2" not specific, no associated quotation, or not yet annotated
Quotation 3 - Third quotation (includes non-Latin characters in some cases)
Translated Quotation 3 - English translation of "Quotation 3"
Specific / In Passing (Quotation 3) - Specific if "Quotation 3" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 3) - Methods; Results; or Methods and Results - blank if "Quotation 3" not specific, no associated quotation, or not yet annotated
Quotation 4 - Fourth quotation (includes non-Latin characters in some cases)
Translated Quotation 4 - English translation of "Quotation 4"
Specific / In Passing (Quotation 4) - Specific if "Quotation 4" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 4) - Methods; Results; or Methods and Results - blank if "Quotation 4" not specific, no associated quotation, or not yet annotated
Quotation 5 - Fifth quotation (includes non-Latin characters in some cases)
Translated Quotation 5 - English translation of "Quotation 5"
Specific / In Passing (Quotation 5) - Specific if "Quotation 5" refers to methods or results of the Matsuyama paper (or blank)
What is referenced from Matsuyama (Quotation 5) - Methods; Results; or Methods and Results - blank if "Quotation 5" not specific, no associated quotation, or not yet annotated
Further Notes - additional notes
keywords:
citation context annotation, retraction, diffusion of retraction
published:
2021-07-22
Hsiao, Tzu-Kun; Schneider, Jodi
(2021)
This dataset includes five files. Descriptions of the files are given as follows:
<b>FILENAME: PubMed_retracted_publication_full_v3.tsv</b>
- Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ).
- Except for the information in the "cited_by" column, all the data is from PubMed.
- PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses:
[1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file).
[2] Citing paper and the cited retracted paper have the same PMID.
ROW EXPLANATIONS
- Each row is a retracted paper. There are 7,813 retracted papers.
COLUMN HEADER EXPLANATIONS
1) PMID - PubMed ID
2) Title - Paper title
3) Authors - Author names
4) Citation - Bibliographic information of the paper
5) First Author - First author's name
6) Journal/Book - Publication name
7) Publication Year
8) Create Date - The date the record was added to the PubMed database
9) PMCID - PubMed Central ID (if applicable, otherwise blank)
10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank)
11) DOI - Digital object identifier (if applicable, otherwise blank)
12) retracted_in - Information of retraction notice (given by PubMed)
13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank)
14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite.
15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank)
<b>FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv</b>
- This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles.
- This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis.
- Citation contexts that meet either of the two conditions below have been excluded from analyses:
[1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file).
[2] Citing paper and the cited retracted paper have the same PMID.
ROW EXPLANATIONS
- Each row is a citation context associated with one retracted paper that's cited.
- In the manuscript, we count each citation context once, even if it cites multiple retracted papers.
COLUMN HEADER EXPLANATIONS
1) pmcid - PubMed Central ID of the citing paper
2) pmid - PubMed ID of the citing paper
3) year - Publication year of the citing paper
4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions)
5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified)
6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively.
7) total_sentences - Total number of sentences in a given location
8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper.
9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper.
10) citation - The citation context
11) progression - Position of a citation context by centile within the citing paper.
12) retracted_yr - Retraction year of the retracted paper
13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction.
<b>FILENAME: 724_knowingly_post_retraction_cit.csv</b> (updated)
- The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv".
- Two citation contexts from retraction notices have been excluded from analyses.
ROW EXPLANATIONS
- Each row is a citation context.
COLUMN HEADER EXPLANATIONS
1) pmcid - PubMed Central ID of the citing paper
2) pmid - PubMed ID of the citing paper
3) pub_type - Publication type collected from the metadata in the PMCOA XML files.
4) pub_type2 - Specific article types. Please see the manuscript for explanations.
5) year - Publication year of the citing paper
6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions)
7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper.
8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper.
9) citation - The citation context
10) retracted_yr - Retraction year of the retracted paper
11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation.
12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation.
<b>FILENAME: Annotation manual.pdf</b>
- The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv.
<b>FILENAME: retraction_notice_PMID.csv</b> (new file added for this version)
- A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).
keywords:
citation context; in-text citation; citation to retracted papers; retraction
published:
2024-02-16
Zhang, Mingxiao; Sutton, Bradley
(2024)
Sample data from one typical phantom test and one deidentified shunt patient test (shown in Fig. 8 of the MRM paper), with the corresponding analysis code for the Shunt-FENSI technique.
For the MRM paper “Measuring CSF Shunt Flow with MRI Using Flow Enhancement of Signal Intensity (FENSI)”
keywords:
Shunt-FENSI; MRM; Hydrocephalus; VP Shunt; Flow Quantification; Pediatric Neurosurgery; Pulse Sequence; Signal Simulation
published:
2025-04-24
Smith, Rebecca; Chakraborty, Sulagna; Lyons, Lee Ann; Winata, Fikriyah; Mateus-Pinilla, Nohra
(2025)
These are the datasets underlying the figures in the manuscript "Methods of active surveillance for hard ticks and associated tick-borne pathogens of public health importance in the contiguous United States: A Comprehensive Systematic Review".
The review considered only publications reporting on active tick or tick-borne pathogen surveillance in the contiguous United States published between 1944 and 2018. For the purposes of this review, we were only concerned with studies of Ixodidae (hard ticks) and/or studies of tick-borne pathogens (in humans, animals, or hard ticks) of public health importance to humans. Study designs included cross-sectional, serological, epidemiological, ecological, or observational studies. Only peer-reviewed publications published in the English language were included. Studies were excluded if they focused on a tick that is not a vector of a human pathogen or on a pathogen that does not cause disease in humans, if the tick or tick-borne pathogen findings were incidental, or if they did not include quantitative surveillance data. For the purpose of this study, we defined surveillance data as information on ticks or pathogens provided through active sampling in natural areas; it should be noted that this does not match the strict definition used by the CDC, which requires sustained sampling efforts across time. Studies were also excluded if they: explored regions other than the contiguous US; focused on treatment, vaccine, or therapeutics development and/or diagnostics of human disease; focused on tick or pathogen genetics; focused on experimental studies with ticks or hosts; were tick control and/or management studies; performed only passive surveillance; were review articles; were not peer reviewed; were in a language other than English; the full text was not available; and if the disease was not a risk to the general public. In addition, for articles which reported data that had previously been published, we only included previously unreported information collected by the authors, and we referenced the specific period of collection for these data to ensure we were not double-recording data. Due to publication delays, we also performed a non-systematic review of the literature of articles published between 2019 – 2023 on tick and tickborne pathogen surveillance methods conducted in the contiguous United States.
Keyword search was performed in PubMed Central and Web of Science Core Collection databases. The search algorithm keywords included tick(s), Amblyomma, Dermacentor, Ixodes, Rhipicephalus, Acari Ixodidea, tick host(s), Lyme disease, Rocky Mountain Spotted Fever, Spotted Fever Group, Rickettsiosis, Ehrlichiosis, Anaplasmosis, Borreliosis, Tularemia, Babesiosis, tick-borne pathogen, Powassan, Heartland, Bourbon, Colorado tick fever, Pacific Coast tick fever, tick surveillance, surveillance, (sero)epidemiology, prevalence, distribution, ecology, United States. The search algorithm utilized is provided as follows:
TI= ((ticks OR Ixodes OR Amblyomma OR Dermacentor OR Rhipicephalus OR "Acari Ixodidi" OR "tick hosts" OR "tick host") OR ("Lyme Disease" OR "Rocky Mountain Spotted Fever" OR "Spotted Fever Group" OR Rickettsiosis OR Rickettsial OR Ehrlichiosis OR Anaplasmosis OR Borreliosis OR Tularemia OR Babesiosis OR Borrelia OR Ehrlichia OR Anaplasma OR Rickettsia OR Babesia OR "tick-borne pathogen" OR "tick borne pathogen")) AND TS= ("tick surveillance" OR surveillance OR epidemiology OR seroepidemiology OR ecology) AND CU=("United States of America" OR "USA" OR "United States" OR United-States).
These datasets are the collated data underlying the figures in the manuscript. For more details, please see the publication.
The following are explanations for variables used in all the CSV files:
Tick: Species of tick collected
Tick_Method: Method of collecting ticks
Pathogen: Species of pathogen tested for
Path_Method: Method of testing for pathogens
Decade: Decade of publication
n: Number of publications
STATE: state in which study was conducted
COUNTY: county in which study was conducted
1944 - 2018 (Was surveillance performed?): was there at least one publication included with a publication date within the 1944-2018 period in this geographic region?
2019 - 2023 (Was surveillance performed?): was there at least one publication included with a publication date within the 2019-2023 period in this geographic region?
keywords:
ticks; systematic review; surveillance
published:
2021-06-16
Warnow , Tandy; Wedell, Eleanor
(2021)
Thank you for using these datasets.
These RNAsim aligned fragmentary sequences were generated from the query sequences selected by Balaban et al. (2019) in their variable-size datasets (https://doi.org/10.5061/dryad.78nf7dq). They were created for use for phylogenetic placement with the multiple sequence alignments and backbone trees provided by Balaban et al. (2019).
The file structures included here also correspond with the data Balaban et al. (2020) provided.
This includes:
Directories for five varying backbone tree sizes, shown as 5000, 10000, 50000, 100000, and 200000. These directory names are also used by Balaban et al. (2019), and indicate the size of the backbone tree included in their data.
Subdirectories for each replicate from the backbone tree size labelled 0 through 4. For the smaller four backbone tree sizes there are five replicates, and for the largest there is one replicate.
Each replicate contains 200 text files with one aligned query sequence fragment in fasta format.
keywords:
Fragmentary Sequences; RNAsim
published:
2023-09-13
Shen, Chengze; Liu, Baqiao; Williams, Kelly P.; Warnow, Tandy
(2023)
This upload contains one additional set of datasets (RNASim10k, ten replicates) used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment".
The zipped file has the following structure:
10k
|__R0
|__unaln.fas
|__true.fas
|__true.tre
|__R1
...
# Alignment files:
1. `unaln.fas`: all unaligned sequences.
2. `true.fas`: the reference alignment of all sequences.
3. `true.tre`: the reference tree on all sequences.
For other datasets that uniquely appeared in EMMA, please refer to the related dataset (which is linked below): Shen, Chengze; Liu, Baqiao; Williams, Kelly P.; Warnow, Tandy (2022): Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2567453_V1
keywords:
SALMA;MAFFT;alignment;eHMM;sequence length heterogeneity
published:
2023-05-30
Clem, C. Scott; Hart, Lily V.; McElrath, Thomas C.
(2023)
Primary occurrence data for Clem, Hart, & McElrath. 2023. A century of Illinois hover flies (Diptera: Syrphidae): Museum and citizen science data reveal recent range expansions, contractions, and species of potential conservation significance. Included are a license.txt file, the cleaned occurrences from each of the six merged datasets, and a cleaned, merged dataset containing all occurrence records in one spreadsheet, formatted according to Darwin Core standards, with a few extra fields such as GBIF identifiers that were included in some of the original downloads.
keywords:
csv; occurrences; syrphidae; hover flies; flies; biodiversity; darwin core; darwin-core; GBIF; citizen science; iNaturalist
published:
2019-07-11
Daniels, Melissa; Larson, Eric
(2019)
We studied the effect of windstorm disturbance on forest invasive plants in southern Illinois. This data includes raw data on plant abundance at survey points, compiled data used in statistical analyses, and spatial data for surveyed plots and units. This file package also includes a readme.doc file that describes the data in detail, including attribute descriptions.
keywords:
tornado, blowdowns, derecho, invasive plants, Shawnee National Forest, southern Illinois
published:
2020-06-30
Chakraborty, Sulagna; Cristina Drumond Andrade , Flavia; Lee Smith, Rebecca
(2020)
This file contains 13 unique case studies that were created for the One health: Infectious diseases course offered at the University of Illinois at Urbana-Champaign campus. The case studies are being made available as educational resources for other One health courses. Each case study is focused on a theme/topic which is associated with One health. These case studies were created using publicly available information and references have been provided for each case study.
keywords:
One health education; infectious diseases; case studies
published:
2023-07-14
Schneider, Jodi; Das, Susmita; Léveillé, Jacqueline ; Proescholdt, Randi
(2023)
Data for Post-retraction citation: A review of scholarly research on the spread of retracted science
Schneider, Jodi; Das, Susmita; Léveillé, Jacqueline; Proescholdt, Randi
Contact: Jodi Schneider jodi@illinois.edu & jschneider@pobox.com
**********
OVERVIEW
**********
This dataset provides further analysis for an ongoing literature review about post-retraction citation.
This ongoing work extends a poster presented as:
Jodi Schneider, Jacqueline Léveillé, Randi Proescholdt, Susmita Das, and The RISRS Team. Characterization of Publications on Post-Retraction Citation of Retracted Articles. Presented at the Ninth International Congress on Peer Review and Scientific Publication, September 8-10, 2022 hybrid in Chicago. https://hdl.handle.net/2142/114477 (now also in https://peerreviewcongress.org/abstract/characterization-of-publications-on-post-retraction-citation-of-retracted-articles/ )
Items as of the poster version are listed in the bibliography 92-PRC-items.pdf.
Note that following the poster, we made several changes to the dataset (see changes-since-PRC-poster.txt). For both the poster dataset and the current dataset, 5 items have 2 categories (see 5-items-have-2-categories.txt).
Articles were selected from the Empirical Retraction Lit bibliography (https://infoqualitylab.org/projects/risrs2020/bibliography/ and https://doi.org/10.5281/zenodo.5498474 ). The current dataset includes 92 items; 91 items were selected from the 386 total items in Empirical Retraction Lit bibliography version v.2.15.0 (July 2021); 1 item was added because it is the final form publication of a grouping of 2 items from the bibliography: Yang (2022) Do retraction practices work effectively? Evidence from citations of psychological retracted articles http://doi.org/10.1177/01655515221097623
Items were classified into 7 topics; 2 of the 7 topics have been analyzed to date.
**********************
OVERVIEW OF ANALYSIS
**********************
DATA ANALYZED:
2 of the 7 topics have been analyzed to date:
field-based case studies (n = 20)
author-focused case studies of 1 or several authors with many retracted publications (n = 15)
FUTURE DATA TO BE ANALYZED, NOT YET COVERED:
5 of the 7 topics have not yet been analyzed as of this release:
database-focused analyses (n = 33)
paper-focused case studies of 1 to 125 selected papers (n = 15)
studies of retracted publications cited in review literature (n = 8)
geographic case studies (n = 4)
studies selecting retracted publications by method (n = 2)
**************
FILE LISTING
**************
------------------
BIBLIOGRAPHY
------------------
92-PRC-items.pdf
------------------
TEXT FILES
------------------
README.txt
5-items-have-2-categories.txt
changes-since-PRC-poster.txt
------------------
CODEBOOKS
------------------
Codebook for authors.docx
Codebook for authors.pdf
Codebook for field.docx
Codebook for field.pdf
Codebook for KEY.docx
Codebook for KEY.pdf
------------------
SPREADSHEETS
------------------
field.csv
field.xlsx
multipleauthors.csv
multipleauthors.xlsx
multipleauthors-not-named.csv
multipleauthors-not-named.xlsx
singleauthors.csv
singleauthors.xlsx
***************************
DESCRIPTION OF FILE TYPES
***************************
BIBLIOGRAPHY (92-PRC-items.pdf) presents the items, as of the poster version. This has minor differences from the current data set. Consult changes-since-PRC-poster.txt for details on the differences.
TEXT FILES provide notes for additional context. These files end in .txt.
CODEBOOKS describe the data we collected. The same data is provided in both Word (.docx) and PDF format.
There is one general codebook that is referred to in the other codebooks: Codebook for KEY lists fields assigned (e.g., for a journal or conference). Note that this is distinct from the overall analysis in the Empirical Retraction Lit bibliography of fields analyzed; for that analysis see Proescholdt, Randi (2021): RISRS Retraction Review - Field Variation Data. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2070560_V1
Other codebooks document specific information we entered on each column of a spreadsheet.
SPREADSHEETS present the data collected. The same data is provided in both Excel (.xlsx) and CSV format.
Each data row describes a publication or item (e.g., thesis, poster, preprint).
For column header explainations, see the associated codebook.
*****************************
DETAILS ON THE SPREADSHEETS
*****************************
field-based case studies
CODEBOOK: Codebook for field
--REFERS TO: Codebook for KEY
DATA SHEET: field
REFERS TO: Codebook for KEY
--NUMBER OF DATA ROWS: 20 NOTE: Each data row describes a publication/item.
--NUMBER OF PUBLICATION GROUPINGS: 17
--GROUPED PUBLICATIONS: Rubbo (2019) - 2 items, Yang (2022) - 3 items
author-focused case studies of 1 or several authors with many retracted publications
CODEBOOK: Codebook for authors
--REFERS TO: Codebook for KEY
DATA SHEET 1: singleauthors (n = 9)
--NUMBER OF DATA ROWS: 9
--NUMBER OF PUBLICATION GROUPINGS: 9
DATA SHEET 2: multipleauthors (n = 5
--NUMBER OF DATA ROWS: 5
--NUMBER OF PUBLICATION GROUPINGS: 5
DATA SHEET 3: multipleauthors-not-named (n = 1)
--NUMBER OF DATA ROWS: 1
--NUMBER OF PUBLICATION GROUPINGS: 1
*********************************
CRediT <http://credit.niso.org>
*********************************
Susmita Das: Conceptualization, Data curation, Investigation, Methodology
Jaqueline Léveillé: Data curation, Investigation
Randi Proescholdt: Conceptualization, Data curation, Investigation, Methodology
Jodi Schneider: Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Supervision
keywords:
retraction; citation of retracted publications; post-retraction citation; data extraction for scoping reviews; data extraction for literature reviews;
published:
2022-08-22
Pastrana-Otero, Isamar; Majumdar, Sayani; Kraft, Mary L.
(2022)
This dataset contains Raman spectra, each acquired from an individual, living, primary murine cell belonging to one of the six most immature hematopoietic cell populations found in the body: hematopoietic stem cell (HSC), mutipotent progenitor 1 (MPP1), multipotent progenitor 2 (MPP2), multipotent progenitor 3 (MPP3), common lymphoid progenitor, common myeloid progenitor (CLP). These spectra are useful for identifying spectral signatures that are characteristic of each hematopoietic stem or early progenitor cell population.
*NOTE: __MACOSX folder and files start with “._[file name]” found in "Raman spectra of single cells text files.zip" were created by the computer operation system, in unreadable format, which are not part of the data and can be removed/ignored when using the data.
keywords:
Raman spectroscopy; single-cell spectrum; hematopoietic cell; hematopoietic stem cell; multipotent progenitor cell; common myeloid progenitor; common lymphoid progenitor
published:
2023-07-11
Parulian, Nikolaus
(2023)
The dissertation_demo.zip contains the base code and demonstration purpose for the dissertation: A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning.
Each chapter has a demo folder for demonstrating provenance queries or tools.
The Airbnb dataset for demonstration and simulation is not included in this demo but is available to access directly from the reference website.
Any updates on demonstration and examples can be found online at: https://github.com/nikolausn/dissertation_demo
published:
2025-06-05
Guan, Yingjun; Fang, Liri
(2025)
There are two files in this dataset.
File1: AffiNorm
AffiNorm contains 1,001 rows, including one header row, randomly sampled from MapAffil 2018 Dataset ([**https://doi.org/10.13012/B2IDB-2556310_V1**](https://databank.illinois.edu/datasets/IDB-2556310)). Each row in the file corresponds to a particular author on a particular PubMed record, and contains the following 26 columns, comma-delimited. All columns are ASCII, except city which contains Latin-1.
COLUMN DESCRIPTION
1. PMID: the PubMed identifier. int.
2. ORDER: the position of the author. int.
3. YEAR - The year of publication. int(4), eg: 1975.
4. affiliation - affiliation string of the author. eg: Department of Pathology, University of Chicago, Illinois 60637.
5. annotation_type: the number of institutions annotated, denoted by S, M, O, or Z, where "S" (single) indicates 1 institution was annotated; "M" (Multiple) indicates more than one institutions were annotated; "O" (Out of Vocabulary or None) indicates no institution was annotated, but an institution was apparently mentioned; "Z" indicates no institution was mentioned.
6. Institution: the standard name(s) of the annotated institution(s), according to ROR. if "S" (single institution), it is saved as a string, eg: University of Chicago; if "M", it is saved as a string that looks like a python list, eg: ['Public Health Laboratory Service'; 'Centre for Applied Microbiology and Research']; if "O" or "Z", then blank.
7. inst_type: the type of institution, according to ROR. the potential values are: education, funder, healthcare, company, archive, nonprofit, government, facility, other. An institution may have more than one type, eg: ['Education', 'Funder']
8. type_edu: TRUE if the inst_type contains "Education"; FALSE otherwise.
9. RORid: ROR identifier(s), eg: https://ror.org/05hs6h993. when multiple, the order corresponds to institution (column 6)
10. RORid_label. the standard name(s) of the annotated institution(s) according to ROR.same as institution (column 6)
11. GRIDid: GRID identifier(s). eg: grid.170205.1
12. GRIDid_label: the standard name(s) of the annotated institution(s) according to GRID. eg: University of Chicago.
13. WikiDataid: WikiData identifier(s). eg: Q131252
14. WikiDataid_label: the standard name(s) of the annotated institution(s) according to WikiData. eg: University of Chicago
15. synonyms: a comma separated list of variant names from InsVar (file 2) . format of string. eg: University of Chicago, Chicago University, U of C, UChicago, uchicago.edu, U Chicago, ...
16. MapAffil-grid: GRID from the MapAffil 2018 Dataset.
17. MapAffil-grid_label: The standard name of institution from MapAffil 2018 Dataset.
18. judge_mapA: TRUE if GRIDid (column 11) contains MapAffil-grid (column 16); FALSE otherwise.
19. MapAffiltemporal-grid: GRID from the temporal version of MapAffil, http://abel.ischool.illinois.edu/data/MapAffilTempo2018.tsv.gz
20. MapAffiltemporal-grid_label: The standard name of institution from MapAffilTemporal 2018 Dataset.
21. judge_mapT: TRUE if GRIDid (column 11) contains MapAffiltemporal-grid (column 19); FALSE otherwise.
22. RORapi_query_id: ROR from ROR api tool (query endpoint)
23. RORapi_query_id_label: The standard name of institution from ROR api tool (query endpoint). format in string.
24. judge_rorapi_affiliation: TRUE if RORid (column 9) contains RORapi_query_id (column 22); FALSE otherwise.
25. rorapi_affiliation_id: ROR from ROR api tool (affiliation endpoint).
26. judge_rorapi_affiliation: TRUE if RORid (column 9) contains RORapi_affiliation (column 25); FALSE otherwise.
File 2: insVar.json
InsVar is a supplementary dataset for AffiNorm, which includes the institution ID and its redirected aliases from wikidata. The institution ID list is from GRID, the redirected aliases are from wiki api, for example: https://en.wikipedia.org/wiki/Special:WhatLinksHere?target=University+of+Illinois+Urbana-Champaign&namespace=&hidetrans=1&hidelinks=1&limit=100
In InsVar, the data is saved in a python dictionary format. the key is the GRID identifier, for example: "grid.1001.0" (Australian National University), and the value is a list of redirected aliases strings.
{"grid.1001.0": ["ANU", "ANU College", "ANU College of Arts and Social Sciences", "ANU College of Asia and the Pacific", "ANU Union", "ANUSA", "Asia Pacific Week", "Australia National University", "Australian Forestry School", "the Australian National University", ...], "grid.1002.3": ...}
keywords:
PubMed; MEDLINE; Digital Libraries; Bibliographic Databases; Institution Names; Author Affiliations; Institution Name Ambiguity; Authority files