Illinois Data Bank
Displaying 201 - 225 of 886 in total
Subject Area
Funder
Publication Year
License
Illinois Data Bank Dataset Search Results

Dataset Search Results

published: 2025-04-15
 
Data for the invertebrate analysis in chapter 2 of Jacob Ridgway's thesis: "Neonicotinoids and Fungicides Alter Soil Invertebrate Abundance and Richness Within Restored Prairie"
keywords: Thesis;Soil Invertebrate;Pesticides
published: 2025-04-04
 
This dataset, uCite, is the union of nine large-scale open-access PubMed citation data separated by reliability. There are 20 files, including the reliable and unreliable citation PMID pairs, non-PMID identifiers to PMID mapping (for DOIs, Lens, MAG, and Semantic Scholar), original PMID pairs from the nine resources, some metadata for PMIDs, duplicate PMIDs, some redirected PMID pairs, and PMC OA Patci citation matching results. The short description of each data file is listed as follows. A detailed description can be found in the README.txt. <strong>DATASET DESCRIPTION</strong> <ol> <li>PPUB.tsv.gz - tsv format file containing reliable citation pairs uCite.</li> <li>PUNR.tsv.gz - tsv format file containing reliable citation pairs uCite.</li> <li>DOI2PMID.tsv.gz - tsv format file containing results mapping DOI to PMID. </li> <li> LEN2PMID.tsv.gz - tsv format file containing results mapping LensID pairs to PMID pairs.. </li> <li> MAG2PMIDsorted.tsv.gz - tsv format file containing results mapping MAG ID to PMID. </li> <li>SEM2PMID.tsv.gz - tsv ormat file containing results mapping Semantic Scholar ID to PMID. </li> <li>JVNPYA.tsv.gz - tsv format file containing metadata of papers with PMID, journal name, volume, issue, pages, publication year, and first author's last name. </li> <li>TiLTyAlJVNY.tsv.gz - tsv format file containing metadata of papers. </li> <li> PMC-OA-patci.tsv.gz - tsv format file containing PubMed Central Open Access subset reference strings extracted by \cite{} processed by Patci.</li> <li>REDIRECTS.gz - txt file containing unreliable PMID pairs mapped to reliable PMID pairs. </li> <li>REMAP - file containing pairs of duplicate PubMed records (lhs PMID mapped to rhs PMID).</li> <li> ami_pair.tsv.gz - tsv format file containing all citation pairs from Aminer (2015 version). </li> <li> dim_pair.tsv.gz - tsv format file containing all citation pairs from Dimensions. </li> <li> ice_pair.tsv.gz - tsv format file containing all citation pairs from iCite (April 2019 version, version 1). </li> <li> len_pair.tsv.gz - tsv format file containing all citation pairs from Lens.org (harvested through Oct 2021). </li> <li>mag_pair.tsv.gz - tsv format file containing all citation pairs from Microsoft Academic Graph (2015 version). </li> <li> oci_pair.tsv.gz - tsv format file containing all citation pairs from Open Citations (Nov. 2021 dump, csv version ). </li> <li> pat_pair.tsv.gz - tsv format file containing all citation pairs from Patci (i.e., from "PMC-OA-patci.tsv.gz"). </li> <li> pmc_pair.tsv.gz - tsv format file containing all citation pairs from PubMed Central (harvest through Dec 2018 via e-Utilities).</li> <li> sem_pair.tsv.gz - tsv format file containing all citation pairs from Semantic Scholar (2019 version) . </li> </ol> <strong>COLUMN DESCRIPTION</strong> <strong>FILENAME</strong> : <em>PPUB.tsv.gz, PUNR.tsv.gz</em> (1) fromPMID - PubMed ID of the citing paper. (2) toPMID - PubMed ID of the cited paper. (3) sources - citation sources, in which the citation pairs are identified. (4) fromYEAR - Publication year of the citing paper. (5) toYEAR - Publication year of the cited paper. <strong>FILENAME</strong> : <em>DOI2PMID.tsv.gz</em> (1) DOI - Semantic Scholar ID of paper records. (2) PMID - PubMed ID of paper records. (3) PMID2 - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs. <strong>FILENAME</strong> : <em>SEMID2PMID.tsv.gz</em> (1) SemID - Semantic Scholar ID of paper records. (2) PMID - PubMed ID of paper records. (3) DOI - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs. <strong>FILENAME</strong> : <em>JVNPYA.tsv.gz</em> - Each row refers to a publication record. (1) PMID - PubMed ID. (2) journal - Journal name. (3) volume - Journal volume. (4) issue - Journal issue. (5) pages - The first page and last page (without leading digits) number of the publication separated by '-'. (6) year - Publication year. (7) lastname - Last name of the first author. <strong>FILENAME</strong> : <em>TiLTyAlJVNY.tsv.gz</em> (1) PMID - PubMed ID. (2) title_tokenized - Paper title after tokenization. (3) languages - Language that paper is written in. (4) pub_types - Types of the publication. (5) length(authors) - String length of author names. (6) journal -Journal name . (7) volume - Journal volume . (8) issue - Journal issue. (9) year - Publication year of print (not necessary epub). <strong>FILENAME</strong> : <em> PMC-OA-patci.tsv.gz</em> (1) pmcid - PubMed Central identifier. (2) pos - (3) fromPMID - PubMed ID of the citing paper. (4) toPMID - PubMed ID of the cited paper. (5) SRC - citation sources, in which the citation pairs are identified. (6) MatchDB - PubMed, ADS, DBLP. (7) Probability - Matching probability predicted by Patci. (8) toPMID2 - PubMed ID of the cited paper, extracted from OA xml file (9) SRC2 - citation sources, in which the citation pairs are identified. (10) intxt_id - (11) jounal - First character of the journal name. (12) same_ref_string - Y if patci and xml reference string match, otherwise N. (13) DIFF - (14) bestSRC - Citation sources, in which the citation pairs are identified. (15) Match - Matching strings annotated by Patci. <strong>FILENAME</strong> : <em>REDIRECTS.gz</em> Each row in Redirectis.txt is a string sequence in the same format as follows. - "REDIRECTED FROM: source PMID_i PMID_j -> PMID_i' PMID_j " - "REDIRECTED TO: source PMID_i PMID_j -> PMID_i PMID_j' " Note: source is the names of sources where the PMID_i and PMID_j are from. <strong>FILENAME</strong> : <em>REMAP</em> Each row is remapping unreliable PMID pairs mapped to reliable PMID pairs. The format of each row is "$REMAP{PMID_i} = PMID_j". <strong>FILENAME</strong> : <em>ami_pair.tsv.gz, dim_pair.tsv.gz, ice_pair.tsv.gz, len_pair.tsv.gz, mag_pair.tsv.gz, oci_pair.tsv.gz, pat_pair.tsv.gz,pmc_pair.tsv.gz, sem_pair.tsv.gz</em> (1) fromPMID - PubMed ID of the citing paper. (2) toPMID - PubMed ID of the cited paper.
keywords: Citation data; PubMed; Social Science;
published: 2025-04-05
 
This data set includes information on mixing metric values and distances to determine the average length scale, rates and variability of mixing downstream of 43 river confluences for 150 mixing events. The file "pmx_all data.csv" contains confluence names, the number of events per confluence site, and Pmx values measured at various actual and dimensionless downstream distances. The file "pmx_binned data.csv" provides mean Pmx values within 0.5-unit dimensionless distance bins.
keywords: river; mixing; confluences; remote sensing
published: 2020-08-22
 
We are releasing the tracing dataset of four microservice benchmarks deployed on our dedicated Kubernetes cluster consisting of 15 heterogeneous nodes. The dataset is not sampled and is from selected types of requests in each benchmark, i.e., compose-posts in the social network application, compose-reviews in the media service application, book-rooms in the hotel reservation application, and reserve-tickets in the train ticket booking application. The four microservice applications come from [DeathStarBench](https://github.com/delimitrou/DeathStarBench) and [Train-Ticket](https://github.com/FudanSELab/train-ticket). The performance anomaly injector is from [FIRM](https://gitlab.engr.illinois.edu/DEPEND/firm.git). The dataset was preprocessed from the raw data generated in FIRM's tracing system. The dataset is separated by on which microservice component is the performance anomaly located (as the file name suggests). Each dataset is in CSV format and fields are separated by commas. Each line consists of the tracing ID and the duration (in 10^(-3) ms) of each component. Execution paths are specified in `execution_paths.txt` in each directory.
keywords: Microservices; Tracing; Performance
published: 2025-04-01
 
ICoastalDB, which was developed using Microsoft structured query language (SQL) Server, consists of water quality and related data in the Illinois coastal zone that were collected by various organizations. The information in the dataset includes, but is not limited to, sample data type, method of data sampling, location, time and date of sampling and data units.
keywords: Illinois Coastal Zone; Water Quality Data
published: 2025-03-20
 
This dataset contains white-tailed deer (Odocoileus virginianus) land cover utility score (deer LCU score) data for every TRS (township, range, and section), township-range, and county in Illinois, USA, based on annual National Land Cover Database (NLCD) data released for all years between 2000 and 2023. LCU data is provided in CSV files for each spatial scale, with TRS data split into 2 CSV files due to size limits. Rasters (TIF) showing all deer habitat in Illinois are also provided to show the location, quality, and quantity of deer habitat. A metadata file is also included for additional information.
keywords: habitat; white-tailed deer; deer; Odocoileus virginianus; land cover; land classification; landscape; habitat suitability index; ecology; environment
published: 2025-03-18
 
The Cline Center Global News Index is a searchable database of textual features extracted from millions of news stories, specifically designed to provide comprehensive coverage of events around the world. In addition to searching documents for keywords, users can query metadata and features such as named entities extracted using Natural Language Processing (NLP) methods and variables that measure sentiment and emotional valence. Archer is a web application purpose-built by the Cline Center to enable researchers to access data from the Global News Index. Archer provides a user-friendly interface for querying the Global News Index (with the back-end indexing still handled by Solr). By default, queries are built using icons and drop-down menus. More technically-savvy users can use Lucene/Solr query syntax via a ‘raw query’ option. Archer allows users to save and iterate on their queries, and to visualize faceted query results, which can be helpful for users as they refine their queries. Additional Resources: - Access to Archer and the Global News Index is limited to account-holders. If you are interested in signing up for an account, please fill out the <a href="https://docs.google.com/forms/d/e/1FAIpQLSf-J937V6I4sMSxQt7gR3SIbUASR26KXxqSurrkBvlF-CIQnQ/viewform?usp=pp_url"><b>Archer Access Request Form</b></a> so we can determine if you are eligible for access or not. - Current users who would like to provide feedback, such as reporting a bug or requesting a feature, can fill out the <a href="https://forms.gle/6eA2yJUGFMtj5swY7"><b>Archer User Feedback Form</b></a>. - The Cline Center sends out periodic email newsletters to the Archer Users Group. Please fill out this <a href="https://groups.webservices.illinois.edu/subscribe/154221"><b>form</b></a> to subscribe to it. <b>Citation Guidelines:</b> 1) To cite the GNI codebook (or any other documentation associated with the Global News Index and Archer) please use the following citation: Cline Center for Advanced Social Research. 2025. Global News Index and Extracted Features Repository [codebook], v1.3.0. Champaign, IL: University of Illinois. June. XX. doi:10.13012/B2IDB-5649852_V6 2) To cite data from the Global News Index (accessed via Archer or otherwise) please use the following citation (filling in the correct date of access): Cline Center for Advanced Social Research. 2025. Global News Index and Extracted Features Repository [database], v1.3.0. Champaign, IL: University of Illinois. Jun. XX. Accessed Month, DD, YYYY. doi:10.13012/B2IDB-5649852_V6 *NOTE: V6 is replacing V5 with updated ‘Archer’ documents to reflect changes made to the Archer system.
published: 2025-03-14
 
Hype - PubMed dataset Prepared by Apratim Mishra This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences. The candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’. This is version 3 of the dataset. Added new file - WSD_hype.tsv File 1: hype_dataset_final.tsv Primary dataset. It has the following columns: 1. PMID: represents unique article ID in PubMed 2. Year: Year of publication 3. Hype_word: Candidate hype word, such as ‘novel.’ 4. Sentence: Sentence in abstract containing the hype word. 5. Hype_percentile: Abstract relative position of hype word. 6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location. 7. Introduction: The ‘I’ component of the hype word based on IMRaD 8. Methods: The ‘M’ component of the hype word based on IMRaD 9. Results: The ‘R’ component of the hype word based on IMRaD 10. Discussion: The ‘D’ component of the hype word based on IMRaD File 2: hype_removed_phrases_final.tsv Secondary dataset with same columns as File 1. Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases: 1. Major: histocompatibility, component, protein, metabolite, complex, surgery 2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid 3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment 4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values 5. Essential: medium, features, properties, opportunities, oil 6. Unique: model, amino 7. Robust: regression 8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information 9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains 10. Remarkable: properties 11. Definite: radiotherapy, surgery File 3: WSD_hype.tsv Includes hype-based disambiguation for candidate words targeted for WSD (Word sense disambiguation)
keywords: Hype; PubMed; Abstracts; Biomedicine
published: 2025-03-05
 
References - Li, Fu, Umberto Villa, Seonyeong Park, and Mark A. Anastasio. "3-D stochastic numerical breast phantoms for enabling virtual imaging trials of ultrasound computed tomography." IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 69, no. 1 (2021): 135-146. DOI: 10.1109/TUFFC.2021.3112544 - Li, Fu; Villa, Umberto; Park, Seonyeong; Anastasio, Mark, 2021, "2D Acoustic Numerical Breast Phantoms and USCT Measurement Data", https://doi.org/10.7910/DVN/CUFVKE, Harvard Dataverse, V1 Overview - This dataset includes 1,089 two-dimensional slices extracted from 3D numerical breast phantoms (NBPs) for ultrasound computed tomography (USCT) studies. The anatomical structures of these NBPs were obtained using tools from the Virtual Imaging Clinical Trial for Regulatory Evaluation (VICTRE) project. The methods used to modify and extend the VICTRE NBPs for use in USCT studies are described in the publication cited above. - The NBPs in this dataset represent the following four ACR BI-RADS breast composition categories: > Type A - The breast is almost entirely fatty > Type B - There are scattered areas of fibroglandular density in the breast > Type C - The breast is heterogeneously dense > Type D - The breast is extremely dense - Each 2D slice is taken from a different 3D NBP, ensuring that no more than one slice comes from any single phantom. File Name Format - Each data file is stored as an HDF5 .mat file. The filenames follow this format: {type}{subject_id}.mat where{type} indicates the breast type (A, B, C, or D), and {subject_id} is a unique identifier assigned to each sample. For example, in the filename D510022534.mat, "D" represents the breast type, and "510022534" is the sample ID. File Contents - Each file contains the following variables: > "type": Breast type > "sos": Speed-of-sound map [mm/μs] > "den": Ambient density map [kg/mm³] > "att": Acoustic attenuation (power-law prefactor) map [dB/ MHzʸ mm] > "y": power-law exponent > "label": Tissue label map. Tissue types are denoted using the following labels: water (0), fat (1), skin (2), glandular tissue (29), ligament (88), lesion (200). - All spatial maps ("sos", "den", "att", and "label") have the same spatial dimensions of 2560 x 2560 pixels, with a pixel size of 0.1 mm x 0.1 mm. - "sos", "den", and "att" are float32 arrays, and "label" is an 8-bit unsigned integer array.
keywords: Medical imaging; Ultrasound computed tomography; Numerical phantom
published: 2025-02-20
 
To gather news articles from the web that discuss the Cochrane Review (DOI: 10.1002/14651858.CD006207.pub6), we retrieved articles on August 1, 2023 from used Altmetric.com's Altmetric Explorer. We selected all articles that were written in English, published in the United States, and had a publication date <b>on or after March 10, 2023</b> (according to the "Mention Date" from Altmetric.com). This date is significant as it is when Cochrane issued a statement (https://www.cochrane.org/news/statement-physical-interventions-interrupt-or-reduce-spread-respiratory-viruses-review) about the "misleading interpretation" of the Cochrane Review made by news articles. A previously published dataset for "Arguing about Controversial Science in the News: Does Epistemic Uncertainty Contribute to Information Disorder?" (DOI: 10.13012/B2IDB-4781172_V1) contains annotation of the news articles published before March 10, 2023. Our dataset annotates the news published on or after March 10, 2023. The Altmetric_data.csv describes the selected news articles with both data exported from Altmetric Explorer and data we manually added Data exported from Altmetric Explorer: - Publication date of the news article - Title of the news article - Source/publication venue of the news article - URL - Country Data we manually added: - Whether the article is accessible - The date we checked the article - The corresponding ID of the article in MAXQDA For each article from Altmetric.com, we first tried to use the Web Collector for MAXQDA to download the article from the website and imported it into MAXQDA (version 22.8.0). We manually extracted direct quotations from the articles using MAXQDA. We included surrounding words and sentences around direct quotations for context where needed. We manually added codes and code categories in MAXQDA to identify the individuals (chief editors of the Cochrane Review, government agency representatives, journalists, and other experts such as physicians) or organizations (government agencies, other organizations, and research publications) who were quoted. The MAXQDA_data.csv file contains excerpts from the news articles that contain the direct quotations we annotated. For each excerpt, we included the following information: - MAXQDA ID of the document from which the excerpt originates - The collection date and source of the document - The code we assigned to the excerpt - The code category - The excerpt itself
keywords: altmetrics; MAXQDA; masks for COVID-19; scientific controversies; news articles
published: 2025-02-07
 
This dataset contains raw data of plasma glucose, insulin, c-peptide, GLP-1, and FGF21 collected as part of a study aimed to study alcohol pharmacokinetics in women who underwent metabolic surgery.
keywords: Excel; Alcohol and metabolic surgery; glucose; insulin; c-peptide; glp-1; fgf21
published: 2024-03-27
 
To gather news articles from the web that discuss the Cochrane Review, we used Altmetric Explorer from Altmetric.com and retrieved articles on August 1, 2023. We selected all articles that were written in English, published in the United States, and had a publication date <b>prior to March 10, 2023</b> (according to the “Mention Date” on Altmetric.com). This date is significant as it is when Cochrane issued a statement about the "misleading interpretation" of the Cochrane Review. The collection of news articles is presented in the Altmetric_data.csv file. The dataset contains the following data that we exported from Altmetric Explorer: - Publication date of the news article - Title of the news article - Source/publication venue of the news article - URL - Country We manually checked and added the following information: - Whether the article still exists - Whether the article is accessible - Whether the article is from the original source We assigned MAXQDA IDs to the news articles. News articles were assigned the same ID when they were (a) identical or (b) in the case of Article 207, closely paraphrased, paragraph by paragraph. Inaccessible items were assigned a MAXQDA ID based on their "Mention Title". For each article from Altmetric.com, we first tried to use the Web Collector for MAXQDA to download the article from the website and imported it into MAXQDA (version 22.7.0). If an article could not be retrieved using the Web Collector, we either downloaded the .html file or in the case of Article 128, retrieved it from the NewsBank database through the University of Illinois Library. We then manually extracted direct quotations from the articles using MAXQDA. We included surrounding words and sentences, and in one case, a news agency’s commentary, around direct quotations for context where needed. The quotations (with context) are the positions in our analysis. We also identified who was quoted. We excluded quotations when we could not identify who or what was being quoted. We annotated quotations with codes representing groups (government agencies, other organizations, and research publications) and individuals (authors of the Cochrane Review, government agency representatives, journalists, and other experts such as epidemiologists). The MAXQDA_data.csv file contains excerpts from the news articles that contain the direct quotations we identified. For each excerpt, we included the following information: - MAXQDA ID of the document from which the excerpt originates; - The collection date and source of the document; - The code with which the excerpt is annotated; - The code category; - The excerpt itself.
keywords: altmetrics; MAXQDA; polylogue analysis; masks for COVID-19; scientific controversies; news articles
published: 2022-05-13
 
The files are plain text and contain the original data used in phylogenetic analyses of of Typhlocybinae (Bin, Dietrich, Yu, Meng, Dai and Yang 2022: Ecology & Evolution, in press). The three files with extension .phy are text files with aligned DNA sequences in the standard PHYLIP format and correspond to Matrix 1 (amino acid alignment), Matrix 2 (nucleotide alignment of first two codon positions of protein-coding genes) and Matrix 3 (nucleotide alignment of protein-coding genes plus 2 ribosomal genes) described in the Methods section. An additional text file in NEXUS format (.nex extension) contains the morphological character data used in the ancestral state reconstruction (ASCR) analysis described in the Methods. NEXUS is a standard format used by various phylogenetic analysis software. For more information on data file content, see the included "readme" files.
keywords: Hemiptera; phylogeny; mitochondrial genome; morphology; leafhopper
published: 2022-10-14
 
The Membracoidea_morph_data_Final.nex text file contains the original data used in the phylogenetic analyses of Dietrich et al. (Insect Systematics and Diversity, in review). The text file is marked up according to the standard NEXUS format commonly used by various phylogenetic analysis software packages. The file will be parsed automatically by a variety of programs that recognize NEXUS as a standard bioinformatics file format. The complete taxon names corresponding to the 131 genus names listed under “BEGIN TAXA” are listed in Table 1 in the included PDF file “Taxa_and_characters”; the 229 morphological characters (names abbreviated under under “BEGIN CHARACTERS” are fully explained in the list of character descriptions following Table 1 in the same PDF). The data matrix follows “MATRIX” and gives the numerical values of characters for each taxon. Question marks represent missing data. The lists of characters and taxa and details on the methods used for phylogenetic analysis are included in the submitted manuscript.
keywords: leafhopper; treehopper; evolution; Cretaceous; Eocene
published: 2024-04-05
 
The following files include specimen information, DNA sequence data, and additional information on the analyses used to reconstruct the phylogeny of the leafhopper genus Neoaliturus as described in the Methods section of the original paper: 1. Taxon_sampling.csv: contains data on the individual specimens from which DNA was extracted, including sample code, taxon name, collection data (locality, date and name of collector) and museum unique identifier. 2. Alignments.zip: a ZIP archive containing 432 separate FASTA files representing the aligned nucleotide sequences of individual gene loci used in the analysis. 3. Concatenated_Matrix.fa: is a FASTA file containing the concatenated individual gene alignments used for the maximum likelihood analysis in IQ-TREE. 4. Genes_and_Loci.rtf: identifies the individual genes and loci used in the analysis. The partition name is the same as the name of the individual alignment file in the zipped Alignments folder. 5. Partitions_best_scheme.nex: is a text file in the standard NEXUS format that indicates the names of the individual data partitions and their locations in the concatenated matrix, and also indicates the substitution model for each partition. 6. (New in this version 2) Scripts & Description.zip includes 8 custom shell or perl scripts used to assemble the DNA sequence data by perform reciprocal blast searches between the reference sequences and assemblies for each sample, extract the best sequences based on the blast searches, screen the hits for each locus and keep only the best result, and generate the nucleotide sequence dataset for the predicted orthologues (see the file description.txt for details). 7. (New in this version 2) Full_genetic_distances_matrix.csv shows the genetic distances between pairs of samples in the datset (proportion of nucleotides that differ between samples).
keywords: leafhopper; phylogeny; anchored-hybrid-enrichment; DNA sequence; insect
published: 2024-09-17
 
The following seven zip files are compressed folders containing the input datasets/trees, main output files and the scripts of the related analyses performed in this study. I. ancestral_microhabitat_reconstruction.zip: contains four files, including two input files (microhabitats.csv, timetree.tre) and a script (simmap_microhabitat.R) for ancestral states reconstruction of microhabitat by make.simmap implemented in the R package phytools v1.5, as well as the main output file (ancestral_microhabitats.csv). 1. ancestral_microhabitats.csv: reconstructed ancestral microhabitats for each node. 2. microhabitats.csv: microhabitats of the studies species. 3. simmap_microhabitat.R: the R script of make.simmap for ancestral microhabitat reconstruction 4. timetree.tre: dated tree used for ancestral state reconstruction for microhabitat and morphological characters II. ancestral_morphology_reconstruction.zip: contains six files, including an input file (morphology.csv) and a script (simmap_morphology.R) for ancestral states reconstruction of morphology by make.simmap implemented in the R package phytools v1.5, as well as four main output files(forewing_ancestral_state.csv, frontal_sutures_ancestral_state.csv, hind_wing_ancestral_state.csv, ocellus_ancestral_state.csv). 1. forewing_ancestral_state.csv: reconstructed ancestral states of the development of the forewing for each node. 2. frontal_sutures_ancestral_state.csv: reconstructed ancestral states of the development of frontal sutures for each node. 3. hind_wing_ancestral_state.csv: reconstructed ancestral states of the development of the hind wing for each node. 4. morphology.csv: the states of the development of ocellus, forewing, hing wing and frontal sutures for each studies species. 5. ocellus_ancestral_state.csv: reconstructed ancestral states of the development of the ocellus for each node. 6. simmap_morphology.R: the R script of make.simmap for ancestral state reconstruction of morphology III. biogeographic_reconstruction.zip: contains four files, including three input files (dispersal_probablity.txt, distributions.csv, timetree_noOutgroup.tre) used for a stratified biogeographic analysis by BioGeoBEARS in RASP v4.2 and the main output file (DIVELIKE_result.txt). 1. dispersal_probablity.txt: relative dispersal probabilities among biogeographical regions at different geological epochs. 2. distributions.csv: current distributions of the studied species. 3. DIVELIKE_result.txt: BioGeoBEARS result of ancestral areas based on the DIVELIKE model. 4. timetree_noOutgroup.tre: the dated tree with the outgroup lineage (Eurymelinae) excluded. IV. coalescent_analysis.zip: contains a folder and two files, including a folder (individual_gene_alignment) of input files used to construct gene trees, an input file (MLtree_BS70.tre) used for the multi-species coalescent analysis by ASTRAL v 4.10.5 and the main output file (coalescent_species_tree.tre). 1. coalescent_species_tree.tre: the species tree generated by the multi-species coalescent analysis with the quartet support, effective number of genes and the local posterior probability indicated. 2. individual_gene_alignment: a folder containing 427 FASTA files, each one represents the nucleotide alignment for a gene. Hyphens are used to represent gaps. These files were used to construct gene trees using IQ-TREE v1.6.12. 3. MLtree_BS70.tre: 165 gene trees with the average SH-aLRT and ultrafast bootstrap values of ≥ 70%. This file was used to estimate the species tree by ASTRAL v 4.10.5. V. divergence_time_estimation.zip: contains five files, including two input files (treefile_rooted_noBranchLength.tre, treefile_rooted.tre) and two control files (baseml.ctl, mcmctree.ctl) used for divergence time estimation by BASEML and MCMCTREE in PAML v4.9, as well as the main output file (timetree_with95%HPD.tre). 1. baseml.ctl: the control file used for the estimation of substitution rates by BASEML in PAML v4.9. 2. mcmctree.ctl: the control file used for the estimation of divergence times by MCMCTREE in PAML v4.9. 3. timetree_with95%HPD.tre: dated tree with the 95% highest posterior density confidence intervals indicated. 4. treefile_rooted_noBranchLength.tre: the maximum likelihood tree based on the concatenated nucleotide dataset with calibrations for the crown and internal nodes. Branch length and support values were not indicated. 5. treefile_rooted.tre: the maximum likelihood tree based on the concatenated nucleotide dataset with a secondary calibration on the root age. Branch support values were not indicated. VI. maximum_likelihood_analysis_aa.zip: contains three files, including two input files (concatenated_aa_partition.nex, concatenated_aa.phy) used for the maximum likelihood analysis by IQ-TREE v1.6.12 and the main output file (MLtree_aa.tre). 1. concatenated_aa_partition.nex: the partitioning schemes for the maximum likelihood analysis using concatenated_aa.phy. This file partitions the 52,024 amino acid positions into 427 character sets. 2. concatenated_aa.phy: a concatenated amino acid dataset with 52,024 amino acid positions. Hyphens are used to represent gaps. This dataset was used for the maximum likelihood analysis. 3. MLtree_aa.tre: the maximum likelihood tree based on the concatenated amino acid dataset, with SH-aLRT values and ultrafast bootstrap values indicated. VII. maximum_likelihood_analysis_nt.zip: contains three files, including two input files (concatenated_nt_partition.nex, concatenated_nt.phy) used for the maximum likelihood analysis by IQ-TREE v1.6.12 and the main output file (MLtree_nt.tre). 1. concatenated_nt_partition.nex: the partitioning schemes for the maximum likelihood analysis using concatenated_nt.phy. This file partitions the 156,072 nucleotide positions into 427 character sets. 2. concatenated_nt.phy: a concatenated nucleotide dataset with 156,072 nucleotide positions. Hyphens are used to represent gaps. This dataset was used for the maximum likelihood analysis as well as divergence time estimation. 3. MLtree_nt.tre: the maximum likelihood tree based on the concatenated nucleotide dataset, with SH-aLRT values and ultrafast bootstrap values indicated. VIII. Taxon_sampling.csv: contains the sample IDs (1st column) which were used in the alignments and the taxonomic information (2nd to 6th columns).
keywords: Anchored Hybrid Enrichment, Biogeography, Cicadellidae, Phylogenomics, Treehoppers
published: 2025-01-06
 
The complete data for the publication "RNA helicase MOV10 suppresses fear memory and dendritic arborization and regulates microtubule dynamics in hippocampal neurons," excluding sequencing data deposited in GEO, is provided here.
keywords: MOV10; NUMA1; hippocampal neurons; behavior; cytoskeleton; tiff; czi; dv; mp4; mpg; ndpi; csv; xlsx; R
published: 2025-01-17
 
This is the data set for a publication titled, "Coupling carbon dioxide gas within a bubble curtain enhances its effectiveness to deter fish." The current study sought to quantify whether adding carbon dioxide gas (CO2) to a bubble curtain would enhance its efficacy to block fish. For this, a choice tank was outfitted with bubble curtains infused with either compressed air alone, or with two different concentrations of CO2 [30 or 100 mg/L]. Passage rates and position of common carp (an invasive Cyprinid) and black bullhead (a native Ictalurid) exposed to these treatments were compared. The data set consists of data from each of the experiments performed during the study.
keywords: invasive species; multimodal barriers; deterrents; biodiversity; species range; distribution
published: 2025-02-07
 
These data represent the raw data from the paper “Influence of light availability and water depth on competition between Phalaris arundinacea and herbaceous vines” published in Wetlands by Annie H. Huang and Jeffrey W. Matthews. The data are archived in one file: Huang&Matthews_mesocosm_data_archive. This file includes raw data collected during a greenhouse experiment described in the paper.
published: 2025-02-07
 
Incoherent scatter radar datasets collected during the September 2016 campaign at Arecibo have been deposited in this databank. The lag products of the ISR data are stored as lag profile matrices with 5 minutes of integration time. The data is organized in a Python dictionary format, with each file containing 12 lag profile matrices representing one hour of observation. A sample Python script is provided to illustrate its usage.
published: 2025-02-06
 
Data from a study on the behavior of blue-winged and golden-winged warblers. We were investigating vocalizations and how the species reconizes each other. There are banding, behavioral data from a playback study, and song data.
keywords: warblers; songs; species recognition
published: 2021-05-07
 
Prepared by Vetle Torvik 2021-05-07 The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters). • How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in December, 2018. (NLMs baseline 2018 plus updates throughout 2018). Affiliations are linked to a particular author on a particular article. Prior to 2014, NLM recorded the affiliation of the first author only. However, MapAffil 2018 covers some PubMed records lacking affiliations that were harvested elsewhere, from PMC (e.g., PMID 22427989), NIH grants (e.g., 1838378), and Microsoft Academic Graph and ADS (e.g. 5833220). Affiliations are pre-processed (e.g., transliterated into ASCII from UTF-8 and html) so they may differ (sometimes a lot; see PMID 27487542) from PubMed records. All affiliation strings where processed using the MapAffil procedure, to identify and disambiguate the most specific place-name, as described in: Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p • Look for Fig. 4 in the following article for coverage statistics over time: Palmblad, M., Torvik, V.I. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Trop Med Health 45, 33 (2017). <a href="https://doi.org/10.1186/s41182-017-0073-6">https://doi.org/10.1186/s41182-017-0073-6</a> Expect to see big upticks in coverage of PMIDs around 1988 and for non-first authors in 2014. • The code and back-end data is periodically updated and made available for query by PMID at http://abel.ischool.illinois.edu/cgi-bin/mapaffil/search.py • What is the format of the dataset? The dataset contains 52,931,957 rows (plus a header row). Each row (line) in the file has a unique PMID and author order, and contains the following eighteen columns, tab-delimited. All columns are ASCII, except city which contains Latin-1. 1. PMID: positive non-zero integer; int(10) unsigned 2. au_order: positive non-zero integer; smallint(4) 3. lastname: varchar(80) 4. firstname: varchar(80); NLM started including these in 2002 but many have been harvested from outside PubMed 5. initial_2: middle name initial 6. orcid: From 2019 ORCID Public Data File https://orcid.org/ and from PubMed XML 7. year: year of the publication 8. journal: name of journal that the publication is published 9. affiliation: author's affiliation?? 10. disciplines: extracted from departments, divisions, schools, laboratories, centers, etc. that occur on at least unique 100 affiliations across the dataset, some with standardization (e.g., 1770799), English translations (e.g., 2314876), or spelling corrections (e.g., 1291843) 11. grid: inferred using a high-recall technique focused on educational institutions (but, for experimental purposes, includes a few select hospitals, national institutes/centers, international companies, governmental agencies, and 200+ other IDs [RINGGOLD, Wikidata, ISNI, VIAF, http] for institutions not in GRID). Based on 2019 GRID version https://www.grid.ac/ 12. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK 13. city: varchar(200); typically 'city, state, country' but could include further subdivisions; unresolved ambiguities are concatenated by '|' 14. state: Australia, Canada and USA (which includes territories like PR, GU, AS, and post-codes like AE and AA) 15. country 16. lat: at most 3 decimals (only available when city is not a country or state) 17. lon: at most 3 decimals (only available when city is not a country or state) 18. fips: varchar(5); for USA only retrieved by lat-lon query to https://geo.fcc.gov/api/census/block/find
keywords: PubMed, MEDLINE, Digital Libraries, Bibliographic Databases; Author Affiliations; Geographic Indexing; Place Name Ambiguity; Geoparsing; Geocoding; Toponym Extraction; Toponym Resolution; institution name disambiguation
published: 2021-04-22
 
Author-ity 2018 dataset Prepared by Vetle Torvik Apr. 22, 2021 The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018). A total of 29.1 million Article records and 114.2 million author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. The resulting clusters are provided in two different formats, the first in a file with only IDs and PMIDs, and the second in a file with cluster summaries: #################### File 1: au2id2018.tsv #################### Each line corresponds to an author name instance (PMID and Author name position) with an Author ID. It has the following tab-delimited fields: 1. Author ID 2. PMID 3. Author name position ######################## File 2: authority2018.tsv ######################### Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants. Each cluster has a unique Author ID (the PMID of the earliest paper in the cluster and the author name position). The summary has the following tab-delimited fields: 1. Author ID (or cluster ID) e.g., 3797874_1 represents a cluster where 3797874_1 is the earliest author name instance. 2. cluster size (number of author name instances on papers) 3. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix 4. last name variants separated by '|' 5. first name variants separated by '|' 6. middle initial variants separated by '|' ('-' if none) 7. suffix variants separated by '|' ('-' if none) 8. email addresses separated by '|' ('-' if none) 9. ORCIDs separated by '|' ('-' if none). From 2019 ORCID Public Data File https://orcid.org/ and from PubMed XML 10. range of years (e.g., 1997-2009) 11. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none) 12. Top 20 most frequent MeSH (after stoplisting) with counts in parenthesis; separated by '|'; ('-' if none) 13. Journal names with counts in parenthesis (separated by '|'), 14. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none) 15. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none) 16. Author name instances (PMID_auno separated by '|') 17. Grant IDs (after normalization; '-' if none given; separated by '|'), 18. Total number of times cited. (Citations are based on references harvested from open sources such as PMC). 19. h-index 20. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by '|'
keywords: author name disambiguation; PubMed
published: 2024-10-10
 
Diversity - PubMed dataset Contact: Apratim Mishra (Oct, 2024) This dataset presents article-level (pmid) and author-level (auid) diversity data for PubMed articles. The chosen selection includes articles retrieved from Authority 2018 [1], 907 024 papers, and 1 316 838 authors, and is an expanded dataset of V1. The sample of articles consists of the top 40 journals in the dataset, limited to 2-12 authors published between 1991 – 2014, which are article type "journal type" written in English. Files are 'gzip' compressed and separated by tab space, and V3 includes the correct author count for the included papers (pmids) and updated results with no NaNs. ################################################ File1: auids_plos_3.csv.gz (Important columns defined, 5 in total) • AUID: a unique ID for each author • Genni: gender prediction • Ethnea: ethnicity prediction ################################################# File2: pmids_plos_3.csv.gz (Important columns defined) • pmid: unique paper • auid: all unique auids (author-name unique identification) • year: Year of paper publication • no_authors: Author count • journal: Journal name • years: first year of publication for every author • Country-temporal: Country of affiliation for every author • h_index: Journal h-index • TimeNovelty: Paper Time novelty [2] • nih_funded: Binary variable indicating funding for any author • prior_cit_mean: Mean of all authors’ prior citation rate • Insti_impact: All unique institutions’ citation rate • mesh_vals: Top MeSH values for every author of that paper • relative_citation_ratio: RCR The ‘Readme’ includes a description for all columns. [1] Torvik, Vetle; Smalheiser, Neil (2021): Author-ity 2018 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2273402_V1 [2] Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1
keywords: Diversity; PubMed; Citation
Research Data Service Illinois Data Bank
Access and Use Policies Web Privacy Notice Contact Us