Displaying datasets 1 - 25 of 478 in total

Subject Area

Life Sciences (254)
Social Sciences (114)
Physical Sciences (68)
Technology and Engineering (38)
Uncategorized (3)
Arts and Humanities (1)

Funder

U.S. National Science Foundation (NSF) (139)
Other (129)
U.S. National Institutes of Health (NIH) (49)
U.S. Department of Energy (DOE) (42)
U.S. Department of Agriculture (USDA) (23)
Illinois Department of Natural Resources (IDNR) (10)
U.S. National Aeronautics and Space Administration (NASA) (5)
U.S. Geological Survey (USGS) (5)
Illinois Department of Transportation (IDOT) (1)
U.S. Army (1)

Publication Year

2021 (109)
2020 (96)
2022 (76)
2019 (72)
2018 (59)
2017 (35)
2016 (30)
2023 (1)

License

CC0 (280)
CC BY (186)
custom (12)
published: 2022-08-08
 
This upload contains all datasets used in Experiments 2 and 3 of the SALMA paper (pending submission): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "SALMA: Scalable ALignment using MAFFT-Add". The zip file has the following structure (presented as an example): salma_paper_datasets/ |_README.md |_10aa/ |_crw/ |_homfam/ |_aat/ | |_... |_... |_het/ |_5000M2-het/ | |_... |_5000M3-het/ ... |_rec_res/ Generally, the structure can be viewed as: [category]/[dataset]/[replicate]/[alignment files] # Categories: 1. 10aa: There are 10 small biological protein datasets within the `10aa` directory, each with just one replicate. 2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM). 3. homfam: There are the 10 largest Homfam datasets, each with one replicate. 4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates. 5. rec\_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper. # Alignment files There are at most 6 `.fasta` files in each sub-directory: 1. `all.unaln.fasta`: All unaligned sequences. 2. `all.aln.fasta`: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included. 3. `all-queries.unaln.fasta`: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences). 4. `all-queries.aln.fasta`: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included. 5. `backbone.unaln.fasta`: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences). 6. `backbone.aln.fasta`: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included. >If all sequences are full-length sequences, then `all-queries.unaln.fasta` will be missing. >If fewer than two query sequences have reference alignments, then `all-queries.aln.fasta` will be missing. >If fewer than two backbone sequences have reference alignments, then `backbone.aln.fasta` will be missing. # Additional file(s) 1. `350378genomes.txt`: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.
keywords: SALMA;MAFFT;alignment;eHMM;sequence length heterogeneity
published: 2022-03-25
 
This upload includes the 16S.B.ALL in 100-HF condition (referred to as 16S.B.ALL-100-HF) used in Experiment 3 of the WITCH paper (currently accepted in principle by the Journal of Computational Biology). 100-HF condition refers to making sequences fragmentary with an average length of 100 bp and a standard deviation of 60 bp. Additionally, we enforced that all fragmentary sequences to have lengths > 50 bp. Thus, the final average length of the fragments is slightly higher than 100 bp (~120 bp). In this case (i.e., 16S.B.ALL-100-HF), 1,000 sequences with lengths 25% around the median length are retained as "backbone sequences", while the remaining sequences are considered "query sequences" and made fragmentary using the "100-HF" procedure. Backbone sequences are aligned using MAGUS (or we extract their reference alignment). Then, the fragmentary versions of the query sequences are added back to the backbone alignment using either MAGUS+UPP or WITCH. More details of the tar.gz file are described in README.txt.
keywords: MAGUS;UPP;Multiple Sequence Alignment;eHMMs
published: 2022-08-06
 
This dataset consists of all the files and codes that are part of the manuscript (main text and supplement) titled "Spin-selective tunneling from nanowires of the candidate topological Kondo insulator SmB6". For detailed information on the individual files refer to the specific readme files.
keywords: Topology; Kondo Inuslator; Spin; Scanning tunneling microscopy; antiferromagnetism
has sharing link
 
published: 2022-08-06
 
An online knowledge, attitudes, and practices survey on ticks and tick-borne diseases was distributed to medical professionals in Illinois during summer 2020 to fall 2021. These are the raw data associated with that survey and the survey questions used. Age, gender, and county of practice have been removed for identifiability. We have added calculated values (columns 165 to end), including: the tick knowledge score, TBD knowledge score, and total knowledge score, which are the sum of the total number of correct answers in each category, and score percent, which are the proportion of correct answers in each category; region, which is determined from the county of practice; TBD relevant practice, which separates the practice variable into TBD primary, secondary, and non-responders; and several variables which group categories.
keywords: ticks; medicine; tick-borne disease; survey
published: 2022-08-05
 
This data set documents bat activity (counts per detector-night per phonic group) and bat diversity (number of bat species per detector-night) in relation to distance to the nearest forested corridor in a row crop agriculture dominated landscape and in relation to relative crop pest abundance. This data set was used to assess if bats were homogeneously distributed over a near-uninterrupted agricultural landscape and to assess the importance of forested corridors and the presence of pest species on their distribution across the landscape. Data was collected with 50 AudioMoth bat detectors along 10 transects, with each transect having 5 detectors. The transects started at a forest corridor and extended out for 4 km into uninterrupted row crop agriculture. Pest abundance was extrapolated from data collected in the same county during the same time as the study. Potentially important weather covariates were extracted from the nearest operational weather station.
keywords: bats; bat activity; biodiversity; agricultural pest
published: 2022-08-05
 
Simulated sequences provide a way to evaluate multiple sequence alignment (MSA) methods where the ground truth is exactly known. However, the realism of such simulated conditions often comes under question compared to empirical datasets. In particular, simulated data often does not display heterogeneity in the sequence lengths, a common feature in biological datasets. In order to imitate sequence length heterogeneity, we here present a set of data that are evolved under a mixture model of indel lengths, where indels have an occasional chance of being promoted to long indels (emulating large insertion/deletion events, e.g., domain-level gain/loss). This dataset is otherwise (e.g., in GTR parameters) analogous to the 1000M condition as presented in the SATe paper (doi: 10.1126/science.1171243) but with 5000 sequences and simulated with INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/). For more information, see README.txt. For the INDELible control files, see https://github.com/ThisBioLife/5000M-234-het.
keywords: simulated data; sequence length heterogeneity; multiple sequence alignment;
published: 2022-08-01
 
Datasets that accompany Shearer and Beilke 2022 publication (Title: Playing it by ear: gregarious sparrows recognize and respond to isolated wingbeat sounds and predator-based cues.; Journal: Animal Cognition)
keywords: Vigilance; auditory detection; predator detection; predator-prey interaction; antipredator behavior
published: 2022-07-25
 
Related to the raw entity mentions, this dataset represents the effects of the data cleaning process and collates all of the entity mentions which were too ambiguous to successfully link to the NCBI's taxonomy identifier system.
keywords: synthetic biology; NERC data; species mentions, ambiguous entities
published: 2022-07-25
 
A set of species entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords: synthetic biology; NERC data; species mentions
published: 2022-07-25
 
This dataset represents the results of manual cleaning and annotation of the entity mentions contained in the raw dataset (https://doi.org/10.13012/B2IDB-4950847_V1). Each mention has been consolidated and linked to an identifier for a matching concept from the NCBI's taxonomy database.
keywords: synthetic biology; NERC data; species mentions; cleaned data; NCBI TaxonID
published: 2022-07-25
 
This dataset is derived from the raw dataset (https://doi.org/10.13012/B2IDB-4950847_V1) and collects entity mentions that were manually determined to be noisy, non-species entities.
keywords: synthetic biology; NERC data; species mentions, noisy entities
published: 2022-07-25
 
This dataset is derived from the raw entity mention dataset (https://doi.org/10.13012/B2IDB-4950847_V1) for species entities and represents those that were determined to be species (i.e., were not noisy entities) but for which no corresponding concept could be found in the NCBI taxonomy database.
keywords: synthetic biology; NERC data; species mentions, not found entities
published: 2022-07-25
 
Related to the raw entity mentions (https://doi.org/10.13012/B2IDB-4163883_V1), this dataset represents the effects of the data cleaning process and collates all of the entity mentions which were too ambiguous to successfully link to the ChEBI ontology.
keywords: synthetic biology; NERC data; chemical mentions; ambiguous entities
published: 2022-07-25
 
A set of chemical entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords: synthetic biology; NERC data; chemical mentions
published: 2022-07-25
 
A set of cell-line entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords: synthetic biology; NERC data; cell-line mentions
published: 2022-07-25
 
This dataset represents the results of manual cleaning and annotation of the entity mentions contained in the raw dataset (https://doi.org/10.13012/B2IDB-4163883_V1). Each mention has been consolidated and linked to an identifier for a matching concept from the NCBI's taxonomy database.
keywords: synthetic biology; NERC data; chemical mentions; cleaned data; ChEBI ontology
published: 2022-07-25
 
This dataset is derived from the raw dataset (https://doi.org/10.13012/B2IDB-4163883_V1) and collects entity mentions that were manually determined to be noisy, non-chemical entities.
keywords: synthetic biology; NERC data; chemical mentions, noisy entities
published: 2022-07-25
 
This dataset is derived from the raw entity mention dataset (https://doi.org/10.13012/B2IDB-4163883_V1) for checmical entities and represents those that were determined to be chemicals (i.e., were not noisy entities) but for which no corresponding concept could be found in the ChEBI ontology.
keywords: synthetic biology; NERC data; chemical mentions, not found entities
published: 2022-07-25
 
A set of gene and gene-related entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords: synthetic biology; NERC data; gene mentions
published: 2021-05-10
 
This dataset contains data used in publication "Institutional Data Repository Development, a Moving Target" submitted to Code4Lib Journal. It is a tabular data file describing attributes of data files in datasets published in Illinois Data Bank 2016-04-01 to 2021-04-01.
keywords: institutional repository
published: 2022-07-11
 
This dataset was developed as part of an online survey study that explores student characteristics that may predict what one finds helpful in replies to requests for help posted to an online college course discussion forum. 223 college students enrolled in an introductory statistics course were surveyed on their sense of belonging to their course community, as well as how helpful they found 20 examples of replies to requests for help posted to a statistics course discussion forum.
keywords: help-giving; discussion forums; sense of belonging; college student
published: 2022-07-22
 
Data in this publication were used to examine the effects of environmental and temporal covariates on detection probability, and the effects of habitat and landscape level covariates on occupancy and within season turnover of Black-billed Cuckoos and Yellow-billed Cuckoos. Data were collected between 2019-2020 in northern Illinois, USA. Procedures were approved by the Illinois Institutional Animal Care and Use Committee (IACUC), protocol no. 19086.
keywords: Black-billed Cuckoo; call broadcast; Coccyzus americanus; Coccyzus erythropthalmus; detection probability; occupancy dynamics; rare and secretive species; Yellow-billed Cuckoo
published: 2022-07-19
 
#### Details of Pseudomonas aeruginosa biofilm dataset #### ----------------*Folder Structure*------------------------------------- This dataset contains peak intensity tables extracted from mass spectrometry imaging (MSI) data using tools, SCiLS and MSI reader. There are 2 folders in "MSI-Data-Paeruginosa-biofilms-UIUC-DP-JVS-July2022.zip", each folder contains 3 sub-folders as listed below. 1. PellicleBiofilms-and-Supernatant [Pellicle biofilms collected from air-liquid interface and spend supernatant medium after 96 h incubation period]: (1) Full-Scan-Data-96h; (2) MSMS-data-from-C7-Quinolones-96h; and (3) MSMS-data-from-C9-Quinolones-96h 2. StaticBiofilms [Static biofilms grown on mucin surface]: (1) Full-Scan-Data; (2) MSMS-data-from-C7-Quinolones; and (3) MSMS-data-from-C9-Quinolones ----------------*File name*---------------------------------------------- Sample information is included in the file names for easy identification and processing. Attributes covered in file names are explained in the example below. *Example file name "Rep1-Stat-FRD1-mPat-48-FS"* ~ Each unit of information is separated by "-" ~Unit 1 - "Rep1" - Biological replicate ( Rep1, Rep2, and Rep3) ~Unit 2 - "Stat" - Sample type (Stat = Static Biofilm, Pel = Pellicle biofilm, Sup = Supernatant) ~Unit 3 - "FRD1" - Strain (FRD1 = Mucoid strain, PAO1C = Non-mucoid strain) ~Unit 4 - "mPat" - Type of mucin surface used (mPat = patterned mucin surface, mUni = uniform mucin surface) ~Unit 5 - "48" - Sample time point (hours = 48, 72, 96) ~Unit 6 - "FS" - Scan type used in MSI (FS = high resolution full-scan, 260 = targeted MS/MS of C7 quinolones (m/z 260), 288 = targeted MS/MS of C9 quinolones (m/z 288)) ----------------*File structure*------------------------------------------ All MSI data has been exported to CSV format. Each CSV files contains information about scan number, Coordinates (x,y,z), m/z values, extraction window (absolute), and corresponding intensities in the form of a matrix. ----------------*End of Information*--------------------------------------
keywords: mass spectrometry imaging (MSI); biofilm; antibiotic resistance; Pseudomonas aeruginosa; quorum sensing; rhamnolipids
published: 2022-06-20
 
This is a sentence-level parallel corpus in support of research on OCR quality. The source data comes from: (1) Project Gutenberg for human-proofread "clean" sentences; and, (2) HathiTrust Digital Library for the paired sentences with OCR errors. In total, this corpus contains 167,079 sentence pairs from 189 sampled books in four domains (i.e., agriculture, fiction, social science, world war history) published from 1793 to 1984. There are 36,337 sentences that have two OCR views paired with each clean version. In addition to sentence texts, this corpus also provides the location (i.e., sentence and chapter index) of each sentence in its belonging Gutenberg volume.
keywords: sentence-level parallel corpus; optical character recognition; OCR errors; Project Gutenberg; HathiTrust Digital Library; digital libraries; digital humanities;
published: 2022-06-22
 
This dataset helps to investigate the Spatial Accessibility to HIV Testing, Treatment, and Prevention Services in Illinois and Chicago, USA. The main components are: population data, healthcare data, GTFS feeds, and road network data. The core components are: 1) `GTFS` which contains GTFS (<a href="https://gtfs.org/">General Transit Feed Specification</a>) data which is provided by Chicago Transit Authority (CTA) from <a href="https://developers.google.com/transit/gtfs">Google's GTFS feeds</a>. Documentation defines the format and structure of the files that comprise a GTFS dataset: <a href="https://developers.google.com/transit/gtfs/reference?csw=1">https://developers.google.com/transit/gtfs/reference?csw=1</a>. 2) `HealthCare` contains shapefiles describing HIV healthcare providers in Chicago and Illinois respectively. The services come from <a href="https://locator.hiv.gov/">Locator.HIV.gov</a>. 3) `PopData` contains population data for Chicago and Illinois respectively. Data come from The American Community Survey and <a href="https://map.aidsvu.org/map">AIDSVu</a>. AIDSVu (https://map.aidsvu.org/map) provides data on PLWH in Chicago at the census tract level for the year 2017 and in the State of Illinois at the county level for the year 2016. The American Community Survey (ACS) provided the number of people aged 15 to 64 at the census tract level for the year 2017 and at the county level for the year 2016. The ACS provides annually updated information on demographic and socio economic characteristics of people and housing in the U.S. 4) `RoadNetwork` contains the road networks for Chicago and Illinois respectively from <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a> using the Python <a href="https://osmnx.readthedocs.io/en/stable/">osmnx</a> package. <b>The abstract for our paper is:</b> Accomplishing the goals outlined in “Ending the HIV (Human Immunodeficiency Virus) Epidemic: A Plan for America Initiative” will require properly estimating and increasing access to HIV testing, treatment, and prevention services. In this research, a computational spatial method for estimating access was applied to measure distance to services from all points of a city or state while considering the size of the population in need for services as well as both driving and public transportation. Specifically, this study employed the enhanced two-step floating catchment area (E2SFCA) method to measure spatial accessibility to HIV testing, treatment (i.e., Ryan White HIV/AIDS program), and prevention (i.e., Pre-Exposure Prophylaxis [PrEP]) services. The method considered the spatial location of MSM (Men Who have Sex with Men), PLWH (People Living with HIV), and the general adult population 15-64 depending on what HIV services the U.S. Centers for Disease Control (CDC) recommends for each group. The study delineated service- and population-specific accessibility maps, demonstrating the method’s utility by analyzing data corresponding to the city of Chicago and the state of Illinois. Findings indicated health disparities in the south and the northwest of Chicago and particular areas in Illinois, as well as unique health disparities for public transportation compared to driving. The methodology details and computer code are shared for use in research and public policy.
keywords: HIV;spatial accessibility;spatial analysis;public transportation;GIS