Illinois Data Bank Dataset Search Results
Results
published:
2018-04-05
GBS data from Phaseolus accessions, for a study led by Dr. Glen Hartman, UIUC. <br />The (zipped) fastq file can be processed with the TASSEL GBS pipeline or other pipelines for SNP calling. The related article has been submitted and the methods section describes the data processing in detail.
published:
2018-04-23
Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc.
Using authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless.
The tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then
nA = sum(Ai); nB = sum(Ai)
nAB = sum(Ai*Bi) if A not equal B; nAA = sum(Ai*(Ai-1)/2) otherwise
nAnB = nA*nB if A not equal B; nAnA = nA*(nA-1)/2 otherwise
score = 1000000*nAB/nAnB if A is not equal B; 1000000*nAA/nAnA otherwise
Token pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB.
The cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs.
Each dataset has the following 7 tab-delimited all-ASCII columns
1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000
2: nAB: total number of co-occurrences
3: nAnB: total number of pairs
4: nA: number of occurrences of token A
5: nB: number of occurrences of token B
6: A: token A
7: B: token B
We made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in:
<a href="https://doi.org/10.1371/journal.pone.0115681">D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681. https://doi.org/10.1371/journal.pone.0115681</a>
<a href="http://dx.doi.org/10.5210/disco.v7i0.6654">Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:http://dx.doi.org/10.5210/disco.v7i0.6654</a>
keywords:
PubMed; MeSH; token; name disambiguation
published:
2019-09-06
This is a dataset of 1101 comments from The New York Times (May 1, 2015-August 31, 2015) that contains a mention of the stemmed words vaccine or vaxx.
keywords:
vaccine;online comments
published:
2020-10-01
Fraterrigo, Jennifer; Rembelski, Mara
(2020)
We measured the effects of fire or drought treatment on plant, microbial and biogeochemical responses in temperate deciduous forests invaded by the annual grass Microstegium vimineum with a history of either frequent fire or fire exclusion.
Please note, on Documentation tab / Experimental or Sampling Design, “15 (XVI)” should be “16 (XVI)”.
keywords:
plant-soil interaction; grass-fire cycle; Microstegium; carbon and nitrogen cycling; microbial decomposers
published:
2025-10-29
Chen, Chu-Chun; Dominguez, Francina; Matus, Sean
(2025)
This dataset contains variables from the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5; Hersbach et al., 2020). These data were used for the analysis in “The impact of large-scale land surface conditions on the South American low-level jet” published in Geophysical Research Letters.
Acknowledgments:
This work was supported by NSF Award AGS-1852709. We thank Dr. Zhuo Wang and Dr. Divyansh Chug for their valuable feedback and insightful discussions.
References:
Hersbach H, Bell B, Berrisford P, et al. The ERA5 global reanalysis. Q J R Meteorol Soc. 2020; 146: 1999–2049. https://doi.org/10.1002/qj.3803
keywords:
atmospheric sciences; South American low-level jet; land-atmosphere interactions; soil moisture; regional atmospheric circulation; southeastern South America
published:
2025-10-10
Yang, Pan; Cai, Ximing; Leibensperger, Carrie; Khanna, Madhu
(2025)
The success of a bioenergy policy relies largely on the wide adoption of perennial energy crops at the farm scale. This study uses survey data to examine potential adoption decisions by farmers in the U.S. Midwest and the causal effects of various direct and indirect influencing factors, especially heterogeneous preferences of farmers. A Bayesian network (BN) model is developed to delineate the causal relationship between farmers adoption decisions and the influencing factors. We find a dominating role of economic factors and a non-negligible impact of non-economic factors, such as the perceived environmental benefits and the extent of familiarity with perennial energy crops. To examine the effect of heterogeneity in farmer preferences, we classify the surveyed farmers into four categories based on their attitudes toward the economic, social, and environmental dimensions of perennial energy crops. We identified statistically significant between-group differences in the responses of the four types of farmers to the various influencing factors. Our findings contribute to disentangling the complicated motivations that will influence perennial energy crop adoption decisions and provide implications for more targeted policy development that need to consider the heterogeneous drivers of farmer decisions about land use.
keywords:
Sustainability;Modeling
published:
2017-07-29
This dataset contains the PartMC-MOSAIC simulations used in the article “Plume-exit modeling to determine cloud condensation nuclei activity of aerosols from residential biofuel combustion”. The data is organized as a set of folders, each folder representing a different scenario modeled. Each folder contains a series of NetCDF files, which are the output of the PartMC-MOSAIC simulation. They contain information on particle and gas properties, both of the biofuel burning plume and background. Input files for PartMC-MOSAIC are also included. This dataset was used during the open review process at Atmospheric Chemistry and Physics (ACP) and supports both the discussion paper and final article.
keywords:
CCN; cloud condensation nuclei; activation; supersaturation; biofuel
published:
2017-10-10
Kozak, Derek L.; Luo, Jie; Olson, Scott M.; LaFave, James M.; Fahnestock, Larry A.
(2017)
This dataset contains ground motion data for Newmark Structural Engineering Laboratory (NSEL) Report Series 048, "Modification of ground motions for use in Central North America: Southern Illinois surface ground motions for structural analysis". The data are 20 individual ground motion time history records developed at each of the 10 sites (for a total of 200 ground motions). These accompanying ground motions are developed following the detailed procedure presented in Kozak et al. [2017].
keywords:
earthquake engineering; ground motion records; southern Illinois seismic hazard; dynamic structural analysis; conditional mean spectrum
published:
2020-08-10
Zinnen, Jack; Spyreas, Greg; Erdős, László; Berg, Christian; Matthews, Jeffrey
(2020)
These are text files downloaded from the Web of Science for the bibliographic analyses found in Zinnen et al. (2020) in Applied Vegetation Science. They represent the papers and reference lists from six expert-based indicator systems: Floristic Quality Assessment, hemeroby, naturalness indicator values (& social behaviors), Ellenberg indicator values, grassland utilization values, and urbanity indicator values.
To examine data, download VOSviewer and see instructrions from van Eck & Waltman (2019) for how to upload data. Although we used bibliographic coupling, there are a number of other interesting bibliographic analyses you can use with these data (e.g., visualizing citations between journals from this set of documents).
Note: There are two caveats to note about these data and Supplements 1 & 2 associated with our paper. First, there are some overlapping papers in these text files (i.e., raw data). When added individually, the papers sum to more than the numbers we give. However, when combined VOSviewer recognizes these as repeats, and matches the numbers we list in S1 and the manuscript. Second, we labelled the downloaded papers in S2 with their respective systems. In some cases, the labels do not completely match our counts listed in S1 and raw data. This is because some of these papers use another system, but were not captured in our systematic literature search (e.g., a paper may have used hemeroby, but was not picked up by WoS, so this paper is not listed as one of the 52 hemeroby papers).
keywords:
Web of Science; bibliographic analyses; vegetation; VOSviewer
published:
2024-09-17
Cao, Yanghui; Dietrich, Christopher H.; Dmitriev, Dmitry A.; Kits, Joel H.; Xue, Qingquan; Zhang, Yalin
(2024)
The following seven zip files are compressed folders containing the input datasets/trees, main output files and the scripts of the related analyses performed in this study.
I. ancestral_microhabitat_reconstruction.zip: contains four files, including two input files (microhabitats.csv, timetree.tre) and a script (simmap_microhabitat.R) for ancestral states reconstruction of microhabitat by make.simmap implemented in the R package phytools v1.5, as well as the main output file (ancestral_microhabitats.csv).
1. ancestral_microhabitats.csv: reconstructed ancestral microhabitats for each node.
2. microhabitats.csv: microhabitats of the studies species.
3. simmap_microhabitat.R: the R script of make.simmap for ancestral microhabitat reconstruction
4. timetree.tre: dated tree used for ancestral state reconstruction for microhabitat and morphological characters
II. ancestral_morphology_reconstruction.zip: contains six files, including an input file (morphology.csv) and a script (simmap_morphology.R) for ancestral states reconstruction of morphology by make.simmap implemented in the R package phytools v1.5, as well as four main output files(forewing_ancestral_state.csv, frontal_sutures_ancestral_state.csv, hind_wing_ancestral_state.csv, ocellus_ancestral_state.csv).
1. forewing_ancestral_state.csv: reconstructed ancestral states of the development of the forewing for each node.
2. frontal_sutures_ancestral_state.csv: reconstructed ancestral states of the development of frontal sutures for each node.
3. hind_wing_ancestral_state.csv: reconstructed ancestral states of the development of the hind wing for each node.
4. morphology.csv: the states of the development of ocellus, forewing, hing wing and frontal sutures for each studies species.
5. ocellus_ancestral_state.csv: reconstructed ancestral states of the development of the ocellus for each node.
6. simmap_morphology.R: the R script of make.simmap for ancestral state reconstruction of morphology
III. biogeographic_reconstruction.zip: contains four files, including three input files (dispersal_probablity.txt, distributions.csv, timetree_noOutgroup.tre) used for a stratified biogeographic analysis by BioGeoBEARS in RASP v4.2 and the main output file (DIVELIKE_result.txt).
1. dispersal_probablity.txt: relative dispersal probabilities among biogeographical regions at different geological epochs.
2. distributions.csv: current distributions of the studied species.
3. DIVELIKE_result.txt: BioGeoBEARS result of ancestral areas based on the DIVELIKE model.
4. timetree_noOutgroup.tre: the dated tree with the outgroup lineage (Eurymelinae) excluded.
IV. coalescent_analysis.zip: contains a folder and two files, including a folder (individual_gene_alignment) of input files used to construct gene trees, an input file (MLtree_BS70.tre) used for the multi-species coalescent analysis by ASTRAL v 4.10.5 and the main output file (coalescent_species_tree.tre).
1. coalescent_species_tree.tre: the species tree generated by the multi-species coalescent analysis with the quartet support, effective number of genes and the local posterior probability indicated.
2. individual_gene_alignment: a folder containing 427 FASTA files, each one represents the nucleotide alignment for a gene. Hyphens are used to represent gaps. These files were used to construct gene trees using IQ-TREE v1.6.12.
3. MLtree_BS70.tre: 165 gene trees with the average SH-aLRT and ultrafast bootstrap values of ≥ 70%. This file was used to estimate the species tree by ASTRAL v 4.10.5.
V. divergence_time_estimation.zip: contains five files, including two input files (treefile_rooted_noBranchLength.tre, treefile_rooted.tre) and two control files (baseml.ctl, mcmctree.ctl) used for divergence time estimation by BASEML and MCMCTREE in PAML v4.9, as well as the main output file (timetree_with95%HPD.tre).
1. baseml.ctl: the control file used for the estimation of substitution rates by BASEML in PAML v4.9.
2. mcmctree.ctl: the control file used for the estimation of divergence times by MCMCTREE in PAML v4.9.
3. timetree_with95%HPD.tre: dated tree with the 95% highest posterior density confidence intervals indicated.
4. treefile_rooted_noBranchLength.tre: the maximum likelihood tree based on the concatenated nucleotide dataset with calibrations for the crown and internal nodes. Branch length and support values were not indicated.
5. treefile_rooted.tre: the maximum likelihood tree based on the concatenated nucleotide dataset with a secondary calibration on the root age. Branch support values were not indicated.
VI. maximum_likelihood_analysis_aa.zip: contains three files, including two input files (concatenated_aa_partition.nex, concatenated_aa.phy) used for the maximum likelihood analysis by IQ-TREE v1.6.12 and the main output file (MLtree_aa.tre).
1. concatenated_aa_partition.nex: the partitioning schemes for the maximum likelihood analysis using concatenated_aa.phy. This file partitions the 52,024 amino acid positions into 427 character sets.
2. concatenated_aa.phy: a concatenated amino acid dataset with 52,024 amino acid positions. Hyphens are used to represent gaps. This dataset was used for the maximum likelihood analysis.
3. MLtree_aa.tre: the maximum likelihood tree based on the concatenated amino acid dataset, with SH-aLRT values and ultrafast bootstrap values indicated.
VII. maximum_likelihood_analysis_nt.zip: contains three files, including two input files (concatenated_nt_partition.nex, concatenated_nt.phy) used for the maximum likelihood analysis by IQ-TREE v1.6.12 and the main output file (MLtree_nt.tre).
1. concatenated_nt_partition.nex: the partitioning schemes for the maximum likelihood analysis using concatenated_nt.phy. This file partitions the 156,072 nucleotide positions into 427 character sets.
2. concatenated_nt.phy: a concatenated nucleotide dataset with 156,072 nucleotide positions. Hyphens are used to represent gaps. This dataset was used for the maximum likelihood analysis as well as divergence time estimation.
3. MLtree_nt.tre: the maximum likelihood tree based on the concatenated nucleotide dataset, with SH-aLRT values and ultrafast bootstrap values indicated.
VIII. Taxon_sampling.csv: contains the sample IDs (1st column) which were used in the alignments and the taxonomic information (2nd to 6th columns).
keywords:
Anchored Hybrid Enrichment, Biogeography, Cicadellidae, Phylogenomics, Treehoppers
published:
2017-12-15
These are the results of an 8 month cohort study in two commercial dairy herds in Northwest Illinois. From each herd, 50 cows were selected at random, stratified over lactations 1 to 3. Serum from these animals was collected every two months and tested for antibodies to Bovine Leukosis Virus, Neospora caninum, and Mycobacterium avium subsp. paratuberculosis. Animals that left the herd during the study were replaced by another animal in the same herd and lactation. At the last sampling, serum neutralization assays were performed for Bovine Herpesvirus type 1 and Bovine Viral Diarrhea virus type 1 and 2. Production data before and after sampling was collected for the entire herd from PCdart.
keywords:
serostatus;dairy;production;cohort
published:
2018-12-31
Sixty undergraduate STEM lecture classes were observed across 14 departments at the University of Illinois Urbana-Champaign in 2015 and 2016. We selected the classes to observe using purposive sampling techniques with the objectives of (1) collecting classroom observations that were representative of the STEM courses offered; (2) conducting observations on non-test, typical class days; and (3) comparing these classroom observations using the Class Observation Protocol for Undergraduate STEM (COPUS) to record the presence and frequency of active learning practices utilized by Community of Practice (CoP) and non-CoP instructors.
Decimal values are the result of combined observations. All COPUS codes listed are from Smith (2013) "The Classroom Observation Protocol for Undergraduate STEM (COPUS): A New Instrument to Characterize STEM Classroom Practices" paper.
For more information on the data collection process, see "Evidence that communities of practice are associated with active learning in large STEM lectures" by Tomkin et. al. (2019) in the International Journal of STEM Education.
keywords:
COPUS, Community of Practice
published:
2023-07-05
Njuguna, Joyce; Clark, Lindsay; Lipka, Alexander; Anzoua, Kossonou; Bagmet, Larisa; Chebukin, Pavel; Dwiyanti, Maria; Dzyubenko, Elena; Dzyubenko, Nicolay; Ghimire, Bimal; Jin, Xiaoli; Johnson, Douglas; Kjeldsen, Jens; Nagano, Hironori; Oliveira, Ivone; Peng, Junhua; Petersen, Karen; Sabitov, Andrey; Seong, Eun; Yamada, Toshihiko; Yoo, Ji; Yu, Chang; Zhao, Hu; Munoz, Patricio; Long, Stephen; Sacks, Erik
(2023)
This dataset contains all data used in the paper "Impact of genotype-calling methodologies on genome-wide association and genomic prediction in polyploids". The dataset includes genotypes and phenotypic data from two autotetraploid species Miscanthus sacchariflorus and Vaccinium corymbosum that was used used for genome wide association studies and genomic prediction and the scripts used in the analysis.
In this V2, 2 files have the raw data are added:
"Miscanthus_sacchariflorus_RADSeq.vcf" is the VCF file with the raw SNP calls of the Miscanthus sacchariflorus data used for genotype calling using the 6 genotype calling methods.
"Blueberry_data_read_depths.RData" is the a RData file with the read depth data that was used for genotype calling in the Blueberry dataset.
keywords:
Polyploid; allelic dosage; Bayesian genotype-calling; Genome-wide association; Genomic prediction
published:
2024-07-11
Gholamalamdari, Omid; Belmont, Andrew
(2024)
This repository contains the data and computational analysis notebooks that were used in the following manuscript.
For more information on the methods and contributing authors, please refer to the original manuscript.
"Beyond A and B Compartments: how major nuclear locales define nuclear genome organization and function Omid Gholamalamdari et al. 2024"
keywords:
genomic analysis; R markdown; genomic segmentations
published:
2025-09-29
Zhai, Zhiyang; Liu, Hui; Shanklin, John
(2025)
During the transformation of wild-type (WT) Arabidopsis thaliana, a T-DNA containing OLEOSIN-GFP (OLE1-GFP) was inserted by happenstance within the GBSS1 gene, resulting in significant reduction in amylose and increase in leaf oil content in the transgenic line (OG). The synergistic effect on oil accumulation of combining gbss1 with the expression of OLE1-GFP was confirmed by transforming an independent gbss1 mutant (GABI_914G01) with OLE1-GFP. The resulting OLE1-GFP/gbss1 transgenic lines showed higher leaf oil content than the individual OLE1-GFP/WT or single gbss1 mutant lines. Further stacking of the lipogenic factors WRINKLED1, Diacylglycerol O-Acyltransferase (DGAT1), and Cys-OLEOSIN1 (an engineered sesame OLEOSIN1) in OG significantly elevated its oil content in mature leaves to 2.3% of dry weight, which is 15 times higher than that in WT Arabidopsis. Inducible expression of the same lipogenic factors was shown to be an effective strategy for triacylglycerol (TAG) accumulation without incurring growth, development, and yield penalties.
keywords:
Feedstock Production;Biomass Analytics
published:
2018-04-23
Mishra, Shubhanshu; Fegley, Brent D; Diesner, Jana; Torvik, Vetle I.
(2018)
Self-citation analysis data based on PubMed Central subset (2002-2005)
----------------------------------------------------------------------
Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018
## Introduction
This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.
It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015.
The dataset is distributed in the form of the following tab separated text files:
* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors
* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors
* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors
* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data
* COLUMNS_DESC.txt file - Descriptions of all columns
* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection.
* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments.
* README.txt file
## Dataset creation
Our experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href="https://clarivate.com/products/web-of-science/databases/">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset.
* MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>
* Citation data from PubMed Central (original paper includes additional citations from Web of Science)
* Author-ity 2009 dataset:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4222651_V1">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a>
- Paper citation: <a href="https://doi.org/10.1145/1552303.1552304">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a>
- Paper citation: <a href="https://doi.org/10.1002/asi.20105">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a>
* Genni 2.0 + Ethnea for identifying author gender and ethnicity:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-9087546_V1">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a>
- Paper citation: <a href="https://doi.org/10.1145/2467696.2467720">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a>
- Paper citation: <a href="http://hdl.handle.net/2142/88927">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a>
* MapAffil for identifying article country of affiliation:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4354331_V1">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a>
- Paper citation: <a href="http://doi.org/10.1045/november2015-torvik">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a>
* IMPLICIT journal similarity:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4742014_V1">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a>
* Novelty dataset for identify article level novelty:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-5060298_V1">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a>
- Paper citation: <a href="https://doi.org/10.1045/september2016-mishra"> Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a>
- Code: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
* Expertise dataset for identifying author expertise on articles:
* Source code provided at: <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>
**Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.**
Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions
Additional data related updates can be found at <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a>
## Acknowledgments
This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347</a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
## License
Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.
Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.
keywords:
Self citation; PubMed Central; Data Analysis; Citation Data;
published:
2018-04-19
Torvik, Vetle I.; Smalheiser, Neil R.
(2018)
Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03
The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed.
• How was the dataset created?
The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in
<i>Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304</i>
<i>Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105</i>
Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication.
• How accurate is the 2009 dataset (compared to 2006 and 2009)?
The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors.
• What is the format of the dataset?
The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields:
1. blocks separated by '||'; each block may consist of multiple lastname-first initial variants separated by '|'
2. prior probabilities of the respective blocks separated by '|'
3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks)
4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased).
5. cluster size (number of author name instances on papers)
6. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix
7. last name variants separated by '|'
8. first name variants separated by '|'
9. middle initial variants separated by '|' ('-' if none)
10. suffix variants separated by '|' ('-' if none)
11. email addresses separated by '|' ('-' if none)
12. range of years (e.g., 1997-2009)
13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none)
14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '|'; ('-' if none)
15. Journals with counts in parenthesis (separated by "|"),
16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none)
17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none)
18. Co-author IDs with counts in parenthesis; separated by '|'; ('-' if none)
19. Author name instances (PMID_auno separated '|')
20. Grant IDs (after normalization; "-" if none given; separated by "|"),
21. Total number of times cited. (Citations are based on references extracted from PMC).
22. h-index
23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "|"
24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "|"
25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "|"
26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)
keywords:
Bibliographic databases; Name disambiguation; MEDLINE; Library information networks
published:
2017-06-16
Haselhorst, Derek S; Tcheng, David K. ; Moreno, J. Enrique ; Punyasena, Surangi W.
(2017)
Table S1. Pollen types identified in the BCI and PNSL pollen rain data sets. Pollen types were identified to species when possible and assigned a life form based on descriptions provided in Croat, T.B. (1978). Taxa from BCI and PNSL were assigned a 1 if present in forest census data or a 0 if absent. The relative representation of each taxon has been provided for each extended record and by dry and wet season representation respectively. CA loadings are provided for axes 1 and 2 (Fig. 1).
keywords:
pollen; identifications; abundance; data; BCI; PNSL; Panama
published:
2018-04-23
Mishra, Shubhanshu; Torvik, Vetle I.
(2018)
Conceptual novelty analysis data based on PubMed Medical Subject Headings
----------------------------------------------------------------------
Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018
## Introduction
This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra.
It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015.
The dataset is distributed in the form of the following tab separated text files:
* PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow:
- PMID: PubMed ID
- Year: year of publication
- TimeNovelty: time novelty score of the paper based on individual concepts (see paper)
- VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper)
- PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper)
- PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper)
* mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow:
- MeshTerm: Name of the MeSH term
- Year: year
- AbsVal: Total publications with that MeSH term in the given year
- TimeNovelty: age (in years since first publication) of MeSH term in the given year
- VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year
* meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years
- Mesh1: Name of the first MeSH term (alphabetically sorted)
- Mesh2: Name of the second MeSH term (alphabetically sorted)
- Year: year
- AbsVal: Total publications with that MeSH pair in the given year
- TimeNovelty: age (in years since first publication) of MeSH pair in the given year
- VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year
* README.txt file
## Dataset creation
This dataset was constructed using multiple datasets described in the following locations:
* MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>
* MeSH tree 2015: <a href="ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/">ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/</a>
* Source code provided at: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.
Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here </a>for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions:
Additional data related updates can be found at: <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a>
## Acknowledgments
This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347 </a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742 </a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
## License
Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.
Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
keywords:
Conceptual novelty; bibliometrics; PubMed; MEDLINE; MeSH; Medical Subject Headings; Analysis;
published:
2022-01-01
Cao, Yanghui; Dietrich, Christopher H.
(2022)
The file “Fla.fasta”, comprising 10526 positions, is the concatenated amino acid alignments of 51 orthologues of 182 bacterial strains. It was used for the maximum likelihood and maximum parsimony analyses of Flavobacteriales. Bacterial species names and strains were used as the sequence names, host names of insect endosymbionts were shown in brackets. The file “16S.fasta” is the alignment of 233 bacterial 16S rRNA sequences. It contains 1455 positions and was used for the maximum likelihood analysis of flavobacterial insect endosymbionts. The names of endosymbiont strains were replaced by the name of their hosts. In addition to the species names, National Center for Biotechnology Information (NCBI) accession numbers were also indicated in the sequence names (e.g., sequence “Cicadellidae_Deltocephalinae_Macrostelini_Macrosteles_striifrons_AB795320” is the 16S rRNA of Macrosteles striifrons (Cicadellidae: Deltocephalinae: Macrostelini) with a NCBI accession number AB795320). The file “Sulcia_pep.fasta” is the concatenated amino acid alignments of 131 orthologues of “Candidatus Sulcia muelleri” (Sulcia). It contains 41970 positions and presents 101 Sulcia strains and 3 Blattabacterium strains. This file was used for the maximum likelihood analysis of Sulcia. The file “Sulcia_nucleotide.fasta” is the concatenated nucleotide alignment corresponding to the sequences in “Sulcia_pep.fasta” but also comprises the alignment of 16S rRNA. It has 127339 positions and was used for the maximum likelihood and maximum parsimony analyses of Sulcia. Individual gene alignments (16S rRNA and 131 orthologues of Sulcia and Blattabacterium) are deposited in the compressed file “individual_gene_alignments.zip”, which were used to construct gene trees for multispecies coalescent analysis. The names of Sulcia strains were replaced by the name of their hosts in “Sulcia_pep.fasta”, “Sulcia_nucleotide.fasta” and the files in “individual_gene_alignments.zip”. In all the alignment files, gaps are indicated by “-”.
keywords:
endosymbiont, “Candidatus Sulcia muelleri”, Auchenorrhyncha, coevolution
published:
2024-08-24
Jones, Todd; Llamas, Alfredo; Phillips, Jennifer
(2024)
Dataset associated with Jones et al. GCB-23-1273.R1 submission: Phenotypic signatures of urbanization? Resident, but not migratory, songbird eye size varies with urban-associated light pollution levels. Excel CSV file with all of the data used in analyses and file with descriptions of each column.
keywords:
body size; demographics; eye size; phenotypic divergence; songbirds; sensory pollution; urbanization
published:
2023-12-18
Edmonds, Devin; Adamovicz, Laura; Allender, Matthew; Colton, Andrea; Randy, Nyboer; Michael, Dreslik
(2023)
We conducted long-term capture-mark-recapture surveys on two isolated ornate box turtle (Terrapene ornata) populations in northern Illinois, USA. This dataset provides the capture history strings and additional demographic information used for estimating population vital rates with robust design capture-mark-recapture models. The vital rates were then used in a stage-based population projection matrix model for each population.
keywords:
demography; capture-mark-recapture; vital rates; conservation; wildlife ecology
published:
2011-09-20
Swenson, M. Shel; Suri, Rahul; Linder, C. Randal; Warnow, Tandy; Nguyen, Nam-puhong; Mirarab, Siavash; Neves, Diogo Telmo; Sobral, João Luís; Pingali, Keshav; Nelesen, Serita; Liu, Kevin; Wang, Li-San
(2011)
This page provides the data for SuperFine, DACTAL, and BeeTLe publications.
- Swenson, M. Shel, et al. "SuperFine: fast and accurate supertree estimation." Systematic biology 61.2 (2012): 214.
- Nguyen, Nam, Siavash Mirarab, and Tandy Warnow. "MRL and SuperFine+ MRL: new supertree methods." Algorithms for Molecular Biology 7 (2012): 1-13.
- Neves, Diogo Telmo, et al. "Parallelizing superfine." Proceedings of the 27th Annual ACM Symposium on Applied Computing. 2012.
- Nelesen, Serita, et al. "DACTAL: divide-and-conquer trees (almost) without alignments." Bioinformatics 28.12 (2012): i274-i282.
- Liu, Kevin, and Tandy Warnow. "Treelength optimization for phylogeny estimation." PLoS One 7.3 (2012): e33104.
published:
2017-12-14
Hepler, Katherine C.
(2017)
keywords:
uranium harvesting from seawater; Geospatial analysis; adsorbent performance; NPRE 412
published:
2017-11-14
Miller, Martin; Chung, Soon-Jo; Hutchinson, Seth
(2017)
If you use this dataset, please cite the IJRR data paper (bibtex is below).
We present a dataset collected from a canoe along the Sangamon River in Illinois. The canoe was equipped with a stereo camera, an IMU, and a GPS device, which provide visual data suitable for stereo or monocular applications, inertial measurements, and position data for ground truth. We recorded a canoe trip up and down the river for 44 minutes covering 2.7 km round trip. The dataset adds to those previously recorded in unstructured environments and is unique in that it is recorded on a river, which provides its own set of challenges and constraints that are described
in this paper. The data is divided into subsets, which can be downloaded individually.
Video previews are available on Youtube:
https://www.youtube.com/channel/UCOU9e7xxqmL_s4QX6jsGZSw
The information below can also be found in the README files provided in the 527 dataset and each of its subsets. The purpose of this document is to assist researchers in using this dataset.
Images
======
Raw
---
The raw images are stored in the cam0 and cam1 directories in bmp format. They are bayered images that need to be debayered and undistorted before they are used. The camera parameters for these images can be found in camchain-imucam.yaml. Note that the camera intrinsics describe a 1600x1200 resolution image, so the focal length and center pixel coordinates must be scaled by 0.5 before they are used. The distortion coefficients remain the same even for the scaled images. The camera to imu tranformation matrix is also in this file. cam0/ refers to the left camera, and cam1/ refers to the right camera.
Rectified
---------
Stereo rectified, undistorted, row-aligned, debayered images are stored in the rectified/ directory in the same way as the raw images except that they are in png format. The params.yaml file contains the projection and rotation matrices necessary to use these images. The resolution of these parameters do not need to be scaled as is necessary for the raw images.
params.yml
----------
The stereo rectification parameters. R0,R1,P0,P1, and Q correspond to the outputs of the OpenCV stereoRectify function except that 1s and 2s are replaced by 0s and 1s, respectively.
R0: The rectifying rotation matrix of the left camera.
R1: The rectifying rotation matrix of the right camera.
P0: The projection matrix of the left camera.
P1: The projection matrix of the right camera.
Q: Disparity to depth mapping matrix
T_cam_imu: Transformation matrix for a point in the IMU frame to the left camera frame.
camchain-imucam.yaml
--------------------
The camera intrinsic and extrinsic parameters and the camera to IMU transformation usable with the raw images.
T_cam_imu: Transformation matrix for a point in the IMU frame to the camera frame.
distortion_coeffs: lens distortion coefficients using the radial tangential model.
intrinsics: focal length x, focal length y, principal point x, principal point y
resolution: resolution of calibration. Scale the intrinsics for use with the raw 800x600 images. The distortion coefficients do not change when the image is scaled.
T_cn_cnm1: Transformation matrix from the right camera to the left camera.
Sensors
-------
Here, each message in name.csv is described
###rawimus###
time # GPS time in seconds
message name # rawimus
acceleration_z # m/s^2 IMU uses right-forward-up coordinates
-acceleration_y # m/s^2
acceleration_x # m/s^2
angular_rate_z # rad/s IMU uses right-forward-up coordinates
-angular_rate_y # rad/s
angular_rate_x # rad/s
###IMG###
time # GPS time in seconds
message name # IMG
left image filename
right image filename
###inspvas###
time # GPS time in seconds
message name # inspvas
latitude
longitude
altitude # ellipsoidal height WGS84 in meters
north velocity # m/s
east velocity # m/s
up velocity # m/s
roll # right hand rotation about y axis in degrees
pitch # right hand rotation about x axis in degrees
azimuth # left hand rotation about z axis in degrees clockwise from north
###inscovs###
time # GPS time in seconds
message name # inscovs
position covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz m^2
attitude covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz deg^2
velocity covariance # 9 values xx,xy,xz,yx,yy,yz,zx,zy,zz (m/s)^2
###bestutm###
time # GPS time in seconds
message name # bestutm
utm zone # numerical zone
utm character # alphabetical zone
northing # m
easting # m
height # m above mean sea level
Camera logs
-----------
The files name.cam0 and name.cam1 are text files that correspond to cameras 0 and 1, respectively. The columns are defined by:
unused: The first column is all 1s and can be ignored.
software frame number: This number increments at the end of every iteration of the software loop.
camera frame number: This number is generated by the camera and increments each time the shutter is triggered. The software and camera frame numbers do not have to start at the same value, but if the difference between the initial and final values is not the same, it suggests that frames may have been dropped.
camera timestamp: This is the cameras internal timestamp of the frame capture in units of 100 milliseconds.
PC timestamp: This is the PC time of arrival of the image.
name.kml
--------
The kml file is a mapping file that can be read by software such as Google Earth. It contains the recorded GPS trajectory.
name.unicsv
-----------
This is a csv file of the GPS trajectory in UTM coordinates that can be read by gpsbabel, software for manipulating GPS paths.
@article{doi:10.1177/0278364917751842,
author = {Martin Miller and Soon-Jo Chung and Seth Hutchinson},
title ={The Visual–Inertial Canoe Dataset},
journal = {The International Journal of Robotics Research},
volume = {37},
number = {1},
pages = {13-20},
year = {2018},
doi = {10.1177/0278364917751842},
URL = {https://doi.org/10.1177/0278364917751842},
eprint = {https://doi.org/10.1177/0278364917751842}
}
keywords:
slam;sangamon;river;illinois;canoe;gps;imu;stereo;monocular;vision;inertial