Illinois Data Bank
Deposit Dataset
Find Data
Policies
Guides
Contact Us
Log in with NetID
University Library, University of Illinois at Urbana-Champaign
Toggle navigation
Illinois Data Bank
Deposit Dataset
Find Data
Policies
Guides
Contact Us
Log in with NetID
<
1
2
…
12
13
14
15
16
17
18
19
20
21
22
>
25 per page
50 per page
Show All
Displaying datasets 376 - 400 of 550 in total
Clear Filters
Generate Report from Search Results
Subject Area
Life Sciences (292)
Social Sciences (123)
Physical Sciences (78)
Technology and Engineering (49)
Uncategorized (7)
Arts and Humanities (1)
Funder
U.S. National Science Foundation (NSF) (164)
Other (159)
U.S. Department of Energy (DOE) (56)
U.S. National Institutes of Health (NIH) (53)
U.S. Department of Agriculture (USDA) (30)
Illinois Department of Natural Resources (IDNR) (12)
U.S. National Aeronautics and Space Administration (NASA) (5)
U.S. Geological Survey (USGS) (5)
Illinois Department of Transportation (IDOT) (3)
U.S. Army (2)
Publication Year
2022 (111)
2021 (108)
2020 (96)
2019 (72)
2018 (59)
2023 (39)
2017 (35)
2016 (30)
License
CC0 (314)
CC BY (220)
custom (16)
published: 2019-10-18
Smith, Rebecca (2019): Spatial and Temporal Invasion Dynamics of Aedes albopictus (Diptera: Culicidae) in Illinois. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-7540359_V1
Supporting secondary data used in a manuscript currently in submission regarding the invasion dynamics of the asian tiger mosquito, Aedes albopictus, in the state of Illinois
keywords:
albopictus;mosquito
published: 2019-10-15
Choi, Sang Hyun; Rao, Vikyath; Gernat, Tim; Hamilton, Adam; Robinson, Gene; Goldenfeld, Nigel (2019): Honeybee trophallaxis event data for The origin of heavy tails in honeybee and human interaction times. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2712449_V1
Filtered trophallaxis interactions for two honeybee colonies, each containing 800 worker bees and one queen. Each colony consists of bees that were administered a juvenile hormone analogy, a vehicle treatment, or a sham treatment to determine the effect of colony perturbation on the duration of trophallaxis interactions. Columns one and two display the unique identifiers for each bee involved in a particular trophallaxis exchange, and columns three and four display the Unix timestamp of the beginning/end of the interaction (in milliseconds), respectively.<br /><b>Note</b>: the queen interactions were omitted from the uploaded dataset for reasons that are described in submitted manuscript. Those bees that performed poorly are also omitted from the final dataset.
keywords:
honey bee; trophallaxis; social network
published: 2019-10-03
Choi, Sang Hyun; Rao, Vikyath D.; Gernat, Tim; Hamilton, Adam R.; Robinson, Gene E.; Goldenfeld, Nigel (2019): Honeybee F2F event data for The origin of heavy tails in honeybee and human interaction times. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4021786_V1
Dataset for F2F events of honeybees. F2F events are defined as face-to-face encounters of two honeybees that are close in distance and facing each other but not connected by the proboscis, thus not engaging in trophallaxis. The first and the second columns show the unique id's of honeybees participating in F2F events. The third column shows the time at which the F2F event started while the fourth column shows the time at which it ended. Each time is in the Unix epoch timestamp in milliseconds.
keywords:
honeybee;face-to-face interaction
published: 2019-07-04
Rapti, Zoi (2019): Control of bacterial infections via antibiotic-induced proviruses . University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9721455_V1
Software (Matlab .m files) for the article: Lying in Wait: Modeling the Control of Bacterial Infections via Antibiotic-Induced Proviruses. The files can be used to reproduce the analysis and figures in the article.
keywords:
Matlab codes; antibiotic-induced dynamics
published: 2019-09-01
Jackson, Nicole; Konar, Megan; Debaere, Peter; Estes, Lyndon (2019): Data for: Probabilistic global maps of crop-specific areas from 1961 to 2014. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-7439710_V1
Agriculture has substantial socioeconomic and environmental impacts that vary between crops. However, information on how the spatial distribution of specific crops has changed over time across the globe is relatively sparse. We introduce the Probabilistic Cropland Allocation Model (PCAM), a novel algorithm to estimate where specific crops have likely been grown over time. Specifically, PCAM downscales annual and national-scale data on the crop-specific area harvested of 17 major crops to a global 0.5-degree grid from 1961-2014. The resulting database presented here provides annual global gridded likelihood estimates of crop-specific areas. Both mean and standard deviations of grid cell fractions are available for each of the 17 crops. Each netCDF file contains an individual year of data with an additional variable ("crs") that defines the coordinate reference system used. Our results provide new insights into the likely changes in the spatial distribution of major crops over the past half-century. For additional information, please see the related paper by Jackson et al. (2019) in Environmental Research Letters (https://doi.org/10.1088/1748-9326/ab3b93).
keywords:
global; gridded; probabilistic allocation; crop suitability; agricultural geography; time series
published: 2019-09-25
Wong, Tony; Hughes, A; Tokuda, K; Indebetouw, R; Onishi, T; Bandurski, J. B.; Chen, C. H. R.; Fukui, Y; Glover, S. C. O.; Klessen, R. S.; Pineda, J. L.; Roman-Duval, J.; Sewilo, M.; Wojciechowski, E.; Zahorecz, S. (2019): Data for: Relations Between Molecular Cloud Structure Sizes and Line Widths in the Large Magellanic Cloud. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-7090706_V1
<sup>12</sup>CO and <sup>13</sup>CO maps for six molecular clouds in the Large Magellanic Cloud, obtained with the Atacama Large Millimeter/submillimeter Array (ALMA). See the associated article in the Astrophysical Journal, and README files within each ZIP archive. Please cite the article if you use these data.
keywords:
Radio astronomy
published: 2019-09-17
Fraebel, David T.; Kuehn, Seppe (2019): Sequencing data for migration rate selection experiments (0.2% agar, 1mM sugar). University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2128477_V1
BAM files for evolved strains from migration rate selection experiments conducted in low viscosity (0.2% w/v) agar plates containing M63 minimal medium with 1mM of mannose, melibiose, N-acetylglucosamine or galactose
published: 2019-09-17
Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1094364_V1
Trained models for multi-task multi-dataset learning for text classification as well as sequence tagging in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging. Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification_tagging.py</a> See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning; classification; sequence tagging
published: 2019-09-17
Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for text classification in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1917934_V1
Trained models for multi-task multi-dataset learning for text classification in tweets. Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality. Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py</a> See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning; sentiment; sarcasm; abusive content;
published: 2019-09-17
Mishra, Shubhanshu (2019): Trained models for multi-task multi-dataset learning for sequence prediction in tweets. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0934773_V1
Trained models for multi-task multi-dataset learning for sequence tagging in tweets. Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging. Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_experiment.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_experiment.py</a> See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details. If you are using this data, please also cite the related article: Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning;
published: 2019-09-06
Gallagher, John (2019): NYT vaccine comments. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-7724021_V1
This is a dataset of 1101 comments from The New York Times (May 1, 2015-August 31, 2015) that contains a mention of the stemmed words vaccine or vaxx.
keywords:
vaccine;online comments
published: 2019-09-05
Yang, Ning; Gao, Jiarong; Lewis, Fred; Yau, Peter; Collins, James; Sweedler, Jonathan; Newmark, Phillip (2019): Data for A novel rotifer derived alkaloid paralyzes schistosome larvae and prevents infection. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1599850_V1
The data set here include data from NMR, LC-MS/MS, MALDI-MS, H/D exchange MS experiments used in paper "A novel rotifer derived alkaloid paralyzes schistosome larvae and prevents infection".
published: 2019-08-29
de Moya, Robert (2019): Bemisia tabaci ortholog set. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5333299_V1
This is the published ortholog set derived from whole genome data used for the analysis of members of the B. tabaci complex of whiteflies. It includes the concatenated alignment and individual gene alignments used for analyses (Link to publication: https://www.mdpi.com/1424-2818/11/9/151).
published: 2019-07-04
Sashittal, Palash; El-Kebir, Mohammed (2019): SharpTNI Results. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9734610_V1
Results generated using SharpTNI on data collected from the 2014 Ebola outbreak in Sierra Leone.
published: 2019-07-29
Christensen, Sarah; Molloy, Erin K.; Vachaspati, Pranjal; Warnow, Tandy (2019): Data from TRACTION: Fast non-parametric improvement of estimated gene trees. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1747658_V1
Datasets used in the study, "TRACTION: Fast non-parametric improvement of estimated gene trees," accepted at the Workshop on Algorithms in Bioinformatics (WABI) 2019.
keywords:
Gene tree correction; horizontal gene transfer; incomplete lineage sorting
published: 2019-08-15
Smith, Rebecca (2019): Mastitis risk effect on the economic consequences of paratuberculosis control in dairy cattle: A stochastic modeling study. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-7539223_V1
Simulation data related to the paper "Mastitis risk effect on the economic consequences of paratuberculosis control in dairy cattle: A stochastic modeling study"
keywords:
paratuberculosis;simulation;dairy
published: 2019-08-13
Nowak, Jennifer E.; Sweet, Andrew D.; Weckstein, Jason D.; Johnson, Kevin P. (2019): Data for: A molecular phylogenetic analysis of the genera of fruit doves and their allies using dense taxonomic sampling. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9797270_V1
Multiple sequence alignments from concatenated nuclear and mitochondrial genes and resulting phylogenetic tree files of fruit doves and their close relatives. Files include: BEAST input XML file (fruit_dove_beast_input.xml); a maximum clade credibility tree from a BEAST analysis (fruit_dove_beast_mcc.tre); concatenated multiple sequence alignment NEXUS files for the novel dataset (fruit_dove_concatenated_alignment.nex, 76 taxa, 4,277 characters) and the dataset with additional sequences (fruit_dove_plus_cibois_data_concatenated_alignment.nex, 204 taxa, 4,277 characters), both of which contain a MrBayes block including partition information; and 50% majority-rule consensus trees generated from MrBayes analyses, using the NEXUS alignment files as inputs (fruit_dove_mrbayes_consensus.tre, fruit_dove_plus_cibois_data_mrbayes_consensus.tre).
keywords:
fruit doves; multiple sequence alignment; phylogeny; Aves: Columbidae
published: 2019-08-30
Allen, Maximilian (2019): Wisconsin Bobcat Harvest Data. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2501832_V1
This dataset includes the data from an analysis of bobcat harvest data with particular focus on the relationship between catch-per-unit-effort and population size. The data relate to bobcat trapper and hunter harvest metrics from Wisconsin and include two RDS files which can be open in the software R using the readRDS() function.
keywords:
bobcat; catch-per-unit-effort; CPUE; harvest; Lynx rufus; wildlife management; trapper; hunter
published: 2019-08-05
Skinner, Rachel; Dietrich, Christopher; Walden, Kimberly; Gordon, Eric; Sweet, Andrew; Podsiadlowski, Lars; Petersen, Malte; Simon, Chris; Takiya, Daniela; Johnson, Kevin (2019): Data for Phylogenomics of Auchenorrhyncha (Insecta: Hemiptera) using Transcriptomes: Examining Controversial Relationships via Degeneracy Coding and Interrogation of Gene Conflict. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1461292_V1
The data in this directory corresponds to: Skinner, R.K., Dietrich, C.H., Walden, K.K.O., Gordon, E., Sweet, A.D., Podsiadlowski, L., Petersen, M., Simon, C., Takiya, D.M., and Johnson, K.P. Phylogenomics of Auchenorrhyncha (Insecta: Hemiptera) using Transcriptomes: Examining Controversial Relationships via Degeneracy Coding and Interrogation of Gene Conflict. Systematic Entomology. Correspondance should be directed to: Rachel K. Skinner, rskinn2@illinois.edu If you use these data, please cite our paper in Systematic Entomology. The following files can be found in this dataset: Amino_acid_concatenated_alignment.phy: the amino acid alignment used in this analysis in phylip format. Amino_acid_raxml_partitions.txt (for reference only): the partitions for the amino acid alignment, but a partitioned amino acid analysis was not performed in this study. Amino_acid_concatenated_tree.newick: the best maximum likelihood tree with bootstrap values in newick format. ASTRAL_input_gene_trees.tre: the concatenated gene tree input file for ASTRAL README_pie_charts.md: explains the the scripts and data needed to recreate the pie charts figure from our paper. There is also another Corresponds to the following files: ASTRAL_species_tree_EN_only.newick: the species tree with only effective number (EN) annotation ASTRAL_species_tree_pp1_only.newick: the species tree with only the posterior probability 1 (main topology) annotation ASTRAL_species_tree_q1_only.newick: the species tree with only the quartet scores for the main topology (q1) ASTRAL_species_tree_q2_only.newick: the species tree with only the quartet scores for the first alternative topology (q2) ASTRAL_species_tree_q3_only.newick: the species tree with only the quartet scores for the second alternative topology (q3) print_node_key_files.py: script needed to create the following files: node_keys.key: text file with node IDs and topologies complete_q_scores.key: text file with node IDs multiplied q scores EN_node_vals.key: text file with node IDs and EN values create_pie_charts_tree.py: script needed to visualize the tree with pie charts, pp1, and EN values plotted at nodes ASTRAL_species_tree_full_annotation.newick: the species tree with full annotation from the ASTRAL analysis. NOTE: It may be more useful to examine individual value files if you want to visualize the tree, e.g., in figtree, since the full annotations are extensive and can make viewing difficult. Complete_NT_concatenated_alignment.phy: the nucleotide alignment that includes unmodified third codon positions. The alignment is in phylip format. Complete_NT_raxml_partitions.txt: the raxml-style partition file of the nucleotide partitions Complete_NT_concatenated_tree.newick: the best maximum likelihood tree from the concatenated complete analysis NT with bootstrap values in newick format Complete_NT_partitioned_tree.newick: the best maximum likelihood tree from the partitioned complete NT analysis with bootstrap values in newick format Degeneracy_coded_nt_concatenated_alignment.phy: the degeneracy coded nucleotide alignment in phylip format Degeneracy_coded_nt_raxml_partitions.txt: the raxml-style partition file for the degeneracy coded nucleotide alignment Degeneracy_coded_nt_concatenated_tree.newick: the best maximum likelihood tree from the degeneracy-coded concatenated analysis with bootstrap values in newick format Degeneracy_coded_nt_partitioned_tree.newick: the best maximum likelihood tree from the degeneracy-coded partitioned analysis with bootstrap values in newick format count_ingroup_taxa.py: script that counts the number of ingroup and/or outgroup taxa present in an alignment
keywords:
Auchenorrhyncha; Hemiptera; alignment; trees
published: 2019-07-27
Clark, Lindsay V.; Dwiyanti, Maria Stefanie; Anzoua, Kossonou G.; Brummer, Joe E.; Glowacka, Katarzyna; Hall, Megan; Heo, Kweon; Jin, Xiaoli; Lipka, Alexander E.; Peng, Junhua; Yamada, Toshihiko; Yoo, Ji Hye; Yu, Chang Yeon; Zhao, Hua; Long, Stephen P.; Sacks, Erik J. (2019): RAD-seq genotypes for a Miscanthus sinensis diversity panel. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1402948_V1
Genotype calls are provided for a collection of 583 Miscanthus sinensis clones across 1,108,836 loci mapped to version 7 of the Miscanthus sinensis reference genome. Sequence and alignment information for all unique RAD tags is also provided to facilitate cross-referencing to other genomes.
keywords:
variant call format (VCF); sequence alignment/map format (SAM); miscanthus; single nucleotide polymorphism (SNP); restriction site-associated DNA sequencing (RAD-seq); bioenergy; grass
published: 2019-07-26
Buckles, Brittany J; Harmon-Threatt, Alexandra (2019): Data files for "Bee diversity in tallgrass prairies affected by management and its effects on above‐ and below‐ground resources". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0016089_V2
Data used in paper published in the Journal of Applied Ecology titled " Bee diversity in tallgrass prairies affected by management and its effects on above- and below-ground resources" Bee Community file contains info on bees sampled in each site. The first column contain the Tallgrass Prairie Sites sampled all additional columns contain the bee species name in the first row and all individuals recorded. Plant Community file contains info on plants sampled in each site. The first column contain the Tallgrass Prairie Sites sampled all additional columns contain the plant species name in the first row and all individuals recorded. Soil PC1 file contains the soil PC1 values used in the analyses. The first column contain the Tallgrass Prairie Sites sampled, the second column contains the calculated soil PC1 values.
keywords:
bee; community; tallgrass prairie; grazing
published: 2019-07-08
Kehoe, Adam K.; Torvik, Vetle I. (2019): Datasets from "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-8020612_V1
# Overview These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe. The datasets consist of the following: * twin_not_abstract_matched_complete.tsv: a tab-delimited file consisting of pairs of MEDLINE articles with identical titles, authors and years of publication. This file contains the PMIDs of the duplicate publications, as well as their medical subject headings (MeSH) and three measures of their indexing consistency. * twin_abstract_matched_complete.tsv: the same as above, except that the MEDLINE articles also have matching abstracts. * mesh_training_data.csv: a comma-separated file containing the training data for the model discussed in the dissertation. * mesh_scores.tsv: a tab-delimited file containing a pairwise similarity score based on word embeddings, and MeSH hierarchy relationship. ## Duplicate MEDLINE Publications Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns: 1. pmid_one: the PubMed unique identifier of the first paper 2. pmid_two: the PubMed unique identifier of the second paper 3. mesh_one: A list of medical subject headings (MeSH) from the first paper, delimited by the "|" character 4. mesh_two: a list of medical subject headings from the second paper, delimited by the "|" character 5. hoopers_consistency: The calculation of Hooper's consistency between the MeSH of the first and second paper 6. nonhierarchicalfree: a word embedding based consistency score described in the dissertation 7. hierarchicalfree: a word embedding based consistency score additionally limited by the MeSH hierarchy, described in the dissertation. ## MeSH Training Data The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns: 1. pmid: the PubMed unique identifier of the paper 2. term: a candidate MeSH term 3. cit_count: the log of the frequency of the term in the citation candidate set 4. total_cit: the log of the total number the paper's citations 5. citr_count: the log of the frequency of the term in the citations of the paper's citations 6. total_citofcit: the log of the total number of the citations of the paper's citations 7. absim_count: the log of the frequency of the term in the AbSim candidate set 8. total_absim_count: the log of the total number of AbSim records for the paper 9. absimr_count: the log of the frequency of the term in the citations of the AbSim records 10. total_absimr_count: the log of the total number of citations of the AbSim record 11. log_medline_frequency: the log of the frequency of the candidate term in MEDLINE. 12. relevance: a binary indicator (True/False) if the candidate term was assigned to the target paper ## Cosine Similarity The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns: 1. mesh_one: a string of the first MeSH heading. 2. mesh_two: a string of the second MeSH heading. 3. cosine_similarity: the cosine similarity between the terms 4. relationship_type: a string identifying the relationship type, consisting of none, parent/child, sibling, ancestor and direct (terms are identical, i.e. a direct hierarchy match). The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/). For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format."
keywords:
MEDLINE;MeSH;Medical Subject Headings;Indexing
published: 2019-07-08
Mishra, Shubhanshu (2019): Wikipedia category embeddings - Node2Vec, Poincare, Elmo. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4551278_V1
Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (<a href="https://archive.org/download/enwiki-20170920">https://archive.org/download/enwiki-20170920</a>) created using the following algorithms: * Node2vec * Poincare embedding * Elmo model on the category title The following files are present: * wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with "_") <tab> 300 dim space separated embedding. * wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. * elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using * node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt * poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt * wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt * categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. * category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category * category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt * wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt Software used: * <a href="https://github.com/napsternxg/WikiUtils">https://github.com/napsternxg/WikiUtils</a> - Processing sql dumps * <a href="https://github.com/napsternxg/node2vec">https://github.com/napsternxg/node2vec</a> - Generate random walks for node2vec * <a href="https://github.com/RaRe-Technologies/gensim">https://github.com/RaRe-Technologies/gensim</a> (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm * <a href="https://github.com/allenai/allennlp">https://github.com/allenai/allennlp</a> (version 0.8.2) - Generate elmo embeddings for each category title Code used: * wiki_cat_node2vec_commands.sh - Commands used to * wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings * wiki_cat_poincare_embedding.py - generate poincare embeddings
keywords:
Wikipedia; Wikipedia Category Tree; Embeddings; Elmo; Node2Vec; Poincare;
published: 2019-07-11
Daniels, Melissa; Larson, Eric (2019): Data for "Effects of forest windstorm disturbance on invasive plants in protected areas of southern Illinois, USA". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-1401121_V1
We studied the effect of windstorm disturbance on forest invasive plants in southern Illinois. This data includes raw data on plant abundance at survey points, compiled data used in statistical analyses, and spatial data for surveyed plots and units. This file package also includes a readme.doc file that describes the data in detail, including attribute descriptions.
keywords:
tornado, blowdowns, derecho, invasive plants, Shawnee National Forest, southern Illinois
published: 2019-06-22
MacDonald, Sean; Ward, Michael; Sperry, Jinelle (2019): Manipulating social information to promote frugivory by birds on a Hawaiian Island. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9223847_V1
keywords:
conspecific attraction; fruit-eating bird; Hawaiian flora; playback experiment; seed dispersal; social information; Zosterops japonicas