Displaying 426 - 450 of 649 in total

Subject Area

Life Sciences (349)
Social Sciences (136)
Physical Sciences (98)
Technology and Engineering (64)
Arts and Humanities (1)
Uncategorized (1)

Funder

Other (199)
U.S. National Science Foundation (NSF) (193)
U.S. Department of Energy (DOE) (66)
U.S. National Institutes of Health (NIH) (60)
U.S. Department of Agriculture (USDA) (43)
Illinois Department of Natural Resources (IDNR) (17)
U.S. National Aeronautics and Space Administration (NASA) (6)
U.S. Geological Survey (USGS) (6)
Illinois Department of Transportation (IDOT) (4)
U.S. Army (2)

Publication Year

2021 (108)
2022 (108)
2020 (96)
2023 (78)
2019 (72)
2018 (62)
2024 (51)
2017 (36)
2016 (30)
2025 (3)
2009 (1)
2011 (1)
2012 (1)
2014 (1)
2015 (1)

License

CC0 (362)
CC BY (267)
custom (20)

Datasets

published: 2021-12-31
 
We developed and delivered in-person training at local health department offices in six of the seven Illinois Department of Public Health “health regions” between April-May of 2019. Pre-, post-, and six-month follow-up questionnaires on knowledge, attitudes, and practices with regards to tick surveillance were administered to training participants.
keywords: ticks; survey; tick-borne disease; public health
published: 2021-10-15
 
Atomic oxygen data from SCIAMACHY, for the MLT, 2002-2012, averaged for 26, 14 day periods, beginning January 1.
keywords: SCIAMACHY data
published: 2023-09-20
 
Dataset includes bee trait information and species abundance information for bees collected at 29 forests plots in southern Illinois, USA. Plots are located within three public land sites. Environmental data were also collected for each of the 29 plots.
keywords: wild bees; forest management; functional traits
published: 2023-09-19
 
We used the following keywords files to identify categories for journals and conferences not in Scopus, for our STI 2023 paper "Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science". The first four text files each contains keywords/content words in the form: 'keyword1', 'keyword2', 'keyword3', .... The file title indicates the name of the category: file1: healthscience_words.txt file2: lifescience_words.txt file3: physicalscience_words.txt file4: socialscience_words.txt The first four files were generated from a combination of software and manual review in an iterative process in which we: - Manually reviewed venue titles were not able to automatically categorize using the Scopus categorization or extending it as a resource. - Iteratively reviewed uncategorized venue titles to manually curate additional keywords as content words indicating a venue title could be classified in the category healthscience, lifescience, physicalscience, or socialscience. We used English content words and added words we could automatically translate to identify content words. NOTE: Terminology with multiple potential meanings or contain non-English words that did not yield useful automatic translations e.g., (e.g., Al-Masāq) were not selected as content words. The fifth text file is a list of stopwords in the form: 'stopword1', 'stopword2, 'stopword3', ... file5: stopwords.txt This file contains manually curated stopwords from venue titles to handle non-content words like 'conference' and 'journal,' etc. This dataset is a revision of the following dataset: Version 1: Lee, Jou; Schneider, Jodi: Keywords for manual field assignment for Assessing the agreement in retraction indexing across 4 multidisciplinary sources: Crossref, Retraction Watch, Scopus, and Web of Science. University of Illinois at Urbana-Champaign Data Bank. Changes from Version 1 to Version 2: - Added one author - Added a stopwords file that was used in our data preprocessing. - Thoroughly reviewed each of the 4 keywords lists. In particular, we added UTF-8 terminology, removed some non-content words and misclassified content words, and extensively reviewed non-English keywords.
keywords: health science keywords; scientometrics; stopwords; field; keywords; life science keywords; physical science keywords; science of science; social science keywords; meta-science; RISRS
published: 2018-09-26
 
Nucleotide sequences from wild parsnip CYP71AJ4 (angelic in synthase. <a href ="https://www.ncbi.nlm.nih.gov/nuccore/EF191021">Genbank EF191021</a>) were obtained by Sanger sequencing. Seeds from individual plants from different populations were harvested to obtain corresponding cDNA. The cDNA was cloned and directly sequenced. Aminoacid translations were obtained using standard codon usage. Alignments of CYP71AJ4 sequences (involved in angular furanocoumarin biosynthesis) with as the reference sequence. Consistent amino acid variabilities were found between some populations. The relationship between sequencing variability and selective pressure is not yet known.
keywords: Pastinaca sativa; parsnip; furanocoumarins; psoralen
published: 2021-12-09
 
These data were collected in 2018 and 2019 at the University of Illinois Energy Farm (N 40.063607, W 88.206926). During each growing season, bulk and rhizosphere soil were collected from replicate Sorghum bicolor nitrogen use efficiency trial plots at three separate time points (approximately July 1, August 1, and September 1). We measured soil moisture, pH, soil nitrate and ammonium, potential nitrification, potential denitrification, and extracted and sequenced the V4 region of the 16S rRNA gene for microbial community analysis. All microbial sequence data is archived in the National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (accession number SRP326979, project number PRJNA741261).
keywords: soil nitrogen; nitrification; nitrogen cycle; sorghum; bioenergy; Center for Advanced Bioenergy and Bioproducts Innovation
published: 2018-10-24
 
This dataset was compiled between 2010 and 2011 from data published in the scientific literature from articles evaluating the influence of cropping systems and soil management practices on soil organic Carbon. We used the Thomas Reuter Web of Science database and by reviewed the reference sections of key peer-reviewed articles. Articles included in the database presented results from field sites within the continental United States.
keywords: Cropping systems; soil management; soil organic carbon; soil quality.
published: 2019-01-07
 
Vendor transcription of the Catalogue of Copyright Entries, Part 1, Group 1, Books: New Series, Volume 29 for the Year 1932. This file contains all of the entries from the indicated volume.
keywords: copyright; Catalogue of Copyright Entries; Copyright Office
published: 2023-12-13
 
Corbicula spp. are one of the most prolific aquatic invasive species in the world and can have negative effects on aquatic ecosystems. We performed qualitative field surveys, examined literature accounts and natural history museum holdings, and accessed citizen science data sources to document the distribution of Corbicula in Mexico and shared drainages. Through 26 publications (N = 127 records), 312 museum holdings, and 446 iNaturalist records, we documented 885 records pertaining to Corbicula in Mexico and shared drainages. The first record of the species in Mexico was in 1969, and it has since been reported from 26 of the 32 Mexican states and most of the major river basins throughout the country. However, we suggest Corbicula is more prevalent in Mexico than we report in this work as it is often under sampled / under reported.
keywords: Corbicula; exotic species; invasive species; Asian Clams; Bivalvia; freshwater systems
published: 2019-03-19
 
This dataset includes images and extracted centerlines from experiments looking at the formation and evolution of meltwater meandering channels on ice. The laboratory data includes centimeter- and millimeter-scale rivulets. Dataset also includes an image and corresponding centerlines from the Peterman Ice Island. All centerlines were manually digitized in Matlab but no distributable code was developed for the process. Once digitized, centerlines were smoothed and standardized following methods and routines developed by other authors (Zolezzi and Guneralp, 2016; Guneralp and Rhoads, 2008). Details about the preparation of the centerlines and processing with these methods is included in the dissertation by Fernández (2018) linked to this dataset. "Millimeter scale and Peterman Ice Island centerlines.pdf": This file includes the images of two mm-scale experimetns and the Peterman Ice Island image. Seventeen centerlines were digitized from the former and seven were digitized from the latter. Those centerlines are shown above the images themselves. "Centimeter scale rivulet images.pdf": This file includes images corresponding to all cm-scale centerlines used for the analysis presented in the dissertation by Fernandez (2018). Each image has a short caption indicating the run ID and the time at which it was captured. The images were used to extract centerlines to look at the planform evolution of cm-scale meltwater meandering rivulets on ice. Images include 26 centerlines from four different runs. "Meltwater meandering channel centerlines.xlsx": This spreadsheet contains the centerline data for all fifty centerlines. The workbook includes 51 sheets. The first 50 are related to each one of the channels. The mm scale and Peterman Ice Island ones are identified using the same IDs shown in "Millimeter scale and Peterman Ice Island centerlines.pdf". The cm-scale centerlines are identified by run ID and a number indicating the time in minutes (with t = 0 min being the time at which water started flowing over the ice block). The naming convention is also associated to the images in "Centimeter scale rivulet images.pdf". The last sheet in the workbook includes a summary of the channel widths measured from every image for each centerline. The 50 sheets with the centerline information have four columns each. The titles of the columns are X, Y, S, and C. X,Y are dimensionless coordinates of the centerline. S is dimensionless streamwise coordinate (location along the centerline). C is dimensionless curvature value. All these values were non-dimensionalized with the channel width. See Fernandez (2018), Zolezzi and Guneralp (2016), and Guneralp and Rhoads (2008) for more details regarding the process of smoothing, standardizing and non-dimensionalization of the centerline coordinates.
keywords: Meltwater, Meandering, Ice, Supraglacial, Experiments
published: 2023-12-18
 
We conducted long-term capture-mark-recapture surveys on two isolated ornate box turtle (Terrapene ornata) populations in northern Illinois, USA. This dataset provides the capture history strings and additional demographic information used for estimating population vital rates with robust design capture-mark-recapture models. The vital rates were then used in a stage-based population projection matrix model for each population.
keywords: demography; capture-mark-recapture; vital rates; conservation; wildlife ecology
published: 2022-09-19
 
Data characterize zooplankton in Shelbyville Reservoir, Illinois, United States of America. Zooplankton were sampled with a conical zooplankton net (0.5m diameter mouth) when water was deeper than 2 m and by grab sample when water was shallower. Zooplankton samples were concentrated and subsampled with a Hensen-Stempel pipette following protocols described in Detmer et al. (2019). Zooplankton were identified to the lowest feasible taxonomic unit according to Pennak (1989) and Thorp and Covich (2001) and were enumerated in a 1 mL Sedgewick-Rafter cell. Subsamples were analyzed until at least 200 individuals were enumerated from each site.were counted across for each of the three main taxonomic groups (cladocerans, copepods, and rotifers). Given the variation in zooplankton concentrations at each site, this process often lead to far more than 200 individuals being counted (x̄ = 269, min = 200, max = 487). A summary of the sample size from each site can be found in Supplementary Table S2. Abundances were corrected for volume of water filtered. For rare taxa (< 20 individuals per sample), all individuals were measured for length. For abundant taxa, length measurements were collected on the first 20 organisms of each abundant taxon encountered in a subsample. Dry mass was calculated from equations for microcrustaceans, rotifers, and Chaoborus sp. (Rosen ,1981; Botrell et al., 1976; Dumont and Balvay, 1979).
keywords: Reservoir; Zooplankton
published: 2022-09-28
 
Data from an a field survey at Nikko National Park in central Japan. Data contain information about deer carcass, environment of sites, and vertebrate scavenging.
keywords: Carcass; Cervus nippon; Detection; Facultative scavenging; Obligate scavenger
published: 2021-07-21
 
This dataset contains 1 CSV file: RozanskyLarsonTaylorMsat.csv which contains microsatellite fragment lengths for Virile and Spothanded Crayfish from the Current River watershed of Missouri, U.S., and complimentary data, including assignments to species by phenotype and COI sequence data, GenBank accession numbers for COI sequence data, study sites with dates of collection and geographic coordinates, and Illinois Natural History Survey (INHS) Crustacean Collection lots where specimens are stored.
keywords: invasive species; hybridization; crayfishes; streams; freshwater; Cambaridae; virile crayfish; spothanded crayfish; Missouri; Current River; Ozark National Scenic Riverways
published: 2016-05-26
 
This data set includes survey responses collected during 2015 from academic libraries with library publishing services. Each institution responded to questions related to its use of user studies or information about readers in order to shape digital publication design, formats, and interfaces. Survey data was supplemented with institutional categories to facilitate comparison across institutional types.
keywords: academic libraries; publishing; user experience; user studies
published: 2022-07-25
 
Related to the raw entity mentions (https://doi.org/10.13012/B2IDB-4163883_V1), this dataset represents the effects of the data cleaning process and collates all of the entity mentions which were too ambiguous to successfully link to the ChEBI ontology.
keywords: synthetic biology; NERC data; chemical mentions; ambiguous entities
published: 2020-03-03
 
This second version (V2) provides additional data cleaning compared to V1, additional data collection (mainly to include data from 2019), and more metadata for nodes. Please see NETWORKv2README.txt for more detail.
keywords: citations; retraction; network analysis; Web of Science; Google Scholar; indirect citation
published: 2020-04-07
 
Baseline data from a multi-modal intervention study conducted at the University of Illinois at Urbana-Champaign. Data include results from a cardiorespiratory fitness assessment (maximal oxygen consumption, VO2max), a body composition assessment (Dual-Energy X-ray Absorptiometry, DXA), and Magnetic Resonance Spectroscopy Imaging. Data set includes data from 435 participants, ages 18-44 years.
keywords: Magnetic Resonance Spectroscopy; N-acetyl aspartic acid (NAA); Body Mass Index; cardiorespiratory fitness; body composition
published: 2020-05-04
 
The Cline Center Historical Phoenix Event Data covers the period 1945-2019 and includes 8.2 million events extracted from 21.2 million news stories. This data was produced using the state-of-the-art PETRARCH-2 software to analyze content from the New York Times (1945-2018), the BBC Monitoring's Summary of World Broadcasts (1979-2019), the Wall Street Journal (1945-2005), and the Central Intelligence Agency’s Foreign Broadcast Information Service (1995-2004). It documents the agents, locations, and issues at stake in a wide variety of conflict, cooperation and communicative events in the Conflict and Mediation Event Observations (CAMEO) ontology. The Cline Center produced these data with the generous support of Linowes Fellow and Faculty Affiliate Prof. Dov Cohen and help from our academic and private sector collaborators in the Open Event Data Alliance (OEDA). For details on the CAMEO framework, see: Schrodt, Philip A., Omür Yilmaz, Deborah J. Gerner, and Dennis Hermreck. "The CAMEO (conflict and mediation event observations) actor coding framework." In 2008 Annual Meeting of the International Studies Association. 2008. http://eventdata.parusanalytics.com/papers.dir/APSA.2005.pdf Gerner, D.J., Schrodt, P.A. and Yilmaz, O., 2012. Conflict and mediation event observations (CAMEO) Codebook. http://eventdata.parusanalytics.com/cameo.dir/CAMEO.Ethnic.Groups.zip For more information about PETRARCH and OEDA, see: http://openeventdata.org/
keywords: OEDA; Open Event Data Alliance (OEDA); Cline Center; Cline Center for Advanced Social Research; civil unrest; petrarch; phoenix event data; violence; protest; political; conflict; political science
published: 2020-08-21
 
# WikiCSSH If you are using WikiCSSH please cite the following: > Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. “WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia.” In Workshop on Scientific Knowledge Graphs (SKG 2020). https://skg.kmi.open.ac.uk/SKG2020/papers/HAN_et_al_SKG_2020.pdf > Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. "WikiCSSH - Computer Science Subject Headings from Wikipedia". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0424970_V1 Download the WikiCSSH files from: https://doi.org/10.13012/B2IDB-0424970_V1 More details about the WikiCSSH project can be found at: https://github.com/uiuc-ischool-scanr/WikiCSSH This folder contains the following files: WikiCSSH_categories.csv - Categories in WikiCSSH WikiCSSH_category_links.csv - Links between categories in WikiCSSH Wikicssh_core_categories.csv - Core categories as mentioned in the paper WikiCSSH_category_links_all.csv - Links between categories in WikiCSSH (includes a dummy category called <ROOT> which is parent of isolates and top level categories) WikiCSSH_category2page.csv - Links between Wikipedia pages and Wikipedia Categories in WikiCSSH WikiCSSH_page2redirect.csv - Links between Wikipedia pages and Wikipedia page redirects in WikiCSSH This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit <a href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</a> or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
keywords: wikipedia; computer science;
published: 2020-09-02
 
Citation context annotation. This dataset is a second version (V2) and part of the supplemental data for Jodi Schneider, Di Ye, Alison Hill, and Ashley Whitehorn. (2020) "Continued post-retraction citation of a fraudulent clinical trial report, eleven years after it was retracted for falsifying data". Scientometrics. In press, DOI: 10.1007/s11192-020-03631-1 Publications were selected by examining all citations to the retracted paper Matsuyama 2005, and selecting the 35 citing papers, published 2010 to 2019, which do not mention the retraction, but which mention the methods or results of the retracted paper (called "specific" in Ye, Di; Hill, Alison; Whitehorn (Fulton), Ashley; Schneider, Jodi (2020): Citation context annotation for new and newly found citations (2006-2019) to retracted paper Matsuyama 2005. University of Illinois at Urbana-Champaign. <a href="https://doi.org/10.13012/B2IDB-8150563_V1">https://doi.org/10.13012/B2IDB-8150563_V1</a> ). The annotated citations are second-generation citations to the retracted paper Matsuyama 2005 (RETRACTED: Matsuyama W, Mitsuyama H, Watanabe M, Oonakahara KI, Higashimoto I, Osame M, Arimura K. Effects of omega-3 polyunsaturated fatty acids on inflammatory markers in COPD. Chest. 2005 Dec 1;128(6):3817-27.), retracted in 2008 (Retraction in: Chest (2008) 134:4 (893) <a href="https://doi.org/10.1016/S0012-3692(08)60339-6">https://doi.org/10.1016/S0012-3692(08)60339-6<a/> ). <b>OVERALL DATA for VERSION 2 (V2)</b> FILES/FILE FORMATS Same data in two formats: 2010-2019 SG to specific not mentioned FG.csv - Unicode CSV (preservation format only) - same as in V1 2010-2019 SG to specific not mentioned FG.xlsx - Excel workbook (preferred format) - same as in V1 Additional files in V2: 2G-possible-misinformation-analyzed.csv - Unicode CSV (preservation format only) 2G-possible-misinformation-analyzed.xlsx - Excel workbook (preferred format) <b>ABBREVIATIONS: </b> 2G - Refers to the second-generation of Matsuyama FG - Refers to the direct citation of Matsuyama (the one the second-generation item cites) <b>COLUMN HEADER EXPLANATIONS </b> File name: 2G-possible-misinformation-analyzed. Other column headers in this file have same meaning as explained in V1. The following are additional header explanations: Quote Number - The order of the quote (citation context citing the first generation article given in "FG in bibliography") in the second generation article (given in "2G article") Quote - The text of the quote (citation context citing the first generation article given in "FG in bibliography") in the second generation article (given in "2G article") Translated Quote - English translation of "Quote", automatically translation from Google Scholar Seriousness/Risk - Our assessment of the risk of misinformation and its seriousness 2G topic - Our assessment of the topic of the cited article (the second generation article given in "2G article") 2G section - The section of the citing article (the second generation article given in "2G article") in which the cited article(the first generation article given in "FG in bibliography") was found FG in bib type - The type of article (e.g., review article), referring to the cited article (the first generation article given in "FG in bibliography") FG in bib topic - Our assessment of the topic of the cited article (the first generation article given in "FG in bibliography") FG in bib section - The section of the cited article (the first generation article given in "FG in bibliography") in which the Matsuyama retracted paper was cited
keywords: citation context annotation; retraction; diffusion of retraction; second-generation citation context analysis
published: 2018-03-01
 
The data set consists of Illumina sequences derived from 48 sediment samples, collected in 2015 from Lake Michigan and Lake Superior for the purpose of inventorying the fungal diversity in these two lakes. DNA was extracted from ca. 0.5g of sediment using the MoBio PowerSoil DNA isolation kits following the Earth Microbiome protocol. PCR was completed with the fungal primers ITS1F and fITS7 using the Fluidigm Access Array. The resulting amplicons were sequenced using the Illumina Hi-Seq2500 platform with rapid 2 x 250nt paired-end reads. The enclosed data sets contain the forward read files for both primers, both fixed-header index files, and the associated map files needed to be processed in QIIME. In addition, enclosed are two rarefied OTU files used to evaluate fungal diversity. All decimal latitude and decimal longitude coordinates of our collecting sites are also included. File descriptions: Great_lakes_Map_coordinates.xlsx = coordinates of sample sites QIIME Processing ITS1 region: These are the raw files used to process the ITS1 Illumina reads in QIIME. ***only forward reads were processed GL_ITS1_HW_mapFile_meta.txt = This is the map file used in QIIME. ITS1F_Miller_Fludigm_I1_fixedheader.fastq = Index file from Illumina. Headers were fixed to match the forward reads (R1) file in order to process in QIIME ITS1F_Miller_Fludigm_R1.fastq = Forward Illumina reads for the ITS1 region. QIIME Processing ITS2 region: These are the raw files used to process the ITS2 Illumina reads in QIIME. ***only forward reads were processed GL_ITS2_HW_mapFile_meta.txt = This is the map file used in QIIME. ITS7_Miller_Fludigm_I1_Fixedheaders.fastq = Index file from Illumina. Headers were fixed to match the forward reads (R1) file in order to process in QIIME ITS7_Miller_Fludigm_R1.fastq = Forward Illumina reads for the ITS2 region. Resulting OTU Table and OTU table with taxonomy ITS1 Region wahl_ITS1_R1_otu_table.csv = File contains Representative OTUs based on ITS1 region for all the R1 data and the number of each OTU found in each sample. wahl_ITS1_R1_otu_table_w_tax.csv = File contains Representative OTUs based on ITS1 region for all the R1 and the number of each OTU found in each sample along with taxonomic determination based on the following database: sh_taxonomy_qiime_ver7_97_s_31.01.2016_dev ITS2 Region wahl_ITS2_R1_otu_table.csv = File contains Representative OTUs based on ITS2 region for all the R1 data and the number of each OTU found in each sample. wahl_ITS2_R1_otu_table_w_tax.csv = File contains Representative OTUs based on ITS2 region for all the R1 data and the number of each OTU found in each sample along with taxonomic determination based on the following database: sh_taxonomy_qiime_ver7_97_s_31.01.2016_dev Rarified illumina dataset for each ITS Region ITS1_R1_nosing_rare_5000.csv = Environmental parameters and rarefied OTU dataset for ITS1 region. ITS2_R1_nosing_rare_5000.csv = Environmental parameters and rarefied OTU dataset for ITS2 region. Column headings: #SampleID = code including researcher initials and sequential run number BarcodeSequence = LinkerPrimerSequence = two sequences used CTTGGTCATTTAGAGGAAGTAA or GTGARTCATCGAATCTTTG ReversePrimer = two sequences used GCTGCGTTCTTCATCGATGC or TCCTCCGCTTATTGATATGC run_prefix = initials of run operator Sample = location code, see thesis figures 1 and 2 for mapped locations and Great_lakes_Map_coordinates.xlsx for exact coordinates. DepthGroup = S= shallow (50-100 m), MS=mid-shallow (101-150 m), MD=mid-deep (151-200 m), and D=deep (>200 m)" Depth_Meters = Depth in meters Lake = lake name, Michigan or Superior Nitrogen % Carbon % Date = mm/dd/yyyy pH = acidity, potential of Hydrogen (pH) scale SampleDescription = Sample or control X = sequential run number OTU ID = Operational taxonomic unit ID
keywords: Illumina; next-generation sequencing; ITS; fungi
published: 2020-02-12
 
The XSEDE program manages the database of allocation awards for the portfolio of advanced research computing resources funded by the National Science Foundation (NSF). The database holds data for allocation awards dating to the start of the TeraGrid program in 2004 to present, with awards continuing through the end of the second XSEDE award in 2021. The project data include lead researcher and affiliation, title and abstract, field of science, and the start and end dates. Along with the project information, the data set includes resource allocation and usage data for each award associated with the project. The data show the transition of resources over a fifteen year span along with the evolution of researchers, fields of science, and institutional representation.
keywords: allocations; cyberinfrastructure; XSEDE
published: 2018-04-19
 
Prepared by Vetle Torvik 2018-04-15 The dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed. &bull; How was the dataset created? First and last names of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including Ethnea+Genni as described in: <i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA. http://hdl.handle.net/2142/88927</i> <i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i> EthnicSeer: http://singularity.ist.psu.edu/ethnicity <i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i> SexMachine 0.1.1: <a href="https://pypi.python.org/pypi/SexMachine/">https://pypi.org/project/SexMachine</a> First names, for some Author-ity records lacking them, were harvested from outside bibliographic databases. &bull; The code and back-end data is periodically updated and made available for query at <a href ="http://abel.ischool.illinois.edu">Torvik Research Group</a> &bull; What is the format of the dataset? The dataset contains 9,300,182 rows and 10 columns 1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition) 2. name: full name used as input to EthnicSeer) 3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX 4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction 5. lastname: used as input for Ethnea+Genni 6. firstname: used as input for Ethnea+Genni 7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short) 8. Genni: predicted gender; 'F', 'M', or '-' 9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male) 10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'
keywords: Androgyny; Bibliometrics; Data mining; Search engine; Gender; Semantic orientation; Temporal prediction; Textual markers