Displaying 126 - 138 of 138 in total
Subject Area
Funder
Publication Year
License
Illinois Data Bank Dataset Search Results

Dataset Search Results

published: 2018-07-13
 
Qualitative Data collected from the websites of undergraduate research journals between October, 2014 and May, 2015. Two CSV files. The first file, "Sample", includes the sample of journals with secondary data collected. The second file, "Population", includes the remainder of the population for which secondary data was not collected. Note: That does not add up to 800 as indicated in article, rows were deleted for journals that had broken links or defunct websites during random sampling process.
keywords: undergraduate research; undergraduate journals; scholarly communication; libraries; liaison librarianship
published: 2018-04-23
 
Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc. Using authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless. The tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then nA = sum(Ai); nB = sum(Ai) nAB = sum(Ai*Bi) if A not equal B; nAA = sum(Ai*(Ai-1)/2) otherwise nAnB = nA*nB if A not equal B; nAnA = nA*(nA-1)/2 otherwise score = 1000000*nAB/nAnB if A is not equal B; 1000000*nAA/nAnA otherwise Token pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB. The cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs. Each dataset has the following 7 tab-delimited all-ASCII columns 1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000 2: nAB: total number of co-occurrences 3: nAnB: total number of pairs 4: nA: number of occurrences of token A 5: nB: number of occurrences of token B 6: A: token A 7: B: token B We made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in: <a href="https://doi.org/10.1371/journal.pone.0115681">D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681. https://doi.org/10.1371/journal.pone.0115681</a> <a href="http://dx.doi.org/10.5210/disco.v7i0.6654">Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:http://dx.doi.org/10.5210/disco.v7i0.6654</a>
keywords: PubMed; MeSH; token; name disambiguation
published: 2018-04-23
 
Provides links to Author-ity 2009, including records from principal investigators (on NIH and NSF grants), inventors on USPTO patents, and students/advisors on ProQuest dissertations. Note that NIH and NSF differ in the type of fields they record and standards used (e.g., institution names). Typically an NSF grant spanning multiple years is associated with one record, while an NIH grant occurs in multiple records, for each fiscal year, sub-projects/supplements, possibly with different principal investigators. The prior probability of match (i.e., that the author exists in Author-ity 2009) varies dramatically across NIH grants, NSF grants, and USPTO patents. The great majority of NIH principal investigators have one or more papers in PubMed but a minority of NSF principal investigators (except in biology) have papers in PubMed, and even fewer USPTO inventors do. This prior probability has been built into the calculation of match probabilities. The NIH data were downloaded from NIH exporter and the older NIH CRISP files. The dataset has 2,353,387 records, only includes ones with match probability > 0.5, and has the following 12 fields: 1 app_id, 2 nih_full_proj_nbr, 3 nih_subproj_nbr, 4 fiscal_year 5 pi_position 6 nih_pi_names 7 org_name 8 org_city_name 9 org_bodypolitic_code 10 age: number of years since their first paper 11 prob: the match probability to au_id 12 au_id: Author-ity 2009 author ID The NSF dataset has 262,452 records, only includes ones with match probability > 0.5, and the following 10 fields: 1 AwardId 2 fiscal_year 3 pi_position, 4 PrincipalInvestigators, 5 Institution, 6 InstitutionCity, 7 InstitutionState, 8 age: number of years since their first paper 9 prob: the match probability to au_id 10 au_id: Author-ity 2009 author ID There are two files for USPTO because here we linked disambiguated authors in PubMed (from Author-ity 2009) with disambiguated inventors. The USPTO linking dataset has 309,720 records, only includes ones with match probability > 0.5, and the following 3 fields 1 au_id: Author-ity 2009 author ID 2 inv_id: USPTO inventor ID 3 prob: the match probability of au_id vs inv_id The disambiguated inventors file (uiuc_uspto.tsv) has 2,736,306 records, and has the following 7 fields 1 inv_id: USPTO inventor ID 2 is_lower 3 is_upper 4 fullnames 5 patents: patent IDs separated by '|' 6 first_app_yr 7 last_app_yr
keywords: PubMed; USPTO; Principal investigator; Name disambiguation
published: 2017-12-18
 
This dataset matches to a thesis of the same title: Can fair use be adequately taught to Librarians? Assessing Librarians' confidence and comprehension in explaining fair use following an expert workshop.
keywords: fair use; copyright
published: 2017-12-14
 
Objectives: This study follows-up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1) what is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign campus repository? Are datasets more likely to be single file or multiple file items? (2) what is the usage data associated with these datasets? Which items are most popular? Methods: The dataset records collected in this study were identified by filtering item types categorized as "data" or "dataset" using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item's statistics report. The Handle identifier represents the dataset record's persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository. Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS. Results: A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first time frame a large number of PDFs were deposited by the Illinois Department of Agriculture. Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per file across all datasets averaged 3.2. Conclusion: Academic librarians, repository managers, and research data services staff can use the results presented here to anticipate the nature of research data that may be deposited within institutional repositories. With increased awareness, content recruitment, and improvements, IRs can provide a viable cyberinfrastructure for researchers to deposit data, but much can be learned from the data already deposited. Awareness of trends can help librarians facilitate discussions with researchers about research data deposits as well as better tailor their services to address short-term and long-term research needs.
keywords: research data; research statistics; institutional repositories; academic libraries
published: 2017-11-15
 
Monthly water withdrawal records (total pumpage and per-capita consumption) for the City of Austin, Texas (2000-2014). Data were provided by Austin Water Utility.
keywords: Water use; Water conservation
published: 2017-09-26
 
This file contains the supplemental appendix for the article "Farmer Preferences for Agricultural Soil Carbon Sequestration Schemes" published in Applied Economic Policy and Perspectives (accepted 2017).
keywords: appendix; carbon sequestration; tillage; choice experiment
published: 2017-06-01
 
List of Chinese Students Receiving a Ph.D. in Chemistry between 1905 and 1964. Based on two books compiling doctoral dissertations by Chinese students in the United States. Includes disciplines; university; advisor; year degree awarded, birth and/or death date, dissertation title. Accompanies Chapter 5 : History of the Modern Chemistry Doctoral Program in Mainland China by Vera V. Mainz published in "Igniting the Chemical Ring of Fire : Historical Evolution of the Chemical Communities in the Countries of the Pacific Rim", Seth Rasmussen, Editor. Published by World Scientific. Expected publication 2017.
keywords: Chinese; graduate student; dissertation; university; advisor; chemistry; engineering; materials science
published: 2016-12-02
 
This dataset enumerates the number of geocoded tweets captured in geographic rectangular bounding boxes around the metropolitan statistical areas (MSAs) defined for 49 American cities, during a four-week period in 2012 (between April and June), through the Twitter Streaming API. More information on MSA definitions: https://www.census.gov/population/metro/
keywords: human dynamics; social media; urban informatics; pace of life; Twitter; ecological correlation; individual behavior
published: 2016-08-18
 
Copyright Review Management System renewals by year, data from Table 2 of the article "How Large is the ‘Public Domain’? A comparative Analysis of Ringer’s 1961 Copyright Renewal Study and HathiTrust CRMS Data."
keywords: copyright; copyright renewals; HathiTrust
published: 2016-08-02
 
These data are the result of a multi-step process aimed at enriching BIBFRAME RDF with linked data. The process takes in an initial MARC XML file, transforms it to BIBFRAME RDF/XML, and then four separate python files corresponding to the BIBFRAME 1.0 model (Work, Instance, Annotation, and Authority) are run over the BIBFRAME RDF/XML output. The input and outputs of each step are included in this data set. Input file types include the CSV; MARC XML; and Master RDF/XML Files. The CSV contain bibliographic identifiers to e-books. From CSVs a set of MARC XML are generated. The MARC XML are utilized to produce the Master RDF file set. The major outputs of the enrichment code produce BIBFRAME linked data as Annotation RDF, Instance RDF, Work RDF, and Authority RDF.
keywords: BIBFRAME; Schema.org; linked data; discovery; MARC; MARCXML; RDF
published: 2016-06-23
 
This dataset was extracted from a set of metadata files harvested from the DataCite metadata store (https://search.datacite.org/ui) during December 2015. Metadata records for items with a resourceType of dataset were collected. 1,647,949 total records were collected. This dataset contains three files: 1) readme.txt: A readme file. 2) version-results.csv: A CSV file containing three columns: DOI, DOI prefix, and version text contents 3) version-counts.csv: A CSV file containing counts for unique version text content values.
keywords: datacite;metadata;version values;repository data
published: 2016-06-23
 
This dataset was extracted from a set of metadata files harvested from the DataCite metadata store (http://search.datacite.org/ui) during December 2015. Metadata records for items with a resourceType of dataset were collected. 1,647,949 total records were collected. This dataset contains four files: 1) readme.txt: a readme file. 2) language-results.csv: A CSV file containing three columns: DOI, DOI prefix, and language text contents 3) language-counts.csv: A CSV file containing counts for unique language text content values. 4) language-grouped-counts.txt: A text file containing the results of manually grouping these language codes.
keywords: datacite;metadata;language codes;repository data