Illinois Data Bank Dataset Search Results
Results
published:
2018-04-23
Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc.
Using authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless.
The tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then
nA = sum(Ai); nB = sum(Ai)
nAB = sum(Ai*Bi) if A not equal B; nAA = sum(Ai*(Ai-1)/2) otherwise
nAnB = nA*nB if A not equal B; nAnA = nA*(nA-1)/2 otherwise
score = 1000000*nAB/nAnB if A is not equal B; 1000000*nAA/nAnA otherwise
Token pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB.
The cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs.
Each dataset has the following 7 tab-delimited all-ASCII columns
1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000
2: nAB: total number of co-occurrences
3: nAnB: total number of pairs
4: nA: number of occurrences of token A
5: nB: number of occurrences of token B
6: A: token A
7: B: token B
We made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in:
<a href="https://doi.org/10.1371/journal.pone.0115681">D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681. https://doi.org/10.1371/journal.pone.0115681</a>
<a href="http://dx.doi.org/10.5210/disco.v7i0.6654">Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:http://dx.doi.org/10.5210/disco.v7i0.6654</a>
keywords:
PubMed; MeSH; token; name disambiguation
published:
2019-09-06
This is a dataset of 1101 comments from The New York Times (May 1, 2015-August 31, 2015) that contains a mention of the stemmed words vaccine or vaxx.
keywords:
vaccine;online comments
published:
2020-08-10
Zinnen, Jack; Spyreas, Greg; Erdős, László; Berg, Christian; Matthews, Jeffrey
(2020)
These are text files downloaded from the Web of Science for the bibliographic analyses found in Zinnen et al. (2020) in Applied Vegetation Science. They represent the papers and reference lists from six expert-based indicator systems: Floristic Quality Assessment, hemeroby, naturalness indicator values (& social behaviors), Ellenberg indicator values, grassland utilization values, and urbanity indicator values.
To examine data, download VOSviewer and see instructrions from van Eck & Waltman (2019) for how to upload data. Although we used bibliographic coupling, there are a number of other interesting bibliographic analyses you can use with these data (e.g., visualizing citations between journals from this set of documents).
Note: There are two caveats to note about these data and Supplements 1 & 2 associated with our paper. First, there are some overlapping papers in these text files (i.e., raw data). When added individually, the papers sum to more than the numbers we give. However, when combined VOSviewer recognizes these as repeats, and matches the numbers we list in S1 and the manuscript. Second, we labelled the downloaded papers in S2 with their respective systems. In some cases, the labels do not completely match our counts listed in S1 and raw data. This is because some of these papers use another system, but were not captured in our systematic literature search (e.g., a paper may have used hemeroby, but was not picked up by WoS, so this paper is not listed as one of the 52 hemeroby papers).
keywords:
Web of Science; bibliographic analyses; vegetation; VOSviewer
published:
2018-12-31
Sixty undergraduate STEM lecture classes were observed across 14 departments at the University of Illinois Urbana-Champaign in 2015 and 2016. We selected the classes to observe using purposive sampling techniques with the objectives of (1) collecting classroom observations that were representative of the STEM courses offered; (2) conducting observations on non-test, typical class days; and (3) comparing these classroom observations using the Class Observation Protocol for Undergraduate STEM (COPUS) to record the presence and frequency of active learning practices utilized by Community of Practice (CoP) and non-CoP instructors.
Decimal values are the result of combined observations. All COPUS codes listed are from Smith (2013) "The Classroom Observation Protocol for Undergraduate STEM (COPUS): A New Instrument to Characterize STEM Classroom Practices" paper.
For more information on the data collection process, see "Evidence that communities of practice are associated with active learning in large STEM lectures" by Tomkin et. al. (2019) in the International Journal of STEM Education.
keywords:
COPUS, Community of Practice
published:
2018-04-23
Mishra, Shubhanshu; Fegley, Brent D; Diesner, Jana; Torvik, Vetle I.
(2018)
Self-citation analysis data based on PubMed Central subset (2002-2005)
----------------------------------------------------------------------
Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018
## Introduction
This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE.
It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015.
The dataset is distributed in the form of the following tab separated text files:
* Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors
* Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors
* Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors
* Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data
* COLUMNS_DESC.txt file - Descriptions of all columns
* model_text_files.tar.gz - Text files containing model coefficients and scores for model selection.
* results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments.
* README.txt file
## Dataset creation
Our experiments relied on data from multiple sources including properitery data from [Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations](<a href="https://clarivate.com/products/web-of-science/databases/">https://clarivate.com/products/web-of-science/databases/</a>). Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset.
* MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>
* Citation data from PubMed Central (original paper includes additional citations from Web of Science)
* Author-ity 2009 dataset:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4222651_V1">Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1</a>
- Paper citation: <a href="https://doi.org/10.1145/1552303.1552304">Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304</a>
- Paper citation: <a href="https://doi.org/10.1002/asi.20105">Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105</a>
* Genni 2.0 + Ethnea for identifying author gender and ethnicity:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-9087546_V1">Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1</a>
- Paper citation: <a href="https://doi.org/10.1145/2467696.2467720">Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720</a>
- Paper citation: <a href="http://hdl.handle.net/2142/88927">Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927</a>
* MapAffil for identifying article country of affiliation:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4354331_V1">Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1</a>
- Paper citation: <a href="http://doi.org/10.1045/november2015-torvik">Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik</a>
* IMPLICIT journal similarity:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-4742014_V1">Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1</a>
* Novelty dataset for identify article level novelty:
- Dataset citation: <a href="https://doi.org/10.13012/B2IDB-5060298_V1">Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1</a>
- Paper citation: <a href="https://doi.org/10.1045/september2016-mishra"> Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra</a>
- Code: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
* Expertise dataset for identifying author expertise on articles:
* Source code provided at: <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>
**Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.**
Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here</a> for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions
Additional data related updates can be found at <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a>
## Acknowledgments
This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347</a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742</a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
## License
Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.
Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/PubMed_SelfCitationAnalysis">https://github.com/napsternxg/PubMed_SelfCitationAnalysis</a>.
keywords:
Self citation; PubMed Central; Data Analysis; Citation Data;
published:
2018-04-19
Torvik, Vetle I.; Smalheiser, Neil R.
(2018)
Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03
The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed.
• How was the dataset created?
The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in
<i>Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304</i>
<i>Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105</i>
Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication.
• How accurate is the 2009 dataset (compared to 2006 and 2009)?
The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors.
• What is the format of the dataset?
The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields:
1. blocks separated by '||'; each block may consist of multiple lastname-first initial variants separated by '|'
2. prior probabilities of the respective blocks separated by '|'
3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks)
4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased).
5. cluster size (number of author name instances on papers)
6. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix
7. last name variants separated by '|'
8. first name variants separated by '|'
9. middle initial variants separated by '|' ('-' if none)
10. suffix variants separated by '|' ('-' if none)
11. email addresses separated by '|' ('-' if none)
12. range of years (e.g., 1997-2009)
13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none)
14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '|'; ('-' if none)
15. Journals with counts in parenthesis (separated by "|"),
16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none)
17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none)
18. Co-author IDs with counts in parenthesis; separated by '|'; ('-' if none)
19. Author name instances (PMID_auno separated '|')
20. Grant IDs (after normalization; "-" if none given; separated by "|"),
21. Total number of times cited. (Citations are based on references extracted from PMC).
22. h-index
23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "|"
24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "|"
25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "|"
26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)
keywords:
Bibliographic databases; Name disambiguation; MEDLINE; Library information networks
published:
2018-04-23
Mishra, Shubhanshu; Torvik, Vetle I.
(2018)
Conceptual novelty analysis data based on PubMed Medical Subject Headings
----------------------------------------------------------------------
Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018
## Introduction
This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra.
It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015.
The dataset is distributed in the form of the following tab separated text files:
* PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow:
- PMID: PubMed ID
- Year: year of publication
- TimeNovelty: time novelty score of the paper based on individual concepts (see paper)
- VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper)
- PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper)
- PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper)
* mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow:
- MeshTerm: Name of the MeSH term
- Year: year
- AbsVal: Total publications with that MeSH term in the given year
- TimeNovelty: age (in years since first publication) of MeSH term in the given year
- VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year
* meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years
- Mesh1: Name of the first MeSH term (alphabetically sorted)
- Mesh2: Name of the second MeSH term (alphabetically sorted)
- Year: year
- AbsVal: Total publications with that MeSH pair in the given year
- TimeNovelty: age (in years since first publication) of MeSH pair in the given year
- VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year
* README.txt file
## Dataset creation
This dataset was constructed using multiple datasets described in the following locations:
* MEDLINE 2015 baseline: <a href="https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html">https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html</a>
* MeSH tree 2015: <a href="ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/">ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/</a>
* Source code provided at: <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.
Check <a href="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">here </a>for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions:
Additional data related updates can be found at: <a href="http://abel.ischool.illinois.edu">Torvik Research Group</a>
## Acknowledgments
This work was made possible in part with funding to VIT from <a href="https://projectreporter.nih.gov/project_info_description.cfm?aid=8475017&icde=18058490">NIH grant P01AG039347 </a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348742">NSF grant 1348742 </a>. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
## License
Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License.
Permissions beyond the scope of this license may be available at <a href="https://github.com/napsternxg/Novelty">https://github.com/napsternxg/Novelty</a>
keywords:
Conceptual novelty; bibliometrics; PubMed; MEDLINE; MeSH; Medical Subject Headings; Analysis;
published:
2019-09-17
Mishra, Shubhanshu
(2019)
Trained models for multi-task multi-dataset learning for text classification in tweets.
Classification tasks include sentiment prediction, abusive content, sarcasm, and veridictality.
Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_classification.py</a>
See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details.
If you are using this data, please also cite the related article:
Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning; sentiment; sarcasm; abusive content;
published:
2020-08-21
Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana
(2020)
# WikiCSSH
If you are using WikiCSSH please cite the following:
> Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. “WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia.” In Workshop on Scientific Knowledge Graphs (SKG 2020). https://skg.kmi.open.ac.uk/SKG2020/papers/HAN_et_al_SKG_2020.pdf
> Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana. 2020. "WikiCSSH - Computer Science Subject Headings from Wikipedia". University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-0424970_V1
Download the WikiCSSH files from: https://doi.org/10.13012/B2IDB-0424970_V1
More details about the WikiCSSH project can be found at: https://github.com/uiuc-ischool-scanr/WikiCSSH
This folder contains the following files:
WikiCSSH_categories.csv - Categories in WikiCSSH
WikiCSSH_category_links.csv - Links between categories in WikiCSSH
Wikicssh_core_categories.csv - Core categories as mentioned in the paper
WikiCSSH_category_links_all.csv - Links between categories in WikiCSSH (includes a dummy category called <ROOT> which is parent of isolates and top level categories)
WikiCSSH_category2page.csv - Links between Wikipedia pages and Wikipedia Categories in WikiCSSH
WikiCSSH_page2redirect.csv - Links between Wikipedia pages and Wikipedia page redirects in WikiCSSH
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit <a href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</a> or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
keywords:
wikipedia; computer science;
published:
2016-06-23
This dataset was extracted from a set of metadata files harvested from the DataCite metadata store (https://search.datacite.org/ui) during December 2015. Metadata records for items with a resourceType of dataset were collected. 1,647,949 total records were collected.
This dataset contains three files:
1) readme.txt: A readme file.
2) version-results.csv: A CSV file containing three columns: DOI, DOI prefix, and version text contents
3) version-counts.csv: A CSV file containing counts for unique version text content values.
keywords:
datacite;metadata;version values;repository data
published:
2024-10-10
Mishra, Apratim; Lee, Haejin; Jeoung, Sullam; Torvik, Vetle; Diesner, Jana
(2024)
Diversity - PubMed dataset
Contact: Apratim Mishra (Oct, 2024)
This dataset presents article-level (pmid) and author-level (auid) diversity data for PubMed articles. The chosen selection includes articles retrieved from Authority 2018 [1], 907 024 papers, and 1 316 838 authors, and is an expanded dataset of V1. The sample of articles consists of the top 40 journals in the dataset, limited to 2-12 authors published between 1991 – 2014, which are article type "journal type" written in English. Files are 'gzip' compressed and separated by tab space, and V3 includes the correct author count for the included papers (pmids) and updated results with no NaNs.
################################################
File1: auids_plos_3.csv.gz (Important columns defined, 5 in total)
• AUID: a unique ID for each author
• Genni: gender prediction
• Ethnea: ethnicity prediction
#################################################
File2: pmids_plos_3.csv.gz (Important columns defined)
• pmid: unique paper
• auid: all unique auids (author-name unique identification)
• year: Year of paper publication
• no_authors: Author count
• journal: Journal name
• years: first year of publication for every author
• Country-temporal: Country of affiliation for every author
• h_index: Journal h-index
• TimeNovelty: Paper Time novelty [2]
• nih_funded: Binary variable indicating funding for any author
• prior_cit_mean: Mean of all authors’ prior citation rate
• Insti_impact: All unique institutions’ citation rate
• mesh_vals: Top MeSH values for every author of that paper
• relative_citation_ratio: RCR
The ‘Readme’ includes a description for all columns.
[1] Torvik, Vetle; Smalheiser, Neil (2021): Author-ity 2018 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2273402_V1
[2] Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1
keywords:
Diversity; PubMed; Citation
published:
2017-12-14
Objectives: This study follows-up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1) what is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign campus repository? Are datasets more likely to be single file or multiple file items? (2) what is the usage data associated with these datasets? Which items are most popular?
Methods: The dataset records collected in this study were identified by filtering item types categorized as "data" or "dataset" using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item's statistics report. The Handle identifier represents the dataset record's persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository. Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS.
Results: A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first time frame a large number of PDFs were deposited by the Illinois Department of Agriculture. Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per file across all datasets averaged 3.2.
Conclusion: Academic librarians, repository managers, and research data services staff can use the results presented here to anticipate the nature of research data that may be deposited within institutional repositories. With increased awareness, content recruitment, and improvements, IRs can provide a viable cyberinfrastructure for researchers to deposit data, but much can be learned from the data already deposited. Awareness of trends can help librarians facilitate discussions with researchers about research data deposits as well as better tailor their services to address short-term and long-term research needs.
keywords:
research data; research statistics; institutional repositories; academic libraries
published:
2017-06-01
List of Chinese Students Receiving a Ph.D. in Chemistry between 1905 and 1964. Based on two books compiling doctoral dissertations by Chinese students in the United States. Includes disciplines; university; advisor; year degree awarded, birth and/or death date, dissertation title. Accompanies Chapter 5 : History of the Modern Chemistry Doctoral Program in Mainland China by Vera V. Mainz published in "Igniting the Chemical Ring of Fire : Historical Evolution of the Chemical Communities in the Countries of the Pacific Rim", Seth Rasmussen, Editor. Published by World Scientific. Expected publication 2017.
keywords:
Chinese; graduate student; dissertation; university; advisor; chemistry; engineering; materials science
published:
2017-09-26
Gramig, Benjamin M.; Widmar, Nicole
(2017)
This file contains the supplemental appendix for the article "Farmer Preferences for Agricultural Soil Carbon Sequestration Schemes" published in Applied Economic Policy and Perspectives (accepted 2017).
keywords:
appendix; carbon sequestration; tillage; choice experiment
published:
2018-04-19
Prepared by Vetle Torvik 2018-04-15
The dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.
• How was the dataset created?
First and last names of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including
Ethnea+Genni as described in:
<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.
http://hdl.handle.net/2142/88927</i>
<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>
EthnicSeer: http://singularity.ist.psu.edu/ethnicity
<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>
SexMachine 0.1.1: <a href="https://pypi.python.org/pypi/SexMachine/">https://pypi.org/project/SexMachine</a>
First names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.
• The code and back-end data is periodically updated and made available for query at <a href ="http://abel.ischool.illinois.edu">Torvik Research Group</a>
• What is the format of the dataset?
The dataset contains 9,300,182 rows and 10 columns
1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)
2. name: full name used as input to EthnicSeer)
3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX
4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction
5. lastname: used as input for Ethnea+Genni
6. firstname: used as input for Ethnea+Genni
7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)
8. Genni: predicted gender; 'F', 'M', or '-'
9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)
10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'
keywords:
Androgyny; Bibliometrics; Data mining; Search engine; Gender; Semantic orientation; Temporal prediction; Textual markers
published:
2018-12-14
Stein Kenfield, Ayla
(2018)
Spreadsheet with data about whether or not the indicated institutional repository website provides metadata documentation. See readme file for more information.
keywords:
institutional repositories; metadata; best practices; metadata documentation
published:
2016-12-02
Gross, Alexander Jones; Murthy, Dhiraj; Varshney, Lav R.
(2016)
This dataset enumerates the number of geocoded tweets captured in geographic rectangular bounding boxes around the metropolitan statistical areas (MSAs) defined for 49 American cities, during a four-week period in 2012 (between April and June), through the Twitter Streaming API.
More information on MSA definitions: https://www.census.gov/population/metro/
keywords:
human dynamics; social media; urban informatics; pace of life; Twitter; ecological correlation; individual behavior
published:
2018-09-04
Teper, Thomas; Lenkart, Joe; Thacker, Mara; Coskun, Esra
(2018)
This dataset contains records of five years of interlibrary loan (ILL) transactions for the University of Illinois at Urbana-Champaign
Library. It is for the materials lent to other institutions during period 2009-2013. It includes 169,890 transactions showing date; borrowing institution’s type, state and country; material format, imprint city, imprint country, imprint region, call number, language, local circulation count, ILL lending count, and OCLC holdings count.
The dataset was generated putting together monthly ILL reports. Circulation and ILL lending fields were added from the ILS records. Borrower region and imprint region fields are created based on Title VI Region List. OCLC holdings field has been added from WorldCat records.
keywords:
Interlibrary Loan; ILL; Lending; OCLC Holding; Library; Area Studies; Collection; Circulation; Collaborative; Shared; Resource Sharing
published:
2017-11-15
Monthly water withdrawal records (total pumpage and per-capita consumption) for the City of Austin, Texas (2000-2014). Data were provided by Austin Water Utility.
keywords:
Water use; Water conservation
published:
2016-05-26
This data set includes survey responses collected during 2015 from academic libraries with library publishing services. Each institution responded to questions related to its use of user studies or information about readers in order to shape digital publication design, formats, and interfaces. Survey data was supplemented with institutional categories to facilitate comparison across institutional types.
keywords:
academic libraries; publishing; user experience; user studies
published:
2018-12-20
Dong, Xiaoru; Xie, Jingyi; Hoang, Linh
(2018)
File Name: WordsSelectedByInformationGain.csv
Data Preparation: Xiaoru Dong, Linh Hoang
Date of Preparation: 2018-12-12
Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang
Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks.
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider.
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews.
Description: the file contains a list of 1655 informative words selected by applying information gain feature selection strategy.
Information gain is one of the methods commonly used for feature selection, which tells us how many bits of information the presence of the word are helpful for us to predict the classes, and can be computed in a specific formula [Jurafsky D, Martin JH. Speech and language processing. London: Pearson; 2014 Dec 30].We ran Information Gain feature selection on Weka -- a machine learning tool.
Notes: In order to reproduce the data in this file, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords:
Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews
published:
2021-08-05
Lotspeich-Yadao, Michael
(2021)
This geodatabase serves two purposes: 1) to provide State of Illinois agencies with a fast resource for the preparation of maps and figures that require the use of shape or line files from federal agencies, the State of Illinois, or the City of Chicago, and 2) as a start for social scientists interested in exploring how geographic information systems (whether this is data visualization or geographically weighted regression) can bring new meaning to the interpretation of their data. All layer files included are relevant to the State of Illinois. Sources for this geodatabase include the U.S. Census Bureau, U.S. Geological Survey, City of Chicago, Chicago Public Schools, Chicago Transit Authority, Regional Transportation Authority, and Bureau of Transportation Statistics.
keywords:
State of Illinois; City of Chicago; Chicago Public Schools; GIS; Statistical tabulation areas; hydrography
published:
2022-10-04
One of the newest types of multimedia involves body-connected interfaces, usually termed haptics. Haptics may use stylus-based tactile interfaces, glove-based systems, handheld controllers, balance boards, or other custom-designed body-computer interfaces. How well do these interfaces help students learn Science, Technology, Engineering, and Mathematics (STEM)? We conducted an updated review of learning STEM with haptics, applying meta-analytic techniques to 21 published articles reporting on 53 effects for factual, inferential, procedural, and transfer STEM learning. This deposit includes the data extracted from those articles and comprises the raw data used in the meta-analytic analyses.
keywords:
Computer-based learning; haptic interfaces; meta-analysis