Illinois Data Bank Dataset Search Results
Results
published:
2025-06-05
Guan, Yingjun; Fang, Liri
(2025)
There are two files in this dataset.
File1: AffiNorm
AffiNorm contains 1,001 rows, including one header row, randomly sampled from MapAffil 2018 Dataset ([**https://doi.org/10.13012/B2IDB-2556310_V1**](https://databank.illinois.edu/datasets/IDB-2556310)). Each row in the file corresponds to a particular author on a particular PubMed record, and contains the following 26 columns, comma-delimited. All columns are ASCII, except city which contains Latin-1.
COLUMN DESCRIPTION
1. PMID: the PubMed identifier. int.
2. ORDER: the position of the author. int.
3. YEAR - The year of publication. int(4), eg: 1975.
4. affiliation - affiliation string of the author. eg: Department of Pathology, University of Chicago, Illinois 60637.
5. annotation_type: the number of institutions annotated, denoted by S, M, O, or Z, where "S" (single) indicates 1 institution was annotated; "M" (Multiple) indicates more than one institutions were annotated; "O" (Out of Vocabulary or None) indicates no institution was annotated, but an institution was apparently mentioned; "Z" indicates no institution was mentioned.
6. Institution: the standard name(s) of the annotated institution(s), according to ROR. if "S" (single institution), it is saved as a string, eg: University of Chicago; if "M", it is saved as a string that looks like a python list, eg: ['Public Health Laboratory Service'; 'Centre for Applied Microbiology and Research']; if "O" or "Z", then blank.
7. inst_type: the type of institution, according to ROR. the potential values are: education, funder, healthcare, company, archive, nonprofit, government, facility, other. An institution may have more than one type, eg: ['Education', 'Funder']
8. type_edu: TRUE if the inst_type contains "Education"; FALSE otherwise.
9. RORid: ROR identifier(s), eg: https://ror.org/05hs6h993. when multiple, the order corresponds to institution (column 6)
10. RORid_label. the standard name(s) of the annotated institution(s) according to ROR.same as institution (column 6)
11. GRIDid: GRID identifier(s). eg: grid.170205.1
12. GRIDid_label: the standard name(s) of the annotated institution(s) according to GRID. eg: University of Chicago.
13. WikiDataid: WikiData identifier(s). eg: Q131252
14. WikiDataid_label: the standard name(s) of the annotated institution(s) according to WikiData. eg: University of Chicago
15. synonyms: a comma separated list of variant names from InsVar (file 2) . format of string. eg: University of Chicago, Chicago University, U of C, UChicago, uchicago.edu, U Chicago, ...
16. MapAffil-grid: GRID from the MapAffil 2018 Dataset.
17. MapAffil-grid_label: The standard name of institution from MapAffil 2018 Dataset.
18. judge_mapA: TRUE if GRIDid (column 11) contains MapAffil-grid (column 16); FALSE otherwise.
19. MapAffiltemporal-grid: GRID from the temporal version of MapAffil, http://abel.ischool.illinois.edu/data/MapAffilTempo2018.tsv.gz
20. MapAffiltemporal-grid_label: The standard name of institution from MapAffilTemporal 2018 Dataset.
21. judge_mapT: TRUE if GRIDid (column 11) contains MapAffiltemporal-grid (column 19); FALSE otherwise.
22. RORapi_query_id: ROR from ROR api tool (query endpoint)
23. RORapi_query_id_label: The standard name of institution from ROR api tool (query endpoint). format in string.
24. judge_rorapi_affiliation: TRUE if RORid (column 9) contains RORapi_query_id (column 22); FALSE otherwise.
25. rorapi_affiliation_id: ROR from ROR api tool (affiliation endpoint).
26. judge_rorapi_affiliation: TRUE if RORid (column 9) contains RORapi_affiliation (column 25); FALSE otherwise.
File 2: insVar.json
InsVar is a supplementary dataset for AffiNorm, which includes the institution ID and its redirected aliases from wikidata. The institution ID list is from GRID, the redirected aliases are from wiki api, for example: https://en.wikipedia.org/wiki/Special:WhatLinksHere?target=University+of+Illinois+Urbana-Champaign&namespace=&hidetrans=1&hidelinks=1&limit=100
In InsVar, the data is saved in a python dictionary format. the key is the GRID identifier, for example: "grid.1001.0" (Australian National University), and the value is a list of redirected aliases strings.
{"grid.1001.0": ["ANU", "ANU College", "ANU College of Arts and Social Sciences", "ANU College of Asia and the Pacific", "ANU Union", "ANUSA", "Asia Pacific Week", "Australia National University", "Australian Forestry School", "the Australian National University", ...], "grid.1002.3": ...}
keywords:
PubMed; MEDLINE; Digital Libraries; Bibliographic Databases; Institution Names; Author Affiliations; Institution Name Ambiguity; Authority files
published:
2025-04-25
Tassitano, Rafael; Chakraborty, Shreyonti
(2025)
This is an Excel file containing data about the physical environments of four Brazilian schools and the average daily minutes/day of physical activity and sedentary behavior exhibited by schoolchildren during school hours.
The Following Key describes the basic variables:
Subject IDs and Characteristics
Subject_ID: ID of Subject
total_days: Total number of days subject participated in experiment
Gender : Gender of subject
Age: Age of subject
School IDs and Characteristics
ID_School = ID of School
school1 = 1 if ID_School = 1, else = 0
school2 = 1 if ID_School = 2, else = 0
school3 = 1 if ID_School = 3, else = 0
school4 = 1 if ID_School = 4, else = 0
TotalSiteArea: Total Site Area on School Campus
PatioArea: Area of Patio(s)
CourtyardArea: Area of Courtyard(s)
TotalOpenArea: Total Area of Open Spaces on Campus
Class: Number of Sections in the School
Population: Total Number of Students Enrolled in the School
keywords:
school environment; physical activity
published:
2022-12-05
Ng, Yee Man Margaret ; Taneja, Harsh
(2022)
These are similarity matrices of countries based on dfferent modalities of web use. Alexa website traffic, trending vidoes on Youtube and Twitter trends. Each matrix is a month of data aggregated
keywords:
Global Internet Use
published:
2019-09-17
Mishra, Shubhanshu
(2019)
Trained models for multi-task multi-dataset learning for sequence tagging in tweets.
Sequence tagging tasks include POS, NER, Chunking, and SuperSenseTagging.
Models were trained using: <a href="https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_experiment.py">https://github.com/socialmediaie/SocialMediaIE/blob/master/SocialMediaIE/scripts/multitask_multidataset_experiment.py</a>
See <a href="https://github.com/socialmediaie/SocialMediaIE">https://github.com/socialmediaie/SocialMediaIE</a> and <a href="https://socialmediaie.github.io">https://socialmediaie.github.io</a> for details.
If you are using this data, please also cite the related article:
Shubhanshu Mishra. 2019. Multi-dataset-multi-task Neural Sequence Tagging for Information Extraction from Tweets. In Proceedings of the 30th ACM Conference on Hypertext and Social Media (HT '19). ACM, New York, NY, USA, 283-284. DOI: https://doi.org/10.1145/3342220.3344929
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning;
published:
2021-05-07
Prepared by Vetle Torvik 2021-05-07
The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters).
• How was the dataset created?
The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in December, 2018. (NLMs baseline 2018 plus updates throughout 2018). Affiliations are linked to a particular author on a particular article. Prior to 2014, NLM recorded the affiliation of the first author only. However, MapAffil 2018 covers some PubMed records lacking affiliations that were harvested elsewhere, from PMC (e.g., PMID 22427989), NIH grants (e.g., 1838378), and Microsoft Academic Graph and ADS (e.g. 5833220). Affiliations are pre-processed (e.g., transliterated into ASCII from UTF-8 and html) so they may differ (sometimes a lot; see PMID 27487542) from PubMed records. All affiliation strings where processed using the MapAffil procedure, to identify and disambiguate the most specific place-name, as described in:
Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p
• Look for Fig. 4 in the following article for coverage statistics over time:
Palmblad, M., Torvik, V.I. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Trop Med Health 45, 33 (2017). <a href="https://doi.org/10.1186/s41182-017-0073-6">https://doi.org/10.1186/s41182-017-0073-6</a>
Expect to see big upticks in coverage of PMIDs around 1988 and for non-first authors in 2014.
• The code and back-end data is periodically updated and made available for query by PMID at http://abel.ischool.illinois.edu/cgi-bin/mapaffil/search.py
• What is the format of the dataset?
The dataset contains 52,931,957 rows (plus a header row). Each row (line) in the file has a unique PMID and author order, and contains the following eighteen columns, tab-delimited. All columns are ASCII, except city which contains Latin-1.
1. PMID: positive non-zero integer; int(10) unsigned
2. au_order: positive non-zero integer; smallint(4)
3. lastname: varchar(80)
4. firstname: varchar(80); NLM started including these in 2002 but many have been harvested from outside PubMed
5. initial_2: middle name initial
6. orcid: From 2019 ORCID Public Data File https://orcid.org/ and from PubMed XML
7. year: year of the publication
8. journal: name of journal that the publication is published
9. affiliation: author's affiliation??
10. disciplines: extracted from departments, divisions, schools, laboratories, centers, etc. that occur on at least unique 100 affiliations across the dataset, some with standardization (e.g., 1770799), English translations (e.g., 2314876), or spelling corrections (e.g., 1291843)
11. grid: inferred using a high-recall technique focused on educational institutions (but, for experimental purposes, includes a few select hospitals, national institutes/centers, international companies, governmental agencies, and 200+ other IDs [RINGGOLD, Wikidata, ISNI, VIAF, http] for institutions not in GRID). Based on 2019 GRID version https://www.grid.ac/
12. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK
13. city: varchar(200); typically 'city, state, country' but could include further subdivisions; unresolved ambiguities are concatenated by '|'
14. state: Australia, Canada and USA (which includes territories like PR, GU, AS, and post-codes like AE and AA)
15. country
16. lat: at most 3 decimals (only available when city is not a country or state)
17. lon: at most 3 decimals (only available when city is not a country or state)
18. fips: varchar(5); for USA only retrieved by lat-lon query to https://geo.fcc.gov/api/census/block/find
keywords:
PubMed, MEDLINE, Digital Libraries, Bibliographic Databases; Author Affiliations; Geographic Indexing; Place Name Ambiguity; Geoparsing; Geocoding; Toponym Extraction; Toponym Resolution; institution name disambiguation
published:
2020-05-20
Origin Ventures Academy for Entrepreneurial Leadership, Gies College of Business
(2020)
This dataset is a snapshot of the presence and structure of entrepreneurship education in U.S. four-year colleges and universities in 2015, including co-curricular activities and related infrastructure. Public, private not-for-profit and for-profit institutions are included, as are specialized four-year institutions. The dataset provides insight into the presence of entrepreneurship education both within business units and in other units of college campuses. Entrepreneurship is defined broadly, to include small business management and related career-focused options.
keywords:
Entrepreneurship education; Small business education; Ewing Marion Kauffman Foundation; csv
published:
2025-02-20
Zhou, Xiaoran; Zheng, Heng
(2025)
To gather news articles from the web that discuss the Cochrane Review (DOI: 10.1002/14651858.CD006207.pub6), we retrieved articles on August 1, 2023 from used Altmetric.com's Altmetric Explorer. We selected all articles that were written in English, published in the United States, and had a publication date <b>on or after March 10, 2023</b> (according to the "Mention Date" from Altmetric.com). This date is significant as it is when Cochrane issued a statement (https://www.cochrane.org/news/statement-physical-interventions-interrupt-or-reduce-spread-respiratory-viruses-review) about the "misleading interpretation" of the Cochrane Review made by news articles.
A previously published dataset for "Arguing about Controversial Science in the News: Does Epistemic Uncertainty Contribute to Information Disorder?" (DOI: 10.13012/B2IDB-4781172_V1) contains annotation of the news articles published before March 10, 2023. Our dataset annotates the news published on or after March 10, 2023.
The Altmetric_data.csv describes the selected news articles with both data exported from Altmetric Explorer and data we manually added
Data exported from Altmetric Explorer:
- Publication date of the news article
- Title of the news article
- Source/publication venue of the news article
- URL
- Country
Data we manually added:
- Whether the article is accessible
- The date we checked the article
- The corresponding ID of the article in MAXQDA
For each article from Altmetric.com, we first tried to use the Web Collector for MAXQDA to download the article from the website and imported it into MAXQDA (version 22.8.0).
We manually extracted direct quotations from the articles using MAXQDA.
We included surrounding words and sentences around direct quotations for context where needed.
We manually added codes and code categories in MAXQDA to identify the individuals (chief editors of the Cochrane Review, government agency representatives, journalists, and other experts such as physicians) or organizations (government agencies, other organizations, and research publications) who were quoted.
The MAXQDA_data.csv file contains excerpts from the news articles that contain the direct quotations we annotated.
For each excerpt, we included the following information:
- MAXQDA ID of the document from which the excerpt originates
- The collection date and source of the document
- The code we assigned to the excerpt
- The code category
- The excerpt itself
keywords:
altmetrics; MAXQDA; masks for COVID-19; scientific controversies; news articles
published:
2022-07-11
Jeng, Amos; Bosch, Nigel; Perry, Michelle
(2022)
This dataset was developed as part of an online survey study that explores student characteristics that may predict what one finds helpful in replies to requests for help posted to an online college course discussion forum. 223 college students enrolled in an introductory statistics course were surveyed on their sense of belonging to their course community, as well as how helpful they found 20 examples of replies to requests for help posted to a statistics course discussion forum.
keywords:
help-giving; discussion forums; sense of belonging; college student
published:
2022-07-25
A set of species entity mentions derived from an NERC dataset analyzing 900 synthetic biology articles published by the ACS. This data is associated with the Synthetic Biology Knowledge System repository (https://web.synbioks.org/). The data in this dataset are raw mentions from the NERC data.
keywords:
synthetic biology; NERC data; species mentions
published:
2022-07-25
Related to the raw entity mentions, this dataset represents the effects of the data cleaning process and collates all of the entity mentions which were too ambiguous to successfully link to the NCBI's taxonomy identifier system.
keywords:
synthetic biology; NERC data; species mentions, ambiguous entities
published:
2022-07-25
This dataset represents the results of manual cleaning and annotation of the entity mentions contained in the raw dataset (https://doi.org/10.13012/B2IDB-4950847_V1). Each mention has been consolidated and linked to an identifier for a matching concept from the NCBI's taxonomy database.
keywords:
synthetic biology; NERC data; species mentions; cleaned data; NCBI TaxonID
published:
2023-07-20
Atallah, Shady; Huang, Ju-Chin; Leahy, Jessica; Bennett, Karen P.
(2023)
This is a dataset from a choice experiment survey on family forest landowner preferences for managing invasive species.
keywords:
ecosystem services, forests, invasive species control, neighborhood effect
published:
2022-04-21
This dataset was created based on the publicly available microdata from PNS-2019, a national health survey conducted by the Instituto Brasileiro de Geografia e Estatistica (IBGE, Brazilian Institute of Geography and Statistics). IBGE is a federal agency responsible for the official collection of statistical information in Brazil – essentially, the Brazilian census bureau. Data on selected variables focusing on biopsychosocial domains related to pain prevalence, limitations and treatment are available. The Fundação Instituto Oswaldo Cruz has detailed information about the PNS, including questionnaires, survey design, and datasets (www.pns.fiocruz.br). The microdata can be found on the IBGE website (https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html?caminho=PNS/2019/Microdados/Dados).
keywords:
back pain; health status disparities; biopsychosocial; Brazil
published:
2021-04-28
An Atlas.ti dataset and accompanying documentation of a thematic analysis of problems and opportunities associated with retracted research and its continued citation.
keywords:
Retraction; Citation; Problems and Opportunities
published:
2021-11-05
Keralis, Spencer D. C.; Yakin, Syamil
(2021)
This data set contains survey results from a 2021 survey of University of Illinois University Library patrons who identify as transgender or gender non-conforming conducted as part of the Becoming a Trans Inclusive Library Project to assess the experiences of transgender patrons seeking information and services in the University Library. Survey instruments are available in the IDEALS repository: http://hdl.handle.net/2142/110081.
keywords:
transgender awareness; academic library; gender identity awareness; patron experience
published:
2019-06-13
Rezapour, Rezvaneh; Diesner, Jana
(2019)
This lexicon is the expanded/enhanced version of the Moral Foundation Dictionary created by Graham and colleagues (Graham et al., 2013).
Our Enhanced Morality Lexicon (EML) contains a list of 4,636 morality related words.
This lexicon was used in the following paper - please cite this paper if you use this resource in your work.
Rezapour, R., Shah, S., & Diesner, J. (2019). Enhancing the measurement of social effects by capturing morality. Proceedings of the 10th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA). Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN.
In addition, please consider citing the original MFD paper:
<a href="https://doi.org/10.1016/B978-0-12-407236-7.00002-4">Graham, J., Haidt, J., Koleva, S., Motyl, M., Iyer, R., Wojcik, S. P., & Ditto, P. H. (2013). Moral foundations theory: The pragmatic validity of moral pluralism. In Advances in experimental social psychology (Vol. 47, pp. 55-130)</a>.
keywords:
lexicon; morality
published:
2022-07-25
This dataset is derived from the raw dataset (https://doi.org/10.13012/B2IDB-4163883_V1) and collects entity mentions that were manually determined to be noisy, non-chemical entities.
keywords:
synthetic biology; NERC data; chemical mentions, noisy entities
published:
2020-06-12
Fu, Yuanxi; Hsiao, Tzu-Kun
(2020)
This is a network of 14 systematic reviews on the salt controversy and their included studies. Each edge in the network represents an inclusion from one systematic review to an article. Systematic reviews were collected from Trinquart (Trinquart, L., Johns, D. M., & Galea, S. (2016). Why do we think we know what we know? A metaknowledge analysis of the salt controversy. International Journal of Epidemiology, 45(1), 251–260. https://doi.org/10.1093/ije/dyv184 ).
<b>FILE FORMATS</b>
1) Article_list.csv - Unicode CSV
2) Article_attr.csv - Unicode CSV
3) inclusion_net_edges.csv - Unicode CSV
4) potential_inclusion_link.csv - Unicode CSV
5) systematic_review_inclusion_criteria.csv - Unicode CSV
6) Supplementary Reference List.pdf - PDF
<b>ROW EXPLANATIONS</b>
1) Article_list.csv - Each row describes a systematic review or included article.
2) Article_attr.csv - Each row is the attributes of a systematic review/included article.
3) inclusion_net_edges.csv - Each row represents an inclusion from a systematic review to an article.
4) potential_inclusion_link.csv - Each row shows the available evidence base of a systematic review.
5) systematic_review_inclusion_criteria.csv - Each row is the inclusion criteria of a systematic review.
6) Supplementary Reference List.pdf - Each item is a bibliographic record of a systematic review/included paper.
<b>COLUMN HEADER EXPLANATIONS</b>
<b>1) Article_list.csv:</b>
ID - Numeric ID of a paper
paper assigned ID - ID of the paper from Trinquart et al. (2016)
Type - Systematic review / primary study report
Study Groupings - Groupings for related primary study reports from the same report, from Trinquart et al. (2016) (if applicable, otherwise blank)
Title - Title of the paper
year - Publication year of the paper
Attitude - Scientific opinion about the salt controversy from Trinquart et al. (2016)
Doi - DOIs of the paper. (if applicable, otherwise blank)
Retracted (Y/N) - Whether the paper was retracted or withdrawn (Y). Blank if not retracted or withdrawn.
<b>2) Article_attr.csv:</b>
ID - Numeric ID of a paper
year - Publication year
Attitude - Scientific opinion about the salt controversy from Trinquart et al. (2016)
Type - Systematic review/ primary study report
<b>3) inclusion_net_edges.csv:</b>
citing_ID - The numeric ID of a systematic review
cited_ID - The numeric ID of the included articles
<b>4) potential_inclusion_link.csv:</b>
This data was translated from the Sankey diagram given in Trinquart et al. (2016) as Web Figure 4. Each row indicates a systematic review and each column indicates a primary study. In the matrix, "p" indicates that a given primary study had been published as of the search date of a given systematic review.
<b>5)systematic_review_inclusion_criteria.csv:</b>
ID - The numeric IDs of systematic reviews
paper assigned ID - ID of the paper from Trinquart et al. (2016)
attitude - Its scientific opinion about the salt controversy from Trinquart et al. (2016)
No. of studies included - Number of articles included in the systematic review
Study design - Study designs to include, per inclusion criteria
population - Populations to include, per inclusion criteria
Exposure/Intervention - Exposures/Interventions to include, per inclusion criteria
outcome - Study outcomes required for inclusion, per inclusion criteria
Language restriction - Report languages to include, per inclusion criteria
follow-up period - Follow-up period required for inclusion, per inclusion criteria
keywords:
systematic reviews; evidence synthesis; network visualization; tertiary studies
published:
2025-09-08
Si, Luyang; Salami, Malik Oyewale; Schneider, Jodi
(2025)
This project evaluates the quality of retraction indexing metadata in Crossref. We investigated 208 DOIs that were indexed as retracted in Crossref in our April 2023 union list (Schneider et al., 2023), but were no longer indexed as retracted in the July 2024 union list (Salami et al., 2024), despite still being covered in the Crossref database. Therefore, we manually checked the current retraction status of these 208 DOIs on their publishers’ websites to ascertain their actual status.
keywords:
Crossref; Data Quality; Retraction indexing; Retracted papers; Retraction notices; Retraction status; RISRS
published:
2021-03-17
Imker, Heidi J; Luong, Hoa; Mischo, William H; Schlembach, Mary C; Wiley, Chris
(2021)
This dataset was developed as part of a study that assessed data reuse. Through bibliometric analysis, corresponding authors of highly cited papers published in 2015 at the University of Illinois at Urbana-Champaign in nine STEM disciplines were identified and then surveyed to determine if data were generated for their article and their knowledge of reuse by other researchers. Second, the corresponding authors who cited those 2015 articles were identified and surveyed to ascertain whether they reused data from the original article and how that data was obtained. The project goal was to better understand data reuse in practice and to explore if research data from an initial publication was reused in subsequent publications.
keywords:
data reuse; data sharing; data management; data services; Scopus API
published:
2021-04-22
Torvik, Vetle; Smalheiser, Neil
(2021)
Author-ity 2018 dataset
Prepared by Vetle Torvik Apr. 22, 2021
The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018). A total of 29.1 million Article records and 114.2 million author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. The resulting clusters are provided in two different formats, the first in a file with only IDs and PMIDs, and the second in a file with cluster summaries:
####################
File 1: au2id2018.tsv
####################
Each line corresponds to an author name instance (PMID and Author name position) with an Author ID. It has the following tab-delimited fields:
1. Author ID
2. PMID
3. Author name position
########################
File 2: authority2018.tsv
#########################
Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants. Each cluster has a unique Author ID (the PMID of the earliest paper in the cluster and the author name position). The summary has the following tab-delimited fields:
1. Author ID (or cluster ID) e.g., 3797874_1 represents a cluster where 3797874_1 is the earliest author name instance.
2. cluster size (number of author name instances on papers)
3. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix
4. last name variants separated by '|'
5. first name variants separated by '|'
6. middle initial variants separated by '|' ('-' if none)
7. suffix variants separated by '|' ('-' if none)
8. email addresses separated by '|' ('-' if none)
9. ORCIDs separated by '|' ('-' if none). From 2019 ORCID Public Data File https://orcid.org/ and from PubMed XML
10. range of years (e.g., 1997-2009)
11. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none)
12. Top 20 most frequent MeSH (after stoplisting) with counts in parenthesis; separated by '|'; ('-' if none)
13. Journal names with counts in parenthesis (separated by '|'),
14. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none)
15. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none)
16. Author name instances (PMID_auno separated by '|')
17. Grant IDs (after normalization; '-' if none given; separated by '|'),
18. Total number of times cited. (Citations are based on references harvested from open sources such as PMC).
19. h-index
20. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by '|'
keywords:
author name disambiguation; PubMed
published:
2021-05-07
The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018), and for ORCIDs, primarily, the 2019 ORCID Public Data File https://orcid.org/.
Matching an ORCID to an individual author name on a PMID is a non-trivial process. Anyone can create an ORCID and claim to have contributed to any published work. Many records claim too many articles and most claim too few. Even though ORCID records are (most?) often populated by author name searches in popular bibliographic databases, there is no confirmation that the person's name is listed on the article. This dataset is the product of mapping ORCIDs to individual author names on PMIDs, even when the ORCID name does not match any author name on the PMID, and when there are multiple (good) candidate author names. The algorithm avoids assigning the ORCID to an article when there are no good candidates and when there are multiple equally good matches. For some ORCIDs that clearly claim too much, it triggers a very strict matching procedure (for ORCIDs that claim too much but the majority appear correct, e.g., 0000-0002-2788-5457), and sometimes deletes ORCIDs altogether when all (or nearly all) of its claimed PMIDs appear incorrect. When an individual clearly has multiple ORCIDs it deletes the least complete of them (e.g., 0000-0002-1651-2428 vs 0000-0001-6258-4628). It should be noted that the ORCIDs that claim to much are not necessarily due nefarious or trolling intentions, even though a few appear so. Certainly many are are due to laziness, such as claiming everything with a particular last name. Some cases appear to be due to test engineers (e.g., 0000-0001-7243-8157; 0000-0002-1595-6203), or librarians assisting faculty (e.g., ; 0000-0003-3289-5681), or group/laboratory IDs (0000-0003-4234-1746), or having contributed to an article in capacities other than authorship such as an Investigator, an Editor, or part of a Collective (e.g., 0000-0003-2125-4256 as part of the FlyBase Consortium on PMID 22127867), or as a "Reply To" in which case the identity of the article and authors might be conflated. The NLM has, in the past, limited the total number of authors indexed too. The dataset certainly has errors but I have taken great care to fix some glaring ones (individuals who claim to much), while still capturing authors who have published under multiple names and not explicitly listed them in their ORCID profile. The final dataset provides a "matchscore" that could be used for further clean-up.
Four files:
person.tsv: 7,194,692 rows, including header
1. orcid
2. lastname
3. firstname
4. creditname
5. othernames
6. otherids
7. emails
employment.tsv: 2,884,981 rows, including header
1. orcid
2. putcode
3. role
4. start-date
5. end-date
6. id
7. source
8. dept
9. name
10. city
11. region
12 country
13. affiliation
education.tsv: 3,202,253 rows, including header
1. orcid
2. putcode
3. role
4. start-date
5. end-date
6. id
7. source
8. dept
9. name
10. city
11. region
12 country
13. affiliation
pubmed2orcid.tsv: 13,133,065 rows, including header
1. PMID
2. au_order (author name position on the article)
3. orcid
4. matchscore (see below)
5. source: orcid (2019 ORCID Public Data File https://orcid.org/), pubmed (NLMs distributed XML files), or patci (an earlier version of ORCID with citations processed through the Patci tool)
12,037,375 from orcid; 1,06,5892 from PubMed XML; 29,797 from Patci
matchscore:
000: lastname, firstname and middle init match (e.g., Eric T MacKenzie vs
00: lastname, firstname match (e.g., Keith Ward)
0: lastname, firstname reversed match (e.g., Conde Santiago vs Santiago Conde)
1: lastname, first and middle init match (e.g., L. F. Panchenko)
11: lastname and partial firstname match (e.g., Mike Boland vs Michael Boland or Mel Ziman vs Melanie Ziman)
12: lastname and first init match
15: 3 part lastname and firstname match (David Grahame Hardie vs D Grahame Hardie)
2: lastname match and multipart firstname initial match Maria Dolores Suarez Ortega vs M. D. Suarez
22: partial lastname match and firstname match (e.g., Erika Friedmann vs Erika Friedman)
23: e.g., Antonio Garcia Garcia vs A G Garcia
25: Allan Downie vs J A Downie
26: Oliver Racz vs Oliver Bacz
27: Rita Ostrovskaya vs R U Ostrovskaia
29: Andrew Staehelin vs L A Staehlin
3: M Tronko vs N D Tron'ko
4: Sharon Dent (Also known as Sharon Y.R. Dent; Sharon Y Roth; Sharon Yoder) vs Sharon Yoder
45: Okulov Aleksei vs A B Okulov
48: Maria Del Rosario Garcia De Vicuna Pinedo vs R Garcia-Vicuna
49: Anatoliy Ivashchenko vs A Ivashenko
5 = lastname match only (weak match but sometimes captures alternative first name for better subsequent matches); e.g., Bill Hieb vs W F Hieb
6 = first name match only (weak match but sometimes captures alternative first name for better subsequent matches); e.g., Maria Borawska vs Maria Koscielak
7 = last or first name match on "other names"; e.g., Hromokovska Tetiana (Also known as Gromokovskaia, T. S., Громоковська Тетяна) vs T Gromokovskaia
77: Siva Subramanian vs Kolinjavadi N. Sivasubramanian
88 = no name in orcid but match caught by uniqueness of name across paper (at least 90% and 2 more than next most common name)
prefix:
C = ambiguity reduced (possibly eliminated) using city match (e.g., H Yang on PMID 24972200)
I = ambiguity eliminated by excluding investigators (ie.., one author and one or more investigators with that name)
T = ambiguity eliminated using PubMed pos (T for tie-breaker)
W = ambiguity resolved by authority2018
published:
2021-07-20
Fu, Yuanxi; Schneider, Jodi
(2021)
This dataset contains data from extreme-disagreement analysis described in paper “Aaron M. Cohen, Jodi Schneider, Yuanxi Fu, Marian S. McDonagh, Prerna Das, Arthur W. Holt, Neil R. Smalheiser, 2021, Fifty Ways to Tag your Pubtypes: Multi-Tagger, a Set of Probabilistic Publication Type and Study Design Taggers to Support Biomedical Indexing and Evidence-Based Medicine.” In this analysis, our team experts carried out an independent formal review and consensus process for extreme disagreements between MEDLINE indexing and model predictive scores. “Extreme disagreements” included two situations: (1) an abstract was MEDLINE indexed as a publication type but received low scores for this publication type, and (2) an abstract received high scores for a publication type but lacked the corresponding MEDLINE index term. “High predictive score” is defined as the top 100 high-scoring, and “low predictive score” is defined as the bottom 100 low-scoring. Three publication types were analyzed, which are CASE_CONTROL_STUDY, COHORT_STUDY, and CROSS_SECTIONAL_STUDY. Results were recorded in three Excel workbooks, named after the publication types: case_control_study.xlsx, cohort_study.xlsx, and cross_sectional_study.xlsx.
The analysis shows that, when the tagger gave a high predictive score (>0.9) on articles that lacked a corresponding MEDLINE indexing term, independent review suggested that the model assignment was correct in almost all cases (CROSS_SECTIONAL_STUDY (99%), CASE_CONTROL_STUDY (94.9%), and COHORT STUDY (92.2%)). Conversely, when articles received MEDLINE indexing but model predictive scores were very low (<0.1), independent review suggested that the model assignment was correct in the majority of cases: CASE_CONTROL_STUDY (85.4%), COHORT STUDY (76.3%), and CROSS_SECTIONAL_STUDY (53.6%).
Based on the extreme disagreement analysis, we identified a number of false-positives (FPs) and false-negatives (FNs). For case control study, there were 5 FPs and 14 FNs. For cohort study, there were 7 FPs and 22 FNs. For cross-sectional study, there were 1 FP and 45 FNs. We reviewed and grouped them based on patterns noticed, providing clues for further improving the models. This dataset reports the instances of FPs and FNs along with their categorizations.
keywords:
biomedical informatics; machine learning; evidence based medicine; text mining
published:
2021-05-10
This dataset contains data used in publication "Institutional Data Repository Development, a Moving Target" submitted to Code4Lib Journal. It is a tabular data file describing attributes of data files in datasets published in Illinois Data Bank 2016-04-01 to 2021-04-01.
keywords:
institutional repository
published:
2024-10-18
Exhaustive species inventory of suburban wetland complex in northeast Ohio (Cuyahoga County).
keywords:
floristic survey; wetland complex; comprehensive species list