Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009

Torvik, Vetle

doi:10.13012/B2IDB-4742014_V1

Illinois Data Bank - Dataset

Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009

Cite this dataset:

Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1

Use this persistent URL to link to this dataset:


Dataset Description	Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc. Using authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless. The tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then nA = sum(Ai); nB = sum(Ai) nAB = sum(AiBi) if A not equal B; nAA = sum(Ai(Ai-1)/2) otherwise nAnB = nAnB if A not equal B; nAnA = nA(nA-1)/2 otherwise score = 1000000nAB/nAnB if A is not equal B; 1000000nAA/nAnA otherwise Token pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB. The cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs. Each dataset has the following 7 tab-delimited all-ASCII columns 1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000 2: nAB: total number of co-occurrences 3: nAnB: total number of pairs 4: nA: number of occurrences of token A 5: nB: number of occurrences of token B 6: A: token A 7: B: token B We made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in: D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681. https://doi.org/10.1371/journal.pone.0115681 Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:http://dx.doi.org/10.5210/disco.v7i0.6654
Subject	Social Sciences
Keywords	PubMed; MeSH; token; name disambiguation
License	CC BY
Funder	U.S. National Science Foundation (NSF)-Grant:0965341
Funder	U.S. National Institutes of Health (NIH)-Grant:P01AG039347
Corresponding Creator	Vetle Torvik
Downloaded	2336 times
Related Materials (2) Dataset Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 Article Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773

Versions

Version	DOI	Comment	Publication Date
1	10.13012/B2IDB-4742014_V1		2018-04-23

Files


Select all Open in Globus what's this?
author-implicit-affiliation-journal-pairs2009.tsv 163 MB File
author-implicit-affiliation-mesh-pairs2009.tsv 210 MB File
author-implicit-affiliation-title-pairs2009.tsv 152 MB File
author-implicit-affiliation-word-pairs2009.tsv 96.6 MB File
author-implicit-journal-pairs2009.tsv 52 MB File
author-implicit-mesh-journal-pairs2009.tsv 110 MB File
author-implicit-mesh-pairs2009.tsv 73.6 MB File
author-implicit-mesh-title-pairs2009.tsv 124 MB File
author-implicit-title-journal-pairs2009.tsv 92.6 MB File
author-implicit-title-word-pairs2009.tsv 65.1 MB File

Change Log

Contact the Research Data Service for help interpreting this log.

RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1371/journal.pone.0195773", "uri"=>"10.1371/journal.pone.0195773", "uri_type"=>"DOI", "citation"=>"Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773", "dataset_id"=>537, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo"}	2018-09-29T15:44:37Z
RelatedMaterial	create: {"material_type"=>"Dataset", "availability"=>nil, "link"=>"https://doi.org/10.13012/B2IDB-4222651_V1", "uri"=>"10.13012/B2IDB-4222651_V1", "uri_type"=>"DOI", "citation"=>"Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1", "dataset_id"=>537, "selected_type"=>"Dataset", "datacite_list"=>"IsSupplementTo"}	2018-04-27T16:35:20Z
Dataset	update: {"description"=>["Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc.\r\n\r\nUsing authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless.\r\n\r\nThe tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then\r\nnA = sum(Ai); nB = sum(Ai)\r\nnAB = sum(AiBi) if A not equal B; nAA = sum(Ai(Ai-1)/2) otherwise\r\nnAnB = nAnB if A not equal B; nAnA = nA(nA-1)/2 otherwise\r\nscore = 1000000nAB/nAnB if A is not equal B; 1000000nAA/nAnA otherwise\r\n\r\nToken pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB.\r\nThe cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs.\r\n\r\n\r\n\r\nEach dataset has the following 7 tab-delimited all-ASCII columns\r\n\r\n1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000\r\n2: nAB: total number of co-occurrences\r\n3: nAnB: total number of pairs\r\n4: nA: number of occurrences of token A\r\n5: nB: number of occurrences of token B\r\n6: A: token A\r\n7: B: token B\r\n\r\nWe made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in: \r\n\r\nJennifer LD, Smalheiser NR. Three journal similarity metrics and their application to biomedical journals. PloS one. 2014 Dec 23;9(12):e115681.\r\n\r\nSmalheiser NR, Bonifield G. Two Similarity Metrics for Medical Subject Headings (MeSH):: An Aid to Biomedical Text Mining and Author Name Disambiguation. Journal of biomedical discovery and collaboration. 2016;7.\r\n", "Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc.\r\n\r\nUsing authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless.\r\n\r\nThe tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then\r\nnA = sum(Ai); nB = sum(Ai)\r\nnAB = sum(AiBi) if A not equal B; nAA = sum(Ai(Ai-1)/2) otherwise\r\nnAnB = nAnB if A not equal B; nAnA = nA(nA-1)/2 otherwise\r\nscore = 1000000nAB/nAnB if A is not equal B; 1000000nAA/nAnA otherwise\r\n\r\nToken pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB.\r\nThe cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs.\r\n\r\n\r\n\r\nEach dataset has the following 7 tab-delimited all-ASCII columns\r\n\r\n1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000\r\n2: nAB: total number of co-occurrences\r\n3: nAnB: total number of pairs\r\n4: nA: number of occurrences of token A\r\n5: nB: number of occurrences of token B\r\n6: A: token A\r\n7: B: token B\r\n\r\nWe made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in: \r\n\r\n<a href=\"https://doi.org/10.1371/journal.pone.0115681\">D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681. https://doi.org/10.1371/journal.pone.0115681</a>\r\n\r\n<a href=\"http://dx.doi.org/10.5210/disco.v7i0.6654\">Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:http://dx.doi.org/10.5210/disco.v7i0.6654</a>\r\n"], "keywords"=>["", "PubMed; MeSH; token; name disambiguation"], "version_comment"=>[nil, ""], "subject"=>["", "Social Sciences"]}	2018-04-27T16:35:20Z