Dataset
|
update: {"description"=>["Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc.\r\n\r\nUsing authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless.\r\n\r\nThe tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then\r\nnA = sum(Ai); nB = sum(Ai)\r\nnAB = sum(Ai*Bi) if A not equal B; nAA = sum(Ai*(Ai-1)/2) otherwise\r\nnAnB = nA*nB if A not equal B; nAnA = nA*(nA-1)/2 otherwise\r\nscore = 1000000*nAB/nAnB if A is not equal B; 1000000*nAA/nAnA otherwise\r\n\r\nToken pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB.\r\nThe cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs.\r\n\r\n\r\n\r\nEach dataset has the following 7 tab-delimited all-ASCII columns\r\n\r\n1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000\r\n2: nAB: total number of co-occurrences\r\n3: nAnB: total number of pairs\r\n4: nA: number of occurrences of token A\r\n5: nB: number of occurrences of token B\r\n6: A: token A\r\n7: B: token B\r\n\r\nWe made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in: \r\n\r\nJennifer LD, Smalheiser NR. Three journal similarity metrics and their application to biomedical journals. PloS one. 2014 Dec 23;9(12):e115681.\r\n\r\nSmalheiser NR, Bonifield G. Two Similarity Metrics for Medical Subject Headings (MeSH):: An Aid to Biomedical Text Mining and Author Name Disambiguation. Journal of biomedical discovery and collaboration. 2016;7.\r\n", "Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc.\r\n\r\nUsing authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless.\r\n\r\nThe tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then\r\nnA = sum(Ai); nB = sum(Ai)\r\nnAB = sum(Ai*Bi) if A not equal B; nAA = sum(Ai*(Ai-1)/2) otherwise\r\nnAnB = nA*nB if A not equal B; nAnA = nA*(nA-1)/2 otherwise\r\nscore = 1000000*nAB/nAnB if A is not equal B; 1000000*nAA/nAnA otherwise\r\n\r\nToken pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB.\r\nThe cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs.\r\n\r\n\r\n\r\nEach dataset has the following 7 tab-delimited all-ASCII columns\r\n\r\n1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000\r\n2: nAB: total number of co-occurrences\r\n3: nAnB: total number of pairs\r\n4: nA: number of occurrences of token A\r\n5: nB: number of occurrences of token B\r\n6: A: token A\r\n7: B: token B\r\n\r\nWe made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in: \r\n\r\n<a href=\"https://doi.org/10.1371/journal.pone.0115681\">D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681. https://doi.org/10.1371/journal.pone.0115681</a>\r\n\r\n<a href=\"http://dx.doi.org/10.5210/disco.v7i0.6654\">Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:http://dx.doi.org/10.5210/disco.v7i0.6654</a>\r\n"], "keywords"=>["", "PubMed; MeSH; token; name disambiguation"], "version_comment"=>[nil, ""], "subject"=>["", "Social Sciences"]}
|
2018-04-27T16:35:20Z
|