Author-ity 2009 - PubMed author name disambiguated dataset

Name: Author-ity 2009 - PubMed author name disambiguated dataset
License: http://creativecommons.org/licenses/by/4.0/
Keywords: Bibliographic databases, Name disambiguation, MEDLINE, Library information networks

Torvik, Vetle I.; Smalheiser, Neil R.

doi:10.13012/B2IDB-4222651_V1

Author-ity 2009 - PubMed author name disambiguated dataset

Cite this dataset:

Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1

Use this persistent URL to link to this dataset:

Metadata


Dataset Description	Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03 The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed. • How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304 Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105 Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication. • How accurate is the 2009 dataset (compared to 2006 and 2009)? The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors. • What is the format of the dataset? The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields: 1. blocks separated by '\|\|'; each block may consist of multiple lastname-first initial variants separated by '\|' 2. prior probabilities of the respective blocks separated by '\|' 3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks) 4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased). 5. cluster size (number of author name instances on papers) 6. name variants separated by '\|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix 7. last name variants separated by '\|' 8. first name variants separated by '\|' 9. middle initial variants separated by '\|' ('-' if none) 10. suffix variants separated by '\|' ('-' if none) 11. email addresses separated by '\|' ('-' if none) 12. range of years (e.g., 1997-2009) 13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '\|'; ('-' if none) 14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '\|'; ('-' if none) 15. Journals with counts in parenthesis (separated by "\|"), 16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '\|'; ('-' if none) 17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '\|'; ('-' if none) 18. Co-author IDs with counts in parenthesis; separated by '\|'; ('-' if none) 19. Author name instances (PMID_auno separated '\|') 20. Grant IDs (after normalization; "-" if none given; separated by "\|"), 21. Total number of times cited. (Citations are based on references extracted from PMC). 22. h-index 23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "\|" 24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "\|" 25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "\|" 26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)
Subject	Social Sciences
Keywords	Bibliographic databases; Name disambiguation; MEDLINE; Library information networks
License	CC BY
Funder	U.S. National Institutes of Health (NIH)-Grant:R21LM008364
Corresponding Creator	Vetle I. Torvik
Downloaded	5727 times
Related Materials (4) Article Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304 Article Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105 Article Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773 Conference Paper/Presentation Mishra, Shubhanshu, Brent D. Fegley, Jana Diesner, and Vetle I. Torvik. 2018. “Expertise as an Aspect of Author Contributions.” In Workshop on Informetric and Scientometric Research (SIG/MET). Vancouver. http://hdl.handle.net/2142/102050
Cited By (1) Article Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. Doi: 10.1007/s11192-020-03826-6

Versions

Version	DOI	Comment	Publication Date
1	10.13012/B2IDB-4222651_V1		2018-04-19

Change Log

Contact the Research Data Service for help interpreting this log.

Dataset	update: {"all_globus"=>[nil, true]}	2026-01-16T15:37:49Z
Dataset	update: {"all_medusa"=>[nil, true]}	2026-01-16T15:35:58Z
RelatedMaterial	update: {"datacite_list"=>["IsSupplementedBy ", "IsSupplementedBy"]}	2024-04-18T18:23:34Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1007/s11192-020-03826-6", "uri"=>"10.1007/s11192-020-03826-6", "uri_type"=>"DOI", "citation"=>"Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name\r\ndisambiguation at scale. Scientometrics. Doi: 10.1007/s11192-020-03826-6", "dataset_id"=>526, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2021-02-11T16:46:37Z
RelatedMaterial	create: {"material_type"=>"Conference Paper/Presentation", "availability"=>nil, "link"=>"http://hdl.handle.net/2142/102050", "uri"=>"hdl.handle.net/2142/102050", "uri_type"=>"Handle", "citation"=>"Mishra, Shubhanshu, Brent D. Fegley, Jana Diesner, and Vetle I. Torvik. 2018. “Expertise as an Aspect of Author Contributions.” In Workshop on Informetric and Scientometric Research (SIG/MET). Vancouver. http://hdl.handle.net/2142/102050", "dataset_id"=>526, "selected_type"=>"Other", "datacite_list"=>"IsSupplementTo"}	2018-11-26T17:46:26Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1371/journal.pone.0195773", "uri"=>"10.1371/journal.pone.0195773", "uri_type"=>"DOI", "citation"=>"Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773", "dataset_id"=>526, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo"}	2018-09-29T15:41:20Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1002/asi.20105", "uri"=>"10.1002/asi.20105", "uri_type"=>"DOI", "citation"=>"Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105", "dataset_id"=>526, "selected_type"=>"Article", "datacite_list"=>"IsSupplementedBy "}	2018-04-23T18:59:20Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"http://doi.org/10.1145/1552303.1552304", "uri"=>"10.1145/1552303.1552304", "uri_type"=>"DOI", "citation"=>"Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304", "dataset_id"=>526, "selected_type"=>"Article", "datacite_list"=>"IsSupplementedBy"}	2018-04-23T18:59:19Z
Dataset	update: {"description"=>["Author-ity 2009 baseline dataset\r\nprepared by Vetle Torvik 12/3/2009\r\n\r\nThe dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz.\r\nThe total size should be ~17.4GB uncompressed.\r\n\r\nHow was the dataset created?\r\nThe dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into \"blocks\" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in\r\nTorvik VI, Smalheiser NR. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data 2009; 3(3):11 (1:29).\r\nTorvik VI, Weeber M, Swanson DR, Smalheiser NR. A probabilistic similarity metric for Medline records:a model for author name disambiguation. JASIST 2005; 56(2): 140-158.\r\nNote that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication.\r\n\r\nHow accurate is the 2009 dataset (compared to 2006 and 2009)?\r\nThe recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors.\r\n\r\nWhat is the format of the dataset?\r\nThe cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields:\r\n\r\n1. blocks separated by '\|\|'; each block may consist of multiple lastname-first initial variants separated by '\|'\r\n2. prior probabilities of the respective blocks separated by '\|'\r\n3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks)\r\n4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased).\r\n5. cluster size (number of author name instances on papers)\r\n6. name variants separated by '\|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix\r\n7. last name variants separated by '\|'\r\n8. first name variants separated by '\|'\r\n9. middle initial variants separated by '\|' ('-' if none)\r\n10. suffix variants separated by '\|' ('-' if none)\r\n11. email addresses separated by '\|' ('-' if none)\r\n12. range of years (e.g., 1997-2009)\r\n13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '\|'; ('-' if none)\r\n14. Top 20 most frequent MeSH (after stoplisting; \"-\") with counts in parenthesis; separated by '\|'; ('-' if none)\r\n15. Journals with counts in parenthesis (separated by \"\|\"),\r\n16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '\|'; ('-' if none)\r\n17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '\|'; ('-' if none)\r\n18. Co-author IDs with counts in parenthesis; separated by '\|'; ('-' if none)\r\n19. Author name instances (PMID_auno separated '\|')\r\n20. Grant IDs (after normalization; \"-\" if none given; separated by \"\|\"),\r\n21. Total number of times cited. (Citations are based on references extracted from PMC).\r\n22. h-index\r\n23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by \"\|\"\r\n24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by \"\|\"\r\n25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by \"\|\"\r\n26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)\r\n", "Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03\r\n\r\nThe dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed.\r\n\r\n• How was the dataset created?\r\nThe dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into \"blocks\" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in\r\n<i>Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304</i>\r\n<i>Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105</i>\r\nNote that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication.\r\n\r\n• How accurate is the 2009 dataset (compared to 2006 and 2009)?\r\nThe recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors.\r\n\r\n• What is the format of the dataset?\r\nThe cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields:\r\n\r\n1. blocks separated by '\|\|'; each block may consist of multiple lastname-first initial variants separated by '\|'\r\n2. prior probabilities of the respective blocks separated by '\|'\r\n3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks)\r\n4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased).\r\n5. cluster size (number of author name instances on papers)\r\n6. name variants separated by '\|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix\r\n7. last name variants separated by '\|'\r\n8. first name variants separated by '\|'\r\n9. middle initial variants separated by '\|' ('-' if none)\r\n10. suffix variants separated by '\|' ('-' if none)\r\n11. email addresses separated by '\|' ('-' if none)\r\n12. range of years (e.g., 1997-2009)\r\n13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '\|'; ('-' if none)\r\n14. Top 20 most frequent MeSH (after stoplisting; \"-\") with counts in parenthesis; separated by '\|'; ('-' if none)\r\n15. Journals with counts in parenthesis (separated by \"\|\"),\r\n16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '\|'; ('-' if none)\r\n17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '\|'; ('-' if none)\r\n18. Co-author IDs with counts in parenthesis; separated by '\|'; ('-' if none)\r\n19. Author name instances (PMID_auno separated '\|')\r\n20. Grant IDs (after normalization; \"-\" if none given; separated by \"\|\"),\r\n21. Total number of times cited. (Citations are based on references extracted from PMC).\r\n22. h-index\r\n23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by \"\|\"\r\n24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by \"\|\"\r\n25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by \"\|\"\r\n26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)\r\n"], "keywords"=>["", "Bibliographic databases; Name disambiguation; MEDLINE; Library information networks"], "version_comment"=>[nil, ""], "subject"=>["", "Social Sciences"]}	2018-04-23T18:59:19Z

Author-ity 2009 - PubMed author name disambiguated dataset

Metadata

Dataset Description

Subject

Keywords

License

Funder

Corresponding Creator

Downloaded

Related Materials (4)

Cited By (1)

Versions

Files

Change Log