Genni + Ethnea for the Author-ity 2009 dataset

Name: Genni + Ethnea for the Author-ity 2009 dataset
License: http://creativecommons.org/licenses/by/4.0/
Keywords: Androgyny, Bibliometrics, Data mining, Search engine, Gender, Semantic orientation, Temporal prediction, Textual markers

Torvik, Vetle

doi:10.13012/B2IDB-9087546_V1

Genni + Ethnea for the Author-ity 2009 dataset

Cite this dataset:

Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1

Use this persistent URL to link to this dataset:

Metadata


Dataset Description	Prepared by Vetle Torvik 2018-04-15 The dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed. • How was the dataset created? First and last names of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including Ethnea+Genni as described in: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA. http://hdl.handle.net/2142/88927 Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720 EthnicSeer: http://singularity.ist.psu.edu/ethnicity Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada SexMachine 0.1.1: https://pypi.org/project/SexMachine First names, for some Author-ity records lacking them, were harvested from outside bibliographic databases. • The code and back-end data is periodically updated and made available for query at Torvik Research Group • What is the format of the dataset? The dataset contains 9,300,182 rows and 10 columns 1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition) 2. name: full name used as input to EthnicSeer) 3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX 4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction 5. lastname: used as input for Ethnea+Genni 6. firstname: used as input for Ethnea+Genni 7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short) 8. Genni: predicted gender; 'F', 'M', or '-' 9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male) 10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'
Subject	Social Sciences
Keywords	Androgyny; Bibliometrics; Data mining; Search engine; Gender; Semantic orientation; Temporal prediction; Textual markers
License	CC BY
Funder	U.S. National Science Foundation (NSF)-Grant:1348742
Funder	U.S. National Institutes of Health (NIH)-Grant:P01AG039347
Funder	U.S. National Science Foundation (NSF)-Grant:0965341
Corresponding Creator	Vetle Torvik
Downloaded	1606 times
Related Materials (4) Article Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA. http://hdl.handle.net/2142/88927 Article Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720 Article Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada Article Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773
Cited By (5) Article Kim, J., Kim, J. & Owen-Smith, J. Scientometrics (2018). Generating automatically labeled data for author name disambiguation: an iterative clustering method. https://doi.org/10.1007/s11192-018-2968-3 Article Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. Doi: 10.1007/s11192-020-03826-6 Article Acuna, Daniel E., and Lizhen Liang. 2021. “Are AI Ethics Conferences Different and More Diverse Compared to Traditional Computer Science Conferences?.” OSF Preprints. May 19. doi:10.1145/3461702.3462616. Article Ke, Qing, Lizhen Liang, Ying Ding, Stephen V David and Daniel Ernesto Acuna. “A dataset of mentorship in science with semantic and demographic estimations.” ArXiv abs/2106.06487 (2021): n. pag. Article Ke, Q., Liang, L., Ding, Y. et al. A dataset of mentorship in bioscience with semantic and demographic estimations. Sci Data 9, 467 (2022). https://doi.org/10.1038/s41597-022-01578-x

Versions

Version	DOI	Comment	Publication Date
1	10.13012/B2IDB-9087546_V1		2018-04-19

Files

Change Log

Contact the Research Data Service for help interpreting this log.

Dataset	update: {"all_globus"=>[nil, true]}	2026-01-16T15:37:40Z
Dataset	update: {"all_medusa"=>[nil, true]}	2026-01-16T15:35:57Z
RelatedMaterial	update: {"datacite_list"=>["IsSupplementedBy ", "IsSupplementedBy"]}	2024-04-18T18:23:34Z
RelatedMaterial	update: {"datacite_list"=>["IsSupplementedBy ", "IsSupplementedBy"]}	2024-04-18T18:23:34Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1038/s41597-022-01578-x", "uri"=>"10.1038/s41597-022-01578-x", "uri_type"=>"DOI", "citation"=>"Ke, Q., Liang, L., Ding, Y. et al. A dataset of mentorship in bioscience with semantic and demographic estimations. Sci Data 9, 467 (2022). https://doi.org/10.1038/s41597-022-01578-x", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2022-08-08T16:45:04Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://arxiv.org/abs/2106.06487", "uri"=>"https://arxiv.org/abs/2106.06487", "uri_type"=>"URL", "citation"=>"Ke, Qing, Lizhen Liang, Ying Ding, Stephen V David and Daniel Ernesto Acuna. “A dataset of mentorship in science with semantic and demographic estimations.” ArXiv abs/2106.06487 (2021): n. pag.", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2021-06-21T15:22:30Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1145/3461702.3462616", "uri"=>"10.1145/3461702.3462616", "uri_type"=>"DOI", "citation"=>"Acuna, Daniel E., and Lizhen Liang. 2021. “Are AI Ethics Conferences Different and More Diverse Compared to Traditional Computer Science Conferences?.” OSF Preprints. May 19. doi:10.1145/3461702.3462616.", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2021-05-26T19:15:14Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1007/s11192-020-03826-6", "uri"=>"10.1007/s11192-020-03826-6", "uri_type"=>"DOI", "citation"=>"Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name\r\ndisambiguation at scale. Scientometrics. Doi: 10.1007/s11192-020-03826-6", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2021-02-11T16:43:39Z
Dataset	update: {"description"=>["Prepared by Vetle Torvik 2018-04-15\r\n\r\nThe dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.\r\n\r\n• How was the dataset created?\r\nFirst and lastnames of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including\r\n\r\nEthnea+Genni as described in:\r\n\r\n<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.\r\nhttp://hdl.handle.net/2142/88927</i>\r\n\r\n<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>\r\n\r\nEthnicSeer: http://singularity.ist.psu.edu/ethnicity\r\n<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>\r\n\r\nSexMachine 0.1.1: <a href=\"https://pypi.python.org/pypi/SexMachine/\">https://pypi.org/project/SexMachine</a>\r\n\r\nFirst names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.\r\n\r\n• The code and back-end data is periodically updated and made available for query at <a href =\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n• What is the format of the dataset?\r\nThe dataset contains 9,300,182 rows and 10 columns\r\n1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)\r\n2. name: full name used as input to EthnicSeer)\r\n3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX\r\n4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction\r\n5. lastname: used as input for Ethnea+Genni\r\n6. firstname: used as input for Ethnea+Genni\r\n7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)\r\n8. Genni: predicted gender; 'F', 'M', or '-'\r\n9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)\r\n10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'\r\n", "Prepared by Vetle Torvik 2018-04-15\r\n\r\nThe dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.\r\n\r\n• How was the dataset created?\r\nFirst and last names of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including\r\n\r\nEthnea+Genni as described in:\r\n\r\n<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.\r\nhttp://hdl.handle.net/2142/88927</i>\r\n\r\n<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>\r\n\r\nEthnicSeer: http://singularity.ist.psu.edu/ethnicity\r\n<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>\r\n\r\nSexMachine 0.1.1: <a href=\"https://pypi.python.org/pypi/SexMachine/\">https://pypi.org/project/SexMachine</a>\r\n\r\nFirst names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.\r\n\r\n• The code and back-end data is periodically updated and made available for query at <a href =\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n• What is the format of the dataset?\r\nThe dataset contains 9,300,182 rows and 10 columns\r\n1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)\r\n2. name: full name used as input to EthnicSeer)\r\n3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX\r\n4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction\r\n5. lastname: used as input for Ethnea+Genni\r\n6. firstname: used as input for Ethnea+Genni\r\n7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)\r\n8. Genni: predicted gender; 'F', 'M', or '-'\r\n9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)\r\n10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'\r\n"]}	2018-12-05T21:47:04Z
Dataset	update: {"keywords"=>["Androgyny; Bibliometrics; Data mining; Earch engine; Gender; Semantic orientation; Temporal prediction; Textual markers", "Androgyny; Bibliometrics; Data mining; Search engine; Gender; Semantic orientation; Temporal prediction; Textual markers"]}	2018-12-04T17:56:06Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1007/s11192-018-2968-3", "uri"=>"10.1007/s11192-018-2968-3", "uri_type"=>"DOI", "citation"=>"Kim, J., Kim, J. & Owen-Smith, J. Scientometrics (2018). Generating automatically labeled data for author name disambiguation: an iterative clustering method. https://doi.org/10.1007/s11192-018-2968-3", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2018-12-04T17:40:24Z
RelatedMaterial	destroy: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1007/s11192-018-2968-3", "uri"=>"10.1007/s11192-018-2968-3", "uri_type"=>"DOI", "citation"=>"Kim, J., Kim, J. & Owen-Smith, J. Scientometrics (2018). Generating automatically labeled data for author name disambiguation: an iterative clustering method. https://doi.org/10.1007/s11192-018-2968-3", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2018-12-04T16:29:38Z
RelatedMaterial	destroy: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1007/s11192-018-2968-3", "uri"=>"10.1007/s11192-018-2968-3", "uri_type"=>"DOI", "citation"=>"Kim, J., Kim, J. & Owen-Smith, J. Scientometrics (2018). Generating automatically labeled data for author name disambiguation: an iterative clustering method. https://doi.org/10.1007/s11192-018-2968-3", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2018-12-04T16:29:38Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1007/s11192-018-2968-3", "uri"=>"10.1007/s11192-018-2968-3", "uri_type"=>"DOI", "citation"=>"Kim, J., Kim, J. & Owen-Smith, J. Scientometrics (2018). Generating automatically labeled data for author name disambiguation: an iterative clustering method. https://doi.org/10.1007/s11192-018-2968-3", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2018-12-04T15:55:44Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1007/s11192-018-2968-3", "uri"=>"10.1007/s11192-018-2968-3", "uri_type"=>"DOI", "citation"=>"Kim, J., Kim, J. & Owen-Smith, J. Scientometrics (2018). Generating automatically labeled data for author name disambiguation: an iterative clustering method. https://doi.org/10.1007/s11192-018-2968-3", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsCitedBy"}	2018-12-04T15:48:58Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1371/journal.pone.0195773", "uri"=>"10.1371/journal.pone.0195773", "uri_type"=>"DOI", "citation"=>"Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9): e0195773. https://doi.org/10.1371/journal.pone.0195773", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsSupplementTo"}	2018-09-29T15:43:38Z
Dataset	update: {"description"=>["Prepared by Vetle Torvik April 5, 2018\r\n\r\nThe dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.\r\n\r\n• How was the dataset created?\r\nFirst and lastnames of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including\r\n\r\nEthnea+Genni as described in:\r\n\r\n<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.\r\nhttp://hdl.handle.net/2142/88927</i>\r\n\r\n<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>\r\n\r\nEthnicSeer: http://singularity.ist.psu.edu/ethnicity\r\n<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>\r\n\r\nSexMachine 0.1.1: <a href=\"https://pypi.python.org/pypi/SexMachine/\">https://pypi.org/project/SexMachine</a>\r\n\r\nFirst names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.\r\n\r\n• The code and back-end data is periodically updated and made available for query at <a href =\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n• What is the format of the dataset?\r\nThe dataset contains 9,300,182 rows and 10 columns\r\n1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)\r\n2. name: full name used as input to EthnicSeer)\r\n3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX\r\n4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction\r\n5. lastname: used as input for Ethnea+Genni\r\n6. firstname: used as input for Ethnea+Genni\r\n7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)\r\n8. Genni: predicted gender; 'F', 'M', or '-'\r\n9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)\r\n10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'\r\n", "Prepared by Vetle Torvik 2018-04-15\r\n\r\nThe dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.\r\n\r\n• How was the dataset created?\r\nFirst and lastnames of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including\r\n\r\nEthnea+Genni as described in:\r\n\r\n<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.\r\nhttp://hdl.handle.net/2142/88927</i>\r\n\r\n<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>\r\n\r\nEthnicSeer: http://singularity.ist.psu.edu/ethnicity\r\n<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>\r\n\r\nSexMachine 0.1.1: <a href=\"https://pypi.python.org/pypi/SexMachine/\">https://pypi.org/project/SexMachine</a>\r\n\r\nFirst names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.\r\n\r\n• The code and back-end data is periodically updated and made available for query at <a href =\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n• What is the format of the dataset?\r\nThe dataset contains 9,300,182 rows and 10 columns\r\n1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)\r\n2. name: full name used as input to EthnicSeer)\r\n3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX\r\n4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction\r\n5. lastname: used as input for Ethnea+Genni\r\n6. firstname: used as input for Ethnea+Genni\r\n7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)\r\n8. Genni: predicted gender; 'F', 'M', or '-'\r\n9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)\r\n10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'\r\n"]}	2018-04-23T19:33:56Z
Dataset	update: {"description"=>["Prepared by Vetle Torvik April 5, 2018\r\n\r\nThe dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.\r\n\r\n• How was the dataset created?\r\nFirst and lastnames of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including\r\n\r\nEthnea+Genni as described in:\r\n\r\n<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.\r\nhttp://hdl.handle.net/2142/88927</i>\r\n\r\n<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>\r\n\r\nEthnicSeer: http://singularity.ist.psu.edu/ethnicity\r\n<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>\r\n\r\nSexMachine 0.1.1: < a href=\"https://pypi.python.org/pypi/SexMachine/\">https://pypi.org/project/SexMachine/</a>\r\n\r\nFirst names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.\r\n\r\n• The code and back-end data is periodically updated and made available for query at <a href =\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n• What is the format of the dataset?\r\nThe dataset contains 9,300,182 rows and 10 columns\r\n1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)\r\n2. name: full name used as input to EthnicSeer)\r\n3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX\r\n4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction\r\n5. lastname: used as input for Ethnea+Genni\r\n6. firstname: used as input for Ethnea+Genni\r\n7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)\r\n8. Genni: predicted gender; 'F', 'M', or '-'\r\n9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)\r\n10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'\r\n", "Prepared by Vetle Torvik April 5, 2018\r\n\r\nThe dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.\r\n\r\n• How was the dataset created?\r\nFirst and lastnames of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including\r\n\r\nEthnea+Genni as described in:\r\n\r\n<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.\r\nhttp://hdl.handle.net/2142/88927</i>\r\n\r\n<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>\r\n\r\nEthnicSeer: http://singularity.ist.psu.edu/ethnicity\r\n<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>\r\n\r\nSexMachine 0.1.1: <a href=\"https://pypi.python.org/pypi/SexMachine/\">https://pypi.org/project/SexMachine</a>\r\n\r\nFirst names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.\r\n\r\n• The code and back-end data is periodically updated and made available for query at <a href =\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n• What is the format of the dataset?\r\nThe dataset contains 9,300,182 rows and 10 columns\r\n1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)\r\n2. name: full name used as input to EthnicSeer)\r\n3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX\r\n4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction\r\n5. lastname: used as input for Ethnea+Genni\r\n6. firstname: used as input for Ethnea+Genni\r\n7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)\r\n8. Genni: predicted gender; 'F', 'M', or '-'\r\n9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)\r\n10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'\r\n"]}	2018-04-23T19:33:07Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5180", "uri"=>"https://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/5180", "uri_type"=>"URL", "citation"=>"Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsSupplementedBy "}	2018-04-23T19:30:15Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"https://doi.org/10.1145/2467696.2467720", "uri"=>"10.1145/2467696.2467720", "uri_type"=>"DOI", "citation"=>"Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsSupplementedBy "}	2018-04-23T19:30:15Z
RelatedMaterial	create: {"material_type"=>"Article", "availability"=>nil, "link"=>"http://hdl.handle.net/2142/88927", "uri"=>"hdl.handle.net/2142/88927", "uri_type"=>"Handle", "citation"=>"Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA. http://hdl.handle.net/2142/88927", "dataset_id"=>536, "selected_type"=>"Article", "datacite_list"=>"IsSupplementedBy"}	2018-04-23T19:30:15Z
Dataset	update: {"description"=>["prepared by Vetle Torvik April 5, 2018\r\n\r\nThe dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.\r\n\r\nHow was the dataset created?\r\nFirst and lastnames of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including\r\n\r\nEthnea+Genni as described in:\r\n\r\nTorvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.\r\nhttp://hdl.handle.net/2142/88927\r\n\r\nSmith BN, Singh M, Torvik VI (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 199-208). JCDL '13. Indianapolis, IN, USA.\r\n\r\nEthnicSeer: http://singularity.ist.psu.edu/ethnicity\r\n Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada\r\n\r\nSexMachine 0.1.1: https://pypi.python.org/pypi/SexMachine/\r\n\r\nFirst names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.\r\n\r\nThe code and back-end data is periodically updated and made available for query here\r\nhttp://abel.ischool.illinois.edu\r\n\r\nWhat is the format of the dataset?\r\n1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)\r\n2. name: full name used as input to EthnicSeer)\r\n2. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX\r\n3. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction\r\n4. lastname: used as input for Ethnea+Genni\r\n5. firstname: used as input for Ethnea+Genni\r\n6. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)\r\n7. Genni: predicted gender; 'F', 'M', or '-'\r\n8. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)\r\n9. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'\r\n", "Prepared by Vetle Torvik April 5, 2018\r\n\r\nThe dataset comes as a single tab-delimited ASCII encoded file, and should be about 717MB uncompressed.\r\n\r\n• How was the dataset created?\r\nFirst and lastnames of authors in the Author-ity 2009 dataset was processed through several tools to predict ethnicities and gender, including\r\n\r\nEthnea+Genni as described in:\r\n\r\n<i>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.\r\nhttp://hdl.handle.net/2142/88927</i>\r\n\r\n<i>Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720</i>\r\n\r\nEthnicSeer: http://singularity.ist.psu.edu/ethnicity\r\n<i>Treeratpituk P, Giles CL (2012). Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching. Proceedings of the Twenty-Sixth Conference on Artificial Intelligence (pp. 1141-1147). AAAI-12. Toronto, ON, Canada</i>\r\n\r\nSexMachine 0.1.1: < a href=\"https://pypi.python.org/pypi/SexMachine/\">https://pypi.org/project/SexMachine/</a>\r\n\r\nFirst names, for some Author-ity records lacking them, were harvested from outside bibliographic databases.\r\n\r\n• The code and back-end data is periodically updated and made available for query at <a href =\"http://abel.ischool.illinois.edu\">Torvik Research Group</a>\r\n\r\n• What is the format of the dataset?\r\nThe dataset contains 9,300,182 rows and 10 columns\r\n1. auid: unique ID for Authors in Author-ity 2009 (PMID_authorposition)\r\n2. name: full name used as input to EthnicSeer)\r\n3. EthnicSeer: predicted ethnicity; ARA, CHI, ENG, FRN, GER, IND, ITA, JAP, KOR, RUS, SPA, VIE, XXX\r\n4. prop: decimal between 0 and 1 reflecting the confidence of the EthnicSeer prediction\r\n5. lastname: used as input for Ethnea+Genni\r\n6. firstname: used as input for Ethnea+Genni\r\n7. Ethnea: predicted ethnicity; either one of 26 (AFRICAN, ARAB, BALTIC, CARIBBEAN, CHINESE, DUTCH, ENGLISH, FRENCH, GERMAN, GREEK, HISPANIC, HUNGARIAN, INDIAN, INDONESIAN, ISRAELI, ITALIAN, JAPANESE, KOREAN, MONGOLIAN, NORDIC, POLYNESIAN, ROMANIAN, SLAV, THAI, TURKISH, VIETNAMESE) or two ethnicities (e.g., SLAV-ENGLISH), or UNKNOWN (if no one or two dominant predictons), or TOOSHORT (if both first and last name are too short)\r\n8. Genni: predicted gender; 'F', 'M', or '-'\r\n9. SexMac: predicted gender based on third-party Python program (default settings except case_sensitive=False); female, mostly_female, andy, mostly_male, male)\r\n10. SSNgender: predicted gender based on US SSN data; 'F', 'M', or '-'\r\n"], "keywords"=>["", "Androgyny; Bibliometrics; Data mining; Earch engine; Gender; Semantic orientation; Temporal prediction; Textual markers"], "version_comment"=>[nil, ""], "subject"=>["", "Social Sciences"]}	2018-04-23T19:30:14Z

Genni + Ethnea for the Author-ity 2009 dataset

Metadata

Dataset Description

Subject

Keywords

License

Funder

Funder

Funder

Corresponding Creator

Downloaded

Related Materials (4)

Cited By (5)

Versions

Files

Change Log