Wikipedia category embeddings - Node2Vec, Poincare, Elmo

Name: Wikipedia category embeddings - Node2Vec, Poincare, Elmo
License: http://creativecommons.org/licenses/by/4.0/
Keywords: Wikipedia, Wikipedia Category Tree, Embeddings, Elmo, Node2Vec, Poincare,

Mishra, Shubhanshu

doi:10.13012/B2IDB-4551278_V1

Wikipedia category embeddings - Node2Vec, Poincare, Elmo

Cite this dataset:

Mishra, Shubhanshu (2019): Wikipedia category embeddings - Node2Vec, Poincare, Elmo. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4551278_V1

Use this persistent URL to link to this dataset:

Metadata


Dataset Description	Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (https://archive.org/download/enwiki-20170920) created using the following algorithms: * Node2vec * Poincare embedding * Elmo model on the category title The following files are present: * wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with "_") 300 dim space separated embedding. * wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. * elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using * node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt * poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt * wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt * categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. * category_edges.txt - Category edges based on category names (with spaces). Format from_category to_category * category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt * wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt Software used: * https://github.com/napsternxg/WikiUtils - Processing sql dumps * https://github.com/napsternxg/node2vec - Generate random walks for node2vec * https://github.com/RaRe-Technologies/gensim (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm * https://github.com/allenai/allennlp (version 0.8.2) - Generate elmo embeddings for each category title Code used: * wiki_cat_node2vec_commands.sh - Commands used to * wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings * wiki_cat_poincare_embedding.py - generate poincare embeddings
Subject	Social Sciences
Keywords	Wikipedia; Wikipedia Category Tree; Embeddings; Elmo; Node2Vec; Poincare;
License	CC BY
Corresponding Creator	Shubhanshu Mishra
Downloaded	3367 times

Versions

Version	DOI	Comment	Publication Date
1	10.13012/B2IDB-4551278_V1		2019-07-08

Change Log

Contact the Research Data Service for help interpreting this log.

Dataset	update: {"all_globus"=>[nil, true]}	2026-01-16T15:42:49Z
Dataset	update: {"all_medusa"=>[nil, true]}	2026-01-16T15:35:55Z
Dataset	update: {"description"=>["Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (https://archive.org/download/enwiki-20170920) created using the following algorithms:\r\n\r\n* Node2vec\r\n* Poincare embedding\r\n* Elmo model on the category title\r\n\r\nThe following files are present:\r\n\r\n* wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with \"_\") <tab> 300 dim space separated embedding. \r\n* wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. \r\n* elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using \r\n* node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt \r\n* poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt\r\n* wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt\r\n* categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. \r\n* category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category\r\n* category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt\r\n* wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt\r\n\r\n\r\n\r\nSoftware used:\r\n\r\n* https://github.com/napsternxg/WikiUtils - Processing sql dumps\r\n* https://github.com/napsternxg/node2vec - Generate random walks for node2vec\r\n* https://github.com/RaRe-Technologies/gensim (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm\r\n* https://github.com/allenai/allennlp (version 0.8.2) - Generate elmo embeddings for each category title\r\n\r\n\r\nCode used: \r\n* wiki_cat_node2vec_commands.sh - Commands used to \r\n* wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings\r\n* wiki_cat_poincare_embedding.py - generate poincare embeddings", "Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (<a href=\"https://archive.org/download/enwiki-20170920\">https://archive.org/download/enwiki-20170920</a>) created using the following algorithms:\r\n\r\n* Node2vec\r\n* Poincare embedding\r\n* Elmo model on the category title\r\n\r\nThe following files are present:\r\n\r\n* wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with \"_\") <tab> 300 dim space separated embedding. \r\n* wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. \r\n* elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using \r\n* node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt \r\n* poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt\r\n* wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt\r\n* categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. \r\n* category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category\r\n* category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt\r\n* wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt\r\n\r\n\r\n\r\nSoftware used:\r\n\r\n* <a href=\"https://github.com/napsternxg/WikiUtils\">https://github.com/napsternxg/WikiUtils</a> - Processing sql dumps\r\n* <a href=\"https://github.com/napsternxg/node2vec\">https://github.com/napsternxg/node2vec</a> - Generate random walks for node2vec\r\n* <a href=\"https://github.com/RaRe-Technologies/gensim\">https://github.com/RaRe-Technologies/gensim</a> (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm\r\n* <a href=\"https://github.com/allenai/allennlp\">https://github.com/allenai/allennlp</a> (version 0.8.2) - Generate elmo embeddings for each category title\r\n\r\n\r\nCode used: \r\n* wiki_cat_node2vec_commands.sh - Commands used to \r\n* wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings\r\n* wiki_cat_poincare_embedding.py - generate poincare embeddings"]}	2019-07-19T15:48:24Z
Dataset	update: {"version_comment"=>[nil, ""], "subject"=>[nil, "Social Sciences"]}	2019-07-09T16:26:04Z

Wikipedia category embeddings - Node2Vec, Poincare, Elmo

Metadata

Dataset Description

Subject

Keywords

License

Corresponding Creator

Downloaded

Versions

Files

Change Log