Contact the Research Data Service for
help interpreting this log.
Dataset
|
update: {"description"=>["Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (https://archive.org/download/enwiki-20170920) created using the following algorithms:\r\n\r\n* Node2vec\r\n* Poincare embedding\r\n* Elmo model on the category title\r\n\r\nThe following files are present:\r\n\r\n* wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with \"_\") <tab> 300 dim space separated embedding. \r\n* wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. \r\n* elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using \r\n* node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt \r\n* poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt\r\n* wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt\r\n* categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. \r\n* category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category\r\n* category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt\r\n* wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt\r\n\r\n\r\n\r\nSoftware used:\r\n\r\n* https://github.com/napsternxg/WikiUtils - Processing sql dumps\r\n* https://github.com/napsternxg/node2vec - Generate random walks for node2vec\r\n* https://github.com/RaRe-Technologies/gensim (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm\r\n* https://github.com/allenai/allennlp (version 0.8.2) - Generate elmo embeddings for each category title\r\n\r\n\r\nCode used: \r\n* wiki_cat_node2vec_commands.sh - Commands used to \r\n* wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings\r\n* wiki_cat_poincare_embedding.py - generate poincare embeddings", "Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (<a href=\"https://archive.org/download/enwiki-20170920\">https://archive.org/download/enwiki-20170920</a>) created using the following algorithms:\r\n\r\n* Node2vec\r\n* Poincare embedding\r\n* Elmo model on the category title\r\n\r\nThe following files are present:\r\n\r\n* wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with \"_\") <tab> 300 dim space separated embedding. \r\n* wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. \r\n* elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using \r\n* node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt \r\n* poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt\r\n* wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt\r\n* categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. \r\n* category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category\r\n* category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt\r\n* wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt\r\n\r\n\r\n\r\nSoftware used:\r\n\r\n* <a href=\"https://github.com/napsternxg/WikiUtils\">https://github.com/napsternxg/WikiUtils</a> - Processing sql dumps\r\n* <a href=\"https://github.com/napsternxg/node2vec\">https://github.com/napsternxg/node2vec</a> - Generate random walks for node2vec\r\n* <a href=\"https://github.com/RaRe-Technologies/gensim\">https://github.com/RaRe-Technologies/gensim</a> (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm\r\n* <a href=\"https://github.com/allenai/allennlp\">https://github.com/allenai/allennlp</a> (version 0.8.2) - Generate elmo embeddings for each category title\r\n\r\n\r\nCode used: \r\n* wiki_cat_node2vec_commands.sh - Commands used to \r\n* wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings\r\n* wiki_cat_poincare_embedding.py - generate poincare embeddings"]}
|
2019-07-19T15:48:24Z
|
Dataset
|
update: {"version_comment"=>[nil, ""], "subject"=>[nil, "Social Sciences"]}
|
2019-07-09T16:26:04Z
|