Illinois Data Bank - Dataset

Version DOI Comment Publication Date
1 10.13012/B2IDB-4551278_V1 2019-07-08

47.5 MB File
261 MB File
64 MB File
11.3 GB File
3.39 GB File
1.79 GB File
14.4 GB File
14.6 GB File
1.56 KB File
696 Bytes File
924 Bytes File
1.48 GB File
195 MB File

Contact the Research Data Service for help interpreting this log.

Dataset update: {"description"=>["Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (https://archive.org/download/enwiki-20170920) created using the following algorithms:\r\n\r\n* Node2vec\r\n* Poincare embedding\r\n* Elmo model on the category title\r\n\r\nThe following files are present:\r\n\r\n* wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with \"_\") <tab> 300 dim space separated embedding. \r\n* wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. \r\n* elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using \r\n* node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt \r\n* poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt\r\n* wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt\r\n* categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. \r\n* category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category\r\n* category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt\r\n* wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt\r\n\r\n\r\n\r\nSoftware used:\r\n\r\n* https://github.com/napsternxg/WikiUtils - Processing sql dumps\r\n* https://github.com/napsternxg/node2vec - Generate random walks for node2vec\r\n* https://github.com/RaRe-Technologies/gensim (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm\r\n* https://github.com/allenai/allennlp (version 0.8.2) - Generate elmo embeddings for each category title\r\n\r\n\r\nCode used: \r\n* wiki_cat_node2vec_commands.sh - Commands used to \r\n* wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings\r\n* wiki_cat_poincare_embedding.py - generate poincare embeddings", "Wikipedia category tree embeddings based on wikipedia SQL dump dated 2017-09-20 (<a href=\"https://archive.org/download/enwiki-20170920\">https://archive.org/download/enwiki-20170920</a>) created using the following algorithms:\r\n\r\n* Node2vec\r\n* Poincare embedding\r\n* Elmo model on the category title\r\n\r\nThe following files are present:\r\n\r\n* wiki_cat_elmo.txt.gz (15G) - Elmo embeddings. Format: category_name (space replaced with \"_\") <tab> 300 dim space separated embedding. \r\n* wiki_cat_elmo.txt.w2v.gz (15G) - Elmo embeddings. Format: word2vec format can be loaded using Gensin Word2VecKeyedVector.load_word2vec_format. \r\n* elmo_keyedvectors.tar.gz - Gensim Word2VecKeyedVector format of Elmo embeddings. Nodes are indexed using \r\n* node2vec.tar.gz (3.4G) - Gensim word2vec model which has node2vec embedding for each category identified using the position (starting from 0) in category.txt \r\n* poincare.tar.gz (1.8G) - Gensim poincare embedding model which has poincare embedding for each category identified using the position (starting from 0) in category.txt\r\n* wiki_category_random_walks.txt.gz (1.5G) - Random walks generated by node2vec algorithm (https://github.com/aditya-grover/node2vec/tree/master/node2vec_spark), each category identified using the position (starting from 0) in category.txt\r\n* categories.txt - One category name per line (with spaces). The line number (starting from 0) is used as category ID in many other files. \r\n* category_edges.txt - Category edges based on category names (with spaces). Format from_category <tab> to_category\r\n* category_edges_ids.txt - Category edges based on category ids, each category identified using the position (starting from 1) in category.txt\r\n* wiki_cats-G.json - NetworkX format of category graph, each category identified using the position (starting from 1) in category.txt\r\n\r\n\r\n\r\nSoftware used:\r\n\r\n* <a href=\"https://github.com/napsternxg/WikiUtils\">https://github.com/napsternxg/WikiUtils</a> - Processing sql dumps\r\n* <a href=\"https://github.com/napsternxg/node2vec\">https://github.com/napsternxg/node2vec</a> - Generate random walks for node2vec\r\n* <a href=\"https://github.com/RaRe-Technologies/gensim\">https://github.com/RaRe-Technologies/gensim</a> (version 3.4.0) - generating node2vec embeddings from random walks generated usinde node2vec algorithm\r\n* <a href=\"https://github.com/allenai/allennlp\">https://github.com/allenai/allennlp</a> (version 0.8.2) - Generate elmo embeddings for each category title\r\n\r\n\r\nCode used: \r\n* wiki_cat_node2vec_commands.sh - Commands used to \r\n* wiki_cat_generate_elmo_embeddings.py - generate elmo embeddings\r\n* wiki_cat_poincare_embedding.py - generate poincare embeddings"]} 2019-07-19T15:48:24Z
Dataset update: {"version_comment"=>[nil, ""], "subject"=>[nil, "Social Sciences"]} 2019-07-09T16:26:04Z