# Overview
These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe.
The datasets consist of the following:
* twin_not_abstract_matched_complete.tsv: a tab-delimited file consisting of pairs of MEDLINE articles with identical titles, authors and years of publication. This file contains the PMIDs of the duplicate publications, as well as their medical subject headings (MeSH) and three measures of their indexing consistency.
* twin_abstract_matched_complete.tsv: the same as above, except that the MEDLINE articles also have matching abstracts.
* mesh_training_data.csv: a comma-separated file containing the training data for the model discussed in the dissertation.
* mesh_scores.tsv: a tab-delimited file containing a pairwise similarity score based on word embeddings, and MeSH hierarchy relationship.
## Duplicate MEDLINE Publications
Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns:
1. pmid_one: the PubMed unique identifier of the first paper
2. pmid_two: the PubMed unique identifier of the second paper
3. mesh_one: A list of medical subject headings (MeSH) from the first paper, delimited by the "|" character
4. mesh_two: a list of medical subject headings from the second paper, delimited by the "|" character
5. hoopers_consistency: The calculation of Hooper's consistency between the MeSH of the first and second paper
6. nonhierarchicalfree: a word embedding based consistency score described in the dissertation
7. hierarchicalfree: a word embedding based consistency score additionally limited by the MeSH hierarchy, described in the dissertation.
## MeSH Training Data
The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns:
1. pmid: the PubMed unique identifier of the paper
2. term: a candidate MeSH term
3. cit_count: the log of the frequency of the term in the citation candidate set
4. total_cit: the log of the total number the paper's citations
5. citr_count: the log of the frequency of the term in the citations of the paper's citations
6. total_citofcit: the log of the total number of the citations of the paper's citations
7. absim_count: the log of the frequency of the term in the AbSim candidate set
8. total_absim_count: the log of the total number of AbSim records for the paper
9. absimr_count: the log of the frequency of the term in the citations of the AbSim records
10. total_absimr_count: the log of the total number of citations of the AbSim record
11. log_medline_frequency: the log of the frequency of the candidate term in MEDLINE.
12. relevance: a binary indicator (True/False) if the candidate term was assigned to the target paper
## Cosine Similarity
The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns:
1. mesh_one: a string of the first MeSH heading.
2. mesh_two: a string of the second MeSH heading.
3. cosine_similarity: the cosine similarity between the terms
4. relationship_type: a string identifying the relationship type, consisting of none, parent/child, sibling, ancestor and direct (terms are identical, i.e. a direct hierarchy match).
The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/).
For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format."
|