Dataset Description
|
Hype - PubMed dataset
Prepared by Apratim Mishra
This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences.
The candidate hype words are 36 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'bright', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful'.
File 1: hype_dataset.csv
Primary dataset. It has the following columns:
1. PMID: represents unique article ID in PubMed
2. Hype_word: Candidate hype word, such as ‘novel.’
3. Sentence: Sentence in abstract containing the hype word.
4. Abstract_length: Length of article abstract.
5. Hype_percentile: Abstract relative position of hype word.
6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location.
7. Introduction: The ‘I’ component of the hype word based on IMRaD
8. Methods: The ‘M’ component of the hype word based on IMRaD
9. Results: The ‘R’ component of the hype word based on IMRaD
10. Discussion: The ‘D’ component of the hype word based on IMRaD
File 2: hype_removed_phrases.csv
Secondary dataset with same columns as File 1.
Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases:
1. Major: histocompatibility, component, protein, metabolite, complex, surgery
2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid
3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment
4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration
5. Essential: medium, features, properties, opportunities
6. Unique: model, amino
7. Robust: regression
8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information
9. Outstanding: questions, issues, question, challenge, problems, problem, remains
10. Remarkable: properties
11. Definite: radiotherapy, surgery
12. Bright: field
|