Illinois Data Bank

Hype - PubMed dataset

Hype - PubMed dataset
Prepared by Apratim Mishra

This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences.

The candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’.

This is version 3 of the dataset. Added new file - WSD_hype.tsv

File 1: hype_dataset_final.tsv

Primary dataset. It has the following columns:

1. PMID: represents unique article ID in PubMed
2. Year: Year of publication
3. Hype_word: Candidate hype word, such as ‘novel.’
4. Sentence: Sentence in abstract containing the hype word.
5. Hype_percentile: Abstract relative position of hype word.
6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location.
7. Introduction: The ‘I’ component of the hype word based on IMRaD
8. Methods: The ‘M’ component of the hype word based on IMRaD
9. Results: The ‘R’ component of the hype word based on IMRaD
10. Discussion: The ‘D’ component of the hype word based on IMRaD

File 2: hype_removed_phrases_final.tsv

Secondary dataset with same columns as File 1.
Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases:

1. Major: histocompatibility, component, protein, metabolite, complex, surgery
2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid
3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment
4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values
5. Essential: medium, features, properties, opportunities, oil
6. Unique: model, amino
7. Robust: regression
8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information
9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains
10. Remarkable: properties
11. Definite: radiotherapy, surgery

File 3: WSD_hype.tsv
Includes hype-based disambiguation for candidate words targeted for WSD (Word sense disambiguation)

Social Sciences
Hype; PubMed; Abstracts; Biomedicine
CC BY
Apratim Mishra
332 times
Version DOI Comment Publication Date
3 10.13012/B2IDB-0651259_V3 Include new data 2025-03-14
2 10.13012/B2IDB-0651259_V2 The dataset was modified due to revision requirements from the journal of submission. 2025-01-29
1 10.13012/B2IDB-0651259_V1 2024-03-09

3.94 KB File
103 KB File
1.28 GB File
64 MB File

Contact the Research Data Service for help interpreting this log.

Dataset update: {"publication_state"=>["version candidate under curator review", "released"], "release_date"=>[nil, Fri, 14 Mar 2025]} 2025-03-14T17:00:43Z
Dataset update: {"description"=>["Hype - PubMed dataset\r\nPrepared by Apratim Mishra\r\n\r\nThis dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences.\r\n\r\nThe candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’.\r\n\r\nThis is version 2 of the dataset. Changes include:\r\n\r\nAdded “Year” variable.\r\nRemoved “Abstract length” variable.\r\nModified variable information due to updated probabilistic model of hype.\r\nNumber of hype words - 35 (updated from 36 based on revised findings).\r\n\r\nFile 1: hype_dataset_final.tsv\r\n\r\nPrimary dataset. It has the following columns:\r\n\r\n1. PMID: represents unique article ID in PubMed\r\n2. Year: Year of publication\r\n3. Hype_word: Candidate hype word, such as ‘novel.’\r\n4. Sentence: Sentence in abstract containing the hype word.\r\n5. Hype_percentile: Abstract relative position of hype word.\r\n6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location.\r\n7. Introduction: The ‘I’ component of the hype word based on IMRaD\r\n8. Methods: The ‘M’ component of the hype word based on IMRaD\r\n9. Results: The ‘R’ component of the hype word based on IMRaD\r\n10. Discussion: The ‘D’ component of the hype word based on IMRaD\r\n\r\nFile 2: hype_removed_phrases_final.tsv\r\n\r\nSecondary dataset with same columns as File 1.\r\nHype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases:\r\n\r\n1. Major: histocompatibility, component, protein, metabolite, complex, surgery\r\n2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid\r\n3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment\r\n4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values\r\n5. Essential: medium, features, properties, opportunities, oil\r\n6. Unique: model, amino\r\n7. Robust: regression\r\n8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information\r\n9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains\r\n10. Remarkable: properties\r\n11. Definite: radiotherapy, surgery", "Hype - PubMed dataset\r\nPrepared by Apratim Mishra\r\n\r\nThis dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences.\r\n\r\nThe candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’.\r\n\r\nThis is version 3 of the dataset. Added new file - WSD_hype.tsv\r\n\r\nFile 1: hype_dataset_final.tsv\r\n\r\nPrimary dataset. It has the following columns:\r\n\r\n1. PMID: represents unique article ID in PubMed\r\n2. Year: Year of publication\r\n3. Hype_word: Candidate hype word, such as ‘novel.’\r\n4. Sentence: Sentence in abstract containing the hype word.\r\n5. Hype_percentile: Abstract relative position of hype word.\r\n6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location.\r\n7. Introduction: The ‘I’ component of the hype word based on IMRaD\r\n8. Methods: The ‘M’ component of the hype word based on IMRaD\r\n9. Results: The ‘R’ component of the hype word based on IMRaD\r\n10. Discussion: The ‘D’ component of the hype word based on IMRaD\r\n\r\nFile 2: hype_removed_phrases_final.tsv\r\n\r\nSecondary dataset with same columns as File 1.\r\nHype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases:\r\n\r\n1. Major: histocompatibility, component, protein, metabolite, complex, surgery\r\n2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid\r\n3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment\r\n4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values\r\n5. Essential: medium, features, properties, opportunities, oil\r\n6. Unique: model, amino\r\n7. Robust: regression\r\n8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information\r\n9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains\r\n10. Remarkable: properties\r\n11. Definite: radiotherapy, surgery\r\n\r\nFile 3: WSD_hype.tsv\r\nIncludes hype-based disambiguation for candidate words targeted for WSD (Word sense disambiguation)"]} 2025-03-13T17:20:58Z
Dataset update: {"hold_state"=>["version candidate under curator review", "none"]} 2025-03-13T15:42:56Z
Dataset update: {"version_comment"=>[nil, "Include new data"]} 2025-03-13T04:35:22Z
RelatedMaterial create: {"material_type"=>"Dataset", "availability"=>nil, "link"=>"https://doi.org/10.13012/B2IDB-0651259_V2", "uri"=>"10.13012/B2IDB-0651259_V2", "uri_type"=>"DOI", "citation"=>"Mishra, Apratim; Diesner, Jana; Torvik, Vetle I. (2025): Hype - PubMed dataset. University of Illinois Urbana-Champaign. https://doi.org/10.13012/B2IDB-0651259_V2", "dataset_id"=>2917, "selected_type"=>"Dataset", "datacite_list"=>"IsNewVersionOf", "note"=>nil, "feature"=>nil} 2025-03-13T04:35:06Z
Creator create: {"family_name"=>"Torvik", "given_name"=>"Vetle I.", "identifier"=>"0000-0002-0035-1850", "email"=>"jdiesner@illinois.edu", "is_contact"=>false, "row_position"=>3} 2025-03-13T04:35:06Z
Creator create: {"family_name"=>"Diesner", "given_name"=>"Jana", "identifier"=>"0000-0001-8183-7109", "email"=>"vtorvik@illinois.edu", "is_contact"=>false, "row_position"=>2} 2025-03-13T04:35:05Z
Creator create: {"family_name"=>"Mishra", "given_name"=>"Apratim", "identifier"=>"0000-0002-2946-308X", "email"=>"apratim3@illinois.edu", "is_contact"=>true, "row_position"=>1} 2025-03-13T04:35:04Z
Dataset update: {"corresponding_creator_name"=>[nil, "Apratim Mishra"], "corresponding_creator_email"=>[nil, "apratim3@illinois.edu"]} 2025-03-13T04:35:04Z
Research Data Service Illinois Data Bank
Access and Use Policies Web Privacy Notice Contact Us