|
Dataset Description
|
This dataset contains sentence-level geographic candidate mentions from PubMed abstracts, linked by PMID. Each row represents a candidate place name identified within a sentence, with the top-ranked candidate retained based on prediction scores from three models: a language-only model (A), a metadata-only model (B), and a combined model (C). Columns provide the original candidate, mapped results to MapAffil and GeoMeSH, and the best disambiguated GeoNames entity. Additional fields include merged candidates across geographic hierarchies, alternative candidates appearing in the same sentence, and reference signals from MeSH terms, cited MeSH terms, and author affiliations. This resource supports research on geographic named entity recognition, disambiguation, and information retrieval in biomedical literature.
Detailed description about columns:
- PMID: PubMed identifier of the source abstract.
- sentences: Sentence text from the abstract that contains the candidate place name.
- candidate: The extracted candidate geographic mention with the highest probability from the sentence.
- MapAffil_info_clean: Standardized affiliation information mapped by MapAffil, cleaned for consistency. This field may be empty as [[]].
- og_mapaffil_res_best_geoname: Best-matched GeoNames entity corresponding to the candidate. If MapAffil_info_clean contains information, the resolution is based on affiliation; if empty, it is derived from candidate disambiguation using consistency with affiliation, GeoMeSH, and cited GeoMeSH. This field may be empty as [].
- probA: Prediction probability from Model A (language-only features).
- probB9505: Prediction probability from Model B (metadata-only features, threshold 95/05).
- probC: Prediction probability from Model C (combined language + metadata features).
- merged_candidate: Candidate merged across hierarchical geographic levels if consistency is satisfied (e.g., Chicago, IL, USA consolidated when all three appear in the same sentence). This field may be empty.
- other_potential_candidates: Alternative candidate mentions detected in the same sentence but not selected as top-ranked, or candidates that could not be merged with the primary candidate due to inconsistency. This field may be empty.
- geomesh_list: Geographic MeSH terms linked directly to the article. This field may be empty as [].
- unique_tempocity_level_list: Location names from the article’s affiliation, parsed and normalized by tempoMapAffil. This field may be empty as [[]].
- cited_geomesh_list: Geographic MeSH terms obtained from cited references of the source abstract. This field may be empty as [].
|