Illinois Data Bank Dataset Search Results
Results
published:
2025-03-14
Mishra, Apratim; Diesner, Jana; Torvik, Vetle I.
(2025)
Hype - PubMed dataset
Prepared by Apratim Mishra
This dataset captures ‘Hype’ within biomedical abstracts sourced from PubMed. The selection chosen is ‘journal articles’ written in English, published between 1975 and 2019, totaling ~5.2 million. The classification relies on the presence of specific candidate ‘hype words’ and their abstract location. Therefore, each article (PMID) might have multiple instances in the dataset due to the presence of multiple hype words in different abstract sentences.
The candidate hype words are 35 in count: 'major', 'novel', 'central', 'critical', 'essential', 'strongly', 'unique', 'promising', 'markedly', 'excellent', 'crucial', 'robust', 'importantly', 'prominent', 'dramatically', 'favorable', 'vital', 'surprisingly', 'remarkably', 'remarkable', 'definitive', 'pivotal', 'innovative', 'supportive', 'encouraging', 'unprecedented', 'enormous', 'exceptional', 'outstanding', 'noteworthy', 'creative', 'assuring', 'reassuring', 'spectacular', and 'hopeful’.
This is version 3 of the dataset. Added new file - WSD_hype.tsv
File 1: hype_dataset_final.tsv
Primary dataset. It has the following columns:
1. PMID: represents unique article ID in PubMed
2. Year: Year of publication
3. Hype_word: Candidate hype word, such as ‘novel.’
4. Sentence: Sentence in abstract containing the hype word.
5. Hype_percentile: Abstract relative position of hype word.
6. Hype_value: Propensity of hype based on the hype word, the sentence, and the abstract location.
7. Introduction: The ‘I’ component of the hype word based on IMRaD
8. Methods: The ‘M’ component of the hype word based on IMRaD
9. Results: The ‘R’ component of the hype word based on IMRaD
10. Discussion: The ‘D’ component of the hype word based on IMRaD
File 2: hype_removed_phrases_final.tsv
Secondary dataset with same columns as File 1.
Hype in the primary dataset is based on excluding certain phrases that are rarely hype. The phrases that were removed are included in File 2 and modeled separately. Removed phrases:
1. Major: histocompatibility, component, protein, metabolite, complex, surgery
2. Novel: assay, mutation, antagonist, inhibitor, algorithm, technique, series, method, hybrid
3. Central: catheters, system, design, composite, catheter, pressure, thickness, compartment
4. Critical: compartment, micelle, temperature, incident, solution, ischemia, concentration, thinking, nurses, skills, analysis, review, appraisal, evaluation, values
5. Essential: medium, features, properties, opportunities, oil
6. Unique: model, amino
7. Robust: regression
8. Vital: capacity, signs, organs, status, structures, staining, rates, cells, information
9. Outstanding: questions, issues, question, questions, challenge, problems, problem, remains
10. Remarkable: properties
11. Definite: radiotherapy, surgery
File 3: WSD_hype.tsv
Includes hype-based disambiguation for candidate words targeted for WSD (Word sense disambiguation)
keywords:
Hype; PubMed; Abstracts; Biomedicine
published:
2020-06-19
This dataset include data pulled from the World Bank 2009, the World Values Survey wave 6, Transparency International from 2009. The data were used to measure perceptions of expertise from individuals in nations that are recipients of development aid as measured by the World Bank.
keywords:
World Values Survey; World Bank; expertise; development
published:
2023-04-12
Towns, John; Hart, David
(2023)
The XSEDE program manages the database of allocation awards for the portfolio of advanced research computing resources funded by the National Science Foundation (NSF). The database holds data for allocation awards dating to the start of the TeraGrid program in 2004 through the XSEDE operational period, which ended August 31, 2022. The project data include lead researcher and affiliation, title and abstract, field of science, and the start and end dates. Along with the project information, the data set includes resource allocation and usage data for each award associated with the project. The data show the transition of resources over a fifteen year span along with the evolution of researchers, fields of science, and institutional representation.
Because the XSEDE program has ended, the allocation_award_history file includes all allocations activity initiated via XSEDE processes through August 31, 2022. The Resource Providers and successor program to XSEDE agreed to honor all project allocations made during XSEDE. Thus, allocation awards that extend beyond the end of XSEDE may not reflect all activity that may ultimately be part of the project award. Similarly, allocation usage data only reflects usage reported through August 31, 2022, and may not reflect all activity that may ultimately be conducted by projects that were active beyond XSEDE.
keywords:
allocations; cyberinfrastructure; XSEDE
published:
2018-07-28
Hoang, Linh; Schneider, Jodi
(2018)
This dataset presents a citation analysis and citation context analysis used in Linh Hoang, Frank Scannapieco, Linh Cao, Yingjun Guan, Yi-Yun Cheng, and Jodi Schneider. Evaluating an automatic data extraction tool based on the theory of diffusion of innovation. Under submission. We identified the papers that directly describe or evaluate RobotReviewer from the list of publications on the RobotReviewer website <http://www.robotreviewer.net/publications>, resulting in 6 papers grouped into 5 studies (we collapsed a conference and journal paper with the same title and authors into one study). We found 59 citing papers, combining results from Google Scholar on June 05, 2018 and from Scopus on June 23, 2018. We extracted the citation context around each citation to the RobotReviewer papers and categorized these quotes into emergent themes.
keywords:
RobotReviewer; citation analysis; citation context analysis
published:
2020-03-08
Origin Ventures Academy for Entrepreneurial Leadership, Gies College of Business
(2020)
This dataset inventories the availability of entrepreneurship and small business education, including co-curricular opportunities, in two-year colleges in the United States. The inventory provides a snapshot of activities at more than 1,650 public, not-for-profit, and private for-profit institutions, in 2014.
keywords:
Small business education; entrepreneurship education; Kauffman Entrepreneurship Education Inventory; Ewing Marion Kauffman Foundation; Paul J. Magelli
published:
2018-07-25
Scannapieco, Frank; Hoang, Linh; Schneider, Jodi
(2018)
The PDF describes the process and data used for the heuristic user evaluation described in the related article “<i>Evaluating an automatic data extraction tool based on the theory of diffusion of innovation</i>” by Linh Hoang, Frank Scannapieco, Linh Cao, Yingjun Guan, Yi-Yun Cheng, and Jodi Schneider (under submission).<br />
Frank Scannapieco assessed RobotReviewer data extraction performance on ten articles in 2018-02. Articles are included papers from an update review: Sabharwal A., G.-F.I., Stellrecht E., Scannapeico F.A. <i>Periodontal therapy to prevent the initiation and/or progression of common complex systemic diseases and conditions</i>. An update. Periodontol 2000. In Press. <br/>
The form was created in consultation with Linh Hoang and Jodi Schneider. To do the assessment, Frank Scannapieco entered PDFs for these ten articles into RobotReviewer and then filled in ten evaluation forms, based on the ten Robot Reviewer automatic data extraction reports. Linh Hoang analyzed these ten evaluation forms and synthesized Frank Scannapieco’s comments to arrive at the evaluation results for the heuristic user evaluation.
keywords:
RobotReviewer; systematic review automation; data extraction
published:
2022-02-09
Kansara, Yogeshwar; Hoang, Khanh Linh
(2022)
The data file contains a list of articles with PMIDs information, which were used in a project associated with the manuscript "Evaluation of publication type tagging as a strategy to screen randomized controlled trial articles in preparing systematic reviews".
keywords:
Cochrane reviews; Randomized controlled trials; RCT; Automation; Systematic reviews
published:
2023-01-12
Mischo, William; Schlembach, Mary C.; Cabada, Elisandro
(2023)
This dataset was developed as part of a study that examined the correlational relationships between local journal authorship, local and external citation counts, full-text downloads, link-resolver clicks, and four global journal impact factor indices within an all-disciplines journal collection of 12,200 titles and six subject subsets at the University of Illinois at Urbana-Champaign (UIUC) Library. While earlier investigations of the relationships between usage (downloads) and citation metrics have been inconclusive, this study shows strong correlations in the all-disciplines set and most subject subsets. The normalized Eigenfactor was the only global impact factor index that correlated highly with local journal metrics. Some of the identified disciplinary variances among the six subject subsets may be explained by the journal publication aspirations of UIUC researchers. The correlations between authorship and local citations in the six specific subject subsets closely match national department or program rankings.
All the raw data used in this analysis, in the form of relational database tables with multiple columns. Can be opned using MS Access. Description for variables can be viewed through "Design View" (by right clik on the selected table, choose "Design View"). The 2 PDF files provide an overview of tables are included in each MDB file.
In addition, the processing scripts and Pearson correlation code is available at <a href="https://doi.org/10.13012/B2IDB-0931140_V1">https://doi.org/10.13012/B2IDB-0931140_V1</a>.
keywords:
Usage and local citation relationships; publication; citation and usage metrics; publication; citation and usage correlation analysis; Pearson correlation analysis
published:
2023-07-05
Fu, Yuanxi; Hsiao, Tzu-Kun; Joshi, Manasi Ballal; Lischwe Mueller, Natalie
(2023)
The salt controversy is the public health debate about whether a population-level salt reduction is beneficial. This dataset covers 82 publications--14 systematic review reports (SRRs) and 68 primary study reports (PSRs)--addressing the effect of sodium intake on cerebrocardiovascular disease or mortality. These present a snapshot of the status of the salt controversy as of September 2014 according to previous work by epidemiologists: The reports and their opinion classification (for, against, and inconclusive) were from Trinquart et al. (2016) (Trinquart, L., Johns, D. M., & Galea, S. (2016). Why do we think we know what we know? A metaknowledge analysis of the salt controversy. International Journal of Epidemiology, 45(1), 251–260. https://doi.org/10.1093/ije/dyv184 ), which collected 68 PSRs, 14 SRRs, 11 clinical guideline reports, and 176 comments, letters, or narrative reviews. Note that our dataset covers only the 68 PSRs and 14 SRRs from Trinquart et al. 2016, not the other types of publications, and it adds additional information noted below.
This dataset can be used to construct the inclusion network and the co-author network of the 14 SRRs and 68 PSRs. A PSR is "included" in an SRR if it is considered in the SRR's evidence synthesis. Each included PSR is cited in the SRR, but not all references cited in an SRR are included in the evidence synthesis or PSRs. Based on which PSRs are included in which SRRs, we can construct the inclusion network. The inclusion network is a bipartite network with two types of nodes: one type represents SRRs, and the other represents PSRs. In an inclusion network, if an SRR includes a PSR, there is a directed edge from the SRR to the PSR. The attribute file (report_list.csv) includes attributes of the 82 reports, and the edge list file (inclusion_net_edges.csv) contains the edge list of the inclusion network. Notably, 11 PSRs have never been included in any SRR in the dataset. They are unused PSRs. If visualized with the inclusion network, they will appear as isolated nodes.
We used a custom-made workflow (Fu, Y. (2022). Scopus author info tool (1.0.1) [Python]. https://github.com/infoqualitylab/Scopus_author_info_collection ) that uses the Scopus API and manual work to extract and disambiguate authorship information for the 82 reports. The author information file (salt_cont_author.csv) is the product of this workflow and can be used to compute the co-author network of the 82 reports.
We also provide several other files in this dataset. We collected inclusion criteria (the criteria that make a PSR eligible to be included in an SRR) and recorded them in the file systematic_review_inclusion_criteria.csv. We provide a file (potential_inclusion_link.csv) recording whether a given PSR had been published as of the search date of a given SRR, which makes the PSR potentially eligible for inclusion in the SRR. We also provide a bibliography of the 82 publications (supplementary_reference_list.pdf). Lastly, we discovered minor discrepancies between the inclusion relationships identified by Trinquart et al. (2016) and by us. Therefore, we prepared an additional edge list (inclusion_net_edges_trinquart.csv) to preserve the inclusion relationships identified by Trinquart et al. (2016).
<b>UPDATES IN THIS VERSION COMPARED TO V2</b> (Fu, Yuanxi; Hsiao, Tzu-Kun; Joshi, Manasi Ballal (2022): The Salt Controversy Systematic Review Reports and Primary Study Reports Network Dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-6128763_V2)
- We added a new column "pub_date" to report_list.csv
- We corrected mistakes in supplementary_reference_list.pdf for report #28 and report #80. The author of report #28 is not Salisbury D but Khaw, K.-T., & Barrett-Connor, E. Report #80 was mistakenly mixed up with report #81.
keywords:
systematic reviews; evidence synthesis; network analysis; public health; salt controversy;
published:
2019-10-16
Human annotations of randomly selected judged documents from the AP 88-89, Robust 2004, WT10g, and GOV2 TREC collections. Seven annotators were asked to read documents in their entirety and then select up to ten terms they felt best represented the main topic(s) of the document. Terms were chosen from among a set sampled from the document in question and from related documents.
keywords:
TREC; information retrieval; document topicality; document description
published:
2022-02-11
Hoang, Khanh Linh; Schneider, Jodi; Kansara, Yogeshwar
(2022)
The data contains a list of articles given low score by the RCT Tagger and an error analysis of them, which were used in a project associated with the manuscript "Evaluation of publication type tagging as a strategy to screen randomized controlled trial articles in preparing systematic reviews".
Change made in this V3 is that the data is divided into two parts:
- Error Analysis of 44 Low Scoring Articles with MEDLINE RCT Publication Type.
- Error Analysis of 244 Low Scoring Articles without MEDLINE RCT Publication Type.
keywords:
Cochrane reviews; automation; randomized controlled trial; RCT; systematic reviews
published:
2020-03-03
Schneider, Jodi; Ye, Di
(2020)
This second version (V2) provides additional data cleaning compared to V1, additional data collection (mainly to include data from 2019), and more metadata for nodes. Please see NETWORKv2README.txt for more detail.
keywords:
citations; retraction; network analysis; Web of Science; Google Scholar; indirect citation
published:
2020-04-22
Endres, A. Bryan; Endres, Renata; Krstinić Nižić, Marinela
(2020)
Data on Croatian restaurant allergen disclosures on restaurant websites, on-line menus and social media comments
keywords:
restaurant; allergen; disclosure; tourism
published:
2024-12-05
Salami, Malik Oyewale; McCumber, Corinne
(2024)
This project investigates retraction indexing agreement among data sources: BCI, BIOABS, CCC, Compendex, Crossref, GEOBASE, MEDLINE, PubMed, Retraction Watch, Scopus, and Web of Science Core. Post-retraction citation may be partly due to authors’ and publishers' challenges in systematically identifying retracted publications. To investigate retraction indexing quality, we investigate the agreement in indexing retracted publications between 11 database sources, restricting to their coverage, resulting in a union list of 85,392 unique items. We also discuss common errors in indexing retracted publications. Our results reveal low retraction indexing agreement scores, indicating that databases widely disagree on indexing retracted publications they cover, leading to a lack of consistency in what publications are identified as retracted. Our findings highlight the need for clear and standard practices in the curation and management of retracted publications.
Pipeline code to get the result files can be found in the GitHub repository
https://github.com/infoqualitylab/retraction-indexing-agreement in the ‘src’ file containing iPython notebooks:
The ‘unionlist_completed-ria_2024-07-09.csv’ file has been redacted to remove proprietary data, as noted below in README.txt. Among our sources, data is openly available only for Crossref, PubMed, and Retraction Watch.
FILE FORMATS:
1) unionlist_completed-ria_2024-07-09.csv - UTF-8 CSV file
2) README.txt - text file
keywords:
retraction status; data quality; indexing; retraction indexing; metadata; meta-science; RISRS
published:
2019-07-08
Kehoe, Adam K.; Torvik, Vetle I.
(2019)
# Overview
These datasets were created in conjunction with the dissertation "Predicting Controlled Vocabulary Based on Text and Citations: Case Studies in Medical Subject Headings in MEDLINE and Patents," by Adam Kehoe.
The datasets consist of the following:
* twin_not_abstract_matched_complete.tsv: a tab-delimited file consisting of pairs of MEDLINE articles with identical titles, authors and years of publication. This file contains the PMIDs of the duplicate publications, as well as their medical subject headings (MeSH) and three measures of their indexing consistency.
* twin_abstract_matched_complete.tsv: the same as above, except that the MEDLINE articles also have matching abstracts.
* mesh_training_data.csv: a comma-separated file containing the training data for the model discussed in the dissertation.
* mesh_scores.tsv: a tab-delimited file containing a pairwise similarity score based on word embeddings, and MeSH hierarchy relationship.
## Duplicate MEDLINE Publications
Both the twin_not_abstract_matched_complete.tsv and twin_abstract_matched_complete.tsv have the same structure. They have the following columns:
1. pmid_one: the PubMed unique identifier of the first paper
2. pmid_two: the PubMed unique identifier of the second paper
3. mesh_one: A list of medical subject headings (MeSH) from the first paper, delimited by the "|" character
4. mesh_two: a list of medical subject headings from the second paper, delimited by the "|" character
5. hoopers_consistency: The calculation of Hooper's consistency between the MeSH of the first and second paper
6. nonhierarchicalfree: a word embedding based consistency score described in the dissertation
7. hierarchicalfree: a word embedding based consistency score additionally limited by the MeSH hierarchy, described in the dissertation.
## MeSH Training Data
The mesh_training_data.csv file contains the training data for the model discussed in the dissertation. It has the following columns:
1. pmid: the PubMed unique identifier of the paper
2. term: a candidate MeSH term
3. cit_count: the log of the frequency of the term in the citation candidate set
4. total_cit: the log of the total number the paper's citations
5. citr_count: the log of the frequency of the term in the citations of the paper's citations
6. total_citofcit: the log of the total number of the citations of the paper's citations
7. absim_count: the log of the frequency of the term in the AbSim candidate set
8. total_absim_count: the log of the total number of AbSim records for the paper
9. absimr_count: the log of the frequency of the term in the citations of the AbSim records
10. total_absimr_count: the log of the total number of citations of the AbSim record
11. log_medline_frequency: the log of the frequency of the candidate term in MEDLINE.
12. relevance: a binary indicator (True/False) if the candidate term was assigned to the target paper
## Cosine Similarity
The mesh_scores.tsv file contains a pairwise list of all MeSH terms including their cosine similarity based on the word embedding described in the dissertation. Because the MeSH hierarchy is also used in many of the evaluation measures, the relationship of the term pair is also included. It has the following columns:
1. mesh_one: a string of the first MeSH heading.
2. mesh_two: a string of the second MeSH heading.
3. cosine_similarity: the cosine similarity between the terms
4. relationship_type: a string identifying the relationship type, consisting of none, parent/child, sibling, ancestor and direct (terms are identical, i.e. a direct hierarchy match).
The mesh_model.bin file contains a binary word2vec C format file containing the MeSH term embeddings. It was generated using version 3.7.2 of the Python gensim library (https://radimrehurek.com/gensim/).
For an example of how to load the model file, see https://radimrehurek.com/gensim/models/word2vec.html#usage-examples, specifically the directions for loading the "word2vec C format."
keywords:
MEDLINE;MeSH;Medical Subject Headings;Indexing
published:
2019-11-12
Rezapour, Rezvaneh
(2019)
We are sharing the tweet IDs of four social movements: #BlackLivesMatter, #WhiteLivesMatter, #AllLivesMatter, and #BlueLivesMatter movements. The tweets are collected between May 1st, 2015 and May 30, 2017. We eliminated the location to the United States and focused on extracting the original tweets, excluding the retweets.
Recommended citations for the data:
Rezapour, R. (2019). Data for: How do Moral Values Differ in Tweets on Social Movements?. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9614170_V1
and
Rezapour, R., Ferronato, P., and Diesner, J. (2019). How do moral values differ in tweets on social movements?. In 2019 Computer Supported Cooperative Work and Social Computing Companion Publication (CSCW’19 Companion), Austin, TX.
keywords:
Twitter; social movements; black lives matter; blue lives matter; all lives matter; white lives matter
published:
2025-04-04
Fang, Liri; Salami, Malik Oyewale; Weber, Griffin M.; Torvik, Vetle I.
(2025)
This dataset, uCite, is the union of nine large-scale open-access PubMed citation data separated by reliability. There are 20 files, including the reliable and unreliable citation PMID pairs, non-PMID identifiers to PMID mapping (for DOIs, Lens, MAG, and Semantic Scholar), original PMID pairs from the nine resources, some metadata for PMIDs, duplicate PMIDs, some redirected PMID pairs, and PMC OA Patci citation matching results.
The short description of each data file is listed as follows. A detailed description can be found in the README.txt.
<strong>DATASET DESCRIPTION</strong>
<ol>
<li>PPUB.tsv.gz - tsv format file containing reliable citation pairs uCite.</li>
<li>PUNR.tsv.gz - tsv format file containing reliable citation pairs uCite.</li>
<li>DOI2PMID.tsv.gz - tsv format file containing results mapping DOI to PMID. </li>
<li> LEN2PMID.tsv.gz - tsv format file containing results mapping LensID pairs to PMID pairs.. </li>
<li> MAG2PMIDsorted.tsv.gz - tsv format file containing results mapping MAG ID to PMID. </li>
<li>SEM2PMID.tsv.gz - tsv ormat file containing results mapping Semantic Scholar ID to PMID. </li>
<li>JVNPYA.tsv.gz - tsv format file containing metadata of papers with PMID, journal name, volume, issue, pages, publication year, and first author's last name. </li>
<li>TiLTyAlJVNY.tsv.gz - tsv format file containing metadata of papers. </li>
<li> PMC-OA-patci.tsv.gz - tsv format file containing PubMed Central Open Access subset reference strings extracted by \cite{} processed by Patci.</li>
<li>REDIRECTS.gz - txt file containing unreliable PMID pairs mapped to reliable PMID pairs. </li>
<li>REMAP - file containing pairs of duplicate PubMed records (lhs PMID mapped to rhs PMID).</li>
<li> ami_pair.tsv.gz - tsv format file containing all citation pairs from Aminer (2015 version). </li>
<li> dim_pair.tsv.gz - tsv format file containing all citation pairs from Dimensions. </li>
<li> ice_pair.tsv.gz - tsv format file containing all citation pairs from iCite (April 2019 version, version 1). </li>
<li> len_pair.tsv.gz - tsv format file containing all citation pairs from Lens.org (harvested through Oct 2021). </li>
<li>mag_pair.tsv.gz - tsv format file containing all citation pairs from Microsoft Academic Graph (2015 version). </li>
<li> oci_pair.tsv.gz - tsv format file containing all citation pairs from Open Citations (Nov. 2021 dump, csv version ). </li>
<li> pat_pair.tsv.gz - tsv format file containing all citation pairs from Patci (i.e., from "PMC-OA-patci.tsv.gz"). </li>
<li> pmc_pair.tsv.gz - tsv format file containing all citation pairs from PubMed Central (harvest through Dec 2018 via e-Utilities).</li>
<li> sem_pair.tsv.gz - tsv format file containing all citation pairs from Semantic Scholar (2019 version) . </li>
</ol>
<strong>COLUMN DESCRIPTION</strong>
<strong>FILENAME</strong> : <em>PPUB.tsv.gz, PUNR.tsv.gz</em>
(1) fromPMID - PubMed ID of the citing paper.
(2) toPMID - PubMed ID of the cited paper.
(3) sources - citation sources, in which the citation pairs are identified.
(4) fromYEAR - Publication year of the citing paper.
(5) toYEAR - Publication year of the cited paper.
<strong>FILENAME</strong> : <em>DOI2PMID.tsv.gz</em>
(1) DOI - Semantic Scholar ID of paper records.
(2) PMID - PubMed ID of paper records.
(3) PMID2 - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs.
<strong>FILENAME</strong> : <em>SEMID2PMID.tsv.gz</em>
(1) SemID - Semantic Scholar ID of paper records.
(2) PMID - PubMed ID of paper records.
(3) DOI - Digital Object Identifier of paper records, “-” if the paper doesn't have DOIs.
<strong>FILENAME</strong> : <em>JVNPYA.tsv.gz</em>
- Each row refers to a publication record.
(1) PMID - PubMed ID.
(2) journal - Journal name.
(3) volume - Journal volume.
(4) issue - Journal issue.
(5) pages - The first page and last page (without leading digits) number of the publication separated by '-'.
(6) year - Publication year.
(7) lastname - Last name of the first author.
<strong>FILENAME</strong> : <em>TiLTyAlJVNY.tsv.gz</em>
(1) PMID - PubMed ID.
(2) title_tokenized - Paper title after tokenization.
(3) languages - Language that paper is written in.
(4) pub_types - Types of the publication.
(5) length(authors) - String length of author names.
(6) journal -Journal name .
(7) volume - Journal volume .
(8) issue - Journal issue.
(9) year - Publication year of print (not necessary epub).
<strong>FILENAME</strong> : <em> PMC-OA-patci.tsv.gz</em>
(1) pmcid - PubMed Central identifier.
(2) pos -
(3) fromPMID - PubMed ID of the citing paper.
(4) toPMID - PubMed ID of the cited paper.
(5) SRC - citation sources, in which the citation pairs are identified.
(6) MatchDB - PubMed, ADS, DBLP.
(7) Probability - Matching probability predicted by Patci.
(8) toPMID2 - PubMed ID of the cited paper, extracted from OA xml file
(9) SRC2 - citation sources, in which the citation pairs are identified.
(10) intxt_id -
(11) jounal - First character of the journal name.
(12) same_ref_string - Y if patci and xml reference string match, otherwise N.
(13) DIFF -
(14) bestSRC - Citation sources, in which the citation pairs are identified.
(15) Match - Matching strings annotated by Patci.
<strong>FILENAME</strong> : <em>REDIRECTS.gz</em>
Each row in Redirectis.txt is a string sequence in the same format as follows.
- "REDIRECTED FROM: source PMID_i PMID_j -> PMID_i' PMID_j "
- "REDIRECTED TO: source PMID_i PMID_j -> PMID_i PMID_j' "
Note: source is the names of sources where the PMID_i and PMID_j are from.
<strong>FILENAME</strong> : <em>REMAP</em>
Each row is remapping unreliable PMID pairs mapped to reliable PMID pairs.
The format of each row is "$REMAP{PMID_i} = PMID_j".
<strong>FILENAME</strong> : <em>ami_pair.tsv.gz, dim_pair.tsv.gz, ice_pair.tsv.gz, len_pair.tsv.gz, mag_pair.tsv.gz, oci_pair.tsv.gz, pat_pair.tsv.gz,pmc_pair.tsv.gz, sem_pair.tsv.gz</em>
(1) fromPMID - PubMed ID of the citing paper.
(2) toPMID - PubMed ID of the cited paper.
keywords:
Citation data; PubMed; Social Science;
published:
2020-05-15
Mishra, Shubhanshu
(2020)
Trained models for multi-task multi-dataset learning for sequence prediction in tweets
Tasks include POS, NER, Chunking, and SuperSenseTagging
Models were trained using: https://github.com/napsternxg/SocialMediaIE/blob/master/experiments/multitask_multidataset_experiment.py
See https://github.com/napsternxg/SocialMediaIE for details.
keywords:
twitter; deep learning; machine learning; trained models; multi-task learning; multi-dataset learning;
published:
2016-08-02
Jin, Qiang; Hahn, James; Croll, Gretchen
(2016)
These data are the result of a multi-step process aimed at enriching BIBFRAME RDF with linked data. The process takes in an initial MARC XML file, transforms it to BIBFRAME RDF/XML, and then four separate python files corresponding to the BIBFRAME 1.0 model (Work, Instance, Annotation, and Authority) are run over the BIBFRAME RDF/XML output. The input and outputs of each step are included in this data set. Input file types include the CSV; MARC XML; and Master RDF/XML Files. The CSV contain bibliographic identifiers to e-books. From CSVs a set of MARC XML are generated. The MARC XML are utilized to produce the Master RDF file set. The major outputs of the enrichment code produce BIBFRAME linked data as Annotation RDF, Instance RDF, Work RDF, and Authority RDF.
keywords:
BIBFRAME; Schema.org; linked data; discovery; MARC; MARCXML; RDF
published:
2018-12-20
Dong, Xiaoru; Xie, Jingyi; Hoang, Linh
(2018)
File Name: AllWords.csv
Data Preparation: Xiaoru Dong, Linh Hoang
Date of Preparation: 2018-12-12
Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang
Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks.
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider.
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews.
Description: The file contains lists of all words (all features) from the bag-of-words feature extraction.
Notes: In order to reproduce the data in this file, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords:
Inclusion criteria; Randomized controlled trials; Machine learning; Systematic reviews
published:
2019-01-07
Carlstone, Jamie; Kenfield, Ayla Stein; Norman, Michael; Wilkin, John
(2019)
Vendor transcription of the Catalogue of Copyright Entries, Part 1, Group 1, Books: New Series, Volume 29 for the Year 1932. This file contains all of the entries from the indicated volume.
keywords:
copyright; Catalogue of Copyright Entries; Copyright Office
published:
2018-06-02
Palmer, Ryan; Albarracin, Dolores
(2018)
keywords:
conspiracy theory; trust in science
published:
2018-07-13
Hensley, Merinda Kaye; Johnson, Heidi R.
(2018)
Qualitative Data collected from the websites of undergraduate research journals between October, 2014 and May, 2015. Two CSV files. The first file, "Sample", includes the sample of journals with secondary data collected. The second file, "Population", includes the remainder of the population for which secondary data was not collected. Note: That does not add up to 800 as indicated in article, rows were deleted for journals that had broken links or defunct websites during random sampling process.
keywords:
undergraduate research; undergraduate journals; scholarly communication; libraries; liaison librarianship
published:
2018-12-20
Dong, Xiaoru; Xie, Jingyi; Hoang, Linh; Schneider, Jodi
(2018)
File Name: Error_Analysis.xslx
Data Preparation: Xiaoru Dong
Date of Preparation: 2018-12-12
Data Contributions: Xiaoru Dong, Linh Hoang, Jingyi Xie, Jodi Schneider
Data Source: The classification prediction results of prediction in testing data set
Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider
Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews
Description: The file contains lists of the wrong and correct prediction of inclusion criteria of Cochrane Systematic Reviews from the testing data set and the length (number of words) of the inclusion criteria.
Notes: In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
keywords:
Inclusion criteria, Randomized controlled trials, Machine learning, Systematic reviews
published:
2016-08-18
Copyright Review Management System renewals by year, data from Table 2 of the article "How Large is the ‘Public Domain’? A comparative Analysis of Ringer’s 1961 Copyright Renewal Study and HathiTrust CRMS Data."
keywords:
copyright; copyright renewals; HathiTrust