Illinois Data Bank Dataset Search Results
Results
published:
2019-02-22
Fernández, Roberto; Parker, Gary; Stark, Colin
(2019)
This dataset includes measurements taken during the experiments on patterns of alluvial cover over bedrock. The dataset includes an hour worth of timelapse images taken every 10s for eight different experimental conditions. It also includes the instantaneous water surface elevations measured with eTapes at a frequency of 10Hz for each experiment. The 'Read me Data.txt' file explains in more detail the contents of the dataset.
keywords:
bedrock; erosion; alluvial; meandering; alluvial cover; sinuosity; flume; experiments; abrasion;
published:
2020-11-18
Gardner, Allison; Allan, Brian
(2020)
These data obtained from the peer-reviewed literature and a public database depict the geographic expansion of the black-legged tick (Ixodes scapularis) and human cases of Lyme disease in the midwestern U.S.
<b><i>Note</b></i>: There was an omission from the first version (V1) of the data set that required us to update the data. Specifically, we failed to include the data from the article "Caporale DA, Johnson CM, Millard BJ. 2005 Presence of Borrelia burgdorferi (Spirochaetales: Spirochaetaceae) in Southern Kettle Moraine State Forest, Wisconsin, and characterization of strain W97F51. J. Med. Entomol. 42, 457–472". In the second version (V2) of the data, this omission is corrected.
keywords:
Lyme disease; Borrelia burgdorferi; Ixodes scapularis; black-legged tick
published:
2022-11-28
Avrin, Alexandra; Pekins, Charles; Wilmers, Christopher; Sperry, Jinelle; Allen, Maximilian
(2022)
Detection data of carnivores and their prey species from camera traps in Fort Hood, Texas and Santa Cruz, California, USA. Non-carnivore and non-prey species (humans, domestic species, avian species, etc.) were excluded from this dataset. All detections of each species at a camera within 30 minutes have been combined to 1 detection (only first detection within that 30 minutes kept) to avoid pseudoreplication.
Variable Description:
Site= Study area data were collected
MonitoringPeriod= year in which data was collected (data were collected at each location over multiple monitoring periods)
CameraName= Unique name for each camera location
Date= calendar date of detection
Time= time of detection
-Fort Hood= Central Time USA
-Santa Cruz= Pacific Time USA
Species= Common name of species detected
keywords:
carnivore; community ecology; competition; interspecific interactions; keystone species; mesopredator; predation; trophic cascade
published:
2023-04-02
Lee, Yuanyao; Khanna, Madhu; Chen, Luoye
(2023)
Use of cellulosic biofuels from non-feedstocks are modeled using the BEPAM (Biofuel and Environmental Policy Analysis Model) model to quantifying the uncertainties about induced land use change effects, net greenhouse gas saving potential, and economic costs. The code is in GAMS, general algebraic modeling language.
NOTE: Column 3 is titled "BAU" in "merged_BAU.gdx", "merged_RFS.gdx", and "merged_CEM.gdx", but contains "RFS" data in "merged_RFS.gdx" and "CEM" data in "merged_CEM.gdx".
keywords:
cellulosic biomass; BEPAM; economic modeling
published:
2024-08-02
Morrow Plots Data Curation Working Group
(2024)
The Morrow Plots at the University of Illinois at Urbana-Champaign are the longest-running continuous experimental plots in the Americas. In continuous operation since 1876, the plots were established to explore the impact of crop rotation and soil treatment on corn crop yields. In 2018, The Morrow Plots Data Curation Working Group began to identify, collect and curate the various data records created over the history of the experiment. The resulting data table published here includes planting, treatment and yield data for the Morrow Plots since 1888. Please see the included codebook for a detailed explanation of the data sources and their content. This dataset will be updated as new yield data becomes available.
*NOTE: While digitized and accessed through IDEALS, the physical copy of the field notebook: <a href="https://archon.library.illinois.edu/archives/index.php?p=collections/controlcard&id=11846">Morrow Plots Notebook, 1876-1913, 1967</a> is also held at the University of Illinois Archives.
keywords:
Corn; Crop Science; Experimental Fields; Crop Yields; Agriculture; Illinois; Morrow Plots
published:
2025-11-20
Ahmed, Md Wadud; Esquerre, Carlos A.; Eilts, Kristen; Allen, Dylan P.; McCoy, Scott M.; Varela, Sebastian; Singh, Vijay; Leakey, Andrew; Kamruzzaman, Mohammad
(2025)
NIR spectroscopy is a rapid and accurate green technology for high-throughput biomass characterization, including sorghum (Sorghum bicolor), a promising energy crop for the biofuel industry. This study assessed the influence of particle size on NIR spectroscopic analysis (wavelength range: 867–2535 nm) of sorghum biomass composition. Grown under field conditions, a total of 113 types of genetically diverse sorghum accessions were dried, ground, and sieved (<250, 250–600, 600–850, and > 850 µm particle size) for developing partial least square regression (PLSR) prediction models for moisture, ash, extractive, glucan, xylan, acid-soluble lignin (ASL), acid-insoluble lignin (AIL), and total lignin (ASL + AIL). Overall, smaller particle sizes provided better model performance, while no single particle size provided the best performance for all the selected components. With only 9 selected bands and 4 latent variables (LVs), the best PLSR model was obtained for moisture with particle size of 600–850 µm with the square root of the coefficient of determination (R) of 0.85, the ratio of prediction to deviation (RPD) of 2.2, and the root mean square error (RMSE) of 0.46 % in external validation. Similar model performances were also obtained for ash, extractive, glucan, and xylan. This study showed that size reduction could effectively improve NIR spectroscopic analysis for lipid-producing sorghum biomass for the biofuel industry.
keywords:
Conversion;Feedstock Production;Biomass Analytics;Modeling;Sorghum
published:
2016-12-20
Wickes, Elizabeth; Nakamura, Katia
(2016)
Scripts and example data for AIDData (aiddata.org) processing in support of forthcoming Nakamura dissertation.
This dataset includes two sets of scripts and example data files from an aiddata.org data dump. Fuller documentation about the functionality for these scripts is within the readme file. Additional background information and description of usage will be in the forthcoming Nakamura dissertation (link will be added when available). Data originally supplied by Nakamura. Python code and this readme file created by Wickes. Data included within this deposit are examples to demonstrate execution.
Roughly, there are two python scripts in here: keyword_search.py, designed to assist in finding records matching specific keywords, and matching_tool.ipynb, designed to assist in detection of which records are and are not contained within a keyword results file and an aiddata project data file.
keywords:
aiddata; natural resources
published:
2020-07-16
Mishra, Shubhanshu
(2020)
Dataset to be for SocialMediaIE tutorial
keywords:
social media; deep learning; natural language processing
published:
2020-08-22
Qiu, Haoran; Banerjee, Subho S.; Jha, Saurabh; Kalbarczyk, Zbigniew T.; Iyer, Ravishankar K.
(2020)
We are releasing the tracing dataset of four microservice benchmarks deployed on our dedicated Kubernetes cluster consisting of 15 heterogeneous nodes. The dataset is not sampled and is from selected types of requests in each benchmark, i.e., compose-posts in the social network application, compose-reviews in the media service application, book-rooms in the hotel reservation application, and reserve-tickets in the train ticket booking application.
The four microservice applications come from [DeathStarBench](https://github.com/delimitrou/DeathStarBench) and [Train-Ticket](https://github.com/FudanSELab/train-ticket). The performance anomaly injector is from [FIRM](https://gitlab.engr.illinois.edu/DEPEND/firm.git).
The dataset was preprocessed from the raw data generated in FIRM's tracing system. The dataset is separated by on which microservice component is the performance anomaly located (as the file name suggests). Each dataset is in CSV format and fields are separated by commas. Each line consists of the tracing ID and the duration (in 10^(-3) ms) of each component. Execution paths are specified in `execution_paths.txt` in each directory.
keywords:
Microservices; Tracing; Performance
published:
2021-07-22
Hsiao, Tzu-Kun; Schneider, Jodi
(2021)
This dataset includes five files. Descriptions of the files are given as follows:
<b>FILENAME: PubMed_retracted_publication_full_v3.tsv</b>
- Bibliographic data of retracted papers indexed in PubMed (retrieved on August 20, 2020, searched with the query "retracted publication" [PT] ).
- Except for the information in the "cited_by" column, all the data is from PubMed.
- PMIDs in the "cited_by" column that meet either of the two conditions below have been excluded from analyses:
[1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file).
[2] Citing paper and the cited retracted paper have the same PMID.
ROW EXPLANATIONS
- Each row is a retracted paper. There are 7,813 retracted papers.
COLUMN HEADER EXPLANATIONS
1) PMID - PubMed ID
2) Title - Paper title
3) Authors - Author names
4) Citation - Bibliographic information of the paper
5) First Author - First author's name
6) Journal/Book - Publication name
7) Publication Year
8) Create Date - The date the record was added to the PubMed database
9) PMCID - PubMed Central ID (if applicable, otherwise blank)
10) NIHMS ID - NIH Manuscript Submission ID (if applicable, otherwise blank)
11) DOI - Digital object identifier (if applicable, otherwise blank)
12) retracted_in - Information of retraction notice (given by PubMed)
13) retracted_yr - Retraction year identified from "retracted_in" (if applicable, otherwise blank)
14) cited_by - PMIDs of the citing papers. (if applicable, otherwise blank) Data collected from iCite.
15) retraction_notice_pmid - PMID of the retraction notice (if applicable, otherwise blank)
<b>FILENAME: PubMed_retracted_publication_CitCntxt_withYR_v3.tsv</b>
- This file contains citation contexts (i.e., citing sentences) where the retracted papers were cited. The citation contexts were identified from the XML version of PubMed Central open access (PMCOA) articles.
- This is part of the data from: Hsiao, T.-K., & Torvik, V. I. (manuscript in preparation). Citation contexts identified from PubMed Central open access articles: A resource for text mining and citation analysis.
- Citation contexts that meet either of the two conditions below have been excluded from analyses:
[1] PMIDs of the citing papers are from retraction notices (i.e., those in the “retraction_notice_PMID.csv” file).
[2] Citing paper and the cited retracted paper have the same PMID.
ROW EXPLANATIONS
- Each row is a citation context associated with one retracted paper that's cited.
- In the manuscript, we count each citation context once, even if it cites multiple retracted papers.
COLUMN HEADER EXPLANATIONS
1) pmcid - PubMed Central ID of the citing paper
2) pmid - PubMed ID of the citing paper
3) year - Publication year of the citing paper
4) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, tbl_fig_caption = tables and table/figure captions)
5) IMRaD - IMRaD section of the citation context (I = Introduction, M = Methods, R = Results, D = Discussions/Conclusion, NoIMRaD = not identified)
6) sentence_id - The ID of the citation context in a given location. For location information, please see column 4. The first sentence in the location gets the ID 1, and subsequent sentences are numbered consecutively.
7) total_sentences - Total number of sentences in a given location
8) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper.
9) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper.
10) citation - The citation context
11) progression - Position of a citation context by centile within the citing paper.
12) retracted_yr - Retraction year of the retracted paper
13) post_retraction - 0 = not post-retraction citation; 1 = post-retraction citation. A post-retraction citation is a citation made after the calendar year of retraction.
<b>FILENAME: 724_knowingly_post_retraction_cit.csv</b> (updated)
- The 724 post-retraction citation contexts that we determined knowingly cited the 7,813 retracted papers in "PubMed_retracted_publication_full_v3.tsv".
- Two citation contexts from retraction notices have been excluded from analyses.
ROW EXPLANATIONS
- Each row is a citation context.
COLUMN HEADER EXPLANATIONS
1) pmcid - PubMed Central ID of the citing paper
2) pmid - PubMed ID of the citing paper
3) pub_type - Publication type collected from the metadata in the PMCOA XML files.
4) pub_type2 - Specific article types. Please see the manuscript for explanations.
5) year - Publication year of the citing paper
6) location - Location of the citation context (abstract = abstract, body = main text, back = supporting material, table_or_figure_caption = tables and table/figure captions)
7) intxt_id - Identifier of a cited paper. Here, a cited paper is the retracted paper.
8) intxt_pmid - PubMed ID of a cited paper. Here, a cited paper is the retracted paper.
9) citation - The citation context
10) retracted_yr - Retraction year of the retracted paper
11) cit_purpose - Purpose of citing the retracted paper. This is from human annotations. Please see the manuscript for further information about annotation.
12) longer_context - A extended version of the citation context. (if applicable, otherwise blank) Manually pulled from the full-texts in the process of annotation.
<b>FILENAME: Annotation manual.pdf</b>
- The manual for annotating the citation purposes in column 11) of the 724_knowingly_post_retraction_cit.tsv.
<b>FILENAME: retraction_notice_PMID.csv</b> (new file added for this version)
- A list of 8,346 PMIDs of retraction notices indexed in PubMed (retrieved on August 20, 2020, searched with the query "retraction of publication" [PT] ).
keywords:
citation context; in-text citation; citation to retracted papers; retraction
published:
2021-07-15
Castro, Daniel; Sweedler, Jonathan
(2021)
The dataset contains the high-throughput matrix-assisted laser desorption/ionization mass spectrometry XmL files for the atrial gland and red hemiduct of Aplysia californica.
keywords:
Dense-core vesicle; High-throughput; Mass Spectrometry; MALDI; Organelle; Image-Guided; Atrial gland; red hemiduct; Lucent Vesicle
published:
2024-02-16
Zhang, Mingxiao; Sutton, Bradley
(2024)
Sample data from one typical phantom test and one deidentified shunt patient test (shown in Fig. 8 of the MRM paper), with the corresponding analysis code for the Shunt-FENSI technique.
For the MRM paper “Measuring CSF Shunt Flow with MRI Using Flow Enhancement of Signal Intensity (FENSI)”
keywords:
Shunt-FENSI; MRM; Hydrocephalus; VP Shunt; Flow Quantification; Pediatric Neurosurgery; Pulse Sequence; Signal Simulation
published:
2022-04-20
This is the core data for Zinnen et al., "Functional traits and responses to nutrient and mycorrhizal addition are inconsistently related to wetland plant species’ coefficients of conservatism." This is submitted to Wetlands Ecology and Management.
Two datasets are submitted here. The first is greenhouse-collected data of 9 plant traits and concurrent treatment responses of Illinois wetland plant species. The second are field-collected leaf trait data of Illinois wetland plant species. These data are analyzed in the paper. Please refer to the main manuscript to see how these data were produced and specific analyses.
keywords:
ecological indicators; Floristic Quality Assessment; Floristic Quality Index; wetland degradation
published:
2022-09-08
Hartman, Jordan; Larson, Eric
(2022)
Data associated with the manuscript "Overlooked invaders? Ecological impacts of non-game, native transplant fishes in the United States" by Jordan H. Hartman and Eric R. Larson
keywords:
freshwater; non-game; native transplant; impacts; invasive species
published:
2022-10-27
Holiman, Haley; Kitaif, J. Carson; Fournier, Auriel M.V.; Iglay, Ray; Woodrey, Mark S.
(2022)
keywords:
marsh birds; automated recording units
published:
2023-10-22
Davidson, Ruth; Vachaspati, Pranjal; Mirarab, Siavash; Warnow, Tandy
(2023)
HGT+ILS datasets from Davidson, R., Vachaspati, P., Mirarab, S., & Warnow, T. (2015). Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC genomics, 16(10), 1-12. Contains model species trees, true and estimated gene trees, and simulated alignments.
keywords:
evolution; computational biology; bioinformatics; phylogenetics
published:
2021-11-05
Keralis, Spencer D. C.; Yakin, Syamil
(2021)
This data set contains survey results from a 2021 survey of University of Illinois University Library employees conducted as part of the Becoming A Trans Inclusive Library Project to evaluate the awareness of University of Illinois faculty, staff, and student employees regarding transgender identities, and to assess the professional development needs of library employees to better serve trans and gender non-conforming patrons. The survey instrument is available in the IDEALS repository: http://hdl.handle.net/2142/110080.
keywords:
transgender awareness, academic library, gender identity awareness, professional development opportunities
published:
2021-09-17
Stern, Jessica; Herman, Brook D. ; Matthews, Jeffrey
(2021)
We studied vegetation metric robustness to environmental (season, interannual, and regional) and methodological (observer) variables, as well as adequate sample size for vegetation metrics across four regions of the United States.
keywords:
coefficients of conservatism; floristic quality assessment; restoration; vegetation metric;
published:
2022-03-31
Crawford, Reed D.; Dodd, Luke E.; Tillman, Frank E.; O'Keefe, Joy M.
(2022)
This dataset contains our bi-hourly temperature recordings from 40 rocket box style artificial roosts of 5 designs deployed in Indiana and Kentucky, USA from April through September 2019. This dataset also includes our endothermic and faculatively heterothermic daily energy expenditure datasets used in our bioenergetic analysis, which were calculated from the bi-hourly rocket box temperature data. Lastly, we include our overheating counts dataset which summarizes daily overheating events (i.e., temperatures > 40 Celsius) in each rocket box style bat box over the course of the study period, these daily summaries were also calculated from the bi-hourly rocket box temperature recordings.
keywords:
artificial roost; bat box; microcllimate; temperature
published:
2024-01-01
Christensen, Jacob; Bettler, Simon; Qu, Kejian; Huang, Jeffrey; Kim, Soyeun; Lu, Yinchuan; Zhao, Chengxi; Chen, Jin; Krogstad, Matthew; Woods, Toby; Mahmood, Fahad; Huang, Pinshane; Abbamonte, Peter; Shoemaker, Daniel
(2024)
Contains scattering data obtained for (TaSe4)2I at the Advanced Photon Source at Argonne National Laboratory. Beamline 6ID-D was used with a beam energy of 64.8 keV in a transmission geometry. Data was obtained at temperatures between 28 and 300 K. See the readme.txt file for more information.
keywords:
X-ray diffraction
published:
2025-11-06
Salmonella HilD 3'UTR GRIL-seq sequencing data
keywords:
Salmonella; SPI1; hilD
published:
2023-04-12
Han, Edmund; Nahid, Shahriar Muhammad; Rakib, Tawfiqur; Nolan, Gillian; F. Ferrari, Paolo; Hossain, M. Abir ; Schleife, André ; Nam, SungWoo; Ertekin, Elif; van der Zande, Arend; Huang, Pinshane
(2023)
STEM images of kinks in α-In2Se3, DFT calculation of bending of α-In2Se3, PFM on as exfoliated and controllably bend α-In2Se3
published:
2022-11-09
Wang, Junren; Konar, Megan; Dalin, Carole; Liu, Yu; Stillwell, Ashlynn S.; Xu, Ming; Zhu, Tingju
(2022)
This dataset includes the blue water intensity by sector (41 industries and service sectors) for provinces in China, economic and virtual water network flow for China in 2017, and the corresponding network properties for these two networks.
keywords:
Economic network; Virtual water; Supply chains; Network analysis; Multilayer; MRIO
published:
2023-03-13
Yang, Joyce; Zhao , Lei; Oleson, Keith
(2023)
This dataset contains the historical and future (SSP3 and RCP7.0) CESM climate simulations used in the article "Large humidity effects on urban heat exposure and cooling challenges under climate change" (upcoming). Further details about these simulations can be found in the article. This dataset documents the monthly mean projections of air temperature, wet-bulb temperature, precipitation, relative humidity, and numerous other climatic variables for 2000-2009 (for the historical run) and for 2015-2100 (for the future projection under SSP3-RCP7). This dataset may be useful for urban planners, climate scientists, and decision-makers interested in changes in urban and rural climate under climate change.
keywords:
urban climate; climate change; heat stress; urban heat
published:
2023-03-28
Hsiao, Tzu-Kun; Torvik, Vetle
(2023)
Sentences and citation contexts identified from the PubMed Central open access articles
----------------------------------------------------------------------
The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019.
The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles.
<b>Files</b>:
• A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A.
• B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B.
• C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C.
• D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D.
• E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E.
• F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F.
• G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G.
• H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H.
• I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I.
• J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J.
• K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K.
• L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L.
• M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M.
• N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N.
• O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O.
• P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1).
• P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2).
• Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q.
• R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R.
• S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S.
• T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T.
• UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V.
• W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W.
• XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z.
Each row in the file is a sentence/citation context and contains the following columns:
• pmcid: PMCID of the article
• pmid: PMID of the article. If an article does not have a PMID, the value is NONE.
• location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs.
• IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable.
• sentence_id: The ID of the citation context/sentence in the article component
• total_sentences: The number of sentences in the article component.
• intxt_id: The ID of the citation.
• intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-".
• intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-".
• intxt_mark: The citation marker associated with the inline citation.
• best_id: The best source link ID (e.g., PMID) of the citation.
• best_source: The sources that confirm the best ID.
• best_id_diff: The comparison result between the best_id column and the intxt_pmid column.
• citation: A citation context. If no citation is found in a sentence, the value is the sentence.
• progression: Text progression of the citation context/sentence.
<b>Supplementary Files</b>
• PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column.
Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns:
• pmcid: PMCID of the citing article.
• pos: The citation's position in the reference list.
• fromPMID: PMID of the citing article.
• toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci.
• SRC: The sources that confirm the toPMID.
• MatchDB: The origin bibliographic database of the toPMID.
• Probability: The match probability of the toPMID.
• toPMID2: PMID of the citation (as tagged in the XML file).
• SRC2: The sources that confirm the toPMID2.
• intxt_id: The ID of the citation.
• journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files.
• same_ref_string: Whether the citation string appears in the reference list more than once.
• DIFF: The comparison result between the toPMID column and the toPMID2 column.
• bestID: The best source link ID (e.g., PMID) of the citation.
• bestSRC: The sources that confirm the best ID.
• Match: Matching result produced by Patci.
[1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885
• intxt_cit_license_fromPMC.tsv – This file contains the CC licensing information for each article. The licensing information is from PMC's file lists [2], retrieved on June 19, 2020, and March 9, 2023. It should be noted that the license information for 189,855 PMCIDs is <b>NO-CC CODE</b> in the file lists, and 521 PMCIDs are absent in the file lists. The absence of CC licensing information does not indicate that the article lacks a CC license. For example, PMCID: 6156294 (<b>NO-CC CODE</b>) and PMCID: 6118074 (absent in the PMC's file lists) are under CC-BY licenses according to their PDF versions of articles.
The intxt_cit_license_fromPMC.tsv file has two columns:
• pmcid: PMCID of the article.
• license: The article’s CC license information provided in PMC’s file lists. The value is nan when an article is not present in the PMC’s file lists.
[2] https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/
• Supplementary_File_1.zip – This file contains the code for generating the dataset.
keywords:
citation context; in-text citation; inline citation; bibliometrics; science of science