Dataset Description
|
Provides links to Author-ity 2009, including records from principal investigators (on NIH and NSF grants), inventors on USPTO patents, and students/advisors on ProQuest dissertations.
Note that NIH and NSF differ in the type of fields they record and standards used (e.g., institution names). Typically an NSF grant spanning multiple years is associated with one record, while an NIH grant occurs in multiple records, for each fiscal year, sub-projects/supplements, possibly with different principal investigators.
The prior probability of match (i.e., that the author exists in Author-ity 2009) varies dramatically across NIH grants, NSF grants, and USPTO patents. The great majority of NIH principal investigators have one or more papers in PubMed but a minority of NSF principal investigators (except in biology) have papers in PubMed, and even fewer USPTO inventors do. This prior probability has been built into the calculation of match probabilities.
The NIH data were downloaded from NIH exporter and the older NIH CRISP files. The dataset has 2,353,387 records, only includes ones with match probability > 0.5, and has the following 12 fields:
1 app_id,
2 nih_full_proj_nbr,
3 nih_subproj_nbr,
4 fiscal_year
5 pi_position
6 nih_pi_names
7 org_name
8 org_city_name
9 org_bodypolitic_code
10 age: number of years since their first paper
11 prob: the match probability to au_id
12 au_id: Author-ity 2009 author ID
The NSF dataset has 262,452 records, only includes ones with match probability > 0.5, and the following 10 fields:
1 AwardId
2 fiscal_year
3 pi_position,
4 PrincipalInvestigators,
5 Institution,
6 InstitutionCity,
7 InstitutionState,
8 age: number of years since their first paper
9 prob: the match probability to au_id
10 au_id: Author-ity 2009 author ID
There are two files for USPTO because here we linked disambiguated authors in PubMed (from Author-ity 2009) with disambiguated inventors.
The USPTO linking dataset has 309,720 records, only includes ones with match probability > 0.5, and the following 3 fields
1 au_id: Author-ity 2009 author ID
2 inv_id: USPTO inventor ID
3 prob: the match probability of au_id vs inv_id
The disambiguated inventors file (uiuc_uspto.tsv) has 2,736,306 records, and has the following 7 fields
1 inv_id: USPTO inventor ID
2 is_lower
3 is_upper
4 fullnames
5 patents: patent IDs separated by '|'
6 first_app_yr
7 last_app_yr
|