Global distribution of research efforts, disease burden, and impact of US public funding withdrawal

Data sources

Life science publications

We extracted life science research publications from the parsed PubMed XML database. Trained librarians associate articles indexed in PubMed with MeSH terms, which constitute a controlled vocabulary for the categorization of biomedical research topics. These terms formed the basis for linking publications to specific diseases in our analysis. Previous studies have also used alternative approaches to designate disease-specific research, including topic modeling59 or co-word analysis60. However, a designation via MeSH terms allows for scalability and consistency across a broad range of diseases5,6. We focused specifically on MeSH terms in the ‘C-branch’ of the MeSH tree, which contains terms related to diseases. As of June 2024, PubMed recorded 5,032 unique C-branch MeSH terms. We started article extraction in 1999, when PubMed began to systematically record first-author affiliations, which we used to geolocate authorships and research articles. The endpoint of our dataset was 2021, the latest year for which comprehensive GBD data were available at the time of writing.

GBD data

To assess disease burden, we used data from the GBD database, maintained by the Institute for Health Metrics and Evaluation (IHME)28. The GBD database provides detailed, time-varying estimates of DALYs, a metric capturing the cumulative years of life lost due to illness, disability or premature death for specific diseases. DALYs provide a comprehensive metric that enables longitudinally consistent comparisons across diseases and regions, making them uniquely suited for analyzing divergence between research and disease burden over time.

The GBD database is organized according to disease within a hierarchical structure of four levels. Level 1 is the most general in the hierarchy, distinguishing between communicable and noncommunicable diseases. For our main analyses, we focused on level 2, which strikes a balance between granularity and interpretability at the global macro level. This cause level allows us to distinguish, for example, enteric infections as a subcategory of communicable diseases, or cardiovascular diseases as a subcategory of noncommunicable diseases. In line with previous research3,5, causes that are difficult to assign to a specific disease (for example, ‘other noncommunicable diseases’) were excluded from the analysis. While we present our main analysis for causes categorized at level 2, we integrated our matching of disease burden and disease-specific research from lower levels in the hierarchy (levels 3 and 4) and rolled the resulting associations up to the described level 2. For example, myocardial infarction (level 4) was rolled up into ischemic heart disease (level 3), which was rolled up into cardiovascular diseases (level 2). In total, we examined 16 level 2 causes of disease.

In addition, we offer a sensitivity analysis using level 3 disease categories, which yielded results consistent with our main findings at level 2, specifically a reduction in divergence between disease burden and research over time driven by changes in DALYs rather than changes in research output (Supplementary Fig. 8). Furthermore, we selected the two major level 2 research areas (cardiovascular diseases and neoplasms) as examples to show how the underlying research and disease burden at level 3 were distributed. The patterns at level 3 closely mirrored those observed at level 2: research on neoplasms remains disproportionately high relative to their burden, while research on major cardiovascular conditions remains underrepresented (Supplementary Fig. 9a,b).

Linking research with disease burden

To identify disease-specific research articles, we linked C-branch MeSH terms assigned to PubMed articles with the corresponding disease categories defined by the GBD database. Traditional approaches rely on the ICD system as a crosswalk to bridge these datasets. While probably precise, this method has notable limitations, including structural mismatches between MeSH terms and ICD codes, as well as inefficiencies introduced by the intermediate step of linking ICD codes to GBD categories. Each of the three classification systems—MeSH, ICD and GBD—was designed for different purposes, begetting variance in nomenclature. Crosswalking these three coding systems requires expert judgment by physicians, preventing large-scale linkage of millions of research articles to specific disease burden causes. These shortcomings are particularly problematic for longitudinal analyses that require accurate and consistent associations of research articles with burden data and for studies of geographical regions or research areas with lower research output, where even small numbers of overlooked articles can substantially affect the analyses.

To address these issues, we used a triangulated strategy that combined manual data curation by physicians, an ICD-based approach and an LLM-based methodology. In the first step, we manually curated dataset to serve as a gold standard for validation for cardiovascular diseases, the disease category causing the greatest burden globally and exhibiting the most complex MeSH-Cause nomenclature matrix. Two co-authors who are practicing cardiologists independently reviewed C-branch MeSH terms related to cardiovascular diseases (MeSH branch C14) and matched them to the GBD level 2 cause ‘Cardiovascular Diseases’ and its subcategories. This manual process ensured high accuracy in matches and resolved residual ambiguities through iterative discussions. Second, we applied the traditional ICD-based approach, mapping MeSH terms to ICD Tenth Revision codes and subsequently linking these codes to the 16 level 2 GBD causes. This process relied on established crosswalks in the Unified Medical Language System. Third, we developed an LLM-based method using ChatGPT (model GPT-4o), directly assessing whether a MeSH term aligned with a specific GBD cause. Extended Data Fig. 1 provides an overview of our approach.

We designed a custom prompt that directs the model to evaluate each of roughly 1 million possible combinations of 5,032 MeSH terms and 180 GBD causes (Supplementary Fig. 10). This method circumvents the need for an intermediate step that first links MeSH to ICD and then ICD to disease cause in a one-to-many MeSH to ICD and many-to-one ICD to cause matching structure. Instead, the LLM approach allows for the simultaneous evaluation of a many-to-many MeSH-to-cause matching structure at scale, including multiple assignments where appropriate.

Comparing the performance of the LLM-based approach to the ICD-based method using the physician-derived gold standard, we observed a substantial improvement in overall accuracy at the MeSH term level—from 50.5% with the ICD-based approach to 86.1% with the LLM approach. At the article level, the LLM approach even achieved an accuracy of 94.9%, compared to 67.0% for the ICD-based method (Supplementary Fig. 11). The greater accuracy at the article level is expected because more frequent MeSH-to-cause linkages occur in more productive research areas with relatively more articles.

The substantial improvement in the identification of disease-specific research using the LLM instead of ICD codes was mainly due to increased recall, where the LLM performed markedly better. Precision was comparable between the two methods, indicating that neither approach was prone to false positives. However, high and stable recall is critical for our longitudinal assessment of divergence because it minimizes the risk of underrepresenting research activities related to specific diseases.

To ensure the robustness of our approach beyond the cardiovascular disease category, we also evaluated the recall performance for other level 2 causes. This analysis assessed whether articles published in disease-specific journals and assigned a C-branch MeSH term were correctly classified as related to these diseases. The LLM-based approach achieved a recall rate of 94.76% compared to 72.71% for the ICD-based method, with little variation between disease causes (Supplementary Fig. 12). These results further validate the LLM-based approach in connecting research to GBD causes.

Overall, the comprehensive evaluation of different methods for identifying disease-specific research highlights the potential of LLMs to improve the linkage between research articles and diseases. The feasibility of establishing an accurate and reliable link between disease burden and research is also fundamental to monitoring future progress and we share our publication-level data in an online repository61 (see the ‘Data availability’ statement).

Creation of final sample

The LLM-based association of disease-specific research articles indexed in PubMed with causes of disease in the GBD database resulted in 7.5 million unique articles published between 1999 and 2021. As a single article can be associated with multiple causes of disease, this resulted in 9.7 million disease-cause article links. We assigned these publications to geographical regions based on the affiliation information associated with the first authors of research articles. In the life sciences, authorship norms associate first and last authors with leading roles in the research project. A comparison of the geographical locations of first and last authors showed that in over 90% of first–last author combinations, the geographical locations were identical at the country level, which is the most granular level of analysis in our study. We used the affiliation information recorded in PubMed and supplemented this with affiliation information from Web of Science. As the affiliation data were recorded as unstructured text, we used ChatGPT to process and assign countries to the affiliation strings, considerably enhancing the curated dataset. We also conducted a separate analysis based on the presence or absence of industry-affiliated co-authors, using information extracted from the authors’ institutional affiliations. We randomly selected 200 samples and had two independent raters assess the accuracy of the LLM-assigned country designations. In all cases (100%), both raters confirmed that the LLM country assignments were correct.

Overall, we geolocated over 25 million unique affiliation strings, including those of non-first authors, and successfully assigned geographical information to 6.7 million unique articles. This corresponds to 8.6 million article-to-cause links, covering approximately 89% of the articles in our sample. While our main analysis focused on first authors, a sensitivity analysis that considered all authors’ countries yielded consistent results. Additionally, for a subset of 71 from 100 randomly selected articles where study location could be inferred from the abstract, the first author’s country matched the research location in over 85% of cases.

To further enrich our dataset, we integrated article-specific funding information from Web of Science, available since 2008. Using the LLM, we assigned countries to funding agencies based on the acknowledgement text, automating what would otherwise have required extensive manual inspection. We also used this approach to identify major US public funding institutions, defined as those acknowledged in at least 1,000 research publications in our dataset and being financed by US taxpayer money. This process identified funding information for about 40% of disease-specific articles in our sample. It is important to note that while these funding data provide valuable insights into the sources of acknowledged funding, we are cautious about drawing conclusions for publications lacking such information. To our knowledge, these are the most comprehensive funding data currently available62.

Our final sample consisted of 8.6 million publication–cause links, geolocated based on first-author affiliations. For our analysis, we aggregated these data in several ways and supplemented them with DALY data, which provide year-specific, cause-specific and country-specific assessments of the burden of disease. For our primary analyses, we aggregated country-level data into eight geographical regions defined by the United Nations36: Central and Southern Asia; Eastern and South-Eastern Asia; Europe; Latin America and the Caribbean; Northern Africa and Western Asia; North America; Oceania; and sub-Saharan Africa. The sample creation process is summarized in Extended Data Fig. 2.

Analyses

To quantify the divergence between research and disease burden we used the KLD. Formally, the KLD is a nonsymmetric measure of the difference between two probability distributions p(x) and q(x) and is given by:

$${\mathrm{KLD}}(\;p(x){\rm{||}}q(x))=\sum _{x\in X}p\left(x\right)\mathrm{ln}\left(\frac{p\left(x\right)}{q\left(x\right)}\right)$$

where p(x) represents the reference distribution, in our case the discrete distribution of DALYs per disease x (in percentage), and q(x) represents the discrete distribution of research articles per disease x (in percentage). The sum of the individual divergences of research articles to DALYs per disease x across all 16 diseases in the set X of level 2 diseases from the GBD database yields the KLD as our key divergence measure. The KLD is nonnegative, provides an internally consistent measure of divergence over time and decreases as the fit between the two distributions improves; a KLD of zero would indicate a perfect alignment between research and disease burden. To address the KLD’s sensitivity to very small probabilities, we conducted sensitivity analyses excluding near-zero outliers (values below 0.01) from both the numerator and denominator and applied Laplace smoothing. These adjustments yielded consistent results. We provide additional metrics, that is, the Population Stability Index63, the Hellinger distance64 and the Jensen–Shannon divergence in Supplementary Fig. 1, obtaining consistent results.

To account for the inherent uncertainty in DALY estimates, we used the asymmetric upper and lower bounds reported by the IHME. For each year, we simulated DALY values from a log-normal distribution parameterized according to the reported mean, upper and lower bounds. This approach captures the nonnegative, skewed nature of DALYs while aligning the simulated distribution with IHME uncertainty intervals. We then calculated the divergence metrics for each simulated draw, averaged these metrics across simulations and used the standard devitation to construct the 95% confidence intervals. Thus, we effectively bootstrapped DALY estimates from their distribution and used these bootstrapped values to estimate the divergence metrics.

Alternative assessment of disease burden

DALYs represent a composite measure of morbidity and mortality. To test whether the research–disease divergence varied with our measure of disease burden, we reconstructed the KLD with measures for deaths and prevalence, also provided by the GBD (Extended Data Fig. 3). We observed that the trends in research–disease divergence were broadly parallel across all three metrics. Additionally, Supplementary Fig. 2a,b shows that for each disease burden measure, the divergence would not have decreased in the hypothetical scenario where the disease burden had remained unchanged, which is consistent with the findings based on DALYs. A corresponding breakdown according to disease category is provided in Extended Data Fig. 4.

Alternative research response times

To check for the possibility of delayed adjustments of research to changes in disease burden, we ran a time-lagged sensitivity analysis of the declining research–disease divergence. Specifically, we compared the distribution of DALYs each year with the distribution of research 10 years later. This lagged analysis yielded no evidence that research adjusted to changes in the burden of disease, either contemporaneously or with a lag of up to 10 years (Extended Data Fig. 5).

Alternative assessment of research output

To capture conceivable variation in research output based on its potential for human application, we built on the work of Hutchins and colleagues, who developed the Approximate Potential to Translate (APT) metric22. This metric is derived from a machine learning model that predicts the likelihood that a given publication will be cited in a clinical trial. A higher APT score indicates a higher probability of eventual clinical citation. In addition, we used PubMed publication types, as defined by the iCite classification65, to identify clinical research. By combining the APT score with the iCite definition, we categorized publications into three different groups: (1) basic research: APT score lower than 0.5 and not classified as clinical; (2) applied research: APT score of 0.5 or higher and not classified as clinical; (3) clinical research according to iCite. For each of these groups, we calculated the relative proportion of articles devoted to each disease and compared this distribution to the corresponding DALY distribution. This allowed us to assess how closely the distribution of each type of research matched the distribution of disease burden (Extended Data Fig. 6).

To gain more specific insight into clinical trials sponsored by industry, we also created a subset of publications linked to trials registered with ClinicalTrials.gov. This link enabled the identification of trial-related publications and extraction of key metadata, most notably, the sponsor and trial phase, thus facilitating a more targeted assessment of industry-driven research. We focused on industry-sponsored phase 3 trials and found a considerable increase in the research–disease divergence for this subset of research. This finding supports our recommendation for greater industry involvement in the coordination of the research enterprise. We also included an analysis of publications based on the presence of industry-affiliated co-authors, identified through authors’ institutional affiliations. For this subset, the results were consistent with our main analyses, suggesting that the greater divergence observed for industry-sponsored phase 3 trials is not solely attributable to industry involvement.

In addition to differentiating research according to its clinical potential and industry involvement, we also examined the role of funding acknowledgements in articles published in 2008 or later. We created two additional subsets of articles: those that explicitly acknowledged funding and those that did not. We then compared the distribution of articles in each subset across diseases to the DALY distribution (Supplementary Fig. 5).

Geographical stratification

To assess the geographical distribution of diseases, we first calculated the share of DALYs per world region for each level 2 disease cause across the eight world regions. Extended Data Fig. 7 (left) presents the HHI, a widely used measure of concentration. Higher scores indicate a more regionally concentrated disease. The green shading of the bars (left) corresponds to the green shading in Fig. 4, indicating diseases that contributed to reducing the divergence. These diseases are locally concentrated, as their HHI exceeds the average HHI in our sample, and they are mostly communicable diseases. In contrast, the red-shaded diseases, which are noncommunicable, tend to be more globally distributed. The panel on the right ranks diseases in descending order based on their contribution to reducing the research–disease burden divergence over the past 20 years, with the respective disease burden stratified according to world region. The data show that a set of communicable diseases, concentrated in sub-Sharan Africa and Central and Southern Asia have most contributed to the reduction in research–disease divergence.

To test the sensitivity of these geographical findings to the way in which we geolocate research, we also considered the affiliations of all authors, that is, first, last and middle. The findings remained consistent as the overall distribution of authorship according to region changed only marginally when using any authorship position instead of first authorship alone. However, the data suggest that non-first authorship has a more prominent role in certain regions, particularly for East and Southeast Asia (Supplementary Fig. 6).

Projections of future divergence

To estimate how the divergence between research activity and disease burden may evolve in the coming years, we projected future research trends for each disease based on the trajectories observed over the last 4 years in our sample period (2018–2021). For respiratory diseases and tuberculosis, we adjusted these projections to account for the impact of COVID-19 in 2022, 2023 and 2024. In addition, we modeled a second scenario that assumes major US public funding agencies will cease funding research led by non-US first authors. This scenario reflects ongoing dynamics within the US science funding landscape and may be viewed as conservative. The distribution of DALYs was taken from a forecast by Vollset et al.37, assuming a continuation of progress as was observed in the last years. We provide the disease-specific projections in Supplementary Fig. 7a–c.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *