Introduction {#sec:intro}

The Ada Lovelace Institute and the Nuffield Council for Bioethics are working on a project to map potential futures for the application of Artificial Intelligence (AI) in genomics and explore their societal and policy implications. Nesta is contributing to the project through a scientometric analysis of key features of AI genomics Research and Development (R&D).¹

Some areas of interest include:

What are the levels of R&D activity in the intersection of AI and genomics? How have they evolved over time?
What is the topical composition of the field? What are "emergent topics" in AI and genomics?
What institutions are participating in AI genomics R&D? What is their geography? What is their character (e.g. public, private)?
What are the key differences between AI applications to genomics and the wider field of genomics in terms of application areas, participants and stakeholders, influence and impact etc.?

Our analysis will address questions such as those above to provide an empirical context for the AI Genomics Futures project and inform subsequent activities such as an horizon-scanning exercise and stakeholder engagement activities.

Methodological narrative

We distinguish between core project activities and stretch goals.

At its core, the project will create a novel dataset about AI and genomics R&D that we will analyse in order to map its landscape. In order to do this, we are collecting, processing and analysing data about research, technology development and business activities, the geography and character of actors (e.g. researchers, inventors, entrepreneurs) participating in AI genomics R&D and the purpose and influence (e.g. citations, social media reach) of their R&D activities.

We will analyse these data with Natural Language Processing (NLP) and machine learning methods. This will help us tag projects / patents / companies with categories of interest such as the disease areas they target, measure the composition of AI genomics R&D (e.g. key research themes and technological trajectories) and identify emerging trends that might be of particular interest to the project. This will also help us create and compare the specialisation profiles of different countries / actors / types of actors (e.g. private sector vs academic researchers). Where possible we will benchmark the situation in AI genomics R&D vs. the wider field of genomics.

As stretch goals, we will explore:

Additional data sources
How to integrate all the datasets we are collecting in a consistent taxonomy that makes it possible to unify their analysis,
Experimental indicators capturing novelty, diversity and interdisciplinarity in AI genomics R&D and,
Models to explain and predict outcomes of interest.

Structure

@sec:data describes our data sources and how we have collected and processed them.

R&D encompasses activities to produce and apply new knowledge ranging from basic research to technology development and the launch of new products, services and tools. ↩