“AI is data hungry.” – Kjiersten Fagnan, CIO of the Joint Genome Institute
As a U.S. Department of Energy Office of Science user facility supported by the Biological and Environmental Research (BER) program, the Joint Genome Institute (JGI) leads genomic innovation for advancing biotechnology through its high-throughput sequencing, DNA design and synthesis, metabolomics, and computational analysis services and expertise. Established in 1997, it has evolved through the years from human genome sequencing facility to user facility for genomic science. And it continues to evolve: today, it is reinventing itself as an AI-centric user facility.
Through its work with more than 2,500 users each year, the JGI generates a substantial volume of data and data types. There is an opportunity for the JGI to incorporate AI into the science it is supporting, benefiting these primary users as well as the many thousands of users who download millions of files each year.
There is also an opportunity to address the broader challenge faced by the tens of thousands of researchers across BER’s user facilities, research centers, funded programs and data and computing platforms, who face a data discoverability challenge.
Kjiersten Fagnan, CIO of the Joint Genome Institute, noted at a Biosciences strategy meeting in December 2024 that the BER and broader biological ecosystem has thousands of distinct, unique data resources – different data types, data libraries, metadata systems, and search systems. Accessing data across these systems is currently a hugely time-consuming process.
One comment from a BER user survey summarized the challenge and the opportunity:
“I think a key barrier to entry is the large number of repositories, all with different abilities and requirements. It requires people to go to multiple sites to search for data. If there was a single repository for all aspects of data, from collection metadata to processing metadata to final product, it would be advantageous for research.”
In addition, artificial intelligence tools could leverage all the data that reside at BER-funded facilities, if the data across these facilities had a common data repository infrastructure.
The solution emerged from an idea that the DOE Systems Biology Knowledgebase (KBase) team was developing. The KBase team, with Gazi Mahmud as the KBase Architect Lead, is building a “data lakehouse” — a data management architecture that combines the best features of data repositories that store raw data and data warehouses — with a common library infrastructure that would facilitate data access across the various resources.
Kjiersten’s team saw the potential presented by this idea. As a user facility that was already generating and collating a lot of BER data, the opportunity for JGI was to create a data lakehouse for experimental and derived data, serving researchers as well as AI assistants, not just for JGI but for the larger BER ecosystem.
Getting to AI-Ready Data
It was a challenge that Kjiersten’s team at JGI eagerly took on in October, 2024, starting with a pilot project to make data AI-ready across ESS-DIVE (a data repository for earth system science), the JGI, the National Microbiome Data Collaborative (NMDC), and the Environmental Molecular Science Laboratory (EMSL), a user facility at Pacific Northwest National Lab. The team went through the time-consuming work of organizing the data, which was critical to facilitate accelerated research with AI tools. By December 2024, Kjiersten’s team had leveraged the work by KBase to pilot a common library infrastructure, dubbed the “BER Data Lakehouse.” By June, the team had created a common search entry point for researchers and expanded this to AI assistants via the Model Context Protocol in August.
As a next step, an agentic AI interface layer dubbed BERIL is being developed to facilitate an AI-assisted reference desk, led by KBase’s Paramvir Dehal, as well as AI-assisted research analysis. In FY26 and FY27, the JGI will also expand to additional data types funded in the recent BER AI call (such as beamline images and the Anaerobic Molecular Phenotyping Platform at EMSL).
Said Kjiersten, “This project will help accelerate research through AI agent workflows and powerful semantic searches with large language models. And it can help position JGI as a critical resource for biological data at the DOE network.”