August 14, 2009
By Vivien Marx
With large-scale second-generation sequencing-based projects taking hold, labs need to budget and plan for data management throughout a project’s timeline, an NIH official told BioInform this week.
Many researchers “totally underestimate issues around data management,” Joni Rutter, associate director of Human Population and Applied Genetics at the National Institute on Drug Abuse’s Division of Basic Neuroscience and Behavioral Research, told BioInform.
Rutter is the project officer for the Epigenomics Data Analysis and Coordination Center and spoke about general issues connected to data management, not issues particular to EDACC.
Scientists writing grants think first and foremost about their science and when fitting a project to a particular NIH budget, data-management resources “tend to be the first things to go” when cutting costs, she said.
With centralized data repositories such as EDACC, NIH is helping to be “more efficient” in managing scientific data, since repositories assist in fostering standards and formats for data deposit, she said.
At this year’s Intelligent Systems for Molecular Biology conference, Owen White of the University of Maryland School of Medicine, who directs the Human Microbiome Project’s Data Analysis and Coordination Center, shed some light on how such a center operates.
White noted that the DACC is not responsible for primary data submission, but helps each of the HMP centers handle -omics and clinical data in a “rapid response” fashion.
The data center’s “position,” White said, is to have the sequencing centers deposit data and metadata, to “not interfere, and [to] let the pipeline stream along.” Over time, “we will be structuring it” and doing “lots and lots of clean-up of the data,” he said.
Metadata is being collected, he said, for example, on sample prep methods, since that will “impact how the data will cluster” in downstream analysis, he said.
He and his group are also “encouraging, or some might call it policing,” which means monitoring data quality, assuring data is documented, and helping centers to resolve data-management issues.
“We’re constantly evaluating tools used by the centers and developing standards,” White said, adding that his group is developing a software repository and pipelines and making them available to scientists.
Change the World?
Data management is not only an issue for large, multi-center projects, however. Small labs, in particular, are quickly finding that data management is a core requirement for high-throughput experimental platforms.
While ads for second-generation sequencers tell scientists that the instruments are “going to change the world,” that change won’t happen unless researchers in traditional biochemistry or molecular biology labs can use the data, “and they can’t, “Anton Nekrutenko told BioInform during ISMB.
Nekrutenko is associate professor of biochemistry and molecular biology at Penn State University and co-PI for Galaxy, an open source bioinformatics data integration and analysis platform.
Data-intensive bioinformatics tasks that were once relatively rare are now “permeating every aspect of biology,” said James Taylor, a computational biologist at Emory University and Galaxy co-PI.
That development calls for “effective” methods of managing data, as well as introducing “more control, reproducibility, [and] transparency to data analysis,” he said.
Nekrutenko and Taylor organized the data and analysis management special interest group, or DAM-SIG, meeting at ISMB/ECCB. One focus of the sessions was on the need to standardize metadata. “I think that’s a very important aspect” of data management “to allow for “better, open interchange formats for understanding and querying experimental metadata across experiments,” Taylor said.
Metadata management plays a “very important” role in microbial and metagenomic projects and is still often an unsolved challenge, Nekrutenko said.
Some tools are coming online to help researchers handle metadata, though. At ISMB, Philippe Rocca-Serra, a researcher at the European Bioinformatics Institute, outlined his group’s “standards-supportive infrastructure,” for annotating metadata in the ISA-Tab format with ontologies and according to standardized reporting guidelines so scientists can manage multi-domain experimental metadata.
A beta version of the software suite, called ISA Tools, was released in late July here.
The suite includes ISAcreator for annotating and editing metadata and an app called ISAconverter for converting ISA-Tab files to formats suitable for submission in public repositories.
Annotation is Key
Other observers view accurate and thorough annotation as the key to effective long-term data management.
The research value of -omics data “correlates directly with how precise, exhaustive, and consistent” its annotation is, Tom Beatty told BioInform via e-mail this week.
Beatty, a principal researcher at business technology consulting firm CSC who specializes in life science and healthcare consulting within the firm’s Emerging Practices Group, said that a community-wide, wiki-style approach to annotation could be useful.
Likewise, Sanjeev Wadhwa, director of CSC’s Life Sciences R&D Practice, told BioInform via e-mail that semantically enriched wiki content, which can be processed and interpreted by wiki-based ontology infrastructures, “will provide intuitive means to collaboratively create, organize, and retrieve knowledge.”