Defining Quality Virus Data(sets)

Defining Quality Virus Data(sets)

Nature Biotechnology published a standards paper with guidelines for defining virus genome data.  The standards paper offers guidelines and best practices for characterizing uncultivated viruses. The article is the result of a community effort involving an international consortium of virus experts including dr. Bas E. Dutilh (UBC), the Joint Genome Institute (JGI) in California, the Genomic Standards Consortium (GSC), the International Committee on Taxonomy of Viruses (ICTV), the National Center for Biotechnology Information (NCBI), and the European Bioinformatics Institute (EBI).

Innumerable virus genomes

Microbes in, on and around the planet far outnumber the stars in the Universe. The total number of viruses is expected to exceed even that calculation. While many viruses remain unknown and uncultivated, advances in metagenome sequencing and analyses have allowed researchers to identify more than 750,000 virus genomes from environments ranging from different human body sites to the global oceans. These discoveries tripled the known viral diversity in a single year. Understanding the roles of these viruses in their environment requires the work of a community of virus experts, and data quality standards are critical to allow these researchers to do their work.

Cultured viruses already have their own data quality standards, but these cannot be directly applied to uncultured viruses, whose sequences are often incomplete and for which some properties can only be predicted indirectly using computational approaches.

“Given how easily we can discover viruses in almost any sample, we desperately need standards that help researchers report their findings in a way that is useful to the community, and reusable by bioinformaticians like myself,” explains Dutilh.

Categories of Virus Genome Quality

The paper outlines the minimum amount of information for an uncultivated virus genome, including the source, methods of identification of the virus genome, and data quality. The team proposed three categories of genome quality.

  • ‘Genome fragments’ are comprised of single or multiple fragments that are predicted to be less than 90 percent complete, or have no estimated genome size, and are minimally annotated.
  • A ‘high-quality draft genome’ is estimated to represent 90 percent or more of the complete expected genome sequence, in fragment(s) where any gaps span mostly repetitive regions.
  • Finally, a ‘finished genome’ would include both a complete genome comprised of a single contiguous sequence without gaps, and extensive annotation.


Read the paper entitled: Minimum Information about an Uncultivated Virus Genome (MIUViG)

This publication was also mentioned here:

Image by Leah Pantéa