Protein Evolution and Other Musings: Bacterial contamination in eukaryotic genomes

A few times recently, I've been blasting and retrieving sequences from NCBI RefSeq, making phylogenetic trees, and then being shocked to find the odd sequence from Xenopus (frog), Ixodes (tick) or Nematostella (Sea Anemone) sequences nested deeply within the bacterial part of the tree. Also with branches comparable to the length of bacterial branches. Inititally, when I got these sequences, I was very excited. I probably exclaimed something like "Bloody hell! Ticks have another mitochondrial elongation factor EF-Tu and it looks like very recent horizontal gene transfer from beta-proteobacteria!", cos that's what it looks like- the nesting suggesting HGT and the lack of this version in close tick relatives suggesting it's recent. Of course such a find would be astounding: HGT into a vertebrate, wow. But then I checked Entrez gene and blasted several upstream and downstream genes. They ALL hit bacterial sequences before eukaryotic ones. The whole scaffold was bacterial. I've found the same in these Xenopus and Nematostella cases. So it looks like contamination from bacteria, rather than HGT - some bacteria hanging out with the eukaryote of interest inadvertently got parts of their genome sequenced, and these sequences got included in the genome release, under the name of the eukaryote. And this ends up in NCBI RefSeq:

"The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses."

So RefSeq should be reliable. At least the sequence you're looking at should be from the organism it says so in the record. These sort of contaminants are not too much of a problem for me, because I don't do high-throughput genome comparisons and I check my trees and follow the trail of funny-looking sequences. However some people do do high throughput genome comparisons, and unless they are able to check the reliability of each sequence, or have other methods of filtering out possibly dubious sequences, they may be falling foul of such contaminants. I'm not sure just how common these cases are, but I've found around 5 examples in just a couple of proteins in the last few months.

I'm not an expert in sequencing and assembling genomes, so I'm not sure if this is an easy thing to avoid or not. But it seems like it wouldn't be too hard to scan the genome for cases where the whole scaffold is bacteria-like, and remove those sequences until they can be checked.

Don't get me wrong, RefSeq and all the NCBI databases are amazing resources, and I use them daily. But, it would be really great if the parties who are submitting genome sequences could do a bit more of quality control to make the resource as reliable a stable reference as it sets out to be.

-----

Edit: here's an example from the Xenopus tropicalis genome. I'd love to get some comments on this... would removing this kind of contamination be easy, and should it be done? Or is it up to the people who use the sequence for research to check?

Field of Science

Protein Evolution and Other Musings

Bacterial contamination in eukaryotic genomes

No comments:

Post a Comment

Field of Science

Latest From The Network...