"The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq is a foundation for medical, functional, and diversity studies; they provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses."So RefSeq should be reliable. At least the sequence you're looking at should be from the organism it says so in the record. These sort of contaminants are not too much of a problem for me, because I don't do high-throughput genome comparisons and I check my trees and follow the trail of funny-looking sequences. However some people do do high throughput genome comparisons, and unless they are able to check the reliability of each sequence, or have other methods of filtering out possibly dubious sequences, they may be falling foul of such contaminants. I'm not sure just how common these cases are, but I've found around 5 examples in just a couple of proteins in the last few months.
I'm not an expert in sequencing and assembling genomes, so I'm not sure if this is an easy thing to avoid or not. But it seems like it wouldn't be too hard to scan the genome for cases where the whole scaffold is bacteria-like, and remove those sequences until they can be checked.
Don't get me wrong, RefSeq and all the NCBI databases are amazing resources, and I use them daily. But, it would be really great if the parties who are submitting genome sequences could do a bit more of quality control to make the resource as reliable a stable reference as it sets out to be.
-----
Edit: here's an example from the Xenopus tropicalis genome. I'd love to get some comments on this... would removing this kind of contamination be easy, and should it be done? Or is it up to the people who use the sequence for research to check?
No comments:
Post a Comment
Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS