Featured Presentation GENEMAPPERS 2024

Flawed machine learning and protein coding sequence annotation  (#33)

DJ Champion 1 , Ting-Hsuan Chen 2 , Susan Thomson 2 , Mik A Black 1 , Paul P Gardner 1
  1. University of Otago, Dunedin, TE WAIPOUNAMU, New Zealand
  2. The New Zealand Institute for Plant and Food Research Limited, Lincoln, Canterbury, New Zealand

Background: Detecting protein-coding genes in genomic sequences is a significant challenge for understanding genome functionality, yet the reliability of bioinformatic tools for this task is uncertain. This is despite some of these tools being available for several decades, and have been widely used for genome and transcriptome annotation.

Results: We assess nucleotide sequence and alignment-based de novo protein-coding detection tools. The controls we use exclude any previous training dataset and include coding exons as a positive set and length-matched intergenic and shuffled sequences as negative sets.

Our work demonstrates that several widely used tools are neither accurate nor computationally efficient for the protein-coding sequence detection problem (e.g. CPC2, RNAsamba, bioseq2seq). One tool performs little better than a random number generator in our tests (LGC). Just three of nine tools (RNAcode, PhyloCSF & tcode) significantly outperformed a naive scoring scheme. Furthermore, we note a high discrepancy between self-reported accuracies and the accuracy achieved in our study. Our results show that the extra dimension from conserved and variable nucleotides in alignments have a significant advantage over single-sequence approaches.

Conclusions: These results highlight significant limitations in existing protein-coding annotation tools that are widely used for lncRNA annotation. This shows a need for more robust and efficient approaches to training and assessing the performance of tools for identifying protein-coding sequences. Our study paves the way for future advancements in comparative genomic approaches and we hope will popularise more robust approaches to genome and transcriptome annotation.

  1. Champion DJ, Chen TH, Thomson S, Black MA, Gardner PP (2024) Flawed machine-learning confounds coding sequence annotation. bioRxiv. https://doi.org/10.1101/2024.05.16.594598