Poster Presentation GENEMAPPERS 2024

Predicting functional genomic features (#53)

Daniela Schiavinato 1 , Karla E Rojas Lopez 1 , Helena Cooper 1 , Mik Black 1 , Paul Gardner 1
  1. Biochemistry, University of Otago , Dunedin, Otago, New Zealand

Background: Advancements in sequencing technologies are enabling the identification of numerous potential proteins, non-coding RNAs, and genomic activities. However, there are conflicting conclusions about which regions of the genome are functional, as activity alone can be noisy and misleading. Stronger evidence for function comes from demonstrating evolutionary selection. Quantifying conservation is challenging in non-coding elements like promoters, enhancers, and lncRNA, where interactions or structures, rather than sequence conservation, are key. Thus, approaches integrating multiple functional features may provide a more accurate discrimination between genes and background sequences.

Results: We investigate the association between gene functionality and genomic features by comparing functional protein-coding and non-coding (short and long-non-coding) genes to regions of the genome expected to be largely non-functional. We evaluate the relative importance of six groups of genomic features, selected based on their potential to predict gene functionality: intrinsic sequence features, sequence conservation, transcription, genomic repeat association, protein-coding or RNA-specific features, epigenetic signatures, and population variation. We rank features predictive of gene functionality using a generalized Wilcoxon test and random forest classification models. Both approaches are largely concordant, ranking inter-species evolutionary conservation and transcription highly, while population variation data had low rankings. Compared to short ncRNA and protein-coding sequences, lncRNAs show less defined functional signals, evidenced by weak correlations between features and functionality, and relatively poor-performing classification models.

Conclusions: These results demonstrate the importance of evolutionary conservation and transcription in determining sequence functionality, which should be considered when differentiating between functional sequences and noise. The less distinct functionality signals observed in lncRNA suggest that non-genic sequences may be misannotated as lncRNAs, indicating that current thresholds might not adequately account for experimental and biological noise. To investigate further, we aim to apply our models to generate a ranked list of high-confidence lncRNAs that are most distinguishable from background sequences.