Background: The debate over human genome functionality hinges on two main points of view: one view, causal effect, suggests that biochemical activities such as transcription or interactions are sufficient to label a genomic region as “functional”. This has been criticised by some groups as there may be experimental and biological noise in these datasets, in addition to assigning function to the decaying remnants of transposable elements which drive the 5-fold difference in genome sizes across mammals (bent-winged bat to vizcacha rat). The alternative view, selected effect, requires additional evidence in the shape of an effect on fitness attributed to a biochemical activity. This can be evolutionary conservation, or population-level evidence that suggests evolution is maintaining an advantageous function. We are considering the combination of multiple lines of evidence, including conservation, transcription levels, sequence signatures, population variation and epigenetic features.
Results: In this study, we assess the patterns and relationships between various genomic features, including sequence conservation, transcription levels, intrinsic sequence features, epigenetic signatures, genomic repeat associations, protein-coding and RNA sequence scores, and population variation linked to functionality. We compared these features across three main gene categories—protein-coding, short non-coding RNA (ncRNA), and long non-coding RNA (lncRNA)—against non-genic control regions. To identify the most informative features related to functionality, we used Kolmogorov-Smirnov statistics , PCA plots and violin plots. Our results revealed surprising findings, such as excess SNPs in conserved ncRNA genes, signatures of conserved RNA secondary structures in protein-coding sequences, and a correlation between minimal free energy and SNPs in protein-coding genes. Additionally, the statistical models identified sequence conservation and transcription levels as the most important features associated with functionality.
Conclusions: These results demonstrate the importance of using several statistical features for identifying functional sequences and that we should be able to predict the likely function of genomic regions