Poster Presentation GENEMAPPERS 2024

Statistics of functional genomic features (#51)

Karla E Rojas Lopez 1 , Daniela Schiavinato 1 , Helena B Cooper 1 , Paul P Gardner 1 , Mik Black 1
  1. Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, Otago, New Zealand

Background: The debate over human genome functionality hinges on two main points of view: one view, causal effect, suggests that biochemical activities such as transcription or interactions are sufficient to label a genomic region as “functional”. This has been criticised by some groups as there may be experimental and biological noise in these datasets, in addition to assigning function to the decaying remnants of transposable elements which drive the 5-fold difference in genome sizes across mammals (bent-winged bat to vizcacha rat). The alternative view, selected effect, requires additional evidence in the shape of an effect on fitness attributed to a biochemical activity. This can be evolutionary conservation, or population-level evidence that suggests evolution is maintaining an advantageous function. We are considering the combination of multiple lines of evidence, including conservation, transcription levels, sequence signatures, population variation and epigenetic features.

Results:  In this study, we assess the patterns and relationships between various genomic features, including sequence conservation, transcription levels, intrinsic sequence features, epigenetic signatures, genomic repeat associations, protein-coding and RNA sequence scores, and population variation linked to functionality. We compared these features across three main gene categories—protein-coding, short non-coding RNA (ncRNA), and long non-coding RNA (lncRNA)—against non-genic control regions. To identify the most informative features related to functionality, we used Kolmogorov-Smirnov statistics , PCA plots and violin plots. Our results revealed surprising findings, such as excess SNPs in conserved ncRNA genes, signatures of conserved RNA secondary structures in protein-coding sequences, and a correlation between minimal free energy and SNPs in protein-coding genes. Additionally, the statistical models identified sequence conservation and transcription levels as the most important features associated with functionality.

 Conclusions:  These results demonstrate the importance of using several statistical features for identifying functional sequences and that we should be able to predict the likely function of genomic regions

  1. Amit M, Donyo M, Hollander D, Goren A, Kim E, Gelfman S, Lev-Maor G, Burstein D, Schwartz S, Postolsky B, et al. 2012. Differential GC content between exons and introns establishes distinct strategies of splice-site recognition. Cell Rep 1: 543–556.
  2. Bhandari BK, Lim CS, Gardner PP. 2019. Highly Accessible Translation Initiation Sites Are Predictive of Successful Heterologous Protein Expression.” bioRxiv.
  3. Buccitelli C, Selbach M. 2020. mRNAs, proteins and the emerging principles of gene expression control. Nat Rev Genet. http://dx.doi.org/10.1038/s41576-020-0258-4
  4. Cain AK, Barquist L, Goodman AL, Paulsen IT, Parkhill J, van Opijnen T. 2020. A decade of advances in transposon-insertion sequencing. Nat Rev Genet. http://dx.doi.org/10.1038/s41576-020-0244-x.
  5. Camellato BR, Brosh R, Ashe HJ, Maurano MT, Boeke JD. 2024. Synthetic reversed sequences reveal default genomic states. Nature. http://dx.doi.org/10.1038/s41586-024-07128-2
  6. Christmas MJ, Kaplow IM, Genereux DP, Dong MX, Hughes GM, Li X, Sullivan PF, Hindle AG, Andrews G, Armstrong JC, et al. 2023. Evolutionary constraint and innovation across hundreds of placental mammals. Science 380: eabn3943.
  7. Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program, Green ED, Batzoglou S, Sidow A. 2005. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 15: 901–913.
  8. Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. 2010. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6: e1001025.
  9. Ding Y, Tang Y, Kwok CK, Zhang Y, Bevilacqua PC, Assmann SM. 2014. In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features. Nature 505: 696–700.
  10. Doolittle WF. 2013. Is junk DNA bunk? A critique of ENCODE. Proc Natl Acad Sci U S A 110: 5294–5300.
  11. Doolittle WF, Brunet TDP. 2017. On causal roles and selected effects: our genome is mostly junk. BMC Biol 15: 116.
  12. Doolittle WF, Brunet TDP, Linquist S, Gregory TR. 2014. Distinguishing between “function” and “effect” in genome biology. Genome Biol Evol 6: 1234–1237.
  13. Eddy SR. 2012. The C-value paradox, junk DNA and ENCODE. Curr Biol 22: R898–9.
  14. Elhaik E, Graur D. 2014. A comparative study and a phylogenetic exploration of the compositional architectures of mammalian nuclear genomes. PLoS Comput Biol 10: e1003925.
  15. ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74.
  16. ENCODE Project Consortium, Snyder MP, Gingeras TR, Moore JE, Weng Z, Gerstein MB, Ren B, Hardison RC, Stamatoyannopoulos JA, Graveley BR, et al. 2020. Perspectives on ENCODE. Nature 583: 693–698.
  17. Frenkel FE, Chaley MB, Korotkov EV, Skryabin KG. 2004. Evolution of tRNA-like sequences and genome variability. Gene 335: 57–71.
  18. Freyhult E, Gardner PP, Moulton V. 2005. A comparison of RNA folding measures. BMC Bioinformatics 6: 241–241.
  19. Freyhult EK, Bollback JP, Gardner PP. 2007. Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA. Genome Res 17: 117–125.
  20. Gardner PP, Wilm A, Washietl S. 2005. A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res 33: 2433–2439.
  21. Germain P-L, Ratti E, Boem F. 2014. Junk or functional DNA? ENCODE and the function controversy. Biol Philos 29: 807–831.
  22. Graur D. 2017. An Upper Limit on the Functional Fraction of the Human Genome. Genome Biol Evol 9: 1880–1885.
  23. Haerty W, Ponting CP. 2015. Unexpected selection to retain high GC content and splicing enhancers within exons of multiexonic lncRNA loci. RNA 21: 333–346.
  24. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, et al. 2012. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22: 1760–1774.
  25. Hoeppner MP, Gardner PP, Poole AM. 2012. Comparative analysis of RNA families reveals distinct repertoires for each domain of life. PLoS Comput Biol 8: e1002752.
  26. Kaikkonen MU, Adelman K. 2018. Emerging Roles of Non-Coding RNA Transcription. Trends Biochem Sci 43: 654–667.
  27. Kuderna LFK, Ulirsch JC, Rashid S, Ameen M, Sundaram L, Hickey G, Cox AJ, Gao H, Kumar A, Aguet F, et al. 2024. Identification of constrained sequence elements across 239 primate genomes. Nature 625: 735–742.
  28. Kudla G, Lipinski L, Caffin F, Helwak A, Zylicz M. 2006. High guanine and cytosine content increases mRNA levels in mammalian cells. PLoS Biol 4: e180.
  29. Lesecque Y, Keightley PD, Eyre-Walker A. 2012. A resolution of the mutation load paradox in humans. Genetics 191: 1321–1330.
  30. Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, Kheradpour P, Ernst J, Jordan G, Mauceli E, et al. 2011. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478: 476–482.
  31. Lunter G, Ponting CP, Hein J. 2006. Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput Biol 2: e5.
  32. Luthra I, Jensen C, Chen XE, Salaudeen AL, Rafi AM, de Boer CG. 2024. Regulatory activity is the default DNA state in eukaryotes. Nat Struct Mol Biol. http://dx.doi.org/10.1038/s41594-024-01235-4.
  33. Mann M, Wright PR, Backofen R. 2017. IntaRNA 2.0: enhanced and customizable prediction of RNA-RNA interactions. Nucleic Acids Res 45: W435–W439.
  34. Ohno S. 1972. So much’junk'DNA in our genome. In Evolution of Genetic Systems, Brookhaven Symp. Biol., pp. 366–370.
  35. Palazzo AF, Gregory TR. 2014. The case for junk DNA. PLoS Genet 10: e1004351.
  36. Pheasant M, Mattick JS. 2007. Raising the estimate of functional human sequences. Genome Res 17: 1245–1253.
  37. Ponting CP, Hardison RC. 2011. What fraction of the human genome is functional? Genome Res 21: 1769–1776.
  38. Rands CM, Meader S, Ponting CP, Lunter G. 2014. 8.2% of the Human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet 10: e1004525.
  39. Ribeiro A, Golicz A, Hackett CA, Milne I, Stephen G, Marshall D, Flavell AJ, Bayer M. 2015. An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome. BMC Bioinformatics 16: 382.
  40. Rivas E, Clements J, Eddy SR. 2017. A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs. Nat Methods 14: 45–48.
  41. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15: 1034–1050.
  42. Simons C, Pheasant M, Makunin IV, Mattick JS. 2006. Transposon-free regions in mammalian genomes. Genome Res 16: 164–172.
  43. The RNAcentral Consortium. 2019. RNAcentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res 47: D221–D229.
  44. Tsai ZT-Y, Lloyd JP, Shiu S-H. 2017. Defining Functional Genic Regions in the Human Genome through Integration of Biochemical, Evolutionary, and Genetic Evidence. Mol Biol Evol 34: 1788–1798.
  45. Washietl S, Findeiss S, Müller SA, Kalkhof S, von Bergen M, Hofacker IL, Stadler PF, Goldman N. 2011. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17: 578–594.
  46. Xu J, Bai J, Zhang X, Lv Y, Gong Y, Liu L, Zhao H, Yu F, Ping Y, Zhang G, et al. 2017. A comprehensive overview of lncRNA annotation resources. Brief Bioinform 18: 236–249.
  47. Yan Q, Zhu C, Guang S, Feng X. 2019. The Functions of Non-coding RNAs in rRNA Regulation. Front Genet 10: 290.
  48. Zoonomia Consortium. 2020. A comparative genomics multitool for scientific discovery and conservation. Nature 587: 240–245.
  49. Gregory, T.R. 2005. Genome Size Evolution in Animals. The Evolution of the Genome. 4-71.
  50. Frank J, Massey Jr. 1951. The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association 46: 68-78