Inanc Birol

Professor

Relevant Thesis-Based Degree Programs

 
 

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Machine learning for antimicrobial peptide discovery and design (2024)

The full abstract for this thesis is available in the body of the thesis, and will be available when the embargo expires.

View record

Transcriptome assembly and visualization for RNA-sequencing data (2023)

Since its introduction, RNA-sequencing has allowed us to interrogate the transcriptome of an organism, thereby advancing our understanding of cell biology and diseases. Typically, raw RNA-sequencing data is processed via computational methods, such as transcriptome assembly and visualization, to extract meaningful information. Transcriptome assembly aims to reconstruct full-length transcript sequences from RNA-sequencing reads, which are usually short fragments of the corresponding transcripts. Transcriptome visualization provides a platform for exploring and recognizing patterns in transcriptomic data. Transcriptome assembly and visualization tools have been instrumental in identification of gene structures, annotation of draft genomes, and discovery of molecular markers in diseases.Single-cell RNA-sequencing has enabled us to investigate transcriptome heterogeneity within a tissue sample containing up to a million cells. However, single-cell transcriptome analyses have been predominantly performed at the gene level instead of at the isoform level. In my thesis, I present computational solutions for transcriptome assembly and visualization of single-cell RNA-sequencing data thus enabling isoform-level analysis in single cell transcriptomes.Long-read RNA-sequencing technologies have gained traction in transcriptomic research in recent years as their throughput and data quality improved tremendously. Long-read sequencing is particularly useful in transcriptome assembly because its reads can potentially span multiple exons, which simplifies the transcriptome assembly problem. Reference-free assembly for long-read data is a computationally expensive task due to the long read lengths and high base error rates. In my thesis, I present a fast and memory-efficient reference-free assembly method for long-read RNA-sequencing data.

View record

Annotation of complex genomes for comparative genomics (2022)

Advancements in whole-genome sequencing technologies have opened the use of genomic approaches to study a variety of organisms and allowed studies at the whole-genome scale in non-model organisms. In these studies, genome annotation is a fundamental step to extract diverse biological information from sequences that are otherwise strings of characters incomprehensible to humans.Here I assembled and annotated genomes of plant and insect species of applied interest. A common theme in my thesis is comparative and evolutionary genomics of the described organisms. The sequenced species I studied have complex genomic features, including large genome sizes and high repeat contents, which I described in detail. In Chapter 2, I investigate the protein-coding genes of four spruces (Picea, Pinaceae) native to North America. Comparison to other annotated conifers highlights changes in selection in gene families. Several gene families have a significantly expanded number of genes. Some genes are under positive selection: previous studies in spruce highlighted the same proteins as genetic markers for local adaptation. In Chapter 3, I characterize the genome of Pissodes strobi, a naturally occurring pest of the spruces described in Chapter 2. The genome of P. strobi is larger and more repetitive than other sequenced species in the same family (Curculionidae). In Chapter 4, I assemble and annotate the genome of a proprietary Cannabis sativa strain, and study the flavonoid/anthocyanin metabolic pathway, uncovering the upregulation of key metabolic genes involved in the regulation of leaf pigmentation.The presented genome annotations and comparative analyses provide insights into the biology and evolution of the described species. Comparative genome studies are important for generating hypotheses and open avenues of inquiry in future studies in population genomics. In the case of Picea gen. and P. strobi, such studies will enable us to understand the local adaptation of species and the genetic basis of regulatory processes, such as biotic stress mitigation and pest resistance.

View record

Computational modelling, simulation, and prediction of biological sequences (2022)

Current advances in sequencing technology have led to an exponential growth of omics data. To leverage torrents of large and complex sequencing data, researchers need effective analytical methods to mine the underlying patterns, infer novel insights, and complement the development of related bioinformatics tools. This doctoral thesis combines both descriptive and predictive data analytic strategies, including statistical modelling and machine learning, to analyze nucleotide sequences and make inferences from short- and long-read sequencing data. The tools and pipelines presented are publicly available in the service of the broader research community.As a soaring sequencing technology, long-read sequencing, represented by Oxford Nanopore Technologies (ONT), has unprecedented advantages in providing long-range information that spans inter- and intra-genomic homologous regions and differentiates transcript isoforms. Due to the novelty and uniqueness of this sequencing technique, the features of resulting reads, which is crucial for developing pertinent bioinformatics algorithms, remain to be comprehended. This doctoral thesis combines multiple statistical modelling methodologies, including Markov Model, Kernel Density Estimation, and probability distribution fitting, to build NanoSim, the first ONT read simulator that characterizes and simulates ONT genome, transcriptome, and metagenome data. NanoSim has and will continue to have an enabling role in the field and benefit the development of scalable algorithms, including assembly, alignment, quantification, mutation detection, and metagenomic analysis.RNA sequencing has become the established sequencing method of choice for a wide range of transcriptome projects. To fully realize its potential in comprehensive analysis with the isoform- level resolution, it is desirable to have a transcript completeness annotation pipeline for reconstructed transcripts and to incorporate such into RNA-Seq analysis routines. In this thesis, I explore the application of deep learning on distinguishing 3’ polyadenylation cleavage sites from non-cleavage sites. I have developed a deep neural network with novel sequence representation and propose the first assessment pipeline, Terminitor, which examines the 3’ terminus completeness of reconstructed transcripts from RNA-Seq data. This utility outperforms the state- of-the-art methods in terms of sensitivity and precision and demonstrates the robustness and flexibility of the model architecture in both short- and long-read sequencing data.

View record

Improving sequence analysis with probabilistic data structures and algorithms (2020)

In the biological sciences, sequence analysis refers to analytical investigations that use nucleic acid or protein sequences to elucidate biological insights from them, such as their function, species of origin, or evolutionary relationships. However, sequences are not very meaningful by themselves, and useful insights generally come from comparing them to other sequences. Indexing sequences using concepts borrowed from the computational sciences may help perform these comparisons. One such concept is a probabilistic data structure, the Bloom filter, which enables low memory indexing with high computational efficiency at the cost of false-positive queries by storing a signature of a sequence rather than the sequence itself. This thesis explores high-performance applications of this probabilistic data structure in sequence classification (BioBloom Tools) and targeted sequence assembly (Kollector) and shows how these implemented tools outperform state-of-the-art methods.To remedy some weaknesses of Bloom filters, such as the inability to index multiple targets, I have developed a novel probabilistic data structure called a multi-index Bloom filter (miBF), used to facilitate alignment-free classification of thousands of references. The data structure also synergizes with spaced seeds. Sequences are often broken up into subsequences when using a hashed-based algorithm, and spaced seeds are subsequences with wildcard positions to improve classification sensitivity and specificity. This novel data structure enables faster classification and higher sensitivity than sequence alignment-based methods and executes in an order of magnitude less time while using half the memory compared to other spaced seed-based approaches. This thesis features formulations of classification false-positive rates in relation to the indexed and queried sequences, and benchmarks the data structure on simulated data. In addition to my work on short read data, I explore and evaluate of methods for finding sequence overlaps in error-prone long read datasets.

View record

Efficient assembly of large genomes (2019)

Genome sequence assembly presents a fascinating and frequently-changing challenge. As DNA sequencing technologies evolve, the bioinformatics methods used to assemble sequencing data must evolve along with it. Sequencing technology has evolved from slab gel sequencing, to capillary sequencing, to short read sequencing by synthesis, to long-read and linked-read single-molecule sequencing. Each evolutionary jump in sequencing technology required developing new bioinformatic tools to address the unique characteristics of its sequencing data. This work reports the development of efficient methods to assemble short-read and linked-read sequencing data, named ABySS 2.0 and Tigmint. ABySS 2.0 reduces the memory requirements of short-read genome sequencing assembly by ten fold compared to ABySS 1.0. It does so by using a Bloom filter probabilistic data structure to represent a de Bruijn graph. Tigmint uses linked reads to identify large-scale errors in a genome sequence assembly. Correcting assembly errors using Tigmint before scaffolding improves both the contiguity and correctness of a human genome assembly compared to scaffolding without correction. I have also applied these methods to assemble the 12 gigabase genome of western redcedar (Thuja plicata), which is four times the size of the human genome.Although numerous mitochondrial genomes of angiosperm are available, few mitochondria of gymnosperms have been sequenced. I assembled the plastid and mitochondrial genomes of white spruce (Picea glauca) using whole genome short read sequencing. I assembled the mitochondrial genome of Sitka spruce (Picea sitchensis) using whole genome long read sequencing, the largest complete genome assembly of a gymnosperm mitochondrion. The mitochondrial genomes of both species include a remarkable number of trans-spliced genes.I have developed two additional tools, UniqTag and ORCA. UniqTag assigns unique and stable gene identifiers to genes based on their sequence content. This gene labeling system addresses the inconvenience of gene identifiers changing between versions of a genome assembly. ORCA is a comprehensive bioinformatics computing environment, which includes hundreds of bioinformatics tools in a single easily-installed Docker image, and is useful for education and research.The assembly of linked read and long read sequencing of large molecules of DNA have yielded substantial improvements in the quality of genome assembly projects.

View record

Parallel algorithms and software tools for high-throughput sequencing data (2017)

With growing throughput and dropping cost of High-Throughput Sequencing (HTS) technologies, there is a continued need to develop faster and more cost-effective bioinformatics solutions. However, the algorithms and computational power required to efficiently analyze HTS data have lagged considerably. In health and life sciences research organizations, de novo assembly and sequence alignment have become two key steps in everyday research and analysis. The de novo assembly process is a fundamental step in analyzing previously uncharacterized organisms and is one of the most computationally demanding problems in bioinformatics. The sequence alignment is a fundamental operation in a broad spectrum of genomics projects. In genome resequencing projects, they are often used prior to variant calling. In transcriptome resequencing, they provide information on gene expression. They are even used in de novo sequencing projects to help contiguate assembled sequences. As such designing efficient, scalable, and accurate solutions for de novo assembly and sequence alignment problems would have a wide effect in the field. In this thesis, I present a collection of novel algorithms and software tools for the analysis of high-throughput sequencing data using efficient data structures. I also utilize the latest advances in parallel and distributed computing to design and develop scalable and cost-effective algorithms on High-Performance Computing (HPC) infrastructures especially for the de novo assembly and sequence alignment problems. The algorithms and software solutions I develop are publicly available for free for academic use, to facilitate research at health and life sciences laboratories and other organizations worldwide.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Genome misassembly detection using Stash: a data structure based on stochastic tile hashing (2024)

Analyzing large amounts of data produced by high-throughput sequencing technologies presents challenges in terms of memory and computational requirements. Therefore, it is crucial to develop data structures and computational methods that handle this information effectively. These challenges impact bioinformatics studies, including de novo genome assembly, which serves as the foundation of genomics. Issues like errors in reads or limitations due to heuristic decisions in assembly algorithms can lead to genome misassemblies and inaccurate genomic representations, compromising the quality of downstream analyses. Hence, de novo assemblies can benefit from misassembly detection and correction, to maximize the information provided by reads and produce an optimal assembly. Here, we present Stash, a novel hash-based data structure designed for storing and querying large repositories of sequencing data based on a k-mer representation of a large sequence dataset. Stash uses a two-dimensional data structure based on hash values generated by sliding windows of spaced seed patterns over sequences to compress data. The key-value pairs stored in Stash are k-mers and sequence ID hashes, respectively. The stored hashed identifiers are then used to check if two queried k-mers are observed in the same set of sequences. This functionality provides utility for Stash across diverse domains of bioinformatics. For example, Stash can inform whether two genomic regions are covered by the same set of reads by measuring the number of matches between them. This can be used in detection of misassemblies within a genome assembly of interest. We demonstrate the effectiveness of Stash in detecting misassemblies in human genome assemblies generated by the Flye and Shasta algorithms, using Pacbio HiFi reads from the human cell line NA24385. We observe that scaffolding Stash-cut assemblies reduce 7.6% and 3.4% of misassemblies in the Flye and Shasta assemblies, respectively. It accomplishes this by utilizing eight GB of memory and a total processing time of 117 plus 18 minutes. Remarkably, it can outperform alternative methods for detecting misassemblies in long-read data, all the while preserving contiguity.

View record

Improving white spruce genome annotation and generation of a chromosome-scale epigenetic map (2024)

White spruce (Picea glauca; Pinaceae) is a conifer native to the northern temperate and boreal forests of North America. It is a resilient tree that tolerates variations in climatic conditions. As a result, it is often used as a model for studying the genetic makeup and adaptability of conifer trees. A previous genome assembly of P. glauca, generated using short- and linked-read sequencing data, had over 2.4 million scaffolds with an NG50 length (a measure of assembly contiguity indicating that at least half of the expected genome size is in pieces at least the NG50 length) of 131 kb. Here I produce and report on an improved assembly of P. glauca, built using long nanopore sequencing reads and scaffolded with linked-read sequencing data, which represents one of the most contiguous (NG50 length = 2.3 Mbp) and gene-complete (56.1% complete BUSCO genes in the Embryophyta lineage) genomes of this size (~20 Gb). The new assembly was annotated using BRAKER2, which predicted 68,796 genes with a mean length of 18 kb. Repeat masking demonstrates that approximately 90% of the white spruce genome consists of repeat sequences, the majority of which are long terminal repeats (LTRs). Among other sequenced conifer species, phylogenetic analysis finds the closest relative of the white spruce to be interior, Engelmann and Sitka spruces. Orthogroup analysis recovered 2,024 genes found only in white spruce and not in other spruce or pines I analyzed. These genes are enriched in Gene Ontology (GO) terms related to biotic and abiotic stress responses. I used epigenetic information inherent in the long-read sequencing data to conduct a methylome analysis using NanoMethPhase. Using this, I identified a total of 320,946,144 CpG sites and 12,698 quality-filtered allelic differentially methylated regions (DMRs) for the genome. A total number of 1,930 of the annotated genes intersect with these allelic DMRs and this overlapping subset if enriched with GO terms related to plant responses to external damage and pathogen infection. The updated white spruce genome assembly with its detailed annotation and epigenetic map described here provides a valuable resource for furthering conifer research.

View record

Simulating chromoanagenesis for tool development and testing (2024)

The human genome is large and complex. Variations in the genome of an organism can have drastic health implications from cancer to constitutional disease. Most variants involve changes of just one or a few nucleotides, but differences like structural variants can cause significant changes to larger sections of the genome. One rare group of structural variants is chromoanagenesis, where a catastrophic rearrangement of a large section of the genome occurs during a single event. Whereas simpler events involve one or a few breakpoints and may result in localized duplications, inversions, or deletions of genetic fragments within a section of the genome, a single chromoanagenesis event can have hundreds of breakpoints where each broken segment of the chromosome may be unchanged, inverted, duplicated, or deleted in whole or in part before the pieces reassemble in a different order. Chromoanagenesis has most often been described in cancer among other signs of genomic instability, but there have been cases of such events in patients with other diseases as well.Because of the complexity of chromoanagenesis and the genomic context it is often found in, getting accurate sequence-level characterization of cases has been difficult. Developing bioinformatics tools to detect and fully resolve chromoanagenesis in sequence data sets is challenging and expensive, with new technologies providing new avenues of detection but posing different difficulties to resolve. In this thesis, I report Muddler, a simulator I have developed for chromoanagenesis events that can be used with various available software tools to produce data sets that resemble those obtained with different genomic technologies (Next Generation Sequencing, Optical Maps, etc.). These simulated datasets can then be evaluated with sequence analysis tools to understand the strengths and limitations of genomic technologies and new software for characterizing chromoanagenesis and to assist in the development of better tools for this purpose. The method for generating simulated data is presented along with five complex simulated events and their analysis using technology-specific analytical tools/pipelines. To illustrate the utility of Muddler in the use cases provided, each simulated event has at least 150 breakpoints and around 100 combined duplications or deletions.

View record

Streamlined high throughput assembly and standardization of reference-grade animal mitochondrial genomes (2024)

Mitochondrial genomes (or mitogenomes) are circular double-stranded DNA (Deoxyribonucleic Acid) present in the mitochondria of a eukaryotic cell, typically containing 16,000 nucleotides. Mitochondrial DNA has several characteristics, including maternal inheritance, low mutation rate, higher copy number and higher resistance to degradation, making it a valuable tool for ecological system monitoring, evolutionary studies as well as forensics identifications. In the past, the importance of the mitochondrial genome was not fully appreciated, and mitochondrial DNA was considered less valuable than nuclear DNA due to its smaller size and limited number of coding regions. As a result, mitogenome records on public data portals are limited compared to nuclear references, particularly for underrepresented species. Nowadays, increased attention is directed toward reconstructing mitogenomes, considering their diverse applications. Scalable and robust mitogenome assembly tools are in high demand due to the large volume of DNA sequencing data produced. In this thesis, I have developed mtGrasp: Mitochondrial Genome Assembly and Standardization Pipeline. mtGrasp is a high throughput in silico tool that generates reference-grade mitogenomes in their final standardized format. Three hundred and twenty eight DNA read libraries from the iTrackDNA project and 23 Sequence Read Archive (SRA) libraries were used. Reads were assembled into contigs by ABySS, followed by mitochondrial sequence filtering, gap-filling, polishing, circularization and standardization, resulting in 274 reference-grade genomes (from iTrackDNA samples) and 15 complete sequences (from SRA samples) while requiring only a moderate timeframe and memory usage.

View record

Structure-aware deep learning model for peptide toxicity prediction (2024)

Antimicrobial resistance is a critical public health concern, necessitating the exploration of alternative treatments to conventional antibiotics. Antimicrobial peptides (AMPs) have emerged as a promising avenue to explore such alternatives. However, assessing their toxicity through wet lab methods is time-consuming and costly. Computational tools that accurately predict peptide toxicity may offer a solution by enabling the rapid screening of candidate AMPs. In response to this need, I introduce tAMPer, a multi-modal deep learning model that predicts peptide toxicity by integrating the underlying amino acid sequence composition and the predicted three-dimensional (3D) structure. tAMPer adopts a graph-based representation for peptides, encoding their ColabFold-predicted structures. In these graphs, nodes correspond to amino acids, and edges represent spatial interactions. The model extracts structural features using graph neural networks and employs recurrent neural networks to capture sequential dependencies. tAMPer's performance was assessed on both a publicly available protein toxicity benchmark dataset and an AMP hemolysis data we generated. On the latter, tAMPer achieves an F1-score of 68.7%, surpassing the second-best method with a 45.3% score. On the protein benchmark dataset, tAMPer exhibited an improvement of over 3.0% in the F1-score compared to current state-of-the-art methods. This work highlights the potential of 3D peptide structure predictions and graph neural networks in developing safer peptide therapeutics to combat antimicrobial resistance. tAMPer is freely available at https://github.com/bcgsc/tAMPer.

View record

The genome of black spruce: genome annotation & analyses (2024)

Abiotic and biotic stresses associated with climate change have been identified as a dominant cause of forest tree mortality in boreal forests. Some tree populations may have the capacity for rapid adaptation or migration to keep pace with changing environmental conditions. One species of interest is black spruce (Picea mariana [Mill.] B.S.P.) as adaptive variation in relation to climate change has previously been reported for this transcontinental North American conifer. As exhibited in studies of other economically important forest trees, genomic resources play a critical role in advancing our understanding of the genomic basis of adaptive variation; however, such resources are lacking for black spruce, with the few available being predominantly transcriptome-related. This thesis describes the first genome assembly of P. mariana with a reconstructed genome size of 18.3 Gbp and NG50 scaffold length of 36.0 kbp. A total of 66,332 protein-coding sequences were predicted in silico and annotated based on sequence homology. To showcase the value of this new genomic resource, phylogenetic and comparative genomics analyses were performed. Phylogenetic trees were estimated from the nuclear and organelle genome sequences of P. mariana and five other spruces. The phylogenetic tree estimated from mitochondrial genome sequences agrees with biogeography; specifically, P. mariana was strongly supported as a sister lineage to P. glauca and three other taxa found in western North America, followed by the European P. abies. In contrast, mixed topologies with weaker statistical support were obtained in phylogenetic trees estimated from nuclear and chloroplast genome sequences, indicative of ancient reticulate evolution affecting these two genomes. Clustering of protein-coding sequences from the six Picea taxa and two Pinus species resulted in 34,776 orthogroups, 560 of which appeared to be specific to P. mariana. Analysis of these specific orthogroups and dN/dS analysis of positive selection signatures for 497 single-copy orthogroups indicate gene functions mostly related to plant development and stress response. The P. mariana genome assembly and annotation provides a valuable resource for forest genetics research and applications in this broadly distributed species, especially in relation to climate adaptation.

View record

K-mer-based data structures and pipelines for sequence mapping and analysis (2023)

The exponential growth of genomic data demands progress and research on scalable bioinformatics algorithms. A paradigm to improve computational efficiency in bioinformatics is k-mers. Here we present three works based on the k-mer paradigm that improved the existing methods and opened new possibilities for major applications domains in bioinformatics. LINKS 2.0 is an alignment-free scaffolding tool that brings 3-fold run-time and 5-fold memory optimization to the latest previous version (LINKS v1.8.7). Together with enabling LINKS to process more data with lower computational requirements, this major update also outputs higher- quality scaffolds. Major memory optimization in LINKS 2.0 was obtained by storing k-mers as their 64-bit hash values instead of with ASCII characters. Multi-index Bloom filter (miBF) is a novel associative probabilistic data structure designed for efficiently storing k-mer and spaced seeds. MiBF-mapper discovered the utility of miBF in the long-read mapping domain and demonstrated its competitive accuracy. The mapping with miBF will be a future reference, especially for miBF-based methods. The work on miBF-based global ancestry inference (GAI) proved the scalability of miBF by processing high-coverage data of 208 individuals and promises to increase the accuracy of state-of-art by capturing short insertion and deletion (indel) markers as well as SNPs. We demonstrated high accuracy in continent-level inference and present a promising foundation for developing more accurate, loci-aware ancestry inferences.

View record

Copy number estimation for high-throughput short read shotgun sequencing de novo whole-genome assembly contigs (2022)

High-throughput short shotgun sequencing reads, also known as second-generation sequencing (SGS) reads, continue to be prevalent for de novo whole-genome assembly, whether alone or in combination with long-range information. Knowledge of contig multiplicity (copy number) is acknowledged to improve assembly correctness, contiguity, and coverage for SGS reads. Despite that, a principled, general solution for contig copy number estimation in de novo whole-genome SGS assembly has been unavailable. In the literature, the problem is generally unaddressed or given heuristic treatment.In this work, we introduce a novel, versatile statistically informed contig copy number estimator, based on mixture models, for high-throughput short read shotgun sequencing de novo whole-genome assembly. In particular, this tool targets de Bruijn graph assembly, the dominant paradigm for de novo whole-genome SGS assembly. We show that it performs reliably at resolving multiplicities up to low repeat copy numbers; it is also robust over a range of genome characteristics, sequencing coverage levels, and assembly settings. Moreover, it is far more versatile than the closest existing alternative tools and usually outperforms them, often by a wide margin. At the same time, somewhat reduced though still robust performance in a limited set of experiments using real sequencing data suggests fundamental limitations to its usage of only length and read coverage data; incorporating other types of information, e.g. GC content, may be necessary to improve performance. Our code is publicly available at https://github.com/bcgsc/wgs-copynum-est; we hope this effort will provide a useful reference for similar future work.

View record

Genomic and transcriptomic signatures of virulence and UV resistance in Beauveria bassiana (2022)

Beauveria bassiana is an entomopathogenic fungus used as a biological control agent against insect pests related to agriculture, forestry and human health. There is a large amount of phenotypic and genomic variation within the species complex, and characterizing this variation is required to identify the optimal strain for protection against a specific pest. This thesis outlines comparative genomic and transcriptomic analyses of eight B. bassiana isolates, including six wild-type and two UV resistant derivatives to identify the genetic basis of virulence and UV resistance. The five strains demonstrating the highest virulence levels against mountain pine beetle produced high levels of the red pigment, oosporein. Phylogenetic analysis placed the eight strains in two distinct clusters that reflected their morphology, grouping red strains separately from the non-red strains. Genes unique to the red strains included several membrane transporters, transcription factors and toxins, and may confer virulence or other unique biological functions to these strains. Significant differential expression was identified between the red and non-red strains, and these differentially expressed genes likely contribute to increased virulence, transmembrane transport and stress response in the red strains. Several genes encoding toxins, lipases and chitinases were differentially expressed, all of which are crucial to the infection process. Variant calling and differential expression in the UVR derivatives identified several genes of interest involved in oxidoreductase activity, stress response, copper metabolism and DNA replication/repair. These are all important mechanisms for protecting cells from UV-induced damage such as free radicals. Finally, differential correlation analysis identified several transcription factors that may be involved in the regulation of the oosporein biosynthetic gene cluster. The results of this work have narrowed the scope for selecting and/or engineering the most effective strain of Beauveria bassiana for the biological control of insect pests.

View record

High throughput in silico discovery of antimicrobial peptides in amphibian and insect transcriptomes (2021)

Antimicrobial peptides (AMPs) are a family of short defence proteins produced naturally by all multicellular organisms, varying from microorganisms to humans. Since resistance to AMPs is less frequent as to antibiotics, they may serve as a potential alternative. Past research has shown that amphibians have the richest known AMP diversity, specifically the North American bullfrog has demonstrated potential in aiding the discovery of novel putative AMPs. Antibiotic resistance is becoming more prevalent each day, requiring agricultural practices to reduce the use of antibiotics to protect human health, animal health, and food safety. To reduce the use of antibiotics, the goal of my thesis is to develop and execute an AMP discovery pipeline to discover AMPs suitable for pharmaceutical development. In this thesis, I have accomplished rAMPage: Rapid Antimicrobial Peptide Annotation and Gene Estimation. rAMPage is a scalable, high throughput bioinformatics-based discovery platform for mining AMP sequences in publicly available genomic resources. RNA-seq amphibian and insect reads from the Sequence Read Archive (SRA) are used. After trimming, reads are assembled with RNA-Bloom into transcripts, filtered, and translated in silico. Then, the translated protein sequences are compared to known AMP sequences from the NCBI protein database and specific AMP databases APD3 and DADP, via homology search. These sequences are cleaved into their mature/bioactive form. Next, machine learning algorithm AMPlify is employed to classify and prioritize the candidate AMPs based on their AMP probability score. Finally, these candidate AMPs are annotated and characterized. Across 84 datasets, rAMPage detected > 1,000 putative AMPs, where 90 sequences have been selected for downstream validation.

View record

Scalable methods for improving genome assemblies (2021)

De novo genome assembly is cornerstone to modern genomics studies. It is also a useful method for studying genomes with high variation, such as cancer genomes, as it is not biased by a reference. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by k - 1 bases, followed by merging nodes along unambiguous walks in the graph. The selection of k is influenced by a few factors, and its fine tuning results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, this approach has only been explored with small genomes, without addressing scalability issues with larger ones. Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which it uses to evaluate the assembly graph path support at branching points and removes the paths with insufficient support. RResolver runs efficiently, taking 3% of a typical ABySS human assembly pipeline run time on average with 48 threads and 40GB memory. Compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 16% and reduces misassemblies by up to 7%. RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome.

View record

Seasonal and sex-dependent gene expression in emu (Dromaius novaehollandiae) fat tissues (2021)

The emu (Dromaius novaehollandiae) is a bird that has been farmed for its oil, rendered from fat, for uses in therapeutics and cosmetics. Emu oil is valued for its anti-inflammatory and antioxidant properties, which promote wound healing. In spring and summer, adult emus start to gain fat and expend the energy from their fat stores during breeding in winter to sustain themselves when food is scarce. Since emus go through an annual cycle of fat gain and loss, understanding the genes affecting fat metabolism and deposition is crucial to improve fat production in emu farms. Samples were taken from back and abdominal fat tissues of the same four male and four female emus in April, June, and November for RNA-sequencing. In November, the emus’ body and fat pad weights were recorded. Seasonal and sex-dependent differentially expressed (DE) genes were analyzed and genes involved in fat metabolism were identified. A total of 100 DE genes (47 seasonally in males; 34 seasonally in females; 19 between sexes) were found. Seasonally DE genes generating significant difference between the sexes in gene ontology terms as well as supporting studies suggested integrin beta chain-2 (ITGB2) influences fat changes. Six seasonally DE genes functioned in more than two enriched pathways (two female: angiopoietin-like 4 (ANGPTL4) and lipoprotein lipase (LPL); four male: lumican (LUM), osteoglycin (OGN), aldolase B (ALDOB), and solute carrier family 37 member 2 (SLC37A2)). Two sexually DE genes, follicle stimulating hormone receptor (FSHR) and perilipin 2 (PLIN2), had functional investigations supporting their influence on fat gain and loss. The results suggested these nine genes (ITGB2, ANGPTL4, LPL, LUM, OGN, ALDOB, SLC37A2, FSHR, PLIN2) functionally influence fat metabolism and deposition in emus. This study lays foundation for further downstream studies to improve emu fat production through selective breeding using single nucleotide polymorphism markers.

View record

Antimicrobial peptide host toxicity prediction with transfer learning for proteins (2020)

Antimicrobial peptides (AMPs) are host defense peptides produced by all multicellular organisms, and can be used as alternative therapeutics in peptide-based drug discovery. In large peptide discovery and validation pipelines, it is important to avoid time and resource sinks that arise due to the necessity of experimentally validating a large number of peptides for toxicity. Therefore, in silico methods the prediction of antimicrobial peptide toxicity can be applied in advance to filter out any sequences that may be of toxic nature. While many machine learning-based approaches exist for predicting toxicity of proteins, it is often defined as a problem of classifying venoms and toxins from proteins that are nonvenomous. In my thesis I propose a new method called tAMPer that focuses on the classification of AMPs that may or may not induce host toxicity based on their sequences. I have used deep learning model ELMo as adapted by SeqVec to obtain vector embeddings for a dataset of synthetic and natural AMPs that have been experimentally tested in vitro for their toxicity through hemolytic and cytotoxicity assays. This is a balanced dataset that contains ~2600 sequences, split 80/20 into train and test set. By utilizing the latent representation of the data by SeqVec, and by further applying ensemble learning methods on these embeddings I have built a model that is capable of predicting toxicity of antimicrobial peptides with a F1 score of 0.758 and accuracy of 0.811 on the test set, and performing comparably better than state-of-the-art approaches both when trained and tested on our dataset as well as on other methods’ respective train and test datasets.

View record

De novo annotation of non-model organisms using whole genome and transcriptome shotgun sequencing (2017)

Current genome and transcriptome annotation pipelines mostly depend on reference resources. This restricts their annotation capabilities for novel species that might lack reference resources for itself or a closely related species. To address the limitations of these tools and reduce reliance on reference genomes and existing gene models, we present ChopStitch, a method for finding putative exons and constructing splice graphs using transcriptome assembly and whole genome sequencing data as inputs. We implemented a method that identifies exon-exon boundaries in de novo assembled transcripts with the help of a Bloom filter that represents the k-mer spectrum of genomics reads. We have tested our method on characterizing roundworm and human transcriptomes, while using publicly available RNA-Seq and whole genome shotgun sequencing data. We compared our method with LEMONS, Cufflinks and StringTie and found that Chop-Stitch outperforms these state-of-the-art methods for finding exon-exon junctions with and without the help of a reference genome. We have also applied our method for annotating the transcriptome of the American Bullfrog. Chop-Stitch could be used effectively to annotate de novo transcriptome assemblies, and explore alternative mRNA splicing events in non-model organisms, thus exploring new loci for functional analysis, and studying genes that were previously inaccessible. Long non-coding RNA (lncRNA) have shown to contribute towards sub-cellular structural organization, function, and evolution of genomes. With a composite reference transcriptome and a draft genome assembly for the American Bullfrog, we developed a pipeline to find putative lncRNAs in its transcriptome. We used a staged subtractive approach with different strategies to remove coding contigs and reduce our set. This includes predicting coding potentials and open reading frames; running sequence similarity searches with known coding protein sequences and motifs; evaluating contigs through support vector machines. We further refined our set by selecting and keeping contigs with PolyA tails and sequence hexamers. We interrogated our final set for sequences that shared some level of homology with known lncRNAs and amphibian transcriptome assemblies. We selected 7 candidates from our final set for validation through qPCR, out of which 6 were amplified.

View record

Kollector: transcript-informed targeted de novo assembly of gene loci (2017)

The information stored in nucleotide sequences is of critical importance for modern biological and medical research. However, in spite of considerable advancements in sequencing and computing technologies, de novo assembly of whole eukaryotic genomes is still a time-consuming task that requires a significant amount of computational resources and expertise, and remains beyond the reach of many researchers. One solution to this problem is restricting the assembly to a portion of the genome, which is typically a small region of interest. Genes are the most obvious choice for this kind of targeted assembly approach, as they contain the most relevant biological information, which can be acted upon downstream. Here we present Kollector, a targeted assembly pipeline that assembles genic regions using the information from the transcript sequences. Kollector not just enables researchers to take advantage of the rapidly expanding transcriptome data, but is also scalable to large eukaryotic genomes. These features make Kollector a valuable addition to the current crop of targeted assembly tools, a fact we demonstrate by comparing Kollector to the state-of-the-art. Furthermore, we show that by localizing the assembly problem, Kollector can recover sequences that cannot be reconstructed by a whole genome de novo assembly approach. Finally, we also demonstrate several use cases for Kollector, ranging from comparative genomics to viral strain detection.

View record

Nomenclature errors in public 16s rDNA gene databases: strategies to improve the accuracy of sequence annotations (2017)

Obtaining an accurate representation of the microorganisms present in microbial ecosystems presents a considerable challenge. Microbial communities are typically highly complex, and may consist of a variety of differentially abundant bacteria, archaea, and microbial eukaryotes. The targeted sequencing of the 16S rDNA gene has become a standard method for profiling membership and biodiversity of microbial communities, as the bacterial and archaeal community members may be profiled directly, without any intermediate culturing steps. These studies rely upon specialized 16S rDNA gene reference databases, but little systematic and independent evaluation of the annotations assigned to sequences in these databases has been performed. This project examined the quality of the nomenclature annotations provided by the 16S rDNA sequences in three public databases: The Ribosomal Database Project, SILVA, and Greengenes. To do that, first three nomenclature resources – the List of Prokaryotic Names with Standing in Nomenclature, Integrated Taxonomic Information System, and Prokaryotic Nomenclature Up-to-Date – were evaluated to determine their suitability for validating prokaryote nomenclature. A core-set of valid, invalid, and synonymous organism names was then collected from these resources, and used to identify incorrect nomenclature in the public 16S rDNA databases. To assess the potential impact of misannotated reference sequences on microbial gene survey studies, the misannotations identified in the SILVA database were categorized by sample isolation source. Methods for the detection and prevention of nomenclature errors in reference databases were examined, leading to the proposal of several quality assurance strategies for future biocuration efforts. These included phylogenetic methods for the identification of anomalous taxonomic placements, database design principles and technologies for quality control, and opportunities for community assisted curation.

View record

RNA-Bloom: de novo RNA-seq assembly with Bloom filters (2017)

High-throughput RNA sequencing (RNA-seq) is primarily used in measuring gene expression, quantifying transcript abundance, and building reference transcriptomes. Without bias from a reference sequence, de novo RNA-seq assembly is particularly useful for building new reference transcriptomes, detecting fusion genes, and discovering novel spliced transcripts. This is a challenging problem, and to address it at least eight approaches, including Trans-ABySS and Trinity, were developed within the past decade. For instance, using Trinity and 12 CPUs, it takes approximately one and a half day to assemble a human RNA-seq sample of over 100 million read pairs and requires up to 80 GB of memory. While the high memory usage typical of de novo RNA-seq assemblers may be alleviated by distributed computing, access to a high-performance computing environment is a requirement that may be limiting for smaller labs. In my thesis, I present a novel de novo RNA-seq assembler, “RNA-Bloom,” which utilizes compact data structures based on Bloom filters for the storage of k-mer counts and the de Bruijn graph in memory. Compared to Trans-ABySS and Trinity, RNA-Bloom can assemble a human transcriptome with comparable accuracy using nearly half as much memory and half the wall-clock time with 12 threads.

View record

Gene expression and mutation profiles define novel subclasses of cytogenetically normal acute myeloid leukemia (2016)

Acute myeloid leukemia (AML) is a genetically heterogeneous disease characterized by the accumulation of acquired somatic genetic abnormalities in hematopoietic progenitor cells. Recurrent chromosomal rearrangements are well-established diagnostic and prognostic markers. However, approximately 50% of AML cases have normal cytogenetics and have variable responses to conventional chemotherapy. Molecular markers have been begun to subdivide cytogenetically normal AML (CN-AML) and have been shown to predict clinical outcome.Despite these achievements, current classification schemes are not completely accurate and improved risk stratification is required. My overall objective was to identify specific gene expression and mutation signatures to define novel subclasses of CN-AML. I hypothesized that CN-AML would be separated into at least two or more subgroups. Gene expression and mutational profiles were established using RNA-Sequencing, clustering, de novo transcriptome assembly, and variant detection. I found the CN-AML could be separated into three groups, two of which had statistically significant survival differences (Kaplan-Meier analysis, log-rank test, p=9.75x10-³). Variant analysis revealed nine fusions that are not detectable via cytogenetic analysis and differential expression analysis identified a set of discriminatory genes to classify each subgroup. These findings contribute to the current understanding of the genetic complexity of AML and highlight gene fusion candidates for follow-up functional analyses.

View record

Current Students & Alumni

This is a small sample of students and/or alumni that have been supervised by this researcher. It is not meant as a comprehensive list.
 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

 
 

Sign up for an information session to connect with students, advisors and faculty from across UBC and gain application advice and insight.