Yongjin Park

 
Prospective Graduate Students / Postdocs

This faculty member is currently not actively recruiting graduate students or Postdoctoral Fellows, but might consider co-supervision together with another faculty member.

Assistant Professor

Research Interests

single-cell genomics
Computational Biology
Causal inference
Bayesian machine learning

Relevant Thesis-Based Degree Programs

Affiliations to Research Centres, Institutes & Clusters

Research Options

I am available and interested in collaborations (e.g. clusters, grants).
I am interested in and conduct interdisciplinary research.
 
 

Research Methodology

bioinformatics tool development
Bayesian modelling
Causal inference
probabilistic programming

Graduate Student Supervision

Doctoral Student Supervision

Dissertations completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest dissertations.

Interpretable latent variable models for high-dimensional biological data analysis (2024)

A latent variable model is a statistical model used to uncover the hidden patterns in the data. Matrix factorization and autoencoder are two widely used modeling frameworks for latent variable models. Matrix factorization-based methods represent data in a low-dimensional latent space where each latent dimension is explained by a weighted combination of the original features. While offering a straightforward interpretation of the latent space, these methods do not capture the non-linear structure of the data. On the other hand, artificial neural networks such as autoencoders have emerged as a powerful tool to summarize complex data structures through a series of non-linear transformations. However, the resulting latent representation often lacks a biological interpretation. In this dissertation, I develop three latent variable models to extract meaningful biological patterns from various datasets, using a hybrid structure of autoencoder and matrix factorization framework. The first model incorporates the domain knowledge of biological pathways into an autoencoder framework. I demonstrate that the proposed method can better retain the biological signals in the latent space and recover the underlying latent structure more accurately than a previous matrix-factorization-based approach. The second model builds a topic analysis tool for single-cell genomics, leveraging the variational autoencoder framework and latent topic model. The proposed topic model recovers cell clusters and cell-specific gene programs more accurately than conventional methods, such as principal component analysis and non-negative matrix factorization. Our results suggest that latent topics are suitable to capture cell-type-specific marker genes and recapitulate known immune cells in pancreatic cancer. Lastly, I extend this topic model to interpret dynamic changes in gene expression. The dynamic topic model uncovers short-term transcriptional dynamics from a plethora of spliced and unspliced single-cell RNA-sequencing counts. I demonstrate that modeling both types of RNA counts can improve robustness in statistical estimation and reveal new aspects of transcriptional dynamics that can be missed in previous analyses. In the latent space, I discovered that seven gene programs (topics) are highly correlated with cancer prognosis and generally enrich immune cell types and pathways.

View record

Master's Student Supervision

Theses completed in 2010 or later are listed below. Please note that there is a 6-12 month delay to add the latest theses.

Learning cellular hierarchies through structured topic modeling (2024)

The human immune system relies on the function and balance of various immune cell subsets and their interactions. Immune cells undergo a series of differentiation steps following a lineage-tree structure stemming from hematopoietic stem cells to reach their mature cell state. During differentiation of immune cells in both homeostasis and pathological processes, many cellular features, including gene expression patterns, are shared by fully differentiated immune cell sub-types. The process of immune cell differentiation is complex and not fully understood. Additionally, aberrant function and balance plays a major contributing role in the pathogenesis of many immunological disorders, including systemic lupus erythematosus.In this thesis, I propose LaRCH, a tree-structured neural topic model as a method to quantitatively characterize shared hierarchical features between cell subsets. In this model, single-cell gene expression profiles are represented by a mixture of topics consisting of latent features that follow an underlying tree structure, mirroring the dynamics of cellular differentiation.I present findings of our model trained on simulated single-cell RNA sequencing based on cell-sorted bulk RNA-seq data and a scRNA-seq dataset of over 1.2 million cells from individuals with variable lupus disease phenotypes. The cellular topic profiles estimated by our model markedly improve cell type deconvolution accuracy over traditional methods. Trained model parameters of LaRCH illustrate cell-type specific transcriptomic differences between SLE phenotypes, revealing the contributions of multiple immune cell types in the manifestations of lupus. I also identify a number of candidate genes that may have implications in the driving mechanisms that contribute to lupus disease pathogenesis. Ultimately, LaRCH is able to capture the hierarchical context between immune cell subsets by simultaneously identifying shared and distinct latent features amongst cell subtypes within heterogeneous cell samples.

View record

LiquidBayes : a bayesian network for monitoring cancer progression using liquid biopsies (2023)

The full abstract for this thesis is available in the body of the thesis, and will be available when the embargo expires.

View record

False discovery rate estimation for high-dimensional regression models (2022)

A genome-wide association study (GWAS) aims to determine genetic variants statistically associated with phenotypes. However, because of linkage disequilibrium (LD), a characteristic of large-scale genomic datasets referring to the strong local dependencies between single-nucleotide polymorphisms (SNPs), it is usually challenging to identify the actual causal variants among their associated proxies. In this work, we propose a Bayesian variable selection method called the sparse mixed Gaussian prior for generalized linear models (SMG-GLM). It is an efficient high-dimensional Bayesian variable selection approach designed for arbitrary relationships between variants and phenotypes. Besides, it calibrates the selection uncertainty, which many popular variable selection methods do not address, by estimating posterior inclusion probabilities. We additionally combine SMG-GLM with knockoffs, named SMG-knockoffs, to account for the collinearity problem caused by LD. The SMG-knockoffs method can make inferences on the variable selection result and control the false discovery rate at an expected level. Its competence in discovering causal variables while controlling a desired false discovery rate has been shown in simulation studies conducted on a GWAS dataset.

View record

 
 

If this is your researcher profile you can log in to the Faculty & Staff portal to update your details and provide recruitment preferences.

 
 

Read tips on applying, reference letters, statement of interest, reaching out to prospective supervisors, interviews and more in our Application Guide!