We consider the problem of estimating a high-dimensional p × p covariance matrix Σ, given n observations of confounded data with covariance Σ + ΓΓ^T , where Γ is an unknown p × q matrix of latent factor loadings. We propose a simple and scalable estimator based on the projection on to the right singular vectors of the observed data matrix, which we call RSVP. Our theoretical analysis of this method reveals that in contrast to approaches based on removal of principal components, RSVP is able to cope well with settings where the smallest eigenvalue of Γ^T Γ is relatively close to the largest eigenvalue of Σ, as well as when eigenvalues of Γ^T Γ are diverging fast. RSVP does not require knowledge or estimation of the number of latent factors q, but only recovers Σ up to an unknown positive scale factor. We argue this suffices in many applications, for example if an estimate of the correlation matrix is desired. We also show that by using subsampling, we can further improve the performance of the method. We demonstrate the favourable performance of RSVP through simulation experiments and an analysis of gene expression datasets collated by the GTEX consortium.

Polynomial factor models: non-iterative estimation via method-of-moments and Croon’s approach (Dr. Florian Schuberth)

I present a non-iterative method-of-moments estimator for non-linear latent variable (LV) models, i.e., models including quadratic terms, interaction terms, etc. of LVs. Under the assumption of joint normality of all exogenous variables, the corrected moments of linear combinations of the observed indicators (proxies) can be used to obtain consistent path coefficient and factor loading estimates. Besides providing the theoretical background of this approach, I also suggest how Croon’s correction for attenuation can be applied to estimate non-linear latent variable models in a similar manner.

Methods for evaluating treatment selection markers (Dr. Patrick Bossuyt)

Personalized medicine promises to revolutionize health care. The US National Cancer Institute refers to it as a “form of medicine that uses information about a person’s genes, proteins, and environment to prevent, diagnose, and treat disease”. Central in stratified and personalized medicine is the use of biomarkers to guide clinical actions. This has led to the - sometimes confusing - notion of “predictive markers”.
In this presentation I will offer an alternative way of defining a set of biomarkers as clinically useful, in terms of their ability to improve patient outcomes or to increase health care efficiency: treatment selection markers. These markers improve outcomes by identifying, in well identified population of potentially eligible patients or citizens, those who benefit sufficiently from a treatment or another action, versus those who do not benefit, or for which the benefit – harms balance is unacceptable.
Over 100 study designs have been proposed to evaluate the claim that a specific marker can act as a treatment selection marker. We have classified these proposals for trials based on the flow of participants in the study, which leads to a classification of four basic categories: single arm trials, enrichment trials, all-comer trials and treatment strategy trials. For each category, we will evaluate whether these designs can generate the evidence to support the respective claims, using K-RAS mutation testing for anti-EGFR therapy in colon cancer as an example.
Specific problems arise in the analysis of relatively new trial designs, the so-called McMaster protocol trials in oncology (umbrella trials and basket trials). These will be highlighted, as well as the expectations from big data, real-world evidence, big data and machine learning.

Long-term consequences of childhood behavioural problems in Brazil and the UK: combining latent class analysis and counterfactual mediation to test explanations across contexts (Dr. Gemma Hammerton)

Behavioural problems, such as stealing, fighting, and temper tantrums, are common across childhood and they can impact adulthood in negative ways, including poor health and well-being, criminal behaviour, and unemployment. Much of the current evidence derives from high-income countries, and less is known about the long-term consequences of behavioural problems in low- and middle-income countries. Cross-cohort comparisons can be challenging due to biases that arise from translation and differences in responding across contexts. In the current paper, we test for measurement invariance of childhood behavioural problems across two population-based cohorts in the UK and Brazil using latent class analysis with stepwise multiple indicator multiple cause modelling. We then examine methods to incorporate these latent classes as an exposure in a counterfactual mediation model to examine the extent to which potential mediators (e.g. gang membership) both explain and modify the association between behavioural problems and later adverse outcomes in the UK and Brazil. We compare various options for accounting for the uncertainty in latent class assignment when using a weighting-based approach to counterfactual mediation.

Robust summarization and inference in label-free mass spectrometry based quantification (Prof. Dr. Lieven Clement, Ghent University)

Label-Free mass spectrometry quantification and differential analysis of proteins is often challenging due to peptide-specific effects and context-depending missingness of peptide intensities. In this talk, we evaluate state-of-the-art summarisation strategies using a benchmark spike-in dataset and discuss why and when they fail compared to the state-of-the-art peptide based model, MSqRob. Next, we propose a novel summarisation strategy, MSqRobSum, which estimates MSqRob’s model parameters in a two-stage procedure. This maintains MSqRob’s superior performance, while providing useful protein expression summaries for plotting and downstream analysis. It considerably reduces the computational complexity, the memory footprint and the model complexity, and makes it easier to disseminate the method to a broad audience who are used to infer DE on protein summaries. Moreover, MSqRobSum renders our analysis framework to become modular, providing our end-users with full flexibility to develop modular workflows tailored towards their specific applications.

Probabilistic mixture modelling and software for reproducible spatial proteomics data analysis (Prof. Dr. Laurent Gatto, de Duve institute, UCLouvain)

In biology, localisation is function - understanding the sub-cellular localisation of proteins is paramount to comprehend the context of their functions. Mass spectrometry-based spatial proteomics and contemporary machine learning enable to build proteome-wide spatial maps, informing us on the location of thousands of proteins. Nevertheless, while some proteins can be found in a single location within a cell, up to half of proteins may reside in multiple locations, can dynamically re-localise, or reside within an unknown functional compartment, leading to considerable uncertainty in associating a protein to their sub-cellular location. Recent advances enable us to probabilistically model protein localisation as well as quantify the uncertainty in the location assignments, thus leading to better and more trustworthy biological interpretation of the data.

A short tour of our researches in biostatistics thriving from Prostar (Dr. Thomas Burger, University of Grenoble Alpes)

As an introduction, I will briefly introduce Prostar, a software tool to perform the statistical analysis of differential quantitative proteomics data. Then, I will review three different problems we have recently investigated from theoretical and practical viewpoints: determining nature of missing values, accounting for shared peptides in the differential analysis and making false discovery rate control procedures more robust.

Probabilistic Programming for protein structure analysis, prediction and design (Dr. Thomas Hamelryck)

Probabilistic Programming is an emerging technology in machine learning, after Deep Learning and Big Data Analytics. The main idea is:

1. Use a computer language augmented with statistical operators
2. ...to formulate a suitable probabilistic model for a given data set and
3. ...to perform automated statistical inference by executing the program.

This approach has been made possible by recent breakthroughs in automated inference (including advanced sampling methods and variational inference) and numerical computing (including software such as PyTorch and TensorFlow). I will outline what probabilistic programming is all about and discuss how it can impact protein structure prediction, analysis and design. Specifically, I will present a Bayesian model of protein superposition implemented in the deep probabilistic programming language Pyro. This model could potentially serve as a suitable likelihood function for Bayesian structure prediction using deep probabilistic programming.

Robust normalisation and differential variability testing for noisy scRNAseq data (Dr. Catalina Vallejos (MRC Human Genetics Unit, University of Edinburgh)

Cell-to-cell transcriptional variability in seemingly homogeneous cell populations plays a crucial role in tissue function and development. Single-cell RNA sequencing (scRNAseq) can characterise this variability in a transcriptome-wide manner. However, scRNAseq is prone to high levels of technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression. I will discuss some of the challenges that arise when using normalisation methods developed for bulk data in the context of scRNAseq analyses. Moreover, I will introduce BASiCS - an integrated Bayesian framework to perform data normalisation, technical noise quantification and selected downstream analyses. Beyond traditional mean expression testing, BASiCS can robustly identify changes in transcriptional variability between cell populations (such as experimental conditions or cell types). I will illustrate this feature in the context of immune cell populations. Finally, I will discuss ongoing efforts to extend and improve the scalability of BASiCS.

Count models for single-cell RNA-seq and metagenomics data (Dr. Davide Risso)

Transcriptomics and metagenomics are two applications of high-throughput sequencing that share several key characteristics. In both cases, the sequenced reads are aggregated at some feature level (genes or taxa) to form a matrix of counts. In fact, negative binomial models have been successfully applied to both data types. Microbiome metagenomics data often exhibit zero inflation with respect to the negative binomial distribution, a phenomenon observed in single-cell RNA sequencing as well. Here, we explore the performance of statistical methods developed for (single-cell) RNA-seq differential expression analysis on the detection of differentially abundant taxa in metagenomics experiments. We will focus on microbiome datasets, comprising of both 16S and whole meta-genome data. We will highlight some key differences between single-cell RNA-seq and metagenomics data and make some considerations on scalability of the methods.

Resampling techniques for general models and very small sample sizes (dr. Frank Konietschke)

Since the pioneering work “Bootstrap Methods: Another Look at the Jacknife” by Efron (Annals of Statistics, 1979), bootstrap and resampling methods have become a generally accepted tool in statistical sciences. Most importantly, one often reads “the null distribution of a statistic is computed using permutation methods”. The arising questions are “When do resampling and permutation methods work” and especially “How do resampling methods work?”. In this talk, we will study different resampling methods and find answers to the questions above. It will turn out, that so-called studentized permutation methods are even valid when data is not exchangeable. Real data sets illustrates the application of the proposed methods using mean based and purely nonparametric rank-based methods.

Heterogeneity in causal estimates from genetic instrumental variables (dr. Stephen Burgess)

An instrumental variable can be used to estimate the causal effect of a risk factor on an outcome from observational data. For many risk factors, several genetic variants have been identified that plausibly satisfy the assumptions required to be an instrumental variable. Each instrumental variable can be used to identify an average causal effect across individuals in the population. However, the individual causal effect of the risk factor on the outcome may differ for different people in the population. Additionally, causal estimates from different genetic variants may differ substantially. We investigate two scenarios where there is heterogeneity in the causal effect of the risk factor on the outcome. In the first scenario, the risk factor can be intervened on in different ways leading to different magnitudes of causal effect – we show how evidence for different causal effects can be synthesize. In the second scenario, we investigate genetic variants that influence response to a proposed treatment – we demonstrate a random forest approach for causal effect heterogeneity. We illustrate both scenarios with applied examples.