Quetelet seminars 2017

A quick overview of tail inference: from univariate to multivariate
David Veredas
8 December 2017
Average effects based on regression models with log link: A new approach with stochastic covariates
Axel Mayer
27 November 2017
Accelerating the convergence rate of Monte Carlo integration through ordinary least squares
Johan Segers
10 November 2017
Survival analysis with linked data
A.H. Zwinderman
12 October 2017
A hidden Markov approach to the analysis of cylindrical space-time series
Monia Ranalli
19 September 2017
Some recent developments in Mendelian randomization
Steve Burgess
11 August 2017
Statistical Challenges and Opportunities in Sport and Exercise Science
Marijke Welvaert
12 July 2017
Simple applications of Optimal Stopping in everyday life
Thomas Bruss
22 June 2017
An Introduction to the Correspondence Analysis of Over-dispersed Data: The Freeman-Tukey Statistic
Eric Beh
9 June 2017
Stein operators and their applications
Yvik Swan
30 May 2017
Data mining, big data visualization, data integration: modern approaches to basketball analytics
Marica Manisera and Paola Zuccolotto
April 24, 2017
Master of Statistical Data Analysis Day
March 31, 2017
Inference on the mode of weak directional signals: a Le Cam perspective on hypothesis testing near singularities
Davy Paindaveine and Thomas Verdebout
March 7, 2017
Skewness-Based Projection Pursuit" by Prof. Nicola Loperfido
Prof. Nicola Loperfido
January 24, 2017

A quick overview of tail inference: from univariate to multivariate

David Veredas
Vlerick Business School, Belgium

Friday, December 8, 2017, 14h30
room A3, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

Abstract:

This talk reviews the main procedures for estimating the tail index of a heavy tailed distribution – namely the so-called Hill estimator. We will start with basic notions of univariate extreme value theory, moving slowly towards inference for the tail index. Then we will enter into the multivariate case, where ellipticity plays a crucial role. We will review the family of elliptical probability distributions, followed by two classes of multivariate Hill estimators. The talk is primarily based on the paper “Multivariate Hill estimators” by Dominicy, Ilmonen, and Veredas (Intl. Stat. Review, 2017 85.1, p. 108–142).

Average effects based on regression models with log link: A new approach with stochastic covariates

Axel Mayer
Assistant Professor for Psychological Methods at RWTH Aachen University, Germany

Monday, November 27, 2017, 16h
room 3.1 (3rd floor) , Campus Dunant, Dunantlaan 2, 9000 Gent

Abstract:

Researchers oftentimes use a regression model with log link to evaluate the effects of a treatment or an intervention on a count variable. In order to judge the average effectiveness of the treatment on the original count scale, they compute so-called marginal or average treatment effects, which are defined as the average difference between the expected outcomes under treatment and under control. Current practice is to evaluate the expected differences at every observation and use the sample mean of these differences as point estimate of the average effect. Standard errors for average effects are then obtained using the delta method. This approach makes the implicit assumption that covariate values are fixed, i.e., do not vary across different samples. In this talk, we present a new way to analytically compute average effects based on regression models with log link and stochastic covariates and develop new formulas to obtain standard errors for the average effect. In a simulation study, we evaluate the statistical performance of our new estimator and compare it to the traditional approach. Our findings suggest that the new approach gives unbiased effect estimates and standard error and outperforms traditional approaches in some conditions.

Accelerating the convergence rate of Monte Carlo integration through ordinary least squares

Johan Segers
Université catholique de Louvain, Belgium

Friday, November 10, 2017, 14h30
room A3, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

Abstract:

In numerical integration, control variants are commonly used to reduce the variance of the naive Monte Carlo method. The control functions can be viewed as explanatory variables in a linear regression model with the integrand as dependent variable. The control functions have a known mean vector and covariance matrix, and using this information or not yields a number of variations of the method. A specific variation arises when the control functions are centered and the integral is estimated as the intercept via the ordinary least squares estimator in the linear regression model. When the number of control functions is kept fixed, all these variations are asymptotically equivalent, with asymptotic variance equal to the variance of the error variable in the regression model. Nevertheless, the ordinary least squares estimator presents particular advantages: it is the only one that correctly integrates constant functions and the control functions. In addition, if the number of control functions grows to infinity with the number of Monte Carlo replicates, the ordinary least squares estimator converges at a faster rate than the Monte Carlo procedure, the integration error having a Gaussian limit whose variance can be estimated consistently by the residual variance in the regression model. An extensive simulation confirms the superior performance of the ordinary least squares Monte Carlo method for a variety of univariate and multivariate integrands and control functions.

Survival analysis with linked data

A.H. Zwinderman
Academisch Medisch Centrum, Universiteit van Amsterdam, The Netherlands

Thursday, October 12, 2017, 13h
room V2, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

Abstract:

When follow-up data in national registries are linked to study-data, records in both data files must be linked but this introduces errors when a unique person identification number cannot be used. In this talk I will sketch this problem and different methods to analyze probabilistic linked data with a focus on time to event data.

A hidden Markov approach to the analysis of cylindrical space-time series

Monia Ranalli
Roma Tre University, Italy

Tuesday, September 19, 2017, 16h
room V2, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

Abstract:

Motivated by segmentation issues in marine studies, a new hidden Markov model is proposed for the analysis of cylindrical space-time series, that is, bivariate space-time series of intensities and angles. The model is based on a hierarchical mixture of cylindrical densities, where the parameters of the mixture components vary across space according to a latent Markov field, whereas the parameters of the latent Markov field evolve according to the states of a hidden Markov chain. It allows to segment the data within a finite number of latent classes that vary over time and across space and that represent the conditional distributions of the data under specific environmental conditions, simultaneously accounting for unobserved heterogeneity, spatial and temporal autocorrelation. Further, it parsimoniously accommodates specific features of environmental cylindrical data, such as circular-linear correlation, multimodality and skewness. Due to the numerical intractability of the likelihood function, parameters are estimated by a computationally efficient EM-type algorithm that iteratively alternates the maximization of a weighted composite likelihood function with weights updating. The effectiveness of the proposal is tested in a case study that involves speeds and directions of marine currents in the Gulf of Naples observed over time, where the model was capable to cluster cylindrical data according to a finite number of latent classes varying over time that are associated with specific environmental conditions of the sea.

Some recent developments in Mendelian randomization

Steve Burgess
MRC Cambridge & Cambridge University, UK

Friday, August 11, 2017, 12h
room V1, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

Abstract:

Mendelian randomization uses genetic variants to make causal inferences about the effect of a risk factor on an outcome. We consider two areas of recent methodological development in Mendelian randomization: methods for analysis of fine-mapping data, and mode-based estimation.
First, with fine-mapped genetic data, there may be hundreds of genetic variants in a single gene region any of which are candidate instrumental variables. However, using too many genetic variants in an analysis can lead to spurious estimates and inflated Type 1 error rates. But if only a few genetic variants are used, then the majority of the data is ignored and estimates are highly sensitive to the particular choice of variants. We propose an approach based on summarized data only (genetic association and correlation estimates) that uses principal components analysis to form instruments.
Second, mode-based estimation has been proposed as a robust approach to instrumental variable analysis. Informally, we take the causal estimate that is supported by the largest subset of genetic variants. This approach gives consistent estimation if some genetic variants are invalid instruments, but a plurality of genetic variants are valid instruments. Although a recent approach was proposed for mode-based estimation, it is very sensitive to the choice of bandwidth parameter, and it can be inefficient. We propose an alternative and asymptotically efficient estimation method based on Bayesian model averaging.

Statistical Challenges and Opportunities in Sport and Exercise Science

Marijke Welvaert
University of Canberra Research Institute for Sport and Exercise, Australian Institute of Sport

Wednesday, July 12, 2017, 13h
room Auditorium 4, Campus Dunant, Dunantlaan 2, 9000 Gent

Abstract:

Being an applied statistician involves providing support, guidance and tutoring to scientists who rely on statistics to advance their research. In sport and exercise science, this can be a challenging task as sport scientists typically had limited exposure to statistics education and the field is dealing with big challenges like small sample sizes, small effect sizes and significant variability. During my presentation, I will highlight some particular challenges and opportunities for statisticians in this field. I will also present some examples on how the field is currently polarized by the introduction of "Magnitude-based inference" (MBI; Batterham & Hopkins, 2005) which eschews NHST and develops a new, yet unvalidated, framework for drawing inference from the magnitude of the effect with a focus on clinical significance. A recent review, and the only independent one thus far (Welsh & Knight, 2015), highlighted some major flaws with the approach and warned against using it. Despite this, MBI is still widely used by sport scientists, potentially because it is intuitive and supported by easy-to-use Excel spreadsheets. It has very vocal advocates as well as opponents, and it is an open debate how the field will move further with different journals promoting or banning the method.

Simple applications of Optimal Stopping in everyday life

Thomas Bruss
Université Libre de Bruxelles, Belgium

Thursday, June 22, 2017, 16h
room V2, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

Abstract:

In this talk we will briefly explain the notion of optimal stopping but will then quickly look at real-life applications. We speak about decision problems most of us have encountered or will accounter one day. Life is sequential since time is sequential, hence some situations are risky because decisions may be difficult to change, such as getting married, selling a house, or buying a forest. Finding tractable models yielding convincing optimal solutions is often difficult, but some exceptions are worth remembering. We will concentrate on a few of those in which the speaker was involved, as the 1/e-law (1984), the Odds-theorem (2000), (2002), the Last-arrival-Problem (2012), and on a new „basket strategy“. No special knowledge is required to follow the presentation.

An Introduction to the Correspondence Analysis of Over-dispersed Data: The Freeman-Tukey Statistic

Eric Beh
University of Newcastle, UK

Friday, June 9, 2017, 13h
room E1.015, Campus Coupure, Building E, Coupure links 653, 9000 Gent

Abstract:

Correspondence analysis is commonly used to graphically depict the association between two or more categorical variables. In the case of two variables, it involves the singular value decomposition of the standardised residuals. These residuals are a normalisation of the cell frequencies which are assumed to be Poisson random variables. However, there is extensive literature to suggest that the variance of the standardised residuals is not 1 and so variance stabilising procedures should be considered. There are several ways that the variance can be stabilised but we shall focus our attention on the role of the Freeman-Tukey transformation in correspondence analysis. In doing so, we shall explore how such an approach can be applied to sparse archaeological data from an excavation undertaken in the 1960’s on an island just to the north of Sicily, Italy.

Stein operators and their applications

Yvik Swan
Université de Liège

Tuesday, May 30, 2017, 16h
room V2, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

Abstract:

Stein operators are a collection of linear operators whose properties provide useful handles on otherwise intractable probability distributions. Their study began in the 1970’s following the work of Charles Stein and his theory of “approximate computation of expectations”. We provide an overview of the most recent developments on the theory of Stein operators and discuss several applications (e.g. in quantum mechanics via the Many Worlds Theory, in number theory via convergence to the Dickman distribution, in statistics via sample quality assessment, goodness-of-fit testing or chi-square approximation, ...).

Data mining, big data visualization, data integration: modern approaches to basketball analytics

Marica Manisera and Paola Zuccolotto
University of Brescia, Italy

Monday , April 24, 2017 , 13h
room V3 , Campus Sterre, Krijgslaan 281, Building S9, 9000 Ghent

Abstract:

In the past few decades, the application of statistical thinking to sports has rapidly gained interest, as documented by the wide variety of scientific research on this theme and also by the publication of some insightful collections of statistical analyses applied to data from a broad range of sports.

In this context, an international project, called "Big Data Analytics in Sports" (BDSports) was born at the University of Brescia, within the activities of the "Big & Open Data Innovation Laboratory". The BDSports project aims at creating a unique collaboration among experts of quantitative analysis in all kind of sports.

In this talk, the focus is on basketball, for which, in the literature, several statistical techniques have been applied with a great variety of different aims, ranging from simply depicting the main features of a game by means of descriptive statistics to the investigation of more complex problems, such as forecasting the outcome of a game or a tournament, analyzing players' performance, studying the network of players' pathways from the in-bounds pass to the basket and their spatial positioning, or identifying optimal game strategies.

The range of possible questions that may be answered by statistical analysis is growing thanks to the on-line availability of large sets of play-by-play data and increasing computational power, allowing big data analysis. From a methodological point of view, the huge amount of raw data available on actions and strategies, combined with the absence of a sound theory explaining the relationships between the variables involved, make these questions a challenge for data scientists.

In this framework, starting from the availability of different kind of data (play-by-play, data from sensors; data from psychological scales), in certain cases scraped from the web by R and Python procedures, we apply several data mining techniques with different aims, such as the measurement of the players’ performance, also in the presence of high-pressure conditions, the definition of players’ new roles, the analysis of players’ movements and space-time trajectories, the study of the network of players and their strategic relationships, the study of relationships between performance and psychological profiles of the players.

Master of Statistical Data Analysis Day

Ghent University

Friday , March 31, 2017 , 11h30
Coupure Campus, Building E, Room E2.009 , Coupure Links 653, 9000 Gent

Abstract:

11:30-13:00: Pre-defense poster session
13:00-13:30: Sandwiches
13:30-14:15: "Reproducible Research" by Prof. Mariska Leeflang, Universiteit van Amsterdam
14:15-15:00: "Sports Analytics: A Historical Perspective and Current Directions" by Prof. Jesse Davis, KU Leuven
15:15-16:00: "Unfurling Manifolds" by Dr. Bart Van Rompaye, KBC Bank & Verzekering and alumnus UGent Master of Statistical Data Analysis
16:00-17:00: Alumni short presentations
17:00: Reception

Please confirm your attendance before March 26 via the link: https://webapps.ugent.be/eventManager/events/MaStatDay

Inference on the mode of weak directional signals: a Le Cam perspective on hypothesis testing near singularities

Davy Paindaveine and Thomas Verdebout
Université libre de Bruxelles

Tuesday , March 7, 2017 , 16h - 17h30
Room V3 , Krijgslaan 281-S9, Campus Sterre, 9000 Gent

Abstract:

We revisit, in an original and challenging perspective, the problem of testing the null hypothesis that the mode of a directional signal is equal to a given value. Motivated by a real data example where the signal is weak, we consider this problem under asymptotic scenarios for which the signal strength goes to zero at an arbitrary rate n. Both under the null and the alternative, we focus on rotationally symmetric distributions. We show that, while they are asymptotically equivalent under fixed signal strength, the classical Wald and Watson tests exhibit very different (null and non-null) behaviours when the signal becomes arbitrarily weak. To fully characterize how challenging the problem is as a function of n, we adopt a Le Cam, convergence-of-statistical-experiments, point of view and show that the resulting limiting experiments crucially depend on n. In the light of these results, the Watson test is shown to be adaptively rate-consistent and essentially adaptively Le Cam optimal. Throughout, our theoretical findings are illustrated via Monte-Carlo simulations. The practical relevance of our results is also shown on the real data example that motivated the present work.

The seminar is followed by a reception.

Skewness-Based Projection Pursuit" by Prof. Nicola Loperfido

Prof. Nicola Loperfido
Urbino, Italy

Tuesday , January 24, 2017 , 12h30
Room V1 , Krijgslaan 281-S9, Campus Sterre, 9000 Gent

Abstract:

Projection pursuit is a multivariate statistical technique aimed at finding interesting low-dimensional data projections. More precisely, it looks for the data projection which maximizes the projection pursuit index, that is a measure of its interestingness. Skewness, that is the third standardized cumulant, has some merits as a projection pursuit index. First, skewness is a valid projection pursuit index. Second, all distributions which are deemed uninteresting in the projection pursuit literature are symmetric. Third, for some distributions, projections maximizing skewness have a simple parametric interpretation. Finally, skewness is a mathematical concept with a simple graphical interpretation, and projection pursuit aims at making the most of visual perception as a tool for discovering data patterns. The talk will illustrate skewness-based projection pursuit from both the theoretical and the applied point of view.