# Quetelet seminars 2017

## Quetelet seminars 2017

- A quick overview of tail inference: from univariate to multivariate

David Veredas

8 December 2017 - Average effects based on regression models with log link: A new approach with stochastic covariates

Axel Mayer

27 November 2017 - Accelerating the convergence rate of Monte Carlo integration through ordinary least squares

Johan Segers

10 November 2017 - Survival analysis with linked data

A.H. Zwinderman

12 October 2017 - A hidden Markov approach to the analysis of cylindrical space-time series

Monia Ranalli

19 September 2017 - Some recent developments in Mendelian randomization

Steve Burgess

11 August 2017 - Statistical Challenges and Opportunities in Sport and Exercise Science

Marijke Welvaert

12 July 2017 - Simple applications of Optimal Stopping in everyday life

Thomas Bruss

22 June 2017 - An Introduction to the Correspondence Analysis of Over-dispersed Data: The Freeman-Tukey Statistic

Eric Beh

9 June 2017 - Stein operators and their applications

Yvik Swan

30 May 2017 - Data mining, big data visualization, data integration: modern approaches to basketball analytics

Marica Manisera and Paola Zuccolotto

April 24, 2017 - Master of Statistical Data Analysis Day

March 31, 2017 - Inference on the mode of weak directional signals: a Le Cam perspective on hypothesis testing near singularities

Davy Paindaveine and Thomas Verdebout

March 7, 2017 - Skewness-Based Projection Pursuit" by Prof. Nicola Loperfido

Prof. Nicola Loperfido

January 24, 2017

## A quick overview of tail inference: from univariate to multivariate

David Veredas

Vlerick Business School, Belgium

Friday, December 8, 2017, 14h30

room A3, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

### Abstract:

This talk reviews the main procedures for estimating the tail index of a heavy tailed distribution – namely the so-called Hill estimator. We will start with basic notions of univariate extreme value theory, moving slowly towards inference for the tail index. Then we will enter into the multivariate case, where ellipticity plays a crucial role. We will review the family of elliptical probability distributions, followed by two classes of multivariate Hill estimators. The talk is primarily based on the paper “Multivariate Hill estimators” by Dominicy, Ilmonen, and Veredas (Intl. Stat. Review, 2017 85.1, p. 108–142).

## Average effects based on regression models with log link: A new approach with stochastic covariates

Axel Mayer

Assistant Professor for Psychological Methods at RWTH Aachen University, Germany

Monday, November 27, 2017, 16h

room 3.1 (3rd floor) , Campus Dunant, Dunantlaan 2, 9000 Gent

### Abstract:

Researchers oftentimes use a regression model with log link to evaluate the effects of a treatment or an intervention on a count variable. In order to judge the average effectiveness of the treatment on the original count scale, they compute so-called marginal or average treatment effects, which are defined as the average difference between the expected outcomes under treatment and under control. Current practice is to evaluate the expected differences at every observation and use the sample mean of these differences as point estimate of the average effect. Standard errors for average effects are then obtained using the delta method. This approach makes the implicit assumption that covariate values are fixed, i.e., do not vary across different samples. In this talk, we present a new way to analytically compute average effects based on regression models with log link and stochastic covariates and develop new formulas to obtain standard errors for the average effect. In a simulation study, we evaluate the statistical performance of our new estimator and compare it to the traditional approach. Our findings suggest that the new approach gives unbiased effect estimates and standard error and outperforms traditional approaches in some conditions.

## Accelerating the convergence rate of Monte Carlo integration through ordinary least squares

Johan Segers

Université catholique de Louvain, Belgium

Friday, November 10, 2017, 14h30

room A3, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

### Abstract:

In numerical integration, control variants are commonly used to reduce the variance of the naive Monte Carlo method. The control functions can be viewed as explanatory variables in a linear regression model with the integrand as dependent variable. The control functions have a known mean vector and covariance matrix, and using this information or not yields a number of variations of the method. A specific variation arises when the control functions are centered and the integral is estimated as the intercept via the ordinary least squares estimator in the linear regression model. When the number of control functions is kept fixed, all these variations are asymptotically equivalent, with asymptotic variance equal to the variance of the error variable in the regression model. Nevertheless, the ordinary least squares estimator presents particular advantages: it is the only one that correctly integrates constant functions and the control functions. In addition, if the number of control functions grows to infinity with the number of Monte Carlo replicates, the ordinary least squares estimator converges at a faster rate than the Monte Carlo procedure, the integration error having a Gaussian limit whose variance can be estimated consistently by the residual variance in the regression model. An extensive simulation confirms the superior performance of the ordinary least squares Monte Carlo method for a variety of univariate and multivariate integrands and control functions.

## Survival analysis with linked data

A.H. Zwinderman

Academisch Medisch Centrum, Universiteit van Amsterdam, The Netherlands

Thursday, October 12, 2017, 13h

room V2, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

### Abstract:

When follow-up data in national registries are linked to study-data, records in both data files must be linked but this introduces errors when a unique person identification number cannot be used. In this talk I will sketch this problem and different methods to analyze probabilistic linked data with a focus on time to event data.

## A hidden Markov approach to the analysis of cylindrical space-time series

Monia Ranalli

Roma Tre University, Italy

Tuesday, September 19, 2017, 16h

room V2, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

### Abstract:

Motivated by segmentation issues in marine studies, a new hidden Markov model is proposed for the analysis of cylindrical space-time series, that is, bivariate space-time series of intensities and angles. The model is based on a hierarchical mixture of cylindrical densities, where the parameters of the mixture components vary across space according to a latent Markov field, whereas the parameters of the latent Markov field evolve according to the states of a hidden Markov chain. It allows to segment the data within a finite number of latent classes that vary over time and across space and that represent the conditional distributions of the data under specific environmental conditions, simultaneously accounting for unobserved heterogeneity, spatial and temporal autocorrelation. Further, it parsimoniously accommodates specific features of environmental cylindrical data, such as circular-linear correlation, multimodality and skewness. Due to the numerical intractability of the likelihood function, parameters are estimated by a computationally efficient EM-type algorithm that iteratively alternates the maximization of a weighted composite likelihood function with weights updating. The effectiveness of the proposal is tested in a case study that involves speeds and directions of marine currents in the Gulf of Naples observed over time, where the model was capable to cluster cylindrical data according to a finite number of latent classes varying over time that are associated with specific environmental conditions of the sea.

## Some recent developments in Mendelian randomization

Steve Burgess

MRC Cambridge & Cambridge University, UK

Friday, August 11, 2017, 12h

room V1, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

### Abstract:

Mendelian randomization uses genetic variants to make causal inferences about the effect of a risk factor on an outcome. We consider two areas of recent methodological development in Mendelian randomization: methods for analysis of fine-mapping data, and mode-based estimation.

First, with fine-mapped genetic data, there may be hundreds of genetic variants in a single gene region any of which are candidate instrumental variables. However, using too many genetic variants in an analysis can lead to spurious estimates and inflated Type 1 error rates. But if only a few genetic variants are used, then the majority of the data is ignored and estimates are highly sensitive to the particular choice of variants. We propose an approach based on summarized data only (genetic association and correlation estimates) that uses principal components analysis to form instruments.

Second, mode-based estimation has been proposed as a robust approach to instrumental variable analysis. Informally, we take the causal estimate that is supported by the largest subset of genetic variants. This approach gives consistent estimation if some genetic variants are invalid instruments, but a plurality of genetic variants are valid instruments. Although a recent approach was proposed for mode-based estimation, it is very sensitive to the choice of bandwidth parameter, and it can be inefficient. We propose an alternative and asymptotically efficient estimation method based on Bayesian model averaging.

## Statistical Challenges and Opportunities in Sport and Exercise Science

Marijke Welvaert

University of Canberra Research Institute for Sport and Exercise, Australian Institute of Sport

Wednesday, July 12, 2017, 13h

room Auditorium 4, Campus Dunant, Dunantlaan 2, 9000 Gent

### Abstract:

Being an applied statistician involves providing support, guidance and tutoring to scientists who rely on statistics to advance their research. In sport and exercise science, this can be a challenging task as sport scientists typically had limited exposure to statistics education and the field is dealing with big challenges like small sample sizes, small effect sizes and significant variability. During my presentation, I will highlight some particular challenges and opportunities for statisticians in this field. I will also present some examples on how the field is currently polarized by the introduction of "Magnitude-based inference" (MBI; Batterham & Hopkins, 2005) which eschews NHST and develops a new, yet unvalidated, framework for drawing inference from the magnitude of the effect with a focus on clinical significance. A recent review, and the only independent one thus far (Welsh & Knight, 2015), highlighted some major flaws with the approach and warned against using it. Despite this, MBI is still widely used by sport scientists, potentially because it is intuitive and supported by easy-to-use Excel spreadsheets. It has very vocal advocates as well as opponents, and it is an open debate how the field will move further with different journals promoting or banning the method.

## Simple applications of Optimal Stopping in everyday life

Thomas Bruss

Université Libre de Bruxelles, Belgium

Thursday, June 22, 2017, 16h

room V2, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

### Abstract:

In this talk we will briefly explain the notion of optimal stopping but will then quickly look at real-life applications. We speak about decision problems most of us have encountered or will accounter one day. Life is sequential since time is sequential, hence some situations are risky because decisions may be difficult to change, such as getting married, selling a house, or buying a forest. Finding tractable models yielding convincing optimal solutions is often difficult, but some exceptions are worth remembering. We will concentrate on a few of those in which the speaker was involved, as the 1/e-law (1984), the Odds-theorem (2000), (2002), the Last-arrival-Problem (2012), and on a new „basket strategy“. No special knowledge is required to follow the presentation.

## An Introduction to the Correspondence Analysis of Over-dispersed Data: The Freeman-Tukey Statistic

Eric Beh

University of Newcastle, UK

Friday, June 9, 2017, 13h

room E1.015, Campus Coupure, Building E, Coupure links 653, 9000 Gent

### Abstract:

Correspondence analysis is commonly used to graphically depict the association between two or more categorical variables. In the case of two variables, it involves the singular value decomposition of the standardised residuals. These residuals are a normalisation of the cell frequencies which are assumed to be Poisson random variables. However, there is extensive literature to suggest that the variance of the standardised residuals is not 1 and so variance stabilising procedures should be considered. There are several ways that the variance can be stabilised but we shall focus our attention on the role of the Freeman-Tukey transformation in correspondence analysis. In doing so, we shall explore how such an approach can be applied to sparse archaeological data from an excavation undertaken in the 1960’s on an island just to the north of Sicily, Italy.

## Stein operators and their applications

Yvik Swan

Université de Liège

Tuesday, May 30, 2017, 16h

room V2, Campus Sterre, S9, Krijgslaan 281, 9000 Gent

### Abstract:

Stein operators are a collection of linear operators whose properties provide useful handles on otherwise intractable probability distributions. Their study began in the 1970’s following the work of Charles Stein and his theory of “approximate computation of expectations”. We provide an overview of the most recent developments on the theory of Stein operators and discuss several applications (e.g. in quantum mechanics via the Many Worlds Theory, in number theory via convergence to the Dickman distribution, in statistics via sample quality assessment, goodness-of-fit testing or chi-square approximation, ...).

## Data mining, big data visualization, data integration: modern approaches to basketball analytics

Marica Manisera and Paola Zuccolotto

University of Brescia, Italy

Monday , April 24, 2017 , 13h

room V3 , Campus Sterre, Krijgslaan 281, Building S9, 9000 Ghent

### Abstract:

In the past few decades, the application of statistical thinking to sports has rapidly gained interest, as documented by the wide variety of scientific research on this theme and also by the publication of some insightful collections of statistical analyses applied to data from a broad range of sports.

In this context, an international project, called "Big Data Analytics in Sports" (BDSports) was born at the University of Brescia, within the activities of the "Big & Open Data Innovation Laboratory". The BDSports project aims at creating a unique collaboration among experts of quantitative analysis in all kind of sports.

In this talk, the focus is on basketball, for which, in the literature, several statistical techniques have been applied with a great variety of different aims, ranging from simply depicting the main features of a game by means of descriptive statistics to the investigation of more complex problems, such as forecasting the outcome of a game or a tournament, analyzing players' performance, studying the network of players' pathways from the in-bounds pass to the basket and their spatial positioning, or identifying optimal game strategies.

The range of possible questions that may be answered by statistical analysis is growing thanks to the on-line availability of large sets of play-by-play data and increasing computational power, allowing big data analysis. From a methodological point of view, the huge amount of raw data available on actions and strategies, combined with the absence of a sound theory explaining the relationships between the variables involved, make these questions a challenge for data scientists.

In this framework, starting from the availability of different kind of data (play-by-play, data from sensors; data from psychological scales), in certain cases scraped from the web by R and Python procedures, we apply several data mining techniques with different aims, such as the measurement of the players’ performance, also in the presence of high-pressure conditions, the definition of players’ new roles, the analysis of players’ movements and space-time trajectories, the study of the network of players and their strategic relationships, the study of relationships between performance and psychological profiles of the players.

## Master of Statistical Data Analysis Day

Ghent University

Friday , March 31, 2017 , 11h30

Coupure Campus, Building E, Room E2.009 , Coupure Links 653, 9000 Gent

### Abstract:

11:30-13:00: Pre-defense poster session

13:00-13:30: Sandwiches

13:30-14:15: "Reproducible Research" by Prof. Mariska Leeflang, Universiteit van Amsterdam

14:15-15:00: "Sports Analytics: A Historical Perspective and Current Directions" by Prof. Jesse Davis, KU Leuven

15:15-16:00: "Unfurling Manifolds" by Dr. Bart Van Rompaye, KBC Bank & Verzekering and alumnus UGent Master of Statistical Data Analysis

16:00-17:00: Alumni short presentations

17:00: Reception

Please confirm your attendance before March 26 via the link: https://webapps.ugent.be/eventManager/events/MaStatDay

## Inference on the mode of weak directional signals: a Le Cam perspective on hypothesis testing near singularities

Davy Paindaveine and Thomas Verdebout

Université libre de Bruxelles

Tuesday , March 7, 2017 , 16h - 17h30

Room V3 , Krijgslaan 281-S9, Campus Sterre, 9000 Gent

### Abstract:

We revisit, in an original and challenging perspective, the problem of testing the null hypothesis that the mode of a directional signal is equal to a given value. Motivated by a real data example where the signal is weak, we consider this problem under asymptotic scenarios for which the signal strength goes to zero at an arbitrary rate n. Both under the null and the alternative, we focus on rotationally symmetric distributions. We show that, while they are asymptotically equivalent under fixed signal strength, the classical Wald and Watson tests exhibit very different (null and non-null) behaviours when the signal becomes arbitrarily weak. To fully characterize how challenging the problem is as a function of n, we adopt a Le Cam, convergence-of-statistical-experiments, point of view and show that the resulting limiting experiments crucially depend on n. In the light of these results, the Watson test is shown to be adaptively rate-consistent and essentially adaptively Le Cam optimal. Throughout, our theoretical findings are illustrated via Monte-Carlo simulations. The practical relevance of our results is also shown on the real data example that motivated the present work.

The seminar is followed by a reception.

## Skewness-Based Projection Pursuit" by Prof. Nicola Loperfido

Prof. Nicola Loperfido

Urbino, Italy

Tuesday , January 24, 2017 , 12h30

Room V1 , Krijgslaan 281-S9, Campus Sterre, 9000 Gent

### Abstract:

Projection pursuit is a multivariate statistical technique aimed at finding interesting low-dimensional data projections. More precisely, it looks for the data projection which maximizes the projection pursuit index, that is a measure of its interestingness. Skewness, that is the third standardized cumulant, has some merits as a projection pursuit index. First, skewness is a valid projection pursuit index. Second, all distributions which are deemed uninteresting in the projection pursuit literature are symmetric. Third, for some distributions, projections maximizing skewness have a simple parametric interpretation. Finally, skewness is a mathematical concept with a simple graphical interpretation, and projection pursuit aims at making the most of visual perception as a tool for discovering data patterns. The talk will illustrate skewness-based projection pursuit from both the theoretical and the applied point of view.