Statistical Modelling 19 (1) (2019), 74–101

Exploring and modelling team performances of the Kaggle European Soccer database

Maurizio Carpita
Department of Economics and Management,
University of Brescia,
Brescia,
Italy.
e-mail: maurizio.carpita@unibs.it

Enrico Ciavolino
Department of History,
Society and Human Studies,
University of Salento,
Lecce,
Italy.


Paola Pasca
Department of History,
Society and Human Studies,
University of Salento,
Lecce,
Italy.


Abstract:

This study explores a big and open database of soccer leagues in 10 European countries. Data related to players, teams and matches covering seven seasons (from 2009/2010 to 2015/2016) were retrieved from Kaggle, an online platform in which big data are available for predictive modelling and analytics competition among data scientists. Based on both preliminary data analysis, experts’ evaluation and players’ position on the football pitch, role-based indicators of teams’ performance have been built and used to estimate the win probability of the home team with the binomial logistic regression (BLR) model that has been extended including the ELO rating predictor and two random effects due to the hierarchical structure of the dataset.

The predictive power of the BLR model and its extensions has been compared with the one of other statistical modelling approaches (Random Forest, Neural Network, k-NN, Naïve Bayes). Results showed that role-based indicators substantially improved the performance of all the models used in both this work and in previous works available on Kaggle. The base BLR model increased prediction accuracy by 10 percentage points, and showed the importance of defence performances, especially in the last seasons. Inclusion of both ELO rating predictor and the random effects did not substantially improve prediction, as the simpler BLR model performed equally good. With respect to the other models, only Naïve Bayes showed more balanced results in predicting both win and no-win of the home team.

Keywords:

Kaggle European Soccer (KES) database; binomial logistic regression (BLR) model; role-based player performance indicators; prediction of match results; comparison of classification models; statistical learning models.

Downloads:

Code in zipped archive. For details regarding the dataset used in this article please contact the authors directly.
back