Architecture behind the Qatar World Cup 2022 model

We explain the steps we took to predict the winner of the Qatar World Cup 2022 in this blog. The concepts covered include the various types of source data, the machine learning model, and the architecture used to obtain our results. Furthermore, we will go over and interpret the results in greater depth.

Source data

As with any data science project, we began by determining which data we would need to predict the World Cup 2022 winner.

We have limited ourselves to the following specific concepts due to the large amount of available Football data and the decision to build an understandable and clear model:

  • Team performance based on previous games;
  • Player statistics/performance;
  • FIFA ranking.

This provided us with three distinct indicators (team, player and FIFA) for our model.

We loaded the source data into our data lake using API’s and web scraping. The datasets were cleaned and transformed to ensure high data quality in the following steps.

For example, we began with a dataset of historical football matches spanning the years 1972 to 2022. However, we did not train the model directly on this dataset, but instead performed some important operations on it to improve prediction accuracy and reduce bias. As a result, we decided to exclude games prior to 2008 and friendly games, as these games increase bias towards teams that were strong in the past but are no longer, as well as teams that frequently play against weaker teams in friendly games, making the team appear stronger than it is.


From an architectural standpoint, we established a cross cloud environment comprised of AWS and Azure. The source data was collected from rest API’s and scraped from websites. It was then loaded to an AWS S3 data lake.

In the first step, we used AWS Sagemaker to explore the data residing In S3 with some autoML. Following that, we set up a Databricks environment hosted on AWS to perform the heavy lifting of our data science computations.

The data was explored, transformed, and cleaned. This enabled us to train several models and determine which one performed the best.

We used PowerBI and Streamlit to visualize the data, which were hosted on Azure using a Docker image. Streamlit is a free and open app framework. It enables quick Python visualizations to display graphs and visualize the results of data science models.

Figure 1: architecture setup in AWS and Azure

Data Science Model

The random forest model was used, which is a machine learning algorithm that is commonly used for regression and classification tasks. A random forest can be thought of as a collection of multiple decision trees, each of which classifies the winner of a football game based on the values of the variables in the dataset. Of course, the random forest will produce a predicted outcome for each tree. The random forest then combines all of the predictions and selects the most frequently predicted outcome based on majority voting. This has the significant advantage of being far more accurate and robust than a single decision tree.

We built form features for our random forest model because we believe that the form leading up to a tournament is an important factor in predicting the winner. Italy is a prime example of this, having won all ten qualifying matches with a goal difference of 33.

Interpretation of results

Our model determined a total of +20 features and the importance of each feature.

For example:

  • FIFA ranking;
  • Conceding/scoring/winning form based on previous matches;
  • Offense/Midfield/Defense/Goalkeeper strength based on player statistics.

When all features are used, our model predicts that France will defeat Argentina in the World Cup final. When we look at some of the features, we can see in the screenshot below that France and Argentina are both best in four features and have a draw in one. However, due to a different importance level, France has been declared the tournament winner.

Look at the Streamlit for a more detailed overview of our different models and the importance of the features:

Figure 2: tournament tree and comparison France – Argentina

The majority of our model’s predictions are straightforward and comparable to betting agency odds. With the exception of Senegal and Denmark, which our model overestimates in comparison to the Netherlands and England, which our model underestimates.

It will be interesting to make a comparison based on the actual tournament results to see how well our model performed.

Written by

Aivix team effort: Arne Vanhoof, Tom Thevelein and Kasimir Putseys