AvADArTeam DataStory

A project based on
the dataset of OpenFoodFacts

This project was done as part of a Data Analysis course and aims to visualize the countries' food consumption around the world.
We wanted explore the geographical attributes of a region to better understand its food distribution and the link between both of them.

_______________________

Introduction

Food consumption has always been evolving in each part of the world in many different ways. Nowadays, for the majority of the countries, it has become an important part of their culture. Whether it was meant for holidays or for adventure seeking, discovering the culinary specialties of a country is often part of a trip in order to discover the culture of the country itself.

The first question we may ask is, what kind of food specialties can we eat in that country?

In our Data Story, we relate how we have tried to study and make relevant observations on the different types of food people in a country used to consume and their link with the country itself.

To frame our analysis, we have asked ourselves the following questions:

What kind of nutriments is consumed in certain regions?
Is there a geographical interpretation of those nutriments repartition?
Are there countries that are relatively similar in terms of food consumption?

We hope that this project might help the people to know better what to expect when they are travelling and learn about new countries. We also wanted to make surprising discoveries about the relation between food selling and the geographical attributes of a country, and to see if some countries could be similar, even if they were not expected to be.

_______________________

The Dataset

Open Food Facts Logo

We have decided to use the Open Food Facts dataset, which is in CSV format and is not too big (1.6 GB). Thus, it has been fine to work on it using only Pandas (while on the other hand with a larger dataset, it would have been more convenient to use Spark).

The dataset consists of multiple information about food products, as for example the name of products, the ingredients, the countries in which those products are sold and the nutrition properties of the product such as the sugar, the energy and so on.

We have tried to extract some useful information about what people eat in the different regions of the world and what are the main nutrition elements they consume.

Analysis of the data

Since the dataset has a lot of attributes, we have decided to keep only the features we were interested in. For a better data handling, we stored our dataset into two separated Data Frames:

(1) food: It contains the general information of a food product like barcode, product_name, brands, countries, nutrition_grade, and many other information.

Here is an example of how the Data Frame looks like:

(2) nutriments: It contains only the nutriments (per 100g) of a certain food product like sugar, energy, etc. Each row is identified by a code, corresponding to the code of a product in the first Data Frame.

Cleaning phase: What to keep?

Then, we did a first pass to filter out the food products that had no nutriments or useful information (by useful information we meant to keep the data that contains more than 60% of non-NaN values) and to to generalize the terms to have them all in English (some parts of the data was written in other languages). We only converted the country column and not the product_name, because the translation may lead to weird products’ names.

This preprocessing significantly reduced the amount of data we had to handle and simplified our following work. For the next tasks, we merged our two Data Frames into one by their code so we the computations were much easy to perform, but we still kept our two Data Frames so we could work on different tasks that needed only one of the two.

Since the amount of nutriments consumed in the world is consequent, we have decided to keep only a certain subset of them, that are from our point of view the most interesting ones:

      * energy
      * fat
      * saturated fat
      * carbohydrates
      * sugars
      * proteins
      * salt
      * sodium

Their utility will be later explained in the Motivation part.

Computing our statistics

Our study is focused on the repartition of the food / nutriments in the world. Hence, we had to find a way to represent the general amount of nutriments consumed in each country so it would get closer to the real food consumption statistics.

For each task where we needed to attribute a value to a country (for e.g., to state the general consumption of sugar of that country), we chose to take the median value in each group of food of the same country, so it would represent the general nutriments consumption of that country.

Our thinking is that the median would be less subject to outliers, in contrary to the mean. Our project wants to highlight the general trend of consumption in each country, and having one product that has very particular values as nutriments could influence that trend a lot when computing the mean.

We also intend at a certain point to compute clusters of countries, to see their similarity with each other. In order to do this, we have used the method of K-means and Principal Component Analysis to compute and visualize our results.

Motivation

We wanted to make some observations on 3 main nutriments: sugar, energy & fat. We were particularly interested in knowing which country tends to consumes the most those nutriments and in having a chance to interpret those results. This could give an insight of if the population of a country tends to eat high fat/sugar food compared to the others, and to see if with our knowledge the results make any sense to us. For example, we know that some countries have as food specialty some really fat rich meals, like hamburgers for Americans, fondue for Switzerland,… Do those countries consumes food as fat as they are known for ?

In opposition to that, we were also curious about the consumption of fruits & vegetables in the world. Knowing which country tends to eat more green products and see if our knowledge of this country’s culture can explain the result seemed interesting to us.

Finally, another question that came to our mind was to discover if there is any countries that might be more similar that we could have expected. The reason why we have kept more nutriments in our Data Frame than the three we wanted to study is that we wanted to see which nutriment might influence the most countries' similarity when computing

Our final goal would be to visualize those questionings, and to see if there are some relevant observations to make !

_______________________

Our observations

South is so sweet ~

The first observation we have made was about the quantity of sugar consumed in a country. We first computed the median values of sugar for all products in a certain country, sorted them and plotted those values in a histogram.

Here are our results:

If we take a look at the top 10 countries that consumes the most products containing sugar:

     1. Saudi Arabia
      2. Colombia
      3. Malta
      4. Brazil
      5. Tunisia
      6. Monaco
      7. Lithuania
      8. Jordan
      9. Cyprus
     10. Morocco

1) We can notice some interesting regions. Indeed, it seems like Saudi Arabia, Jordan and Cyprus are the regions in the Middle East that consumes a high amount of sugar in general. On the South side of America, Colombia and Brazil are also great sugar consumers, while around the Mediterranean, we can also find countries with a high sugar consumption that are nearly close to each other like Malta, Tunisia, Monaco and Morocco. When looking at the Top 10 sugarcane producers, we can for example see that Brazil is on the top of the list, followed by India and Colombia that situate also in the top list of sugar consumers. This might explain a bit our results.

2) If we take a closer look to the histogram, we can also observe that in general the European countries tend to eat less sugar per 100g in the products sell in those countries compared to the others in the world.

3) Finally, some other countries seem to consume less sugar that we would have expected, like the United States. On the opposite side, we also notice that the median percent of sugar in the different products is above the 25g of sugar per day. This is the case in approximately 15% of the countries around the world. It is huge, knowing that it is for only one product. This percent get up to more than 40% if we take the products with more than 10g inside. This seems to show that we eat too much sugar!

Let’s visualize the amount of consumed sugar in the world:

Seeing the previous map, we can say multiple things:

1) Like we have seen before, most of the countries listed in the top 10 do appear with a high sugar consumption on the map. It seems like sugar consumption is more present in the countries in the South (even if it is not necessarily those countries that have the most entries in the dataset). The sugar is also present in the Arabic countries, maybe reflecting the food culture¹², in which they use more sweet recipes.

3) Europe doesn’t have the most sugar in their product. This could maybe be explained by the fact that in Europe, a lot of propaganda against sugar consumption is made, hence reducing its consumption a bit more than in other regions in the world.

3) There are a lot of missing values… Those missing values are mainly in the South Africa part of the world. This could be explained by the fact that there are a lot of poor countries and inform about the products on their country is not their main focus.

Maybe due to those missing values, we are quite surprised by the fact that the United States, which is well-known as the land of processed food, soda, and candy, has such a low sugar value. We would rather think, that this might be due to the fact the OpenFoodFacts has more European contributors than the others.

_______________________

Who knows 5 A Day ?

The next subject of our questioning was about the consumption of fruits & vegetables. We were wondering what are the main countries that products and sells the most fruits & vegetables. To answer to this, we have repeated the same process, but this time with the amount of vegetables instead of sugar consumption:

We can notice that the top 5 countries than consume the most vegetables are European countries:

     1. Spain
      2. French Guiana
      3. Saint Pierre and Miquelon
      4. Cost Rica
      5. Germany

1) Again, our assumption is that in Europe, a lot of advertising is made in order to make people consume enough healthy food, including vegetables. Some people may claim that fruits and vegetables taste better in Europe³, which could also be a reason of this higher kind of food consumption.

The Mediterranean diet is also well-known as one of the world’s healthiest⁴. This might be explained by the fact that its meals consist of fresh fruits and vegetables, cereals, and so on.

2) We also realize that this observation might be more biased than expected. Indeed, we might have missed some data about products that are vegetables and fruits, which explains the low quantity of filtered products. It can be explained by the fact that OpenFoodFacts seems to be more an “European” thing, mostly because it was created by a French programmer in 2012. We have also noticed the important amount of European contributors to this dataset, which might explain some missing values for other countries, and on the other side the abundance of products for Europe.

If we take a look on the repartition on a map:

1) We would see that the regions that consume the most vegetables are in Europe, especially around the Mediterranean Sea, which completes the observation made on the histogram previously.

2) However, the results might miss some information, due to our data processing. We could have maybe find more food product that were vegetables if we have done more filtering and if the dataset was more filled.

All of the explanations quoted here are just hypothesis based on our observations and computations, and we are totally aware that it could not reflect the reality. However, we were quite happy to be able to produce some reasoning from these first interesting information :-)

_______________________

Fat = Energy ?

On the other side of healthy food consumption, our next observation was focused on energy and fat nutriments. We were interested in knowing which part of the world was consuming the most fat food and if it could be explained with the country geography / culture. Also, we wanted to take a look at the energy nutriment distribution in the world to see if they both are correlated. Many people may be confused with both of these terms, believing that they are the same. Let’s try to see how different they are, by first looking at which country consume the most products containing those nutriments separately:

1) We first observe that in both cases, Costa Rica and Tunisia seems to be the country that consumes the most both energy and fat as nutriments compared to the others.

2) Then, in the following countries, we can notice that Saudi Arabia stays in the Top 5 in both cases, which could maybe be explained by its high sugar consumption. The same reasoning could be applied on Columbia and Jordan.

3) On a particular side, Switzerland is 3rd in fat consumption with more than 30g of fat per 100g in the median food product, which is impressing ! Maybe do we eat to much fondue, raclette, and chocolate? :P

4) More surprisingly, the United States of America are not even in the Top 10 ! This could again be explained by the small amount of information of this country in the dataset, as we have seen before.

When looking at the world map version we realized that it was quite difficult to extract any meaningful interpretation from that. We can indeed say that some countries are visibly consuming more energy and fat than others in both maps. However, both maps are not exactly the same in terms of correlation.

We can say that fat and energy might be influenced by the same set of other nutriments (like for example sugar), which could explain the results on the histograms. But the results were otherwise generally different (when looking at the sorted lists of countries), hence both fat and energy nutriments are still very different from each other.

_______________________

Similar countries

After all those nutriments analysis, we were curious about the similarities between countries and their food consumption. For this purpose, we have tried 2 different approaches of computing similarities between countries.

1. Similar countries with Switzerland

We first have tried to visualize the similarity from the point of view of Switzerland. Our idea was to find countries that are the closest to Switzerland by in terms of products selling.

We ended up with two kind of visualizations that used two types of normalization: on the left, the number of products that are similar to Switzerland in each country were divided by the total number of product sold in a given country, while on the right, it is divided by the total number of product sold in Switzerland.

These two maps are a bit confusing at first look. Indeed, it seems that on the left Mali and Zimbabwe are closer to Switzerland, even more than the European countries. This is surprising.

Using this kind of normalization results that countries that have a lot of product register on this website will be penalized a lot compare to the ones that don’t have a lot of products on the website.

On the right, France is the country that is the closest to Switzerland. This time, we could be influenced by the fact that there is much more products sold in France than in any country in the world. It will be easier for France to have products in common with Switzerland than the other countries.

2. Similar countries in the world

Then, we have tried to find similarities between all countries in the world. For this purpose, we have computed clusters of countries based on their food nutriments. For each country, the nutriments values were defined on the median.

We have firstly tried to compute the ideal number of clusters before running the method of K-means using the Principal Component Analysis method. Since it was still hard to tell how many clusters we were supposed to take, we have tried an automatic approach to select the number of clusters: we have tried all the possible number of clusters until 50, and plotted the inertia around the center of a cluster to visualize the best number of cluster we should take.

When looking at the 1st plot, it seems that the good amount of cluster should be around 3 since at this moment, the sum of the distance from the different centers of the different clusters doesn’t decrease significantly.

On the 2nd plot, we notice that the 1st component seem to have the most importance in the clustering of the values. This means that only one column seem to have some importance in the clustering of the countries.

After some computation using PCA, we finally notice that the column that as the more impact is the column energy ! This is not surprising if we look at the map of the world when we choose to use 3 clusters. The countries that are clustered together are indeed the the same that have the same color on the map of the energy.

This is quite interesting to see which countries are close to each other in terms of the nutriments (especially, as we saw, in terms of the energy per 100g).

For example, Switzerland and France aren’t that similar… This is not surprising if we consider that in Switzerland, a majority of the people speak German and thus will be influenced by the culture of Germany more than the one of France. Furthermore, we notice that Italy is also in the same cluster as Germany hence influencing more the food culture in Switzerland.

An other interesting information that we get from this map is the fact that the Senegal has the same color as France. This could be explained by the fact that Senegal was under the influence of France during multiple years! This could then explain that those two countries consumes the same type of product and thus they consumes the same amount of each nutriments. Note that this is also influenced by the fact that the products that are register on this OpenFoodFacts come a lot from France and thus when a person register a product, he might also register all the countries in which there are sold, meaning that he will also register this product for Senegal.

Again, considering France, we notice that this country is close to Canada. A possible explanation is that a lot of products that are on this website might come from Quebec, influencing the entire country.

Finally, also countries that are not next to each other can be in the same cluster. This is for example the case for Columbia and India. This is hard to explain but it is interesting to notice that those two countries are close to each other in terms of nutriments. As we have previously seen, both were high sugar consumers. This could show that they share a similar food culture! One possible explanation would be that in Columbia, there are a lot of people living in the mountains (Bogota, the capital, is at 2640 meters above sea level). Thus, they might consume more energy (for example if the weather is cold), meaning that they will need to consume more energy to compensate for the loss of energy. In India, the weather is quite hot, so the persons living there also consume a lot of energy (for example to cool the body) meaning that they also need to compensate for the loss of energy.

We have also managed to visualize the similarity using Hierarchical Clustering with a dendrogram. All products that are sold in the same country were gathered and we computed the median values of each nutriment for each country.

In this dendrogram, we can see the three clusters. One is the green one with the distance just below 1000, and the two others are the two red ones with the distance below 1000. Note that the last 6 are quite different from the other countries. This difference may let us think that a 4th cluster could be a good idea.

Hence, we have also tried the same process as before, but this time with 4 clusters.

This time we notice that it seems to cluster a bit better than previously. This can be seen in the previous scatter plot when you look at the group of green dot that are a distant from the other dots in the scatter plot.

This time it is interesting to notice that in Europe we have the center that is blue and the others that countries that are red. It is also interesting to notice this time that France, Switzerland and Germany are in the same cluster, meaning that with 3 clusters, those 3 countries where on the edge and that they would nearly be in the same cluster already with 3 clusters.

As before, Columbia, Saudi Arabia and India are close to each other in terms of energy, but as seen in all the maps before, there where often very close (for sugar, energy and so on). So it is not that surprising to see them still so close.

_______________________

Conclusion

Based on our previous results, we have made the previous observations:

* In general, we noticed that parts of the South tends to consume more sugar. We did not have enough data to totally state that, but we were able to spot some regions of the world have a higher sugar consumption. When looking at the regions that produce the most sugar, we can see a trend that confirms our results.

* The regions around the Mediterranean Sea seems to sell more fruits and vegetables than in other regions. This could be explained by the fact that European specialties use a lot of vegetables in their meals. We were however skeptical because we did not manage to filter more data from our dataset. Hence, we think that the results might be affected because the OpenFoodFacts dataset was mainly filled by Europeans.

* We have also tried to see the difference between fat and energy. We have seen that some countries that consume more energy nutriment than others also consume fat. However, this is not the case for all, meaning that fat and energy are not necessarily correlated.

* Finally, when computing similarities between countries, we have noticed that most of the time, countries that are similar are generally situated in the same region. We surprisingly found countries that were far from each other but were considered as similar, with respect to a subset of nutriments (including the previous ones we analyzed)

We were totally aware that our results might be influenced by the following reasons:

* The dataset was mainly filled by a majority of Europeans. Hence, results might be deviated from the reality because of that.

* Our methods to filter or cluster the data might not generate exact results. We might have missed some data. That would explain the inconsistency of some of our results.

However, we were still able to find some interesting and surprising discoveries about countries and their similarity through certain nutriments. Considering all those elements, we learned more about food consumption in the world and their link to the country in which the products are sold.

Working with that dataset was not so obvious, but gave us some experience that we are not going to forget. While we have discovered how to treat data cautiously, we also learned a lot about how to handle it and to visualize it in order to make interpretations :-)

_______________________

References

[1] Asked on Quora: “Why do people from Arabic countries consume so much sugar?” ↩
[2] “Arab and Sephardi Pastries Are Too Sweet: Sugar, Power, Taste, and the Politics of Sweets” by Jonathan Katz, June 18 2016 ↩
[3] “Why fruits and vegetables taste better in Europe” by Julia Belluz, February 12 2016 ↩
[4] “Which countries have the healthiest diets?” by Kashmira Gander, April 6 2016 ↩

Project for Applied Data Analysis (CS-401)

Tell me what you eat,
I will tell you
where you are from

A project based on
the dataset of OpenFoodFacts

Analysis of the data

Cleaning phase: What to keep?

Computing our statistics

Motivation

South is so sweet ~

Who knows 5 A Day ?

Fat = Energy ?

Similar countries

1. Similar countries with Switzerland

2. Similar countries in the world