Country Overviews Through News


“A good newspaper is a nation talking to itself.”
  Arthur Miller
With the face paced technology the world becomes more and more connected.
Let's find out if newspapers are still a nation talking to itself or a world talking to itself?

Countries By News


Here is a peculiar story of what people talk with respect to who they are. We would like to present you some interesting facts about 20 English speaking countires in the world, namely United States, Ireland, Australia, Canada, India, New Zealand, Sri Lanka, Singapore, Philippines, Ghana, Nigeria, Kenya, Hong Kong, Jamaica, Pakistan, Bangladesh, Malaysia and Tanzania; and connection of those with their trending topics appered in their online newspapers, blogs and magazines.

In the newspaper and magazine articles we cover more than events, we share what we are doing, what we are interested in and what matters to us as a society. Our aim in this project is to see how the news content changes for countries over time and is there a connection between country profiles and the news published. Our motivation behind this project is to understand, the underlying profile of the published news content: the topics of the news, most commonly used words and whether these information change between countries or course of time. We aim to see if country specific information and published news are somehow correlated or a good newspaper/article is now a world talking to itself?

We processed and analysed following data sets. News data is taken from NOW Corpus for one year period (November 2015 to October 2016). For more information please visit News section. Facts of countires are extracted from The World Factbook of CIA covering the same time period. For more information please visit Countries section. To see statistical and mathematical methods we used on data processing, topic modelling and correlations please visit Methods section.

NOW Corpus



The NOW Corpus (News on the Web) is composed of 6,979,691,862 billion words of data, and it is now growing by 160-170 million words per month, or about 1.6 billion words per year).

In order to create the corpus, scripts run every hour to get URLs for new magazine and newspaper articles from Google News, for about 9,000-10,000 new texts each day. Downloaded texts are cleaned by JusText (to remove boilerplate material); tagged and lemmatized; and then integrated into existing relational database.

The dataset we have in the cluster is for 6 years from 2010-2016(until October) of 20 different English speaking countries. It has lexicon, source, text and wlp (word, lemma, PoS tag) data. The data is in txt format and size of 5.9 million words. (~200GB)


WordsWorld

2016



We integrated Now Corpus with another dataset, Factbook. It has facts groupped per year however the data is not coherrent and consistent. Since we intented to compare countries with each other in terms of news topics, we chose where we had the most consistent records on the Factbook, which is 2016.

Graph on the left shows how many article/news collected per country of interest per month.We can observe several things: Tanzania, Kenya, Ghana has very few articles all the time compared to other countries such as Canada, USA, Great Britain and India. We can see from here as well that our data is not equally distributed neither within months nor overall. It is important to keep in mind that for some countries, since the data is limited the interpretations made here could be misleading compared to the actual news media. According to our data the article counts are fewer (paler colors) in the winter season in the overall most countries.

FactBook



The World Factbook provides information on the geography, history, people, government, economy, communications, transportation, military, and transnational issues for 267 world entities. Among them, we selected 20 countries whose internet media coverage data exists in the Now Corpus data. For each country, Factbook provides us more than 100 facts under the different main topics. after examining through these facts, we decided to select those we can correlate with news topics.


WordsWorld

Factbook Facts



  1. People and Society:
    • Population
      Age structure
      Median age
      Population growth rate
      Birth rate
      Death rate
      Sex ratio
      Net migration rate
      Life expectancy at birth
  2. Economy:
    • GDP - per capita (PPP)
      Unemployment rate
      Inflation rate (consumer prices)
  3. Energy:
    • Electricity - from other renewable source
      Carbon dioxide emissions from consumption of energy
  4. Communications:
    • Internet users

Trending Topics by Countries



In the topic distribution map we showed the most published topic in each country for each month. From the Topic Distribution Map for each Country over time we observed that overall, the most frequently published articles for each country doesn't change very often. This can show in which topics are countries are mostly talking based on the data we have. Some examples, could be USA and India talking Politics most whereas for Australia the topics diverge between Economy, Tech and Entertainment/Art/Magazine and they mostly talk Economy in the beginning of the year whereas they tend to publish more on Tech/Science articles in the second term of the year. For some countries the most published topics in every month doesn't change over time. For example, in South Africa we see that Sports is always the most published topic for 2016. Note: One thing to consider is that if the number of unique web resources that the data collected for a country is limited and it is a web resource on a specific topic, the most frequent topics may seem not to be changing. This might be the case for South Africa since it has lower number of unique resources and if these resources also publishes only on specific topics.

Following charts shows topics appered on the news and percentage of them by country in a year.

Bias Attention!




Graph on the left shows how many unique website are scrapted to collect news data per country and also internet usage of the country. As can be seen, for some countries like Jamaica, Bangladesh and Tanzania, number of unique websites are quite low as well as the internet usage. Especially the former, introduces a huge bias to topic distribution of that country. For example, dominant topic on the news appeared as TECHNOLOGY/SCIENCE/SOCIAL MEDIA in Tanzania. However this is because of those websites where news are collected are particulary tech forms or blogs. Those kind of biases exist for each country but of course their effects are not as severe as of Tanzania. When the number of unique sources increases, bias of those resourses decrease since they likely to cover various aspects of news better.


Another point to keep in mind when analysing this work is that number of topics is not same for all countries present. Some have 7 topic class (Bangladesh) whereas some have 11 (US), in between 7 to 13. This is not because different topic modelling is used per country but rather, topics found by LDA model are assigned multiple classes. In generel, 7 classes is used for LDA model whereas 13 topic classes created by us accoring to words found by LDA per class. We examined top 10 words belonging to a topic found by LDA and then named this topic as POLITICS or INTERNATIONAL or both with some percentage. In this way we achived followings:

  1. Alley the imbalance in the number unique source and articles per country
  2. Seperate and identified topics better
  3. Make topic generalization and comparison easier


From below map, we see the most published news topic for each country in the overall 1 year data we used. For each year, we shared the most frequent and meaningful 5 words LDA find. These 5 words are after excluding the person names and less unreasonable words but each these 5 words are within 10-15 most frequent word range for that topic.

Let's See Bets!




Our model gave pretty accuracte and interesting results. For example, the first mostly appeared topic is POLITICS in US, very reasonable since there was an election back in that time(2016), in UK and Ireland SPORT news dominated the agenda where as COMPANY/BUSINESS news in Hong Kong. Reasonablity was not the metric we used to evaluate models' performance but perplexity scores of models.

Bias aside, some suprizing results were also found like Pakisan's case. The second biggest slice of Pakistan's news belongs to international, well to be more specific coupe attack in Turkei. Feel free to play with the chart to reconfigure the country profiles in your minds. What people talk is not always what we expect them to talk.

Correlation Means Correlation


It's time to merge two datasets and see if there can be found mind blowing yet reasonable results. We conducted pairwise Spearman Correlation Test for each features of Factbook data and each topic found from News data. Spearman is chosen over Pearson because of it is superiority to detect monotonic increase or decrease in our data. Results are shown on the below.

Significant and marginally significant correlations are drawn according to p-values of test (<0.05, ~0.05). We also chose those have z-scores more/less than plus/minus 0.25 as of interest regardless of their significance.


Does migration lead to more political discussion?



POLITICS: percentage of political news per year for all news per country
NET MIGRATION RATE: percentage difference in between the number of immigrants and the number of emigrants per year per population.
CORRELATION: 0.503
P-value: 0.067*

Countries who have more migration rate have more media coverage on politics. Issues related to immagrants might account for some subtitles of political news or arguments for those receiving countries.

Are sports for wealthy people?



SPORTS: percentage of sports news per year for all news per country
GDP PER CAPITA: in dollar
CORRELATION: 0.50
P-value: 0.049***

When the country wealth increases the media coverage percentage of sports in those countries increases. Some people are really crazy about sports (football :)).

Are we playing or are we watching?



SPORTS: percentage of sports news per year for all news per country
INTERNET USERS: percentage of country population who has internet access
CORRELATION: 0.57
P-value: 0.021***

Countries having high internet user percetange, have also higher media coverage percentage in sports.

Young, wild & free
#SocialLife



SOCIAL_LIF/DAILY: percentage of daily news per year for all news per country
AGE STRUCTURE15-24: percentage of country population of age between 15 to 24
CORRELATION: 0.47
P-value: 0.0504*

Social Life / Daily media coverage percentage is higher in the countries having high young people percentage.

When too many people...



ENTERTAINMENT/ART/MAGAZINE: percentage of magazine/art news per year for all news per country
POPULATION: country population
CORRELATION: -0.57
P-value: 0.034***

When population is high in a country, the country talks less about entertainment/art and magazine topics. It might be getting hard to find subjects that can interest everbody or there is a huge population bias of India.

Do men care more about the money?



ECONOMY: percentage of economy news per year for all news per country
SEXT RATIO: male/female proportion
CORRELATION: 0.667
P-value: 0.007***

When the male/female ratio in overall population increases, the media coverage in economy increases.

Is legal being swept under the carpet?



LEGAL/LAW: percentage of legal/law related news per year for all news per country
UNEMPLOYMENT RATE: percentage unemployment rate in the country
CORRELATION: -0.553
P-value: 0.049***

In the countries in which unemployment rate is high tends to have less media coverage in Legal/Law.

Take care of the elderly...



HEALTH/MEDICAL: percentage of health/medical related news per year for all news per country
POPULATION GROWTH RATE: percentage unemployment rate in the country
CORRELATION: -0.63
P-value: 0.028***

Countries who has less population growth rate tends to have more media coverage in Health/Medical. If lowness of population growth rate is due to high old age population then news might treat health related topics of old people.

Methodology



This project is performed through several steps.

  1. Data Exploration
  2. Data Source Analysis
  3. Topic Modelling
  4. Topic Assignment
  5. Correlation and Reasoning
  6. Data Story and Website

Data Exploration

We chose News On the Web data among others offered in the ADA cluster. Since we believe data science for social good, we looked for alternative data source to produce more meaningful and insigthful results so added Factbook.

Data Source Analysis

First step was to discover what we had and the best way to make use of it. We examined sufficiency of data and biases exsist in this step.

Topic Modelling

Data has Word-Lem-Pos format which is crucial for topic modelling. Since data is too big to process in personal computers, we used Spark and build-in Spark libraries (LDA Model) on ADA cluster. We constructed and evaluated multiple models with various parameters (number of topics, number of iteration, corpus pruning with multiple percentages, optimizers et), run and got results then progressed further in local. To find out more about about topic modelling please visit our repository

Topic Assignment

We run our best model for every country then accoring to words discovered by model we assigned human-understandable news classes to each one of them. In case of hesistation, we checked actual articles on the web then decided.

Correlation and Reasoning

We used Facts that we can interpret with our topics, then merged both datasets. Before start reasoning, however, we also try multiple categorization of Factbook data to see if Simpson Paradox etc and to prevent extreme values of facts (eg population India) biased our results. To see various categorization, please visit our repository

Data Story and Website

As a last step, we decided to prepared a data story. Last couple of days allocated for the project was spent to create comprehensible and informative graphs/charts as well as a website.

Conclusion



Overall, from the analysis of these two datasets we can say that there are some meaningful correlation between the Country Profiles and the News content published. For example, we can see that Countries having higher net migration rates tends to publish more on Politics, Countries having younger people profile have a leaning towards to publish more on Social Life and Daily news etc. Therefore, we can conclude that from the data we have, News may have a lean toward and reflect some truth behind the saying "A good newspaper is a nation talking to itself" since there are some significant trends between topics published and country profiles.

On the other hand, these correlations found are based on limited data and may not be exactly reflecting countries behaviors. Also, some other limitations might be human interpretations on the topic name assignment and the number of countries is only 20 therefore, not sufficient to do a general claim such as News are become more Globalized with the changing Technology.

Team AdaGirls

Made with passion

foto

Gorkem Camli

MSc Data Science

foto

Nihal Ezgi Yuceturk

MSc Computer Science

foto

Arzu Guneysu Ozgur

PhD Robotics

Project ADA Fall 2018


Ecole Polytechnique Fédérale de Lausanne, Switzerland

-->