“A good newspaper is a nation talking to itself.”
Arthur Miller
With the face paced technology the world becomes more and more connected.
Let's find out if newspapers are still a
Here is a peculiar story of what people talk with respect to who they are. We would like to present you some interesting facts about 20 English speaking countires in the world, namely United States, Ireland, Australia, Canada, India, New Zealand, Sri Lanka, Singapore, Philippines, Ghana, Nigeria, Kenya, Hong Kong, Jamaica, Pakistan, Bangladesh, Malaysia and Tanzania; and connection of those with their trending topics appered in their online newspapers, blogs and magazines.
In the newspaper and magazine articles we cover more than events, we share what we are doing, what we are interested in and what matters to us as a society. Our aim in this project is to see how the news content changes for countries over time and is there a connection between country profiles and the news published. Our motivation behind this project is to understand, the underlying profile of the published news content: the topics of the news, most commonly used words and whether these information change between countries or course of time. We aim to see if country specific information and published news are somehow correlated or a good newspaper/article is now a world talking to itself?
We processed and analysed following data sets. News data is taken from NOW Corpus for one year period (November 2015 to October 2016). For more information please visit News section. Facts of countires are extracted from The World Factbook of CIA covering the same time period. For more information please visit Countries section. To see statistical and mathematical methods we used on data processing, topic modelling and correlations please visit Methods section.
In the topic distribution map we showed the most published topic in each country for each month.
From the Topic Distribution Map for each Country over time we observed that
overall, the most frequently published articles for each country doesn't change very often.
This can show in which topics are countries are mostly talking based on the data we have.
Some examples, could be USA and India talking Politics most whereas for Australia the topics
diverge between Economy, Tech and Entertainment/Art/Magazine and they mostly talk Economy in
the beginning of the year whereas they tend to publish more on Tech/Science articles in the second term of the year.
For some countries the most published topics in every month doesn't change over time.
For example, in South Africa we see that Sports is always the most published topic for 2016.
Note: One thing to consider is that if the number of unique web resources that the data collected for
a country is limited and it is a web resource on a specific topic, the most frequent topics may seem
not to be changing. This might be the case for South Africa since it has lower number of unique resources
and if these resources also publishes only on specific topics.
Another point to keep in mind when analysing this work is that number of topics is not same for all countries
present. Some have 7 topic class (Bangladesh) whereas some have 11 (US), in between 7 to 13. This is not because
different topic modelling is used per country but rather, topics found by LDA model are assigned multiple classes.
In generel, 7 classes is used for LDA model whereas 13 topic classes created by us accoring to words found by LDA
per class. We examined top 10 words belonging to a topic found by LDA and then named this topic as POLITICS or
INTERNATIONAL or both with some percentage.
In this way we achived followings:
From below map, we see the most published news topic for each country in the overall
1 year data we used. For each year, we shared the most frequent and meaningful
5 words LDA find. These 5 words are after excluding the person names and less
unreasonable words but each these 5 words are within 10-15 most frequent word range for that topic.
It's time to merge two datasets and see if there can be found mind blowing yet reasonable results. We conducted pairwise Spearman Correlation Test for each features of Factbook data and each topic found from News data. Spearman is chosen over Pearson because of it is superiority to detect monotonic increase or decrease in our data. Results are shown on the below.
Significant and marginally significant correlations are drawn according to p-values of test (<0.05, ~0.05). We also chose those have z-scores more/less than plus/minus 0.25 as of interest regardless of their significance.
Overall, from the analysis of these two datasets we can say that there are some meaningful correlation between the
Country Profiles and the News content published. For example, we can see that Countries
having higher net migration rates tends to publish more on Politics, Countries having younger
people profile have a leaning towards to publish more on Social Life and Daily news etc.
Therefore, we can conclude that from the data we have, News may have a lean toward and reflect
some truth behind the saying "A good newspaper is a nation talking to itself" since there are
some significant trends between topics published and country profiles.
On the other hand, these correlations found are based on limited data and may not be exactly reflecting countries behaviors. Also, some other limitations might be human interpretations on the topic name assignment and the number of countries is only 20 therefore, not sufficient to do a general claim such as News are become more Globalized with the changing Technology.
MSc Data Science
MSc Computer Science
PhD Robotics