A Movie behind a Script

Martin Esguerra, Adrian Guerra, Xabier Rubiato
Applied Data Analysis @ EPFL

Abstract

We have come a long way since the public screening of ten of Lumière brothers' short films in Paris on 28 December 1895 to Avatar, a movie whose cost of development reached more than 200 million dollars and recorded nearly 2 billion dollars in gross, the highest of all-time. From a historical point of view, movies have acted as a reflection of our society and our culture. During all this time, scripts have remained the base stone to a movie. What script must one write for a movie to prosper? We have data spanning nearly 100 years with over 4,000 movies to try to solve this puzzle. Using the OpenSubtitles dataset and the IMDb dataset to analyze movies' scripts and measure their popularity we hope to provide a better insight to what makes a good and a bad movie.

Foreword

Soon after the invention of film, efforts were made to convey dialogue of actors to the audience. Subtitles first took the form of intertitles: texts drawn or printed on paper, filmed and placed between sequences of the film. By the 1930s, around the time color television became prevalent, subtitles shown on the screen at the same time as the moving picture were finally patented. Filmmakers would stamp subtitles directly onto film strips, but this was already a step toward digitized subtitles in the future. It wasn't until the late 1980s when advances in technology made it possible for entire subtitle systems to be downloaded onto a computer. Nowadays, closed captioning and subtitling allows English-language movies to be enjoyed in many countries across the world, and for English-speaking countries to enjoy films made in other languages. Moreover, independently made films have increased in popularity in the past couple decades and platforms like YouTube allow users to add subtitles to their video content using machine-generated automatic captions. Our data reflects the evolution of subtitles. Before diving into our analysis, we give a short overview of our dataset to expose readers to some of the challenges we were confronted to. Please note that we avoid implementation details and can refer to our source code for a more detailed analysis.

OpenSubtitles2018 dataset

The OpenSubtitles2018 dataset consists of 34.5 GB of data. For details on how data was gathered see OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. For the purposes of our analysis, we drop TV show episodes, movies with invalid IMDb identifiers and superfluous metadata.

IMDb dataset

IMDb is the largest online database of information related to films including cast, production crew and personnel biographies, plot summaries, trivia, and fan reviews and ratings. To measure a movie's popularity we retrieve a movie's average rating provided on the IMDb dataset page.

Methodology

In order to build a predictor, we follow the following methodology in our analysis.

  1. Data cleaning and selection.
  2. Data exploration.
  3. Extraction of features.
  4. Regression implementation.
  5. Use NLP tools to extract additional information.
  6. Conclusion.

Exploration

Time

After the data cleaning we have 4,127 movie subtitles distributed across time in the following way.

We see from our plot that our data is not evenly distributed. Most of our subtitles come from recent films, thus making an accurate time analysis complicated. Notice how incomplete our dataset is. Certain years have no movies as we see significant gaps in the 1970s and 1980s. The following table also confirms that more than half of our data points are in the 2000s and that we have very few movies before 1960.

Period Count Average Rating
1910-1959 379 7.58
1960-1999 1350 6.83
2000s 2495 6.47

It also seems that recent movies tend to have a worse average rating, while the best films are the older ones. We attribute this to the idea that more data has been collected in recent years, regardless of whether movies are good or bad, while subtitles available for old movies are primarily for good movies. Due to the uneven distribution of movies across time, we believe it would be difficult to take time as a good feature to predict the rating. Intuitively, the release year of a movie has little to do with its rating.

Genres

Different movie genres typically have different characteristics and so we explore statistical features of genres. We explore the distribution of the different genres in our dataset.

Drama is the most frequent genre in our dataset. Since we saw that our dataset had years with no subtitles data and was thus fairly incomplete, we compare the average IMDb rating of our movie's subtitles dataset for the 10 most frequent genres to the corresponding genre average rating on the whole IMDb database.

We see that for each genre there isn't much difference between the average IMDb rating and the OpenSubtitles2018 rating. Despite the incompleteness of our dataset across time, each genre's average IMDb rating in our dataset is fairly representative of the genre's IMDb rating. We also notice the wide gap between Drama movies' and Horror movies' average rating. We can therefore hypothesize that a movie's genre has an influence on its average rating.

Average Rating

Mean Std Dev Min 25% 50% 75% Max
6.682667 1.027169 1.7 6.1 6.8 7.4 9.4

This graph resembles a normal distribution left-skewed and shows that a good movie would be a movie that has a rating higher than 6.7.
We define two classes with approximately the same number of movies: good movies and bad movies. We consider good movies to be movies with average IMDb rating above 8 and bad movies to be movies with average IMDb rating below 5.2. We get two classes with 353 and 355 movies respectively.

Analysis

Word Statistics

Raw features are extracted from the subtitles and the metadata:

  • Number of words per minute: the total number of words divided by the movie runtime.
  • Number of distinct words per minute: the total number of distinct words divided by the movie runtime.
  • Number of sentences per minute: the total number of sentences relative to the movie duration.
  • Mean length of sentences: the total number of words divided by the total number of sentences.
  • the subtitles to movie duration ratio: duration of subtitles divided by movie runtime.
  • Distinct index: the number of distinct words divided by the length of the movie, the total number of words and the mean length of sentences.

We plot scatter plots of each individual feature against movie's average IMDb rating and compute their Pearson correlation coefficient.

Feature Correlation
Words per minute 0.006
Distinct words per minute -0.071
Sentences per minute -0.111
Mean length sentences 0.198
Subtitles to movie duration 0.298
Distinct Index -0.002

We see that Mean length sentences and Subtitles to movie duration have a weak positive correlation with the average rating. Sentences per minute has a weak negative correlation. Words per minute, Distinct index and Distinct words per minute have no real correlation. More precisely, we can see on the scatter plot that a movie with a high value in Mean length sentences or Subtitles to movie duration tends to have a higher IMDb rating. But a low value in Mean length sentences or Subtitles to movie duration have no real significance.

We also plot histograms of the good movies and bad movies classes for each individual feature.

Again, Mean length sentences and Subtitles to movie duration show the strongest distinction between the two classes. While Sentences per minute, Words per minute, Distinct index and Distinct words per minute have no real distinction between the two classes.

Linear regression

We use all our features to build a predictor and hope that our our best features, namely Mean length sentences and Subtitles to movie duration, will provide some insight. The red line represents the function y = x i.e. the perfect predictor.

We see that movies with a high IMDb rating are more predictable than movies with a low IMDB rating. As we see in the scatter plots, Mean length sentences and Subtitles to movie duration have no distinct values that appear only for bad films. On the other hand, movies with high values in Mean length sentences or Subtitles to movie duration tend to have a better rating. This explain why our predictor fails to predict bad movies, but does a better job in predicting good movies. All in all, our features don't seem to be enough to predict the final rating of a movie.

Natural Language Processing

As our linear regression didn't give us satisfying results in the prediction of movies' average IMDb rating, we explore NLP tools to try to get further insight into what makes a good and a bad movie.

Sentiment Analysis

We implement a sentiment analysis to see if a movie's general sentiment has an influence on the final rating. We analyze all the sentences of a movie and extract the positive and negative sentiment of each sentence. For both positive and negative sentiments, we divide the total sentiment by the number of sentences and use this as a feature to compare movies. We focus on the 100 best rated movies and 100 worst rated movies.

We see that the best movies have less of a positive sentiment than the worst movies, but the negative sentiment is more or less equal for both classes. We interpret this as follows: many sad scenes in a movie do not imply a high rating of the movie, but on the other hand, too many positive and joyful scenes generally make the movie worse.

Topic Detection

We come up with recommendations of topics to use and topics to avoid for certain genres by analyzing the topics present in the best and worst movies classes defined earlier. More precisely, we retrieve set of words that characterize recurrent topics in good or bad movies for a given genre using Latent Dirichlet allocation. We then label topics manually, although more advanced techniques using Wikipedia pages (see Labelling topics using unsupervised graph-based methods) or search engines (see Automatic labelling of topic models) also exist.

Results

The number of considered movies for the detection of top 3 topics is written next to the genre in the heading. Our topic detection works best with a larger number of movies, since we can attribute greater confidence in the topics and words detected. The manual labelling, however, can become more difficult as the words detected can be of a different and more diverse nature.

Drama (good: 246 | bad: 89)

Good drama movies indicate crime, justice, and family as topics, while bad ones indicate marriage and police as topics. Marriage is very similar to family, as is police with crime. According to our data, good and bad drama movies are subtly different but if any movie writers are reading this, please follow this recommendation at your own risk!

Comedy (good: 75 | bad: 146)

"Good" comedies often deal with the topics of government, war, religion and morality. Spoof comedies or movies set during the middle ages or renaissance, on the other hand, are not recommended. If the movie talks about holidays, it should be about family, and avoid the sex and teen drama topics.

Action (good: 37 | bad: 96)

Justice is a topic that is often present in "good" action movies, but less in bad action movies. "Good" and "bad" action movies often focus on war, police, the army. We noticed, however, that "good" action movies tend to be more realistic, while "bad" action movies incorporate elements of science-fiction like for instance alien invasion. Moreover, romance is not as dominant in the "good" movies set, as it is in the "bad" movies set.

Romance (good: 56 | bad: 37)

"Good" romantic movie have a high occurrence of war-related topics and stories that involve some "goodbye" theme. We also see in the bag of words retrieved the topics of a young, college and romance. Moreover, movies that involve marital affairs also tend to popular. Finally, romance movies should avoid the holidays topic and having too much sex, particularly if vulgar with words such as b*tch or f*ck.

We provide further down the pyLDAvis visualizer, using a predefined "good" comedy class.

Conclusion

In order to predict the average IMDb rating, we tried to extract statistical features from the movie subtitles. Unfortunately, this method did not lead to great results as very few features were correlated with the average IMDb rating. We then tried to get a sentiment score using NLP tools, but this did not provide us with much further insight into what makes a good movie. Finally, we implemented a topic detection algorithm on our dataset to provide topic recommendations depending on the genre. We may not have been able to provide a recipe to a good movie, but the topic detection did provide us with further ideas that could be implemented in a movie rating predictor. Had we had more time, we would have improved our predictor by incorporating categorical features such as genres and topics. In any case, our work showed us that a movie is more complex than its subtitles' structure, sentiment and topics. Movies combine sound, text and images to convey deeper ideas and emotions than those in subtitles. One would have to dig deeper in the subtitles to obtain a more accurate statistical representation of a good movie.

References

  1. Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
  2. Lau, J. H., Grieser, K., Newman, D., & Baldwin, T. (2011, June) Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 1536-1545). Association for Computational Linguistics.
  3. Aletras, N., & Stevenson, M. (2014). Labelling topics using unsupervised graph-based methods. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 631-636).
  4. Depollo, Adam. “The History of Subtitles.”, 25 July 2017. [Accessed 16 December 2018]
  5. Ivarsson, Jan. “A Short Technical History of Subtitles in Europe”, 17 November 2004. [Accessed 16 December 2018]

Source Code

Check out our open-source code for more details on implementation.

Source Code