Application Case 4.6
Data Mining Goes to Hollywood: Predicting Financial Success of Movies
Application Case 4.6 is about a research study where a number of software tools and data mining techniques are used to build data mining models to predict finan- cial success (box-office receipts) of Hollywood movies while they are nothing more than ideas.
Predicting box-office receipts (i.e., financial success) of a particular motion picture is an inter- esting and challenging problem. According to some domain experts, the movie industry is the “land of hunches and wild guesses” due to the difficulty asso- ciated with forecasting product demand, making the movie business in Hollywood a risky endeavor. In support of such observations, Jack Valenti (the longtime president and CEO of the Motion Picture Association of America) once mentioned that “... no one can tell you how a movie is going to do in the marketplace ... not until the film opens in darkened theatre and sparks fly up between the screen and the audience.” Entertainment industry trade journals and magazines have been full of examples, statements, and experiences that support such a claim.
Like many other researchers who have attempted to shed light on this challenging realworld problem, Ramesh Sharda and Dursun Delen have been exploring the use of data mining to pre- dict the financial performance of a motion picture at the box office before it even enters production (while the movie is nothing more than a conceptual idea). In their highly publicized prediction models, they convert the forecasting (or regression) problem into a classification problem; that is, rather than fore- casting the point estimate of box-office receipts, they classify a movie based on its box-office receipts in one of nine categories, ranging from “flop” to “block- buster,” making the problem a multinomial classifi- cation problem. Table 4.3 illustrates the definition of the nine classes in terms of the range of box-office receipts.
TABLE 4.3 Movie Classification Based on Receipts
Data was collected from a variety of movie-related databases (e.g., ShowBiz, IMDb, IMSDb, AllMovie, BoxofficeMojo, etc.) and consolidated into a single data set. The data set for the most recently developed models contained 2,632 movies released between 1998 and 2006. A summary of the independent vari- ables along with their specifications is provided in Table 4.4. For more descriptive details and justifica- tion for inclusion of these independent variables, the reader is referred to Sharda and Delen (2006).
TABLE 4.4 Movie Classification Based on Receipts
Using a variety of data mining methods, including neural networks, decision trees, SVMs, and three types of ensembles, Sharda and Delen developed the prediction models. The data from 1998 to 2005 were used as training data to build the prediction models, and the data from 2006 was used as the test data to assess and compare the models’ pre- diction accuracy. Figure 4.16 shows a screenshot of IBM SPSS Modeler (formerly Clementine data mining tool) depicting the process map employed for the prediction problem. The upper-left side of the process map shows the model development process, and the lower-right corner of the pro- cess map shows the model assessment (i.e., test- ing or scoring) process (more details on IBM SPSS Modeler tool and its usage can be found on the book’s Web site).
Figure 4.16 Process Flow Screenshot for the Box-Office Prediction System.
Table 4.5 provides the prediction results of all three data mining methods as well as the results of the three different ensembles. The first performance measure is the percent correct classification rate, which is called Bingo. Also reported in the table is the 1-Away correct classification rate (i.e., within one category). The results indicate that SVM performed the best among the individual prediction models, followed by ANN; the worst of the three was the CART decision tree algorithm. In general, the ensemble models performed better than the individual prediction models, of which the fusion algorithm performed the best. What is probably more important to decision makers, and standing out in the results table, is the significantly low standard deviation obtained from the ensembles compared to the individual models.
TABLE 4.5 Tabulated Prediction Results for Individual and Ensemble Models
The researchers claim that these prediction results are better than any reported in the published litera- ture for this problem domain. Beyond the attractive accuracy of their prediction results of the box-office receipts, these models could also be used to further analyze (and potentially optimize) the decision variables to maximize the financial return. Specifically, the parameters used for modeling could be altered using the already trained prediction models to better understand the impact of different parameters on the end results. During this process, which is commonly referred to as sensitivity analysis, the decision maker of a given entertainment firm could find out, with a fairly high accuracy level, how much value a specific actor (or a specific release date, or the addition of more technical effects, etc.) brings to the financial success of a film, making the underlying system an invaluable decision aid.
Questions for Discussion
1.Why is it important for many Hollywood profes- sionals to predict the financial success of movies?
2.How can data mining be used for predicting financial success of movies before the start of their production process?
3.How do you think Hollywood did, and perhaps still is performing, this task without the help of data mining tools and techniques?
预测一部电影的票房收入(即商业成功)是一项有趣而富有挑战性的任务。根据某些领域专家的说法，电影行业是“靠直觉和瞎猜的领域”。由于预测困难，所以使好菜坞的电影产业颇具风险。为了支持此种说法，Jack Valenti (美国电影协会常务主席与CEO)曾说:“......没人能告诉你一部电影会在市场上有何种表现....直到影片在黑暗的电影院中播放，观众面前的屏幕闪烁着光芒。”娱乐业相关的出版物也有很多支持这个说法的例子、评论和经验。
与其他试图挑战这个问题的研究者一样，Ramesh Sharda和Dursun Delen尝试应用数据挖掘在影片开始制作之前(即影片还只是一个概念性的想法时)就预测影片的票房收入。在他们提出的预测模型中， 该模型将预测(回归)问题转化为一个分类问题。与预测影片的具体票房收入不同，他们根据票房收入将电影分为9种类别，从“惨淡经营”到“重磅炸弹”，从而将这个问题转化为多项分类问题。表4.3 展示了他们根据票房收入所定义的9个类别。
数据是从多个电影业相关的数据库中(例如，ShowBiz、 IMDb、IMSDb、AllMovie 等)收集的，然后将它们整合到一个数据集中。模型使用的数据集包括从1998~ 2006年上映的影片共2632部。表4.4总结了自变量以及对应的描述。如果读者希望了解这些变量详细信息和解释，请参考文献Sharda和Delen ( 2006 )。
Sharda和Delen使用了一系列数据挖掘方法，包括神经网络、决策树、支持向量机以及三种方法的组合建立预测模型。他们选择1998~ 2005的数据被用作训练数据集建立预测模型，2006年的数据被用作测试数据来比较不同模型的预测准确度。图4.16是IBM SPSS Modeler（以前是Clementine数据挖掘工具）秒回预测性分析问题过程的截图。左上角是模型开发过程，右下角是模型评估（即测试或评分）过程（关于IBM SPSS Modeler工具的介绍和使用方法可以参考本书的网站）。
表4.5给出了三种数据挖掘方法以及三种方法组合得出的预测结果。第一个性能度量是正确分类的百分比，称为bingo。表中还展示了1类偏差(1-away)分类正确率(即偏差在一类以内)。结果表明，在独立的预测模型中，支持向量机(SVM)表现最好，其次是神经网络( ANN)，性能最差的是CART决策树算法。总体上，组合模型的表现优于独立预测模型，其中组合算法(Fusion) 表现最好。对于决策者来说，可能更为重要的也是在表4.5中表现更明显的，是组合模型得到的低标准差。