Application Case 2.2

Improving Student Retention with Data-Driven Analytics

Student attrition has become one of the most challenging problems for decision makers in academic institutions. Despite all the programs and services that are put in place to help retain students, according to the U.S. Department of Education, Center for Educational Statistics (nces.ed.gov), only about half of those who enter higher education actually earn a bachelor’s degree. Enrollment management and the retention of students has become a top priority for administrators of colleges and universities in the United States and other countries around the world. High dropout of students usually results in overall financial loss, lower graduation rates, and inferior school reputation in the eyes of all stakeholders. The legislators and policy makers who oversee higher education and allocate funds, the parents who pay for their children’s education to prepare them for a better future, and the students who make college choices look for evidence of institutional quality and reputation to guide their decision-making processes.

Proposed Solution

To improve student retention, one should try to understand the nontrivial reasons behind the attrition. To be successful, one should also be able to accurately identify those students that are at risk of dropping out. So far, the vast majority of student attrition research has been devoted to understanding this complex, yet crucial, social phenomenon. Even though these qualitative, behavioral, and survey-based studies revealed invaluable insight by developing and testing a wide range of theories, they do not provide the much-needed instruments to accurately predict (and potentially improve) student attrition. The project summarized in this case study proposed a quantitative research approach where the historical institutional data from student databases could be used to develop models that are capable of predicting as well as explaining the institution-specific nature of the attrition problem. The proposed analytics approach is shown in Figure 2.4.

FIGURE 2.4 An Analytics Approach to Predicting Student Attrition.

Although the concept is relatively new to higher education, for more than a decade now, similar problems in the field of marketing management have been studied using predictive data analytics techniques under the name of “churn analysis,” where the purpose has been to identify among the current customers to answer the question, “Who among our current customers are more likely to stop buying our products or services?” so that some kind of mediation or intervention process can be executed to retain them. Retaining existing customers is crucial because as we all know, and as the related research has shown time and time again, acquiring a new customer costs on an order of magnitude more effort, time, and money than trying to keep the one that you already have.

Data Is of the Essence

The data for this research project came from a single institution (a comprehensive public university located in the Midwest region of the United States) with an average enrollment of 23,000 students, of which roughly 80% are the residents of the same state and roughly 19% of the students are listed under some minority classification. There is no significant difference between the two genders in the enrollment numbers. The average freshman student retention rate for the institution was about 80%, and the average 6-year graduation rate was about 60%.

The study used 5 years of institutional data, which entailed to 16,000+ students enrolled as freshmen, consolidated from various and diverse university student databases. The data contained variables related to students’ academic, financial, and demographic characteristics. After merging and converting the multidimensional student data into a single flat file (a file with columns representing the variables and rows representing the student records), the resultant file was assessed and preprocessed to identify and remedy anomalies and unusable values. As an example, the study removed all international student records from the data set because they did not contain information about some of the most reputed predictors (e.g., high school GPA, SAT scores). In the data transformation phase, some of the variables were aggregated (e.g., “Major” and “Concentration” variables aggregated to binary variables MajorDeclared and ConcentrationSpecified) for better interpretation for the predictive modeling. In addition, some of the variables were used to derive new variables (e.g., Earned/Registered ratio and YearsAfterHighSchool).

Earned/Registered = EarnedHours/ RegisteredHours

YearsAfterHighSchool = FreshmenEnrollmentYear – HighSchoolGraduationYear

The Earned/Registered ratio was created to have a better representation of the students’ resiliency and determination in their first semester of the freshman year. Intuitively, one would expect greater values for this variable to have a positive impact on retention/ persistence. The YearsAfterHighSchool was created to measure the impact of the time taken between high school graduation and initial college enrollment. Intuitively, one would expect this variable to be a contributor to the prediction of attrition. These aggregations and derived variables are determined based on a number of experiments conducted for a number of logical hypotheses. The ones that made more common sense and the ones that led to better prediction accuracy were kept in the final variable set. Reflecting the true nature of the subpopulation (i.e., the freshmen students), the dependent variable (i.e., “Second Fall Registered”) contained many more yes records (~80%) than no records (~20%; see Figure 2.5).

FIGURE 2.5 A Graphical Depiction of the Class Imbalance Problem.

Research shows that having such an imbalanced data has a negative impact on model performance. Therefore, the study experimented with the options of using and comparing the results of the same type of models built with the original imbalanced data (biased for the yes records) and the wellbalanced data.

Modeling and Assessment

The study employed four popular classification methods (i.e., artificial neural networks, decision trees, support vector machines, and logistic regression) along with three model ensemble techniques (i.e., bagging, busting, and information fusion). The results obtained from all model types were then compared to each other using regular classification model assessment methods (e.g., overall predictive accuracy, sensitivity, specificity) on the holdout samples.

In machine-learning algorithms (some of which will be covered in Chapter 4), sensitivity analysis is a method for identifying the “cause-and-effect” relationship between the inputs and outputs of a given prediction model. The fundamental idea behind sensitivity analysis is that it measures the importance of predictor variables based on the change in modeling performance that occurs if a predictor variable is not included in the model. This modeling and experimentation practice is also called a leave-one-out assessment. Hence, the measure of sensitivity of a specific predictor variable is the ratio of the error of the trained model without the predictor variable to the error of the model that includes this predictor variable. The more sensitive the network is to a particular variable, the greater the performance decrease would be in the absence of that variable, and therefore the greater the ratio of importance. In addition to the predictive power of the models, the study also conducted sensitivity analyses to determine the relative importance of the input variables.

Results

In the first set of experiments, the study used the original imbalanced data set. Based on the 10-fold cross-validation assessment results, the support vector machines produced the best accuracy with an overall prediction rate of 87.23%, the decision tree came out as the runner-up with an overall prediction rate of 87.16%, followed by artificial neural networks and logistic regression with overall prediction rates of 86.45% and 86.12%, respectively (see Table 2.2). A careful examination of these results reveals that the predictions accuracy for the “Yes” class is significantly higher than the prediction accuracy of the “No” class. In fact, all four model types predicted the students who are likely to return for the second year with better than 90% accuracy, but they did poorly on predicting the students who are likely to drop out after the freshman year with less than 50% accuracy. Because the prediction of the “No” class is the main purpose of this study, less than 50% accuracy for this class was deemed not acceptable. Such a difference in prediction accuracy of the two classes can (and should) be attributed to the imbalanced nature of the training data set (i.e., ~80% “Yes” and ~20% “No” samples).

TABLE 2.2 Prediction Results for the Original/Unbalanced Dataset

The next round of experiments used a wellbalanced data set where the two classes are represented nearly equally in counts. In realizing this approach, the study took all the samples from the minority class (i.e., the “No” class herein) and randomly selected an equal number of samples from the majority class (i.e., the “Yes” class herein) and repeated this process for 10 times to reduce potential bias of random sampling. Each of these sampling processes resulted in a data set of 7,000+ records, of which both class labels (“Yes” and “No”) were equally represented. Again, using a 10-fold crossvalidation methodology, the study developed and tested prediction models for all four model types. The results of these experiments are shown in Table 2.3. Based on the holdout sample results, support vector machines once again generated the best overall prediction accuracy with 81.18%, followed by decision trees, artificial neural networks, and logistic regression with an overall prediction accuracy of 80.65%, 79.85%, and 74.26%. As can be seen in theper-class accuracy figures, the prediction models did significantly better on predicting the “No” class with the well-balanced data than they did with the unbalanced data. Overall, the three machine-learning techniques performed significantly better than their statistical counterpart, logistic regression.

TABLE 2.3 Prediction Results for the Balanced Data Set

Next, another set of experiments were conducted to assess the predictive ability of the three ensemble models. Based on the 10-fold crossvalidation methodology, the information fusion–type ensemble model produced the best results with an overall prediction rate of 82.10%, followed by the bagging-type ensembles and boosting-type ensembles with overall prediction rates of 81.80% and 80.21%, respectively (see Table 2.4). Even though the prediction results are slightly better than the individual models, ensembles are known to produce more robust prediction systems compared to a single-best prediction model (more on this can be found in Chapter 4).

TABLE 2.4 Prediction Results for the Three Ensemble Models

In addition to assessing the prediction accuracy for each model type, a sensitivity analysis was also conducted using the developed prediction models to identify the relative importance of the independent variables (i.e., the predictors). In realizing the overall sensitivity analysis results, each of the four individual model types generated its own sensitivity measures ranking all the independent variables in a prioritized list. As expected, each model type generated slightly different sensitivity rankings of the independent variables. After collecting all four sets of sensitivity numbers, the sensitivity numbers are normalized and aggregated and plotted in a horizontal bar chart (see Figure 2.6).

FIGURE 2.6 Sensitivity-Analysis-Based Variable Importance Results.

Conclusions

The study showed that, given sufficient data with the proper variables, data mining methods are capable of predicting freshmen student attrition with approximately 80% accuracy. Results also showed that, regardless of the prediction model employed, the balanced data set (compared to unbalanced/ original data set) produced better prediction models for identifying the students who are likely to drop out of the college prior to their sophomore year. Among the four individual prediction models used in this study, support vector machines performed the best, followed by decision trees, neural networks, and logistic regression. From the usability standpoint, despite the fact that support vector machines showed better prediction results, one might choose to use decision trees because compared to support vector machines and neural networks, they portray a more transparent model structure. Decision trees explicitly show the reasoning process of different predictions, providing a justification for a specific outcome, whereas support vector machines and artificial neural networks are mathematical models that do not provide such a transparent view of “how they do what they do.”

QUESTIONS FOR DISCUSSION

1. What is student attrition, and why is it an important problem in higher education?

2. What were the traditional methods to deal with the attrition problem?

3. List and discuss the data-related challenges within context of this case study.

4. What was the proposed solution? And, what were the results?

English

应用案例2.2

通过数据驱动分析提高学生保留率

学生流失已成为学术机构决策者面临的最具挑战性的问题之一。尽管已经提供了所有的项目和服务留住学生，但根据美国教育部教育统计中心(U.S. Department of Education,Center for Educational Statistics，nces.ed.gov)的说法，只有一半接受高等教育的学生获得了学士学位。入学管理和学生保留成为美国和世界各国大学管理者的首要大事。学生的高辍学率往往导致总体经济损失、毕业率下降，以及所有利益相关方心目中的低劣学校名声。监督高等教育和分配资金的立法者与政策确立者、支付孩子教育资金为孩子提供美好未来的父母、选择大学的学生都在寻找院校质量和名声的证据指导他们的决策过程。

提出的解决方案

为提高学生保留率，应该尝试理解学生流失背后不平凡的原因。要想成功，应该还要准确地识别有辍学风险的学生。到目前为止，绝大多数的学生流失研究致力于理解这个复杂又至关重要的社会现象。尽管这些定性的、行为的、基于调查的研究通过开发和测试大量的理论揭示了宝贵的洞察，但它们并没有提供急需的工具来准确预测(并可能改善）学生流失的现象。本案例所总结的项目提出了一种定量研究方法，取自学生数据库的历史机构数据，能被用来开发模型，用于预测和解释院校学生流失问题的特定性质。所提出的分析方法如图2.4所示。

图 2.4 预测学生流失的分析方法

虽然对于高等教育而言，这个概念是比较新的，但在十多年的时间里，营销管理领域的类似问题已经使用名为“流失分析”的预测数据分析技术进行研究，其目的是通过识别现有客户回答“现有客户中的哪些人最有可能不再购买我们的产品或服务”这个问题，因此为了留住他们，可以执行一些调解和干预过程。留住现有客户是至关重要的。众所周知，正如相关研究一再显示，获得一个新客户所花的努力、时间和金钱比保留一个老客户要多上一个数量级。

数据是关键

本研究项目的数据来源于单一院校(位于美国中西部地区的一所综合公立大学)，有平均23 000人的入学新生，其中大约80%为本州居民，而大约19%为少数民族。入学人数中在性别上并无明显差异。生平均新生保留率约为80%，而平均6年毕业率为60%。

研究使用了5年的机构数据，包括16000多名入学新生，从多种大学生数据库中整合而来。数据包括与学生的学术、经济、人口特征相关的变量。在将多维学生数据融合并转换为单一平面文件后(在文件中，列代表变量，行代表学生记录)，对所得文件进行评估和预处理，识别并改正异常与无效值。例如，研究从数据集中除去了所有的留学生数据记录，因为他们缺少一些最重要的预测指标(中学GPA、SAT分数)。在数据转换阶段，一些变量被聚合(例如“Major”和“Concentration”变量聚合为二元变量MajorDeclared与Concentration Specified)，以更好地说明预测模型。此外，一些变量用于衍生新变量(例如Earned/Registered和Years After High School)。

Earned/Registered = EarnedHours/RegisteredHours

YearsAfterHighSchool = FreshmenEnrollmentYear-HighSchoolGraduationYear

Earned/Registered比用来更好地表现学生在大一第一个学期的弹性和决心。直观地，人.们会期望这个变量的值尽量大，以对留住学生产生积极的影响。Years After High School用来测定中学毕业和大学入学之间的间隔产生的影响。直观地，人们会期望这个变量对学生流失预测有所作用。这些聚合和衍生变量是基于一些为逻辑假设做出的实验而确定的。使用更多常识与带来更好预测准确度的变量被存储在最终变量集中。反映出子人群(即新生)的性质，因变量(即“第二年秋季注册”)包含更多的Yes记录(约80%)而不是No记录(约20%，如图2.5所示)。

图 2.5 不平衡问题的图形描述

研究表明，这种不平衡的数据对模型性能有着负面影响。因此，该研究进行了试验，使用和比较原始不平衡数据(偏向于Yes记录)和平衡数据构建的同类模型的结果。

建模和评估

研究采用四种流行的分类方法(即人工神经网络、决策树、支持向量机和逻辑回归)与三种模型组合技术(即装袋、推进和融合)。然后使用针对保留样本的常规分类模型评估方法(例如总体预测准确度、敏感性、特异性)来比较各类模型获得的结果。

在机器学习算法(其中一些在第4章中介绍)中，敏感度分析是一种识别给定预测模型中输入输出之间因果关系的方法，背后的基础思想是:如果一个预测变量不在模型中，则根据模型性能发生的变化度量此预测变量的重要性。这种建模和实验做法也叫作漏一法评估( leave-one-out assessment)。因此，特定预测变量的敏感度评估是没有该预测变量的训练模型的误差与包含该预测变量的训练模型的误差之比。网络对某一特定变量越敏感，缺少该变量时性能下降就越大，因此该重要性比率便越高。除了模型的预测能力外，研究还进行了敏感度分析，确定输入变量的相对重要性。

结果

在第一组实验中，研究使用了原始不平衡数据集。基于十折交叉验证评估结果，支持向.量机得到了最佳的准确度，总体预测(准确)率达87.23%，决策树以87.16%的总体预测率位列第二，人工智能网络与逻辑回归分别以86.45%和86.12%的总体预测率位居其后，如表2.2所示。仔细检查这些结果，可以发现“Yes”类的预测准确度远比“No”类的预测准确度要高得多。事实上，所有四种模型类型对于第二年可能回到学校的学生的预测准确度高于90%，但是它们对于大一后就辍学的学生的预测准确度非常糟糕，低于50%。因为对于“No”类的预测才是研究的主要目的，所以对这类预测低于50%的准确度被认为是难以接受的。这两种类别的预测准确度差异可以(也应该)归咎于训练数据集不平衡的性质(约.80%“Yes” 和约20%“No”样本)。

表 2.2 原始/不平衡数据集的预测结果

下一轮的实验使用了一个平衡数据集，其中两类的样本数量几乎相等。在实现这个方法时，研究采用了少数类(即本研究中的“No”类)中的所有样本，而从多数类(即本研究中的“Yes”类)中随机抽取等量的样本，并重复这个过程十次，以减少随机采样的潜在偏差。每一次采样过程都产生了一个超过7000条记录的数据集，其中两类标签(“Yes”和“No”)被等量地呈现出来。再次使用十折交叉验证方法，在研究中建立并测试了四种预测模型。实验结果在表2.3中。基于保留样本的结果，支持向量机又一次产生了最好的总体预测准确度81.18%，接着分别是决策树、人工神经网络和逻辑回归，总体预测准确度分别为80.65%、79.85%、74.26%。从每类的准确度图中可以看出，使用均衡数据的预测模型在预测“No"类时比使用不均衡数据时要出色得多。总体来说，三种机器学习技术比其统计学上的对手逻辑回归要好得多。

表 2.3 平衡数据集的预测结果

接下来，进行另一组实验评估三组组合模型的预测能力。基于十折交叉验证方法，信息.融合型组合模型以82.10%的总体预测率产生了最佳结果，然后是装袋型组合模型与推进组合模型，分别有着81.80%和80.21%的总体预测率，如表2.4所示。即使预测结果只是稍好于独立模型，但仍可以得知相比于单一最佳预测模型，组合模型能产生更稳定的预测系统.(更多内容可以阅读第4章)。

表 2.4 三种组合模型的预测结果

除了为各类模型评估预测准确度外，研究中还使用得到的预测模型进行敏感度分析，以识别自变量(即预测变量)的相对重要性。在得出总体敏感度分析结果的过程中，四种独立模型中的每一种都产生了自己的敏感度度量，将自变量在优先级列表中排序。正如预期，每个模型产生的不同自变量敏感性排名只是稍有不同。收集到四组敏感性数字后，敏感性数字被标准化、聚合，并绘制在水平条形图中，如图2.6所示。

图 2.6 基于敏感度分析的变量的重要性结果

结论

研究表明，如果有足够的数据与合适的变量，数据挖掘方法能够以80%的准确度预测新生流失。结果也显示，无论使用哪种预测模型，均衡数据集(与非均衡1原始数据集相比)在识别大二之前就可能辍学的学生时能产生更出色的预测模型。在研究中使用的四种独立预测模型中，支持向量机表现最佳，然后是决策树、神经网络和逻辑回归。从可用性角度来看，尽管支持向量机展示了更好的预测结果，但是我们可能会选择决策树，因为相比于支持向量机或者神经网络，决策树描述了一个更透明的模型结构。决策树明确地展示了不同预测的推理过程，为特定结果提供了理由，而作为数学模型的支持向量机与神经网络并不能清晰地阐释其如何计算。

问题讨论

1.什么是学生流失，为什么它是高等教育的重要问题?

2.解决学生流失问题的传统方法是什么?

3.列出并讨论本案例研究背景下的数据相关挑战。

4.提出的解决方案是什么?结果如何?