Application Case 4.4

Data Mining Helps in Cancer Research

    According to the American Cancer Society, half of all men and one-third of all women in the United States will develop cancer during their lifetimes; approxi- mately 1.5 million new cancer cases were expected to be diagnosed in 2013. Cancer is the second-most- common cause of death in the United States and in the world, exceeded only by cardiovascular disease. This year, over 500,000 Americans are expected to die of cancer—more than 1,300 people a day— accounting for nearly one of every four deaths.
    Cancer is a group of diseases generally char- acterized by uncontrolled growth and spread of abnormal cells. If the growth and/or spread are not controlled, it can result in death. Even though the exact reasons are not known, cancer is believed to becaused by both external factors (e.g., tobacco, infectious organisms, chemicals, and radiation) and internal factors (e.g., inherited mutations, hormones, immune conditions, and mutations that occur from metabolism). These causal factors may act together or in sequence to initiate or promote carcinogenesis. Cancer is treated with surgery, radiation, chemotherapy, hormone therapy, biological therapy, and targeted therapy. Survival statistics vary greatly by cancer type and stage at diagnosis.
    The 5-year relative survival rate for all can- cers is improving, and decline in cancer mortality had reached 20% in 2013, translating into the avoidance of about 1.2 million deaths from cancer since 1991. That’s more than 400 lives saved per day! The improvement in survival reflects progress in diagnosing certain cancers at an earlier stage and improvements in treatment. Further improvements are needed to prevent and treat cancer.
    Even though cancer research has traditionally been clinical and biological in nature, in recent years data-driven analytic studies have become a common complement. In medical domains where data and analytics-driven research have been applied successfully, novel research directions have been identified to further advance the clinical and biological studies. Using various types of data, including molecular, clinical, literature-based, and clinical trial data, along with suitable data mining tools and techniques, researchers have been able to identify novel patterns, paving the road toward a cancer-free society.
    In one study, Delen (2009) used three popular data mining techniques (decision trees, artificial neural networks, and SVMs) in conjunction with logistic regression to develop prediction models for prostate cancer survivability. The data set contained around 120,000 records and 77 variables. A k-fold cross-validation methodology was used in model build- ing, evaluation, and comparison. The results showed that support vector models are the most accurate predictor (with a test set accuracy of 92.85%) for this domain, followed by artificial neural networks and decision trees. Furthermore, using a sensitivity– analysis-based evaluation method, the study also revealed novel patterns related to prognostic factors of prostate cancer.
    In a related study, Delen, Walker, and Kadam (2005) used two data mining algorithms (artificial neural networks and decision trees) and logistic regression to develop prediction models for breast cancer survival using a large data set (more than 200,000 cases). Using a 10-fold cross-validation method to measure the unbiased estimate of the prediction models for performance comparison pur- poses, the results indicated that the decision tree (C5 algorithm) was the best predictor, with 93.6% accuracy on the holdout sample (which was the best pre- diction accuracy reported in the literature), followed by artificial neural networks, with 91.2% accuracy, and logistic regression, with 89.2% accuracy. Further analysis of prediction models revealed prioritized importance of the prognostic factors, which can then be used as a basis for further clinical and biological research studies.     In the most recent study, Zolbanin, Delen, and Zadeh (2015) studied the impact of comorbid- ity in cancer survivability. Although prior research has shown that diagnostic and treatment recom- mendations might be altered based on the sever- ity of comorbidities, chronic diseases are still being investigated in isolation from one another in most cases. To illustrate the significance of concurrent chronic diseases in the course of treatment, their study used the Surveillance, Epidemiology, and End Results (SEER) Program’s cancer data to create two comorbid data sets: one for breast and female genital cancers and another for prostate and urinal cancers. Several popular machine-learning techniques are then applied to the resultant data sets to build predictive models (see Figure 4.4). Comparison of the results has shown that having more information about comorbid conditions of patients can improve models’ predictive power, which in turn can help practitioners make better diagnostic and treatment decisions. Therefore, the study suggested that proper identification, recording, and use of patients’ comorbidity status can potentially lower treatment costs and ease the healthcare-related economic challenges.

Figure 4.4 A Data Mining Methodology for Investigation of Comorbidity in Cancer Survivability.

These examples (among many others in the medical literature) show that advanced data mining techniques can be used to develop models that possess a high degree of predictive as well as explanatory power. Although data mining methods are capable of extracting patterns and relationships hidden deep in large and complex medical databases, without the cooperation and feedback from the medical experts, their results are not of much use. The patterns found via data mining methods should be evaluated by medical professionals who have years of experience in the problem domain to decide whether they are logical, actionable, and novel enough to warrant new research directions. In short, data mining is not meant to replace medical professionals and researchers, but to complement their invaluable efforts to provide data-driven new research directions and to ultimately save more human lives.

Questions for Discussion

1.How can data mining be used for ultimately curing illnesses like cancer?

2.What do you think are the promises and major challenges for data miners in contributing to medical and biological research endeavors?

English

应用案例4.4

数据挖握有助于癌症研究

    根据美国癌症协会（American Cancer Society）的调查，在美国1/5的男性和1/3的女性会患上癌症。2009年新增癌症病历大约150万例。癌症是美国乃至全世界的第二大常见致命疾病，仅次于心血管疾病。2013年，预计美国有超过50万人死于癌症，平均每天超过1300人，几乎占到总死亡人数的1/4。
    癌症是由异常细胞不可控制的生长和扩散引起的一种疾病。如果癌症的生长扩散不能得到有效控制就会导致人体死亡。尽管确切病因不详，但一般认为癌症是由外部因素（例如，吸烟、器官感染、化学物质、辐射）和内部因素（例如，遗传突变、急速、免疫疾病、代谢导致的基因突变）共同导致的。这些影响因素可能同时或顺序作用导致癌症引发或恶化。当前，癌症的治疗方法有手术、放射线治疗、化疗、激素疗法、生物疗法、目标疗法。不同种类和诊断期癌症的存活率差别很大。
    癌症病人的5年相对存活率正在提高，截至 2013 年，死亡率已下降 20%，也就是说，自1991年起，有120万人免于死亡，每天拯救超过400条生命。存活率的上升反映了癌症早期诊断的发展和癌症治疗手段的进步。癌症防治还需要进一步加强。
    虽然传统上癌症研究本质是属于临床和生物领域的，但近些年来数据驱动的分析研究也常常被用作补充。在已经成功应用数据和分析驱动研究的医学领域，新的研究方形已经被发现，用以促进临床和生物研究的进步。利用各种类型的数据，包括分子、临床。文献数据、临床实验数据，同时应用合适的数据挖掘工具和技术，研究人员能够识别出新模式，从而为战胜癌症奠定了基础。
    在2009年的一项研究中，Delen 采用三种常用的数据挖掘方法（决策树、人工神经网络、支持向量机），并结合逻辑回归方法，分析了包含大约 120000 条记录和 77 个变量的数据集，建立了前列腺癌存活率预测模型。同时，应用 k 折交叉校验方法完成模型的建立、评估和比较。结果表明，预测准确度最高的方法是支持向量机模型（测试集准确度为 92.85%），其次是人工神经网络和决策树。此外，应用基于敏感度分析的评估方法，这项研究还揭示了与前列腺癌预后相关的新模式。
    Delen、Walker 和 Kadam (2005) 在一个包含超过 20 万例病例的大规模数据集上，应用两种数据挖掘算法（人工神经网络和决策树），结合逻辑回归方法建立了乳腺癌存活状况为预测模型。通过一种十折交叉校验方法预测模型的无偏估计，对模型的性能进行比较，结果表明决策树(C5.0算法)预测的准确率最高，测试样本的预测准确度达到93.6%，是文献中预测准确度最高的。其次是人工神经网络，预测准确度为91.2%。预测准确度最低的方法是逻辑回归，达到89.2%。对预测模型的进一步分析表明，预后因素(prognostic factor)非常重要，可以作为进一步开展临床和生物研究的基础。
    在大多数最近的研究中，Zolbanin等(2015)研究了癌症生存能力的合并症(commorbidity)影响。尽管以前的研究表明，诊断和治疗建议可以根据合并症的严重性调整，但是在大多数情况下，慢性病仍然被单独研究。为了说明在治疗的过程中并发慢性病的重要性，他们的研究使用了监测、流行病学( epidemiology) 以及最后结果(SEER)项目的癌症数据，产生两个合并症的数据集:一个数据集是为乳腺和女性生殖癌症，另一个是为前列腺和尿路癌。然后，使用几个常用的机器学习技术对组合后的数据集进行分析，建立预测模型。对结果进行比较，发现掌握更多的患者合并症情况的信息能改善模型的预测能力。这反过来有助于医生做粗更好的诊断和治疗决策。因此，研究表明患者合并状态的正确识别、记录和使用可以潜在地降低治疗成本，减轻医疗的经济性挑战。

图4.4 癌症存活性共病调查数据挖掘方法

这些实例(以及医学文献中的很多其他实例)说明，先进的数据挖掘技术能够建立具有高度预测和解释能力的模型。虽然数据挖掘方法能够挖掘出隐藏在大型复杂医疗数据库中的模式和关系，但没有医学专家的合作和反馈，这些结果可能是毫无意义的。数据挖掘方法发现的模式需要由问题领域有经验的医学专业人士进行评估，以确定其是否合乎逻辑，是否可行，是否足够新颖，能提供新的研究方向。总之，数据挖掘不是要取代医学专业人士和研究人员，而是要辅助他们的工作，为他们提供数据驱动的新研究方向，并最终拯救更多的人类生命。

问题讨论

1.数据挖掘如何用于治疗疾病，例如癌症?

2.你认为教据挖掘在生物医学研究领域的希望和面临的挑战是什么?