According to the American Cancer Society, half of all men and one-third of all women in the United States will develop cancer during their lifetimes; approxi- mately 1.5 million new cancer cases were expected to be diagnosed in 2013. Cancer is the second-most- common cause of death in the United States and in the world, exceeded only by cardiovascular disease. This year, over 500,000 Americans are expected to die of cancer—more than 1,300 people a day— accounting for nearly one of every four deaths.
Cancer is a group of diseases generally char- acterized by uncontrolled growth and spread of abnormal cells. If the growth and/or spread are not controlled, it can result in death. Even though the exact reasons are not known, cancer is believed to becaused by both external factors (e.g., tobacco, infectious organisms, chemicals, and radiation) and internal factors (e.g., inherited mutations, hormones, immune conditions, and mutations that occur from metabolism). These causal factors may act together or in sequence to initiate or promote carcinogenesis. Cancer is treated with surgery, radiation, chemotherapy, hormone therapy, biological therapy, and targeted therapy. Survival statistics vary greatly by cancer type and stage at diagnosis.
The 5-year relative survival rate for all can- cers is improving, and decline in cancer mortality had reached 20% in 2013, translating into the avoidance of about 1.2 million deaths from cancer since 1991. That’s more than 400 lives saved per day! The improvement in survival reflects progress in diagnosing certain cancers at an earlier stage and improvements in treatment. Further improvements are needed to prevent and treat cancer.
Even though cancer research has traditionally been clinical and biological in nature, in recent years data-driven analytic studies have become a common complement. In medical domains where data and analytics-driven research have been applied successfully, novel research directions have been identified to further advance the clinical and biological studies. Using various types of data, including molecular, clinical, literature-based, and clinical trial data, along with suitable data mining tools and techniques, researchers have been able to identify novel patterns, paving the road toward a cancer-free society.
In one study, Delen (2009) used three popular data mining techniques (decision trees, artificial neural networks, and SVMs) in conjunction with logistic regression to develop prediction models for prostate cancer survivability. The data set contained around 120,000 records and 77 variables. A k-fold cross-validation methodology was used in model build- ing, evaluation, and comparison. The results showed that support vector models are the most accurate predictor (with a test set accuracy of 92.85%) for this domain, followed by artificial neural networks and decision trees. Furthermore, using a sensitivity– analysis-based evaluation method, the study also revealed novel patterns related to prognostic factors of prostate cancer.
In a related study, Delen, Walker, and Kadam (2005) used two data mining algorithms (artificial neural networks and decision trees) and logistic regression to develop prediction models for breast cancer survival using a large data set (more than 200,000 cases). Using a 10-fold cross-validation method to measure the unbiased estimate of the prediction models for performance comparison pur- poses, the results indicated that the decision tree (C5 algorithm) was the best predictor, with 93.6% accuracy on the holdout sample (which was the best pre- diction accuracy reported in the literature), followed by artificial neural networks, with 91.2% accuracy, and logistic regression, with 89.2% accuracy. Further analysis of prediction models revealed prioritized importance of the prognostic factors, which can then be used as a basis for further clinical and biological research studies. In the most recent study, Zolbanin, Delen, and Zadeh (2015) studied the impact of comorbid- ity in cancer survivability. Although prior research has shown that diagnostic and treatment recom- mendations might be altered based on the sever- ity of comorbidities, chronic diseases are still being investigated in isolation from one another in most cases. To illustrate the significance of concurrent chronic diseases in the course of treatment, their study used the Surveillance, Epidemiology, and End Results (SEER) Program’s cancer data to create two comorbid data sets: one for breast and female genital cancers and another for prostate and urinal cancers. Several popular machine-learning techniques are then applied to the resultant data sets to build predictive models (see Figure 4.4). Comparison of the results has shown that having more information about comorbid conditions of patients can improve models’ predictive power, which in turn can help practitioners make better diagnostic and treatment decisions. Therefore, the study suggested that proper identification, recording, and use of patients’ comorbidity status can potentially lower treatment costs and ease the healthcare-related economic challenges.
根据美国癌症协会（American Cancer Society）的调查，在美国1/5的男性和1/3的女性会患上癌症。2009年新增癌症病历大约150万例。癌症是美国乃至全世界的第二大常见致命疾病，仅次于心血管疾病。2013年，预计美国有超过50万人死于癌症，平均每天超过1300人，几乎占到总死亡人数的1/4。
癌症病人的5年相对存活率正在提高，截至 2013 年，死亡率已下降 20%，也就是说，自1991年起，有120万人免于死亡，每天拯救超过400条生命。存活率的上升反映了癌症早期诊断的发展和癌症治疗手段的进步。癌症防治还需要进一步加强。
在2009年的一项研究中，Delen 采用三种常用的数据挖掘方法（决策树、人工神经网络、支持向量机），并结合逻辑回归方法，分析了包含大约 120000 条记录和 77 个变量的数据集，建立了前列腺癌存活率预测模型。同时，应用 k 折交叉校验方法完成模型的建立、评估和比较。结果表明，预测准确度最高的方法是支持向量机模型（测试集准确度为 92.85%），其次是人工神经网络和决策树。此外，应用基于敏感度分析的评估方法，这项研究还揭示了与前列腺癌预后相关的新模式。
Delen、Walker 和 Kadam (2005) 在一个包含超过 20 万例病例的大规模数据集上，应用两种数据挖掘算法（人工神经网络和决策树），结合逻辑回归方法建立了乳腺癌存活状况为预测模型。通过一种十折交叉校验方法预测模型的无偏估计，对模型的性能进行比较，结果表明决策树(C5.0算法)预测的准确率最高，测试样本的预测准确度达到93.6%，是文献中预测准确度最高的。其次是人工神经网络，预测准确度为91.2%。预测准确度最低的方法是逻辑回归，达到89.2%。对预测模型的进一步分析表明，预后因素(prognostic factor)非常重要，可以作为进一步开展临床和生物研究的基础。
在大多数最近的研究中，Zolbanin等(2015)研究了癌症生存能力的合并症(commorbidity)影响。尽管以前的研究表明，诊断和治疗建议可以根据合并症的严重性调整，但是在大多数情况下，慢性病仍然被单独研究。为了说明在治疗的过程中并发慢性病的重要性，他们的研究使用了监测、流行病学( epidemiology) 以及最后结果(SEER)项目的癌症数据，产生两个合并症的数据集:一个数据集是为乳腺和女性生殖癌症，另一个是为前列腺和尿路癌。然后，使用几个常用的机器学习技术对组合后的数据集进行分析，建立预测模型。对结果进行比较，发现掌握更多的患者合并症情况的信息能改善模型的预测能力。这反过来有助于医生做粗更好的诊断和治疗决策。因此，研究表明患者合并状态的正确识别、记录和使用可以潜在地降低治疗成本，减轻医疗的经济性挑战。