Application Case 5.3

Mining for Lies

    Driven by advancements in Web-based information technologies and increasing globalization, computermediated communication continues to filter into everyday life, bringing with it new venues for deception. The volume of text-based chat, instant messaging, text messaging, and text generated by online communities of practice is increasing rapidly. Even e-mail continues to grow in use. With the massive growth of text-based communication, the potential for people to deceive others through computermediated communication has also grown, and such deception can have disastrous results.
    Unfortunately, in general, humans tend to perform poorly at deception-detection tasks. This phenomenon is exacerbated in text-based communications. A large part of the research on deception detection (also known as credibility assessment) has involved face-to-face meetings and interviews. Yet, with the growth of text-based communication,text-based deception-detection techniques are essential.
    Techniques for successfully detecting deception—that is, lies—have wide applicability. Law enforcement can use decision support tools and techniques to investigate crimes, conduct security screening in airports, and monitor communications of suspected terrorists. Human resources professionals might use deception-detection tools to screen applicants. These tools and techniques also have the potential to screen e-mails to uncover fraud or other wrongdoings committed by corporate officers. Although some people believe that they can readily identify those who are not being truthful, a summary of deception research showed that, on average, people are only 54% accurate in making veracity determinations (Bond & DePaulo, 2006). This figure may actually be worse when humans try to detect deception in text.
    Using a combination of text mining and data mining techniques, Fuller et al. (2008) analyzed person-of-interest statements completed by people involved in crimes on military bases. In these statements, suspects and witnesses are required to write their recollection of the event in their own words. Military law enforcement personnel searched archival data for statements that they could conclusively identify as being truthful or deceptive. These decisions were made on the basis of corroborating evidence and case resolution. Once labeled as truthful or deceptive, the law enforcement personnel removed identifying information and gave the statements to the research team. In total, 371 usable statements were received for analysis. The text-based deception-detection method used by Fuller et al. (2008) was based on a process known as message feature mining, which relies on elements of data and text mining techniques. A simplified depiction of the process is provided in Figure 5.3.
    First, the researchers prepared the data for processing. The original handwritten statements had to be transcribed into a word processing file. Second, features (i.e., cues) were identified. The researchers identified 31 features representing categories or types oflanguagethatarerelativelyindependentofthetext content and that can be readily analyzed by automated means. For example, first-person pronouns such as I or me can be identified without analysis of the surrounding text. Table 5.1 lists the categories and an example list of features used in this study.
    The features were extracted from the textual statements and input into a flat file for further processing. Using several feature-selection methods along with 10-fold cross-validation, the researchers compared the prediction accuracy of three popular data mining methods. Their results indicated that neural network models performed the best, with 73.46% prediction accuracy on test data samples; decision trees performed second best, with 71.60% accuracy; and logistic regression was last, with 65.28% accuracy.
    The results indicate that automated text-based deception detection has the potential to aid those who must try to detect lies in text and can be successfully applied to real-world data. The accuracy of these techniques exceeded the accuracy of most other deception-detection techniques, even though it was limited to textual cues.

Questions for Discussion

1.Why is it difficult to detect deception?

2.How can text/data mining be used to detect deception in text?

3.What do you think are the main challenges for such an automated system?

English

应用案例5.3

挖掘谎言

    受基于Web的信息技术的发展和不断加速的全球化进程影响，以计算机为媒介的通信不断深入日常生活中，也成为欺诈发生的新温床。基于文本的聊天、即时通信、文字通信和在线社区产生的文本数量正在飞速增长，电子邮件的使用也在持续增长。随着基于文本通信的大幅增长，人们利用计算机作为媒介进行通信欺诈的潜在可能也在增加，这类欺诈会产生灾难性的后果。
    不幸的是，人类在欺诈检测方面的表现却不尽如人意。这种现象在文本通信中更加严重。很大一部分欺诈检测(也称为置信度评估)研究涉及面对面的会议和采访。然而，随着文本通信的增长，基于文本的欺诈检测技术变得不可或缺。
    能够成功地检测欺诈(谎言)的技术已经得到广泛的应用。执法机构可以使用决策支持工具和技术调查犯罪、在机场进行安全检查和监听恐怖嫌犯的通信。人力资源专家可以使用数诈检测工具对申请人进行筛选。这些工具和技术还具有扫描电子邮件以揭露公司官员欺诈等违法行为的潜力。虽然有些人相信他们能够快速识别那些不诚实的人，但一项欺诈研究的结论显示，在诚实评估方面仅能达到平均54%的准确度(Bond和DePaulo, 2006)。当人类试图从文本中检测欺诈时，这个数字可能会更低。
     Fuller等人(2008)将文本挖掘和数据挖掘技术相结合分析了一些陈述，这些疑犯person-of-interest)陈述是由军事基地中的罪犯完成。在这些陈述中，犯罪嫌疑人和目击者需要用自己的话写下他们对事件的回忆。军事执法人员搜索归档的数据，寻找那些他们能够确凿地认定为真实或欺诈的陈述。对陈述真实性的识别是在相关证据和案例解析的基础上进行的。这些陈述一旦被贴上真实或者欺诈的标签，执法机关的人员就会清除识别信息，然后将陈述交给研究小组。总共收集了371项可用的陈述用来进行分析。Fuller等人( 2008)使用的基于文本的欺诈检测方法是基于一个称为消息特征挖掘( message feature mining) 的过程，该过程依赖数据元素和文本挖掘技术。图5.3简要介绍了这个过程。
    首先，研究人员为分析准备数据。原始的手写陈述需要被转录为一个文字处理文件。其次，识别特征(例如线索)。研究人员识别出了31个代表语言类别的特征，这些特征与文本内容相对独立，并且很容易被自动识别出来。例如，第一人称代词I或者me不需要经过对周围文本的分析就能被识别出来。表5.1列出了这些类别和在这项研究中使用的一个示例特征列表。
    从陈述文本中提取特征，并且输入到一个平面文件(flat file)中进一步处理。通过使用不同的特征选择方法和10折交叉验证，研究人员比较了三种流行的数据挖掘方法的准确度。他们的结果表明神经网络模型的表现最好，在测试数据样本上的预测准确度达到73.46% ;决策树的准确度为71.60%，排在第二位；逻辑回归分析的准确率为65.28%，排在末位。
    结果表明，基于文本的自动欺诈检测可以帮助那些尝试在文本中检测谎言的人，而且能够成功地应用于真实数据。即使被限制在文本线索中，这些技术的准确度还是高于其他大多数欺诈检测技术。

问题讨论

1.为什么检测欺诈很难?

2.文本/数据挖掘如何用于检测文本中的欺诈?

3.你认为对于这样的自动系统面临的主要挑战是什么?