Information retrieval信息检索

时间:2023-03-10 07:06:50
Information retrieval信息检索

https://en.wikipedia.org/wiki/Information_retrieval

信息检索

(一种信息技术)

信息检索(Information Retrieval)是指信息按一定的方式组织起来, 并根据信息用户的需要找出有关的信息的过程和技术。狭义的信息检索就是信息检索过程的后半部分,即从信息集合中找出所需要的信息的过程,也就是我们常说的 信息查寻(Information Search 或Information Seek)。一般情况下,信息检索指的就是广义的信息检索。
信息检索(Information Retrieval)是指从信息资源的集合中查找所需文献或查找所需文献中包含的信息内容的过程。[1] 
中文名
信息检索
外文名
Information Retrieval
匹    配
信息检索也是一个匹配过程。
信息检索过程
包括信息处理和检索两个方面

起源

编辑

信息检索起源于图书馆的参考咨询和文摘索引工作,从19世纪下半叶首先开始发展,至20世纪40年代,索引和检索成已为图书馆独立的工具和用户服务项目。随着1946年世界上第一台电子计算机问世,计算机技术逐步走进信息检索领域,并与信息检索理论紧密结合起来;脱机批量情报检索系统、联机实时情报检索系统
Information retrieval信息检索 文献信息检索

相继研制成功并商业化,20世纪60年代到80年代,在信息处理技术、通讯技术、计算机和数据库技术的推动下,信息检索在教育军事商业等各领域高速发展,得到了广泛的应用。Dialog国际联机情报检索系统是这一时期的信息检索领域的代表,至今仍是世界上最著名的系统之一。

定义

编辑

信息检索有广义和狭义的之分。广义的信息检索全称为“信息存储与检索”,是指将信息按一定的方式组织和存储起来,并根据用户的需要找出有关信息的过程。狭义的信息检索为“信息存储与检索”的后半部分,通常称为“信息查找”或“信息搜索”,是指从信息集合中找出用户所需要的有关信息的过程。狭义的信息检索包括3个方面的含义:了解用户的信息需求、信息检索的技术或方法、满足信息用户的需求。
由信息检索原理可知,信息的存储是实现信息检索的基础。这里要存储的信息 不仅包括原始文档数据,还包括图片、视频和音频等,首先要将这些原始信息进行计算机语言的转换,并将其存储在数据库中,否则无法进行机器识别。待用户根据 意图输入查询请求后,检索系统根据用户的查询请求在数据库中搜索与查询相关的信息,通过一定的匹配机制计算出信息的相似度大小,并按从大到小的顺序将信息 转换输出。

类型

编辑

(一)按存储与检索对象划分,信息检索可以分为:
以上三种信息检索类型的主要区别在于:数据检索和事实检索是要检索出包含在文献中的信息本身,而文献检索则检索出包含所需要信息的文献即可。
(二)按存储的载体和实现查找的技术手段为标准划分:
机械检索
其中发展比较迅速的计算机检索是“网络信息检索”,
Information retrieval信息检索 计算机信息检索概述

也即网络信息搜索,是指互联网用户在网络终端,通过特定的网络搜索工具或是通过浏览的方式,查找并获取信息的行为。

(三)按检索途径划分:
直接检索
间接检索

主要环节

编辑

信息内容分析与编码,产生信息记录及检索标识。
组织存贮,将全部记录按文件、数据库等形式组成有序的信息集合。
用户提问处理和检索输出。关键部分是信息提问与信息集合的匹配和选择,即对给定提问与集合中的记录进行相似性比较,根据一定的匹配标准选出有关信息。它按对象分为文献检索数据检索事实检索;按设备分为手工检索、机械检索和计算机检索。由一定的设备和信息集合构成的服务设施称为信息检索系统,如穿孔卡片系统、联机检索系统、光盘检索系统、多媒体检索系统等。信息检索最初应用于图书馆和科技信息机构,后来逐渐扩大到其他领域,并与各种管理信息系统结合在一起。与信息检索有关的理论、技术和服务构成了一个相对独立的知识领域,是信息学的一个重要分支,并与计算机应用技术相互交叉。

热点

编辑

智能检索或知识检索
传统的全文检索技术基于关键词匹配进行检索,往往存在查不全、查不准、检索质量不高的现象,特别是在网络信息时代,利用关键词匹配很难满足人们检索的要求。智能检索利用分词词典、同义词典,同音词典改善检索效果,比如用户查询“计算机”,与“电脑”相关的信息也能检索出来;进一步还可在知识层面或者说概念层面上辅助查询,通过主题词典、上下位词典、相关同级词典,形成一个知识体系或概念网络,给予用户智能知识提示,最终帮助用户获得最佳的检索
Information retrieval信息检索 虚拟图书馆与网上信息检索

效果,比如用户可以进一步缩小查询范围至“微机”、“服务器”或扩大查询至“信息技术”或查询相关的“电子技术”、“软件”、“计算机应用”等范畴。另外,智能检索还包括歧义信息和检索处理,如“苹果”,究竟是指水果还是电脑品牌,“华人”与“*”的区分,将通过歧义知识描述库、全文索引、用户检索上下文分析以及用户相关性反馈等技术结合处理,高效、准确地反馈给用户最需要的信息。知识挖掘

主要指文本挖掘技术的发展,目的是帮助人们更好的发现、组织、表示信息,提取知识,满足信息检索的高层次需要。知识挖掘包括摘要、分类(聚类)和相似性检索等方面。
自动摘要就是利用计算机自动地从原始文献中提取文摘。在信息检索中,自动摘要有助于用户快速评价检索结果的相关程度,在信息服务中,自动摘要有助于多种形式的内容分发,如发往PDA手机等。相似性检索技术基于文档内容特征检索与其相似或相关的文档,是实现用户个性化相关反馈的基础,也可用于去重分析。自动分类可基于统计或规则,经过机器学习形成预定义分类树,再根据文档的内容特征将其归类;自动聚类则是根据文档内容的相关程度进行分组归并。自动分类(聚类)在信息组织、导航方面非常有用。
异构信息整合检索和全息检索
在信息检索分布化和网络化的趋势下,信息检索系统的开放性和集成性要求越来越高,需要能够检索和整合不同来源和结构的信息,这是异构信息检索技术发展的基点,包括支持各种格式化文件,如TEXTHTMLXMLRTF、MS Office、PDF、PS2/PS、MARCISO2709等处理和检索;支持多语种信息的检索;支持结构化数据、半结构化数据及非结构化数据的统一处理;和关系数据库检索的无缝集成以及其他开放检索接口的集成等。所谓“全息检索”的概念就是支持一切格式和方式的检索,从实践来讲,发展到异构信息整合检索的层面,基于自然语言理解人机交互以及多媒体信息检索整合等方面尚有待取得进一步突破。
另外,从工程实践角度,综合采用内存和外部存储的多级缓存、分布式群集和负载均衡技术也是信息检索技术发展的重要方面。
随着互联网的普及和电子商务的发展,企业和个人可获取、需处理的信息量呈 爆发式增长,而且其中绝大部分都是非结构化和半结构化数据。内容管理的重要性日益凸现,而信息检索作为内容管理的核心支撑技术,随着内容管理的发展和普 及,亦将应用到各个领域,成为人们日常工作生活的密切伙伴。

检索原因

编辑

1.信息检索是获取知识的捷径
美国普林斯顿大学物理系一个年轻大学生名叫约瀚·菲利普,在图书馆里借阅有关公开资料,仅用四个月时间,就画出一张制造原子弹的设计图。他设计的原子弹,体积小(棒球大小)、重量轻(7.5公斤)、威力大(相当广岛原子弹3/4的威力),造价低(当时仅需两千美元),致使一些国家(法国巴基斯坦等)纷纷致函美国大使馆,争相购买他的设计拷贝。
二十世纪七十年代,美国核专家泰勒收到一份题为《制造核弹的方法》的报告,他被报告精湛的技术设计所吸引,惊叹地说:“至今我看到的报告中,它是最详细、最全面的一份。”但使他更为惊异的是,这份报告竟出于哈佛大学经济专业的青年学生之手,而这个四百多页的技术报
Information retrieval信息检索 信息检索系统的体系结构

告的全部信息来源又都是从图书馆那些极为平常的、完全公开的图书资料中所获得的。

2 .信息检索是科学研究的向导
美国在实施“阿波罗登月计划”中,对阿波罗飞船的燃料箱进行压力实验时,发现甲醇会 引起钛应力腐蚀,为此付出了数百万美元来研究解决这一问题,事后查明,早在十多年前,就有人研究出来了,方法非常简单,只需在甲醇中加入2%的水即可,检 索这篇文献的时间是10多分钟。在科研开发领域里,重复劳动在世界各国都不同程度地存在。据统计,美国每年由于重复研究所造成的损失,约占全年研究经费的 38%,达20亿美元之巨。日本有关化学化工方面的研究课题与国外重复的,大学占40%、民间占47%、国家研究机构占40%,平均重复率在40%以上;中国的重复率则更高。
3.信息检索是终身教育的基础
学校培养学生的目标是学生的智能:包括自学能力、研究能力、思维能力、表达能力和组织管理能力
UNESCO提出,教育已扩大到一个人的整个一生,认为唯有全面的终身教育才能够培养完善的人,可以防止知识老化,不断更新知识,适应当代信息社会发展的需求。

四个要素

编辑

1 信息检索的前提----信息意识
所谓信息意识,是人们利用信息系统获取所需信息的内在动因,具体表现为对信息的敏感性、选择能力和消化吸收能力,从而判断该信息是否能为自己或某一团体所利用,是否能解决现实生活实践中某一特定问题等一系列的思维过程。信息意识含有信息认知、信息情感和信息行为倾向三个层面。
信息素养(素质)(Information Literacy)一词最早是由美国信息产业协会主席Paul Zurkowski在1974年给美国*的报告中提出来的。他认为:信息素质是人们在工作中运用信息、学习信息技术、利用信息解决问题的能力。
2.信息检索的基础----信息源
信息源定义:在联合国教科文组织出版的《文献术语中》,将信息源定义为:个人为满足其信息需要而获得信息的来源,称为信息源。[2] 
信息源类型:
按照表现方式划分:口语信息源、体语信息源、实物信息源和文献信息源。[2] 
按照数字化记录形式划分:书目信息源、普通图书信息源、工具书信息源、报纸、期刊信息源、特种文献信息源、数字图书馆信息源、搜索引擎信息源。[2] 
按文献载体分----印刷型、缩微型、机读型、声像型
按文献内容和加工程度分--一次信息、二次信息、三次信息
按出版形式分----图书、报刊、研究报告、会议信息、专利信 息、统计数据、*出版物、档案、学位论文、标准信息(它们被认为是十大信息源,其中后8种被称为特种文献。教育信息资源主要分布在教育类图书、专业期刊、学位论文等不同类型的出版物中)
3.信息检索的核心----信息获取能力
1.了解各种信息来源
2.掌握检索语言
3. 熟练使用检索工具
4.能对检索效果进行判断和评价
判断检索效果的两个指标:
查全率=被检出相关信息量/相关信息总量(%)
查准率=被检出相关信息量/被检出信息总量(%)
4.信息检索的关键:信息利用
社会进步的过程就是一个知识不断的生产—流通—再生产的过程。
为了全面、有效地利用现有知识和信息,在学习、科学研究和生
Information retrieval信息检索 简单的信息检索搜索

活过程中,信息检索的时间比例逐渐增高。

获取学术信息的最终目的是通过对所得信息的整理、分析、归纳和总结,根据自己学习、研究过程中的思考和思路,将各种信息进行重组,船造出新的知识和信息,从而达到信息激活和增值的目的。

检索方法

编辑

信息检索方法包括:普通法、追溯法和分段法。1.普通法是利用书目文摘、 索引等检索工具进行文献资料查找的方法。运用这种方法的关键在于熟悉各种检索工具的性质、特点和查找过程,从不同角度查找。普通法又可分为顺检法和倒检 法。顺检法是从过去到现在按时间顺序检索,费用多、效率低;倒检法是逆时间顺序从近期向远期检索,它强调近期资料,重视当前的信息,主动性
Information retrieval信息检索 相关书籍

强,效果较好。

2.追溯法是利用已有文献所附的参考文献不断追踪查找的方法,在没有检索工具或检索工具不全时,此法可获得针对性很强的资料,查准率较高,查全率较差。
3.分段法是追溯法和普通法的综合,它将两种方法分期、分段交替使用,直至查到所需资料为止。

检索的一般程序

编辑

(一)分析问题
(二)选择检索工具
提供线索的指示型检索工具(二次文献):书目、馆藏目录、索引、文摘、工具书指南;
提供具体信息的参考工具(三次文献):词典、引语工具书、百科全书、类书、政书、传记资料、手册、机构名录、地理资料、统计资料、年鉴、表谱图册、*文献。
(三)检索工具的使用
(四)获取原文
(五)对检索结果的分析
(六)更改检索策略

Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing.

Automated information retrieval systems are used to reduce what has been called "information overload". Many universities and public libraries use IR systems to provide access to books, journals and other documents. Web search engines are the most visible IR applications.

Contents

Overview

An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.

An object is an entity that is represented by information in a content collection or database. User queries are matched against the database information. However, as opposed to classical SQL queries of a database, in information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching.[1]

Depending on the application the data objects may be, for example, text documents, images,[2] audio,[3]mind maps[4] or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata.

Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.[5]

History

there is ... a machine called the Univac ... whereby letters and figures are coded as a pattern of magnetic spots on a long steel tape. By this means the text of a document, preceded by its subject code symbol, ca be recorded ... the machine ... automatically selects and types out those references which have been coded in any desired way at a rate of 120 words a minute
—  J. E. Holmstrom, 1948

The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945.[6] It would appear that Bush was inspired by patents for a 'statistical machine' - filed by Emanuel Goldberg in the 1920s and '30s - that searched for documents stored on film.[7] The first description of a computer searching for information was described by Holmstrom in 1948,[8] detailing an early mention of the Univac computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy, Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents).[6] Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s.

In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further.

Model types

Information retrieval信息检索
Categorization of IR-models (translated from German entry, original source Dominik Kuropka).

For effectively retrieving relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes. The picture on the right illustrates the relationship of some common models. In the picture, the models are categorized according to two dimensions: the mathematical basis and the properties of the model.

First dimension: mathematical basis

Second dimension: properties of the model

  • Models without term-interdependencies treat different terms/words as independent. This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables.
  • Models with immanent term interdependencies allow a representation of interdependencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.
  • Models with transcendent term interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They rely an external source for the degree of interdependency between two terms. (For example, a human or sophisticated algorithms.)

Performance and correctness measures

The evaluation of an information retrieval system is the process of assessing how well a system meets the information needs of its users. Traditional evaluation metrics, designed for Boolean retrieval or top-k retrieval, include precision and recall. Many more measures for evaluating the performance of information retrieval systems have also been proposed. In general, measurement considers a collection of documents to be searched and a search query. All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query. In practice, queries may be ill-posed and there may be different shades of relevancy.

Virtually all modern evaluation metrics (e.g., mean average precision, discounted cumulative gain) are designed for ranked retrieval without any explicit rank cutoff, taking into account the relative order of the documents retrieved by the search engines and giving more weight to documents returned at higher ranks.[citation needed]

The mathematical symbols used in the formulas below mean:

  • X ∩ Y {\displaystyle X\cap Y} Information retrieval信息检索 - Intersection - in this case, specifying the documents in both sets X and Y
  • | X | {\displaystyle |X|} Information retrieval信息检索 - Cardinality - in this case, the number of documents in set X
  • ∫ {\displaystyle \int } Information retrieval信息检索 - Integral
  • ∑ {\displaystyle \sum } Information retrieval信息检索 - Summation
  • Δ {\displaystyle \Delta } Information retrieval信息检索 - Symmetric difference

Precision

Main article: Precision and recall

Precision is the fraction of the documents retrieved that are relevant to the user's information need.

precision = | { relevant documents } ∩ { retrieved documents } | | { retrieved documents } | {\displaystyle {\mbox{precision}}={\frac {|\{{\mbox{relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{retrieved documents}}\}|}}} Information retrieval信息检索

In binary classification, precision is analogous to positive predictive value. Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called precision at n or P@n.

Note that the meaning and usage of "precision" in the field of information retrieval differs from the definition of accuracy and precision within other branches of science and statistics.

Recall

Main article: Precision and recall

Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

recall = | { relevant documents } ∩ { retrieved documents } | | { relevant documents } | {\displaystyle {\mbox{recall}}={\frac {|\{{\mbox{relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{relevant documents}}\}|}}} Information retrieval信息检索

In binary classification, recall is often called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query.

It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.

Fall-out

The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available:

fall-out = | { non-relevant documents } ∩ { retrieved documents } | | { non-relevant documents } | {\displaystyle {\mbox{fall-out}}={\frac {|\{{\mbox{non-relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{non-relevant documents}}\}|}}} Information retrieval信息检索

In binary classification, fall-out is closely related to specificity and is equal to ( 1 − specificity ) {\displaystyle (1-{\mbox{specificity}})} Information retrieval信息检索. It can be looked at as the probability that a non-relevant document is retrieved by the query.

It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.

F-score / F-measure

Main article: F-score

The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:

F = 2 ⋅ p r e c i s i o n ⋅ r e c a l l ( p r e c i s i o n + r e c a l l ) {\displaystyle F={\frac {2\cdot \mathrm {precision} \cdot \mathrm {recall} }{(\mathrm {precision} +\mathrm {recall} )}}} Information retrieval信息检索

This is also known as the F 1 {\displaystyle F_{1}} Information retrieval信息检索 measure, because recall and precision are evenly weighted.

The general formula for non-negative real β {\displaystyle \beta } Information retrieval信息检索 is:

F β = ( 1 + β 2 ) ⋅ ( p r e c i s i o n ⋅ r e c a l l ) ( β 2 ⋅ p r e c i s i o n + r e c a l l ) {\displaystyle F_{\beta }={\frac {(1+\beta ^{2})\cdot (\mathrm {precision} \cdot \mathrm {recall} )}{(\beta ^{2}\cdot \mathrm {precision} +\mathrm {recall} )}}\,} Information retrieval信息检索

Two other commonly used F measures are the F 2 {\displaystyle F_{2}} Information retrieval信息检索 measure, which weights recall twice as much as precision, and the F 0.5 {\displaystyle F_{0.5}} Information retrieval信息检索 measure, which weights precision twice as much as recall.

The F-measure was derived by van Rijsbergen (1979) so that F β {\displaystyle F_{\beta }} Information retrieval信息检索 "measures the effectiveness of retrieval with respect to a user who attaches β {\displaystyle \beta } Information retrieval信息检索 times as much importance to recall as precision". It is based on van Rijsbergen's effectiveness measure E = 1 − 1 α P + 1 − α R {\displaystyle E=1-{\frac {1}{{\frac {\alpha }{P}}+{\frac {1-\alpha }{R}}}}} Information retrieval信息检索. Their relationship is:

F β = 1 − E {\displaystyle F_{\beta }=1-E} Information retrieval信息检索 where α = 1 1 + β 2 {\displaystyle \alpha ={\frac {1}{1+\beta ^{2}}}} Information retrieval信息检索

F-measure can be a better single metric when compared to precision and recall; both precision and recall give different information that can complement each other when combined. If one of them excels more than the other, F-measure will reflect it.[citation needed]

Average precision

Precision and recall are single-value metrics based on the whole list of documents returned by the system. For systems that return a ranked sequence of documents, it is desirable to also consider the order in which the returned documents are presented. By computing a precision and recall at every position in the ranked sequence of documents, one can plot a precision-recall curve, plotting precision p ( r ) {\displaystyle p(r)} Information retrieval信息检索 as a function of recall r {\displaystyle r} Information retrieval信息检索. Average precision computes the average value of p ( r ) {\displaystyle p(r)} Information retrieval信息检索 over the interval from r = 0 {\displaystyle r=0} Information retrieval信息检索 to r = 1 {\displaystyle r=1} Information retrieval信息检索:[9]

AveP = ∫ 0 1 p ( r ) d r {\displaystyle \operatorname {AveP} =\int _{0}^{1}p(r)dr} Information retrieval信息检索

That is the area under the precision-recall curve. This integral is in practice replaced with a finite sum over every position in the ranked sequence of documents:

AveP = ∑ k = 1 n P ( k ) Δ r ( k ) {\displaystyle \operatorname {AveP} =\sum _{k=1}^{n}P(k)\Delta r(k)} Information retrieval信息检索

where k {\displaystyle k} Information retrieval信息检索 is the rank in the sequence of retrieved documents, n {\displaystyle n} Information retrieval信息检索 is the number of retrieved documents, P ( k ) {\displaystyle P(k)} Information retrieval信息检索 is the precision at cut-off k {\displaystyle k} Information retrieval信息检索 in the list, and Δ r ( k ) {\displaystyle \Delta r(k)} Information retrieval信息检索 is the change in recall from items k − 1 {\displaystyle k-1} Information retrieval信息检索 to k {\displaystyle k} Information retrieval信息检索.[9]

This finite sum is equivalent to:

AveP = ∑ k = 1 n ( P ( k ) × rel ⁡ ( k ) ) number of relevant documents {\displaystyle \operatorname {AveP} ={\frac {\sum _{k=1}^{n}(P(k)\times \operatorname {rel} (k))}{\mbox{number of relevant documents}}}\!} Information retrieval信息检索

where rel ⁡ ( k ) {\displaystyle \operatorname {rel} (k)} Information retrieval信息检索 is an indicator function equaling 1 if the item at rank k {\displaystyle k} Information retrieval信息检索 is a relevant document, zero otherwise.[10] Note that the average is over all relevant documents and the relevant documents not retrieved get a precision score of zero.

Some authors choose to interpolate the p ( r ) {\displaystyle p(r)} Information retrieval信息检索 function to reduce the impact of "wiggles" in the curve.[11][12] For example, the PASCAL Visual Object Classes challenge (a benchmark for computer vision object detection) computes average precision by averaging the precision over a set of evenly spaced recall levels {0, 0.1, 0.2, ... 1.0}:[11][12]

AveP = 1 11 ∑ r ∈ { 0 , 0.1 , … , 1.0 } p interp ( r ) {\displaystyle \operatorname {AveP} ={\frac {1}{11}}\sum _{r\in \{0,0.1,\ldots ,1.0\}}p_{\operatorname {interp} }(r)} Information retrieval信息检索

where p interp ( r ) {\displaystyle p_{\operatorname {interp} }(r)} Information retrieval信息检索 is an interpolated precision that takes the maximum precision over all recalls greater than r {\displaystyle r} Information retrieval信息检索:

p interp ( r ) = max r ~ : r ~ ≥ r ⁡ p ( r ~ ) {\displaystyle p_{\operatorname {interp} }(r)=\operatorname {max} _{{\tilde {r}}:{\tilde {r}}\geq r}p({\tilde {r}})} Information retrieval信息检索.

An alternative is to derive an analytical p ( r ) {\displaystyle p(r)} Information retrieval信息检索 function by assuming a particular parametric distribution for the underlying decision values. For example, a binormal precision-recall curve can be obtained by assuming decision values in both classes to follow a Gaussian distribution.[13]

Precision at K

For modern (Web-scale) information retrieval, recall is no longer a meaningful metric, as many queries have thousands of relevant documents, and few users will be interested in reading all of them. Precision at k documents (P@k) is still a useful metric (e.g., P@10 or "Precision at 10" corresponds to the number of relevant results on the first search results page), but fails to take into account the positions of the relevant documents among the top k.[citation needed] Another shortcoming is that on a query with fewer relevant results than k, even a perfect system will have a score less than 1.[14] It is easier to score manually since only the top k results need to be examined to determine if they are relevant or not.

R-Precision

R-precision requires knowing all documents that are relevant to a query. The number of relevant documents, R {\displaystyle R} Information retrieval信息检索, is used as the cutoff for calculation, and this varies from query to query. For example, if there are 15 documents relevant to "red" in a corpus (R=15), R-precision for "red" looks at the top 15 documents returned, counts the number that are relevant r {\displaystyle r} Information retrieval信息检索 turns that into a relevancy fraction: r / R = r / 15 {\displaystyle r/R=r/15} Information retrieval信息检索.[15]

Precision is equal to recall at the R-th position.[14]

Empirically, this measure is often highly correlated to mean average precision.[14]

Mean average precision

Mean average precision for a set of queries is the mean of the average precision scores for each query.

MAP = ∑ q = 1 Q A v e P ( q ) Q {\displaystyle \operatorname {MAP} ={\frac {\sum _{q=1}^{Q}\operatorname {AveP(q)} }{Q}}\!} Information retrieval信息检索

where Q is the number of queries.

Discounted cumulative gain

DCG uses a graded relevance scale of documents from the result set to evaluate the usefulness, or gain, of a document based on its position in the result list. The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

The DCG accumulated at a particular rank position p {\displaystyle p} Information retrieval信息检索 is defined as:

D C G p = r e l 1 + ∑ i = 2 p r e l i log 2 ⁡ i . {\displaystyle \mathrm {DCG_{p}} =rel_{1}+\sum _{i=2}^{p}{\frac {rel_{i}}{\log _{2}i}}.} Information retrieval信息检索

Since result set may vary in size among different queries or systems, to compare performances the normalised version of DCG uses an ideal DCG. To this end, it sorts documents of a result list by relevance, producing an ideal DCG at position p ( I D C G p {\displaystyle IDCG_{p}} Information retrieval信息检索), which normalizes the score:

n D C G p = D C G p I D C G p . {\displaystyle \mathrm {nDCG_{p}} ={\frac {DCG_{p}}{IDCG{p}}}.} Information retrieval信息检索

The nDCG values for all queries can be averaged to obtain a measure of the average performance of a ranking algorithm. Note that in a perfect ranking algorithm, the D C G p {\displaystyle DCG_{p}} Information retrieval信息检索 will be the same as the I D C G p {\displaystyle IDCG_{p}} Information retrieval信息检索 producing an nDCG of 1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so are cross-query comparable.

Other measures

Terminology and derivations

from a confusion matrix
true positive (TP)
eqv. with hit
true negative (TN)
eqv. with correct rejection
false positive (FP)
eqv. with false alarm, Type I error
false negative (FN)
eqv. with miss, Type II error

sensitivity or true positive rate (TPR)
eqv. with hit rate, recall

T
P
R

=

T
P

P

=

T
P

T
P

+

F
N

{\displaystyle {\mathit {TPR}}={\frac {\mathit {TP}}{P}}={\frac {\mathit {TP}}{{\mathit {TP}}+{\mathit {FN}}}}}

Information retrieval信息检索

specificity (SPC) or true negative rate (TNR)

S
P
C

=

T
N

N

=

T
N

F
P

+

T
N

{\displaystyle {\mathit {SPC}}={\frac {\mathit {TN}}{N}}={\frac {\mathit {TN}}{{\mathit {FP}}+{\mathit {TN}}}}}

Information retrieval信息检索

precision or positive predictive value (PPV)

P
P
V

=

T
P

T
P

+

F
P

{\displaystyle {\mathit {PPV}}={\frac {\mathit {TP}}{{\mathit {TP}}+{\mathit {FP}}}}}

Information retrieval信息检索

recall (recall)

r
e
c
a
l
l

=

T
P

T
P

+

F
N

{\displaystyle {\mathit {recall}}={\frac {\mathit {TP}}{{\mathit {TP}}+{\mathit {FN}}}}}

Information retrieval信息检索

negative predictive value (NPV)

N
P
V

=

T
N

T
N

+

F
N

{\displaystyle {\mathit {NPV}}={\frac {\mathit {TN}}{{\mathit {TN}}+{\mathit {FN}}}}}

Information retrieval信息检索

fall-out or false positive rate (FPR)

F
P
R

=

F
P

N

=

F
P

F
P

+

T
N

=
1

S
P
C

{\displaystyle {\mathit
{FPR}}={\frac {\mathit {FP}}{N}}={\frac {\mathit {FP}}{{\mathit
{FP}}+{\mathit {TN}}}}=1-{\mathit {SPC}}}

Information retrieval信息检索

false discovery rate (FDR)

F
D
R

=

F
P

F
P

+

T
P

=
1

P
P
V

{\displaystyle {\mathit {FDR}}={\frac {\mathit {FP}}{{\mathit {FP}}+{\mathit {TP}}}}=1-{\mathit {PPV}}}

Information retrieval信息检索

miss rate or false negative rate (FNR)

F
N
R

=

F
N

P

=

F
N

F
N

+

T
P

=
1

T
P
R

{\displaystyle {\mathit
{FNR}}={\frac {\mathit {FN}}{P}}={\frac {\mathit {FN}}{{\mathit
{FN}}+{\mathit {TP}}}}=1-{\mathit {TPR}}}

Information retrieval信息检索


accuracy (ACC)

A
C
C

=

T
P

+

T
N

P
+
N

{\displaystyle {\mathit {ACC}}={\frac {{\mathit {TP}}+{\mathit {TN}}}{P+N}}}

Information retrieval信息检索

F1 score
is the harmonic mean of precision and sensitivity

F
1

=

2

T
P

2

T
P

+

F
P

+

F
N

{\displaystyle {\mathit {F1}}={\frac {2{\mathit {TP}}}{2{\mathit {TP}}+{\mathit {FP}}+{\mathit {FN}}}}}

Information retrieval信息检索

Matthews correlation coefficient (MCC)

T
P
×
T
N

F
P
×
F
N

(
T
P
+
F
P
)
(
T
P
+
F
N
)
(
T
N
+
F
P
)
(
T
N
+
F
N
)

{\displaystyle {\frac {TP\times TN-FP\times FN}{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}}

Information retrieval信息检索

Informedness = Sensitivity + Specificity - 1

Markedness = Precision + NPV - 1

Sources: Fawcett (2006), Powers (2011), and Ting (2011) [16][17][18]

Visualization

Visualizations of information retrieval performance include:

Timeline

  • Before the 1900s
    1801: Joseph Marie Jacquard invents the Jacquard loom, the first machine to use punched cards to control a sequence of operations.
    1880s: Herman Hollerith invents an electro-mechanical data tabulator using punch cards as a machine readable medium.
    1890 Hollerith cards, keypunches and tabulators used to process the 1890 US Census data.
  • 1920s-1930s
    Emanuel Goldberg
    submits patents for his "Statistical Machine” a document search engine
    that used photoelectric cells and pattern recognition to search the
    metadata on rolls of microfilmed documents.
  • 1940s–1950s
    late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans.
    1945: Vannevar Bush's As We May Think appeared in Atlantic Monthly.
    1947: Hans Peter Luhn (research engineer at IBM since 1941) began work on a mechanized punch card-based system for searching chemical compounds.
    1950s: Growing concern in the US for a "science gap" with the
    USSR motivated, encouraged funding and provided a backdrop for
    mechanized literature searching systems (Allen Kent et al.) and the invention of citation indexing (Eugene Garfield).
    1950: The term "information retrieval" was coined by Calvin Mooers.[19]
    1951: Philip Bagley conducted the earliest experiment in computerized document retrieval in a master thesis at MIT.[20]
    1955: Allen Kent joined Case Western Reserve University,
    and eventually became associate director of the Center for
    Documentation and Communications Research. That same year, Kent and
    colleagues published a paper in American Documentation describing the
    precision and recall measures as well as detailing a proposed
    "framework" for evaluating an IR system which included statistical
    sampling methods for determining the number of relevant documents not
    retrieved.[21]
    1958: International Conference on Scientific Information
    Washington DC included consideration of IR systems as a solution to
    problems identified. See: Proceedings of the International Conference on Scientific Information, 1958 (National Academy of Sciences, Washington, DC, 1959)
    1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval."
  • 1960s:
    early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell.
    1960: Melvin Earl Maron and John Lary Kuhns[22] published "On relevance, probabilistic indexing, and information retrieval" in the Journal of the ACM 7(3):216–244, July 1960.
    1962:
    • Cyril W. Cleverdon
      published early findings of the Cranfield studies, developing a model
      for IR system evaluation. See: Cyril W. Cleverdon, "Report on the
      Testing and Analysis of an Investigation into the Comparative Efficiency
      of Indexing Systems". Cranfield Collection of Aeronautics, Cranfield,
      England, 1962.
    • Kent published Information Analysis and Retrieval.
    1963:
    • Weinberg report "Science, Government and Information" gave a full
      articulation of the idea of a "crisis of scientific information." The
      report was named after Dr. Alvin Weinberg.
    • Joseph Becker and Robert M. Hayes published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New York, Wiley (1963).
    1964:
    • Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification, and continued work on computational linguistics as it applies to IR.
    • The National Bureau of Standards
      sponsored a symposium titled "Statistical Association Methods for
      Mechanized Documentation." Several highly significant papers, including
      G. Salton's first published reference (we believe) to the SMART system.
    mid-1960s:
    • National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch-retrieval system.
    • Project Intrex at MIT.
    1965: J. C. R. Licklider published Libraries of the Future.
    1966: Don Swanson was involved in studies at University of Chicago on Requirements for Future Catalogs.
    late 1960s: F. Wilfrid Lancaster completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval.
    1968:
    • Gerard Salton published Automatic Information Organization and Retrieval.
    • John W. Sammon, Jr.'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model.
    1969: Sammon's "A nonlinear mapping for data structure
    analysis" (IEEE Transactions on Computers) was the first proposal for
    visualization interface to an IR system.
  • 1970s
    early 1970s:
    • First online systems—NLM's AIM-TWX, MEDLINE; Lockheed's Dialog; SDC's ORBIT.
    • Theodor Nelson promoting concept of hypertext, published Computer Lib/Dream Machines.
    1971: Nicholas Jardine and Cornelis J. van Rijsbergen published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis."[23]
    1975: Three highly influential publications by Salton fully
    articulated his vector processing framework and term discrimination
    model:
    • A Theory of Indexing (Society for Industrial and Applied Mathematics)
    • A Theory of Term Importance in Automatic Text Analysis (JASIS v. 26)
    • A Vector Space Model for Automatic Indexing (CACM 18:11)
    1978: The First ACM SIGIR conference.
    1979: C. J. van Rijsbergen published Information Retrieval (Butterworths). Heavy emphasis on probabilistic models.
    1979: Tamas Doszkocs implemented the CITE natural language
    user interface for MEDLINE at the National Library of Medicine. The CITE
    system supported free form query input, ranked output and relevance
    feedback.[24]
  • 1980s
    1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge.
    1982: Nicholas J. Belkin,
    Robert N. Oddy, and Helen M. *s proposed the ASK (Anomalous State
    of Knowledge) viewpoint for information retrieval. This was an important
    concept, though their automated analysis tool proved ultimately
    disappointing.
    1983: Salton (and Michael J. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models.
    1985: David Blair and Bill Maron publish: An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System
    mid-1980s: Efforts to develop end-user versions of commercial IR systems.
    1985–1993: Key papers on and experimental systems for visualization interfaces.
    Work by Donald B. Crouch, Robert R. Korfhage, Matthew Chalmers, Anselm Spoerri and others.
    1989: First World Wide Web proposals by Tim Berners-Lee at CERN.
  • 1990s
    1992: First TREC conference.
    1997: Publication of Korfhage's Information Storage and Retrieval[25] with emphasis on visualization and multi-reference point systems.
    late 1990s: Web search engines
    implementation of many features formerly found only in experimental IR
    systems. Search engines become the most common and maybe best
    instantiation of IR models.

Awards in the field

Leading IR Research Groups

    • Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts Amherst [26]
    • Information Retrieval Group at the University of Glasgow [27]
    • Information and Language Processing Systems (ILPS) at the University of Amsterdam [28]
    • Language Technologies Institutes (LTI) at the Carnegie Mellon University
    • Text Information Management and Analysis Group (TIMAN) at the University of Illinois at Urbana-Champaign