This is an old revision of this page, as edited by Gaspanic99 (talk | contribs) at 00:57, 5 January 2019 (Almost all the content of this article is a very close parapharse of a copyrighted book.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Revision as of 00:57, 5 January 2019 by Gaspanic99 (talk | contribs) (Almost all the content of this article is a very close parapharse of a copyrighted book.)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)This article may meet Misplaced Pages's criteria for speedy deletion as a copyright infringement of https://nlp.stanford.edu/IR-book/html/htmledition/the-binary-independence-model-1.html (Copyvios report). This criterion applies only in unequivocal cases, where there is no free-content material on the page worth saving and no later edits requiring attribution – for more complicated situations, see Misplaced Pages:Copyright violations. See CSD G12.%5B%5BWP%3ACSD%23G12%7CG12%5D%5D%3A+Unambiguous+%5B%5BWP%3ACV%7Ccopyright+infringement%5D%5D+of+https%3A%2F%2Fnlp.stanford.edu%2FIR-book%2Fhtml%2Fhtmledition%2Fthe-binary-independence-model-1.htmlG12
If this article does not meet the criteria for speedy deletion, or you intend to fix it, please remove this notice, but do not remove this notice from pages that you have created yourself. If you created this page and you disagree with the given reason for deletion, you can click the button below and leave a message explaining why you believe it should not be deleted. You can also visit the talk page to check if you have received a response to your message. Note that this article may be deleted at any time if it unquestionably meets the speedy deletion criteria, or if an explanation posted to the talk page is found to be insufficient.
Note to administrators: this article has content on its talk page which should be checked before deletion. Note to administrators: If declining the request due to not meeting the criteria please consider whether there are still copyright problems with the page and if so, see these instructions for cleanup, or list it at Misplaced Pages:Copyright problems.Please be sure that the source of the alleged copyright violation is not itself a Misplaced Pages mirror. Also, ensure the submitter of this page has been notified about our copyright policy.Administrators: check links, talk, history (last), and logs before deletion. Consider checking Google. This page was last edited by Gaspanic99 (contribs | logs) at 00:57, 5 January 2019 (UTC) (5 years ago) |
This article provides insufficient context for those unfamiliar with the subject. Please help improve the article by providing more context for the reader. (June 2012) (Learn how and when to remove this message) |
The Binary Independence Model (BIM) is a probabilistic information retrieval technique that makes some simple assumptions to make the estimation of document/query similarity probability feasible.
Definitions
The Binary Independence Assumption is that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. Terms are independently distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents. The representation is an ordered set of Boolean variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector d = (x1, ..., xm) where xt=1 if term t is present in the document d and xt=0 if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way. "Independence" signifies that terms in the document are considered independently from each other and no association between terms is modeled. This assumption is very limiting, but it has been shown that it gives good enough results for many situations. This independence is the "naive" assumption of a Naive Bayes classifier, where properties that imply each other are nonetheless treated as independent for the sake of simplicity. This assumption allows the representation to be treated as an instance of a Vector space model by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms.
The probability that a document is relevant derives from the probability of relevance of the terms vector of that document . By using the Bayes rule we get:
where and are the probabilities of retrieving a relevant or nonrelevant document, respectively. If so, then that document's representation is x. The exact probabilities can not be known beforehand, so use estimates from statistics about the collection of documents must be used.
and indicate the previous probability of retrieving a relevant or nonrelevant document respectively for a query q. If, for instance, we knew the percentage of relevant documents in the collection, then we could use it to estimate these probabilities. Since a document is either relevant or nonrelevant to a query we have that:
Query Terms Weighting
Given a binary query and the dot product as the similarity function between a document and a query, the problem is to assign weights to the terms in the query such that the retrieval effectiveness will be high. Let and be the probability that a relevant document and an irrelevant document has the i term respectively. Yu and Salton, who first introduce BIM, propose that the weight of the i term is an increasing function of . Thus, if is higher than , the weight of term i will be higher than that of term j. Yu and Salton showed that such a weight assignment to query terms yields better retrieval effectiveness than if query terms are equally weighted. Robertson and Spärck Jones later showed that if the i term is assigned the weight of , then optimal retrieval effectiveness is obtained under the Binary Independence Assumption.
The Binary Independence Model was introduced by Yu and Salton. The name Binary Independence Model was coined by Robertson and Spärck Jones.
See also
Further reading
- Christopher D. Manning; Prabhakar Raghavan; Hinrich Schütze (2008), Introduction to Information Retrieval, Cambridge University Press
- Stefan Büttcher; Charles L. A. Clarke; Gordon V. Cormack (2010), Information Retrieval: Implementing and Evaluating Search Engines, MIT Press
{{citation}}
: CS1 maint: multiple names: authors list (link)
References
- ^ Yu, C. T.; Salton, G. (1976). "Precision Weighting – An Effective Automatic Indexing Method" (PDF). Journal of the ACM. 23: 76. doi:10.1145/321921.321930.
- ^ Robertson, S. E.; Spärck Jones, K. (1976). "Relevance weighting of search terms". Journal of the American Society for Information Science. 27 (3): 129. doi:10.1002/asi.4630270302.