Misplaced Pages

Binary Independence Model: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editContent deleted Content addedVisualWikitext
Revision as of 00:57, 5 January 2019 editGaspanic99 (talk | contribs)2 edits Almost all the content of this article is a very close parapharse of a copyrighted book.← Previous edit Latest revision as of 21:05, 7 September 2024 edit undoKlbrain (talk | contribs)Autopatrolled, Extended confirmed users, New page reviewers86,622 edits Removing stale merge template; no case made; no support 
(16 intermediate revisions by 15 users not shown)
Line 1: Line 1:
The '''Binary Independence Model''' ('''BIM''')<ref name="cyu76" /><ref name="jones77" /> in ] and ] is a probabilistic ] technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.
{{db-copyvio|url=https://nlp.stanford.edu/IR-book/html/htmledition/the-binary-independence-model-1.html}}


== Definitions ==
{{context|date=June 2012}}
The Binary Independence Assumption is the that documents are ]. That is, only the presence or absence of terms in documents are recorded. Terms are ] distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents.
The '''Binary Independence Model''' (BIM)<ref name="cyu76" /><ref name="jones77"/> is a probabilistic ] technique that makes some simple assumptions to make the estimation of document/query similarity probability feasible.

==Definitions==
The Binary Independence Assumption is that documents are ]s. That is, only the presence or absence of terms in documents are recorded. Terms are ] distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents.
The representation is an ordered set of ] variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector {{math|1=''d'' = (''x''<sub>1</sub>, ..., ''x<sub>m</sub>'')}} where {{math|1=''x<sub>t</sub>''=1}} if term ''t'' is present in the document ''d'' and {{math|1=''x<sub>t</sub>''=0}} if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way. The representation is an ordered set of ] variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector {{math|1=''d'' = (''x''<sub>1</sub>, ..., ''x<sub>m</sub>'')}} where {{math|1=''x<sub>t</sub>''=1}} if term ''t'' is present in the document ''d'' and {{math|1=''x<sub>t</sub>''=0}} if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way.
"Independence" signifies that terms in the document are considered independently from each other and no association between terms is modeled. This assumption is very limiting, but it has been shown that it gives good enough results for many situations. This independence is the "naive" assumption of a ], where properties that imply each other are nonetheless treated as independent for the sake of simplicity. This assumption allows the representation to be treated as an instance of a ] by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms. "Independence" signifies that terms in the document are considered independently from each other and no association between terms is modeled. This assumption is very limiting, but it has been shown that it gives good enough results for many situations. This independence is the "naive" assumption of a ], where properties that imply each other are nonetheless treated as independent for the sake of simplicity. This assumption allows the representation to be treated as an instance of a ] by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms.
Line 14: Line 11:


where <math>P(x|R=1,q)</math> and <math>P(x|R=0,q)</math> are the probabilities of retrieving a relevant or nonrelevant document, respectively. If so, then that document's representation is ''x''. where <math>P(x|R=1,q)</math> and <math>P(x|R=0,q)</math> are the probabilities of retrieving a relevant or nonrelevant document, respectively. If so, then that document's representation is ''x''.
The exact probabilities can not be known beforehand, so use estimates from statistics about the collection of documents must be used. The exact probabilities can not be known beforehand, so estimates from statistics about the collection of documents must be used.


<math>P(R=1|q)</math> and <math>P(R=0|q)</math> indicate the previous probability of retrieving a relevant or nonrelevant document respectively for a query ''q''. If, for instance, we knew the percentage of relevant documents in the collection, then we could use it to estimate these probabilities. <math>P(R=1|q)</math> and <math>P(R=0|q)</math> indicate the previous probability of retrieving a relevant or nonrelevant document respectively for a query ''q''. If, for instance, we knew the percentage of relevant documents in the collection, then we could use it to estimate these probabilities.
Line 24: Line 21:
Given a binary query and the ] as the similarity function between a document and a query, the problem is to assign weights to the Given a binary query and the ] as the similarity function between a document and a query, the problem is to assign weights to the
terms in the query such that the retrieval effectiveness will be high. Let <math>p_i</math> and <math>q_i</math> be the probability that a relevant document and an irrelevant document has the {{mvar|i}}{{sup|th}} term respectively. Yu and ],<ref name="cyu76" /> who first introduce BIM, propose that the weight of the {{mvar|i}}{{sup|th}} term is an increasing function of <math>Y_i = \frac{p_i *(1-q_i)}{(1-p_i)*q_i}</math>. Thus, if <math>Y_i</math> is higher than <math>Y_j</math>, the weight terms in the query such that the retrieval effectiveness will be high. Let <math>p_i</math> and <math>q_i</math> be the probability that a relevant document and an irrelevant document has the {{mvar|i}}{{sup|th}} term respectively. Yu and ],<ref name="cyu76" /> who first introduce BIM, propose that the weight of the {{mvar|i}}{{sup|th}} term is an increasing function of <math>Y_i = \frac{p_i *(1-q_i)}{(1-p_i)*q_i}</math>. Thus, if <math>Y_i</math> is higher than <math>Y_j</math>, the weight
of term {{mvar|i}} will be higher than that of term {{mvar|j}}. Yu and Salton<ref name="cyu76" /> showed that such a weight assignment to query terms yields better retrieval effectiveness than if query terms are equally weighted. ] and ]<ref name="jones77"/> later showed that if the {{mvar|i}}{{sup|th}} term is assigned the weight of <math>\log Y_i</math>, then optimal retrieval effectiveness is obtained under the Binary Independence Assumption. of term {{mvar|i}} will be higher than that of term {{mvar|j}}. Yu and Salton<ref name="cyu76" /> showed that such a weight assignment to query terms yields better retrieval effectiveness than if query terms are equally weighted. ] and ]<ref name="jones77" /> later showed that if the {{mvar|i}}{{sup|th}} term is assigned the weight of <math>\log Y_i</math>, then optimal retrieval effectiveness is obtained under the Binary Independence Assumption.


The Binary Independence Model was introduced by Yu and Salton.<ref name="cyu76" /> The name Binary Independence Model was coined by Robertson and Spärck Jones.<ref name="jones77"/> The Binary Independence Model was introduced by Yu and Salton.<ref name="cyu76" /> The name Binary Independence Model was coined by Robertson and Spärck Jones<ref name="jones77" /> who used the log-odds probability of the ] to derive <math>\log Y_i</math> where the log-odds probability is shown to be rank equivalent to the probability of relevance (i.e., <math>P(R|d,q)</math>) by Luk,<ref name="luk22" /> obeying the probability ranking principle.<ref name="robertson77" />


== See also == == See also ==
Line 32: Line 29:
* ] * ]


==Further reading== == Further reading ==
* {{citation | url=http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html | title=Introduction to Information Retrieval | author=Christopher D. Manning | author2=Prabhakar Raghavan|author3=Hinrich Schütze | publisher=Cambridge University Press | year=2008}} * {{citation | url=http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html | title=Introduction to Information Retrieval | author=Christopher D. Manning | author2=Prabhakar Raghavan|author3=Hinrich Schütze | publisher=Cambridge University Press | year=2008}}
* {{citation | url=http://www.ir.uwaterloo.ca/book/ | title=Information Retrieval: Implementing and Evaluating Search Engines | author=Stefan B&uuml;ttcher | author2=Charles L. A. Clarke |author3= Gordon V. Cormack | publisher=MIT Press | year=2010}} * {{citation | url=https://plg.uwaterloo.ca/~ir/ir/book/ | title=Information Retrieval: Implementing and Evaluating Search Engines | author=Stefan Büttcher | author2=Charles L. A. Clarke |author3= Gordon V. Cormack | publisher=MIT Press | year=2010}}


==References== == References ==
{{Reflist|refs= {{Reflist|refs=
<ref name="cyu76">{{Cite journal | doi = 10.1145/321921.321930| title = Precision Weighting – An Effective Automatic Indexing Method| journal = Journal of the ACM| volume = 23| pages = 76| year = 1976| last1 = Yu | first1 = C. T.| last2 = Salton | first2 = G. | authorlink2 = Gerard Salton| url = http://ecommons.cornell.edu/bitstream/1813/7313/1/75-232.pdf}}</ref> <ref name="cyu76">{{Cite journal | doi = 10.1145/321921.321930| title = Precision Weighting – An Effective Automatic Indexing Method| journal = Journal of the ACM| volume = 23| pages = 76–88| year = 1976| last1 = Yu | first1 = C. T.| last2 = Salton | first2 = G. | authorlink2 = Gerard Salton| url = http://ecommons.cornell.edu/bitstream/1813/7313/1/75-232.pdf| hdl = 1813/7313| hdl-access = free}}</ref>
<ref name="jones77">{{Cite journal | doi = 10.1002/asi.4630270302| title = Relevance weighting of search terms| journal = Journal of the American Society for Information Science| volume = 27| issue = 3| pages = 129| year = 1976| last1 = Robertson | first1 = S. E. | authorlink1 = Stephen Robertson (computer scientist)| last2 = Spärck Jones | first2 = K. | authorlink2 = Karen Spärck Jones}}</ref> <ref name="jones77">{{Cite journal | doi = 10.1002/asi.4630270302| title = Relevance weighting of search terms| journal = Journal of the American Society for Information Science| volume = 27| issue = 3| pages = 129| year = 1976| last1 = Robertson | first1 = S. E. | authorlink1 = Stephen Robertson (computer scientist)| last2 = Spärck Jones | first2 = K. | authorlink2 = Karen Spärck Jones}}</ref>
<ref name="luk22">{{Cite journal | doi = 10.1007/s10699-020-09685-x| title = Why is information retrieval a scientific discipline?| journal = Foundations of Science| volume = 27| issue = 2| pages = 427–453| year = 2022 |last1 = Luk | first1 = R. W. P.| hdl = 10397/94873| hdl-access = free}}</ref>
<ref name="robertson77">{{Cite journal | doi = 10.1108/eb026647| title = The Probability Ranking Principle in IR| journal = Journal of Documentation | volume = 33| issue = 4| pages = 294–304 | year = 1977| last1 = Robertson | first1 = S. E. | authorlink1 = Stephen Robertson (computer scientist)}}</ref>
}} }}



Latest revision as of 21:05, 7 September 2024

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

Definitions

The Binary Independence Assumption is the that documents are binary vectors. That is, only the presence or absence of terms in documents are recorded. Terms are independently distributed in the set of relevant documents and they are also independently distributed in the set of irrelevant documents. The representation is an ordered set of Boolean variables. That is, the representation of a document or query is a vector with one Boolean element for each term under consideration. More specifically, a document is represented by a vector d = (x1, ..., xm) where xt=1 if term t is present in the document d and xt=0 if it's not. Many documents can have the same vector representation with this simplification. Queries are represented in a similar way. "Independence" signifies that terms in the document are considered independently from each other and no association between terms is modeled. This assumption is very limiting, but it has been shown that it gives good enough results for many situations. This independence is the "naive" assumption of a Naive Bayes classifier, where properties that imply each other are nonetheless treated as independent for the sake of simplicity. This assumption allows the representation to be treated as an instance of a Vector space model by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms.

The probability P ( R | d , q ) {\displaystyle P(R|d,q)} that a document is relevant derives from the probability of relevance of the terms vector of that document P ( R | x , q ) {\displaystyle P(R|x,q)} . By using the Bayes rule we get:

P ( R | x , q ) = P ( x | R , q ) P ( R | q ) P ( x | q ) {\displaystyle P(R|x,q)={\frac {P(x|R,q)*P(R|q)}{P(x|q)}}}

where P ( x | R = 1 , q ) {\displaystyle P(x|R=1,q)} and P ( x | R = 0 , q ) {\displaystyle P(x|R=0,q)} are the probabilities of retrieving a relevant or nonrelevant document, respectively. If so, then that document's representation is x. The exact probabilities can not be known beforehand, so estimates from statistics about the collection of documents must be used.

P ( R = 1 | q ) {\displaystyle P(R=1|q)} and P ( R = 0 | q ) {\displaystyle P(R=0|q)} indicate the previous probability of retrieving a relevant or nonrelevant document respectively for a query q. If, for instance, we knew the percentage of relevant documents in the collection, then we could use it to estimate these probabilities. Since a document is either relevant or nonrelevant to a query we have that:

P ( R = 1 | x , q ) + P ( R = 0 | x , q ) = 1 {\displaystyle P(R=1|x,q)+P(R=0|x,q)=1}

Query Terms Weighting

Given a binary query and the dot product as the similarity function between a document and a query, the problem is to assign weights to the terms in the query such that the retrieval effectiveness will be high. Let p i {\displaystyle p_{i}} and q i {\displaystyle q_{i}} be the probability that a relevant document and an irrelevant document has the i term respectively. Yu and Salton, who first introduce BIM, propose that the weight of the i term is an increasing function of Y i = p i ( 1 q i ) ( 1 p i ) q i {\displaystyle Y_{i}={\frac {p_{i}*(1-q_{i})}{(1-p_{i})*q_{i}}}} . Thus, if Y i {\displaystyle Y_{i}} is higher than Y j {\displaystyle Y_{j}} , the weight of term i will be higher than that of term j. Yu and Salton showed that such a weight assignment to query terms yields better retrieval effectiveness than if query terms are equally weighted. Robertson and Spärck Jones later showed that if the i term is assigned the weight of log Y i {\displaystyle \log Y_{i}} , then optimal retrieval effectiveness is obtained under the Binary Independence Assumption.

The Binary Independence Model was introduced by Yu and Salton. The name Binary Independence Model was coined by Robertson and Spärck Jones who used the log-odds probability of the probabilistic relevance model to derive log Y i {\displaystyle \log Y_{i}} where the log-odds probability is shown to be rank equivalent to the probability of relevance (i.e., P ( R | d , q ) {\displaystyle P(R|d,q)} ) by Luk, obeying the probability ranking principle.

See also

Further reading

References

  1. ^ Yu, C. T.; Salton, G. (1976). "Precision Weighting – An Effective Automatic Indexing Method" (PDF). Journal of the ACM. 23: 76–88. doi:10.1145/321921.321930. hdl:1813/7313.
  2. ^ Robertson, S. E.; Spärck Jones, K. (1976). "Relevance weighting of search terms". Journal of the American Society for Information Science. 27 (3): 129. doi:10.1002/asi.4630270302.
  3. Luk, R. W. P. (2022). "Why is information retrieval a scientific discipline?". Foundations of Science. 27 (2): 427–453. doi:10.1007/s10699-020-09685-x. hdl:10397/94873.
  4. Robertson, S. E. (1977). "The Probability Ranking Principle in IR". Journal of Documentation. 33 (4): 294–304. doi:10.1108/eb026647.
Categories: