ir2012s lecture05 modeling ii(set, algebra & probabilistic)

Upload: vishnudhanabalan

Post on 05-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    1/57

    Modeling in Information Retrieval- Fuzzy Set, Extended Boolean,

    Generalized Vector Space,

    Set-based Models, and Best Match Models

    Berlin ChenDepartment of Computer Science & Information Engineering

    National Taiwan Normal University

    References:

    1. Modern Information Retrieval, Chapter 3 & Teaching material

    2. Language Modeling for Information Retrieval, Chapter 3

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    2/57

    IR Berlin Chen 2

    Taxonomy of Classic IR Models

    Document Property

    Text

    Links

    Multimedia Proximal Nodes, othersXML-based

    Semi-structured Text

    Classic Models

    Boolean

    Vector

    Probabilistic

    Set Theoretic

    Fuzzy

    Extended Boolean

    Set-based

    Probabilistic

    BM25

    Language Models

    Divergence from Ramdomness

    Bayesian Networks

    Algebraic

    Generalized Vector

    Latent Semanti Indexing

    Neural NetworksSupport Vector Machines

    Page Rank

    Hubs & Authorities

    Web

    Image retrieval

    Audio and Music RetrievalVideo Retrieval

    Multimedia Retrieval

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    3/57

    IR Berlin Chen 3

    Outline

    Alternative Set Theoretic Models

    Fuzzy Set Model (Fuzzy Information Retrieval)

    Extended Boolean Model

    Set-based Model

    Alternative Algebraic Model

    Generalized Vector Space Model

    Alternative Probabilistic Models

    Best Match Models (BM1, BM15, BM11 & BM 25)

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    4/57

    IR Berlin Chen 4

    Fuzzy Set Model

    Premises

    Docs and queries are represented through sets of

    keywords, therefore the matching between them is

    vague

    Keywords cannot completely describe the users

    information need and the docs main theme

    For each query term (keyword)

    Define a fuzzy set and that each doc has a degreeof membership (0~1) in the set

    aboutness

    wi, wj, wk,. ws, wp, wq,.

    Retrieval Model

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    5/57

    IR Berlin Chen 5

    Fuzzy Set Model (cont.)

    Fuzzy Set Theory

    Framework for representing classes (sets) whose

    boundaries are not well defined Key idea is to introduce the notion of a degree of

    membership associated with the elements of a set

    This degree of membership varies from 0 to 1 andallows modeling the notion ofmarginal membership

    0 no membership

    1 full membership

    Thus, membership is now a gradual instead of abrupt

    Not as conventional Boolean logic

    Here we will define a fuzzy set for each query (or index) term,thus each doc has a degree of membership in this set.

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    6/57

    IR Berlin Chen 6

    Fuzzy Set Model (cont.)

    Definition

    A fuzzy subsetA of a universal of discourse Uis

    characterized by a membership functionA: U [0,1]

    Which associates with each element u ofUa

    numberA(u) in the interval [0,1]

    LetA and B be two fuzzy subsets ofU. Also,

    letA be the complement ofA. Then, Complement

    Union

    Intersection

    )(1)( uuAA

    ))(),(max()( uuu BABA

    ))(),(min()( uuu BABA

    U

    AB

    u

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    7/57

    IR Berlin Chen 7

    Fuzzy Set Model (cont.)

    Fuzzy information retrieval

    Fuzzy sets are modeled based on a thesaurus

    This thesaurus can be constructed by a term-termcorrelation matrix (or called keyword connection matrix)

    : a term-term correlation matrix

    : a normalized correlation factor for terms kiand k

    l

    We now have the notion ofproximity among index terms

    The relationship is symmetric !

    c

    lic

    ,

    lili

    li

    linnn

    nc

    ,

    ,

    ,

    li

    i

    n

    n

    ,

    : no of docs that contain ki

    : no of docs that contain both kiand kl

    Defining term relationship

    ikillilk kcck li ,,

    docs, paragraphs, sentences, ..ranged from 0 to 1

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    8/57

    IR Berlin Chen 8

    Fuzzy Set Model (cont.)

    The union and intersection operations aremodified here

    Union: algebraic sum (instead ofmax)

    Intersection: algebraic product (instead ofmin)

    U

    A1 A2

    k

    2

    1

    11

    )()()()()()()(21212121

    jA

    AAAAAAAA

    (k)--

    kkkkkkk

    j

    11

    )()(

    1

    21

    n

    jA

    AAAA

    (k)--

    kk

    j

    jj

    n

    )()()(2121

    kkk AAAA

    n

    j

    AAAA kk jn1

    21

    a negative algebraic product

    )1)(1(1

    )1(1

    11

    ba

    abba

    abaabbabbabaab

    babaab

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    9/57

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    10/57

    IR Berlin Chen 10

    Fuzzy Set Model (cont.)

    Example:

    Query q=ka (kb kc)qdnf=(ka kb kc) (ka kb kc) (ka kb kc)=cc1+cc2+cc3

    Da is the fuzzy set of docs

    associated to the term ka

    Degree of membership ?

    cc3cc2

    Da Db

    Dc

    disjunctive normal form

    conjunctive component

    cc1

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    11/57

    IR Berlin Chen 11

    Fuzzy Set Model (cont.)

    ))1)(1(1())1(1(

    )1(1

    1111

    )1(13

    1

    321

    jcjbjajcjbja

    jcjbja

    jcbajcbajcba

    i jcc

    jccccccjq

    dddddd

    ddd

    ddd

    d

    dd

    i

    algebraic sum

    negative algebraic product

    cc1 cc2 cc3

    algebraic product

    or a doc in

    he fuzzy answer

    set

    jd

    qD

    Degree of membershipcc3

    cc2

    Da Db

    Dc

    cc1

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    12/57

    IR Berlin Chen 12

    More on Fuzzy Set Model

    Advantages

    The correlations among index terms are considered

    Degree of relevance between queries and docs can

    be achieved

    Disadvantages

    Fuzzy IR models have been discussed mainly in the

    literature associated with fuzzy theory

    Experiments with standard test collections are notavailable

    Do not consider the frequecny (or counts) of a term in

    a document or a query

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    13/57

    IR Berlin Chen 13

    Extended Boolean Model

    Motive

    Extend the Boolean model with the functionality of

    partial matching and term weighting

    E.g.: in Boolean model, for the qery q=kx ky, adoc contains eitherkxorky is as irrelevant as

    another doc which contains neither of them How about the disjunctive query q=kx ky

    Combine Boolean query formulations withcharacteristics of the vector model

    Term weighting

    Algebraic distances for similarity measures

    Salton et al., 1983

    a ranking can

    be obtained

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    14/57

    IR Berlin Chen 14

    Extended Boolean Model (cont.)

    Term weighting

    The weight for the term kx in a doc dj is

    is normalized to lie between 0 and 1

    Assume two index terms kxand kywere used

    Let denote the weight of term kxon doc dj Let denote the weight of term kyon doc dj

    The doc vector is represented as

    Queries and docs can be plotted in a two-dimensionalmap

    ii

    x

    jxjxidf

    idftfw

    max,,

    Normalized idf

    jxw ,

    jxw ,xy jyw ,

    jyjxj

    wwd,,

    ,

    yxdj

    ,

    normalized frequency

    ranged from 0 to 1

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    15/57

    IR Berlin Chen 15

    Extended Boolean Model (cont.)

    If the query is q=kx ky (conjunctive query)-The docs near the point (1,1) are preferred

    -The similarity measure is defined as

    2

    111,

    22yx

    dqsimand

    2-norm model

    (Euclidean distance)

    dj

    dj+1

    (0,0)

    (1,1)

    kx

    ky AND

    jyjxj wwd ,, ,

    1

    2/11 2/11

    2/11

    0jxwx ,

    jywy ,

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    16/57

    IR Berlin Chen 16

    Extended Boolean Model (cont.)

    If the query is q=kx ky (disjunctive query)-The docs far from the point (0,0) are preferred

    -The similarity measure is defined as

    2

    ,22 yx

    dqsimor

    dj

    dj+1

    y = wy,j

    (0,0)

    (1,1)

    kx

    ky Or

    x = wx,j

    1

    0 2/1

    2/1

    2-norm model

    (Euclidean distance)

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    17/57

    IR Berlin Chen 17

    Extended Boolean Model (cont.)

    The similarity measures and

    also lie between 0 and 1

    dqsim or , dqsim and ,

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    18/57

    IR Berlin Chen 18

    Extended Boolean Model (cont.)

    Generalization

    tindex terms are used t-dimensional space

    p-norm model,

    Some interesting properties

    p=1

    p=

    p1

    m

    ppp

    andkkkq ...

    21

    m

    ppp

    orkvkkq ...

    21

    pp

    m

    pp

    andm

    xxxdqsim

    1

    211...11

    1,

    pp

    m

    pp

    orm

    xxxdqsim

    1

    21...

    ,

    m

    xxxdqsimdqsim morand ...,,

    21

    iand xdqsim min,

    ior xdqsim max, just like the

    formula of fuzzy logic

    Similar to vector space model

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    19/57

    IR Berlin Chen 19

    Extended Boolean Model (cont.)

    Example query 1:

    Processed by grouping the operators in a predefined

    order

    Example query 2:

    Combination of different algebraic distances

    321

    kkkqpp

    p

    p

    p

    ppp

    xxx

    dqsim

    1

    3

    1

    21

    2

    2

    111

    ,

    32

    2

    1kkkq

    3

    2

    1

    2

    2

    2

    1 ,

    2

    min, xxx

    dqsim

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    20/57

    More on Extended Boolean Model

    Advantages

    A hybrid model including properties of both the set

    theoretic models and the algebraic models

    That is, relax the Boolean algebra by interpreting

    Boolean operations in terms of algebraic distances

    By varying the parameterp between 1 and infinity,we can vary thep -norm ranking behavior from that

    of a vector-like ranking to that of a fuzzy logic-like

    ranking

    Have the possibility of using combinations of

    different values of the parameterp in the same

    query request

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    21/57

    More on Extended Boolean Model (cont.)

    Disadvantages

    Distributive operation does not hold for ranking

    computation

    E.g.:

    Assumes mutual independence of index terms

    IR Berlin Chen 21

    32223212322211 , kkkkqkkkq

    dqsimdqsim ,,21

    21

    2

    3

    2

    2

    12

    2

    2

    1

    2

    2

    111

    x

    xx 21

    22

    3

    2

    2

    22

    2

    2

    1

    2

    21

    21

    1

    xxxx

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    22/57

    IR Berlin Chen 22

    Generalized Vector Model

    Premise

    Classic models enforceindependence

    of index terms

    For the Vector model

    Set of term vectors {k1, k1, ..., kt} are linearly

    independent and form a basis for the subspace of

    interest

    Frequently, it means pairwise orthogonality

    i,j kikj= 0 (in a more restrictive sense)

    Wong et al. proposed an interpretation

    An alternative intepretation: The index term vectors are

    linearly independent, but not pairwise orthogonal

    Generalized Vector Model

    Wong et al., 1985

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    23/57

    IR Berlin Chen 23

    Generalized Vector Model (cont.)

    Key idea Index term vectors form the basis of the space are not

    orthogonal and are represented in terms of smallercomponents (minterms)

    Notations {k1, k2, , kt}: the set of all terms

    wi,j: the weight associated with [ki, dj]

    Minterms:binary indicators (0 or 1) of all patterns ofoccurrence of terms within documents

    Each represents one kind of co-occurrence of index terms in

    a specific document

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    24/57

    IR Berlin Chen 24

    Generalized Vector Model (cont.)

    Representations ofminterms

    m1=(0,0,.,0)

    m2=(1,0,.,0)m3=(0,1,.,0)

    m4=(1,1,.,0)

    m5=(0,0,1,..,0)

    m2t=(1,1,1,..,1)

    m1=(1,0,0,0,0,.,0)

    m2=(0,1,0,0,0,.,0)m3=(0,0,1,0,0,.,0)

    m4=(0,0,0,1,0,.,0)

    m5=(0,0,0,0,1,.,0)

    m2t=(0,0,0,0,0,.,1)

    2tminterms 2tminterm vectors

    Points to the docs where only

    index terms k1 and k2 co-occur and

    the other index terms disappear

    Point to the docs containingall the index terms

    Pairwise orthogonal vectors mi

    associated with minterms mias the basis for the generalized

    vector space

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    25/57

    IR Berlin Chen 25

    Generalized Vector Model (cont.)

    Minterm vectors are pairwise orthogonal. But,this does not mean that the index terms are

    independent

    Each minterm specifies a kind of dependence among

    index terms

    That is, the co-occurrence of index terms inside docs

    in the collection induces dependencies among theseindex terms

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    26/57

    IR Berlin Chen 26

    Generalized Vector Model (cont.)

    The vector associated with the term ki is

    represented by summing up all minterms

    containing it and normalizing

    1,2,

    ,,

    1, ,

    1,2,

    1, ,

    where

    ri

    ri

    ri

    ri

    mgr ri

    riri

    mgr rri

    mgr ri

    mgr rrii

    c

    cc

    mcc

    mck

    allfor, ,,

    lmgdgd jiri rljlj

    wc

    All the docs whose term co-occurrence

    relation (pattern) can be represented

    as (exactly coincide with that of) minterm mr

    The weight associated with the pair [ki, mr]

    sums up the weights of the term ki in all

    the docs which have a term occurrence

    pattern given by mr.

    Notice that for a collection of size N,only Nminterms affect the ranking (and not 2N)

    ri

    mg Indicates the index term ki is in the

    minterm mr

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    27/57

    IR Berlin Chen 27

    Generalized Vector Model (cont.)

    The similarity between the query and doc is

    calculated in the space of minterm vectors

    r

    rd

    r

    rq

    r

    rdrq

    jj

    i

    qi

    i

    qi

    i

    jiqi

    jj

    i r

    rrqiqij

    r

    rrj

    i

    ijij

    ss

    ss

    dqsim

    ww

    ww

    dqsim

    mskwq

    mskwd

    ,,

    ,,

    ,,

    ,,

    ,,

    ,,

    ,

    ,

    t-dimensional 2t-dimensional

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    28/57

    IR Berlin Chen 28

    Generalized Vector Model (cont.)

    Example (a system with three index terms)

    d1

    d2

    d3d4 d5

    d6d7

    k1k2

    k3

    k1 k2 k3 minterm

    d1 2 0 1 m6

    d2 1 0 0 m2

    d3 0 1 3 m7

    d4 2 0 0 m2

    d5 1 2 4 m8

    d6 1 2 0 m4

    d7 0 5 0 m3

    q 1 2 3

    minterm k1 k2 k3

    m1 0 0 0

    m2 1 0 0

    m3 0 1 0

    m4 1 1 0

    m5 0 0 1

    m6 1 0 1

    m7 0 1 1

    m8 1 1 1

    2

    8,2

    2

    7,2

    2

    4,2

    2

    3,2

    88,277,244,233,2

    2

    cccc

    mcmcmcmck

    2

    8,1

    2

    6,1

    2

    4,1

    2

    2,1

    88,166,144,122,1

    1

    cccc

    mcmcmcmck

    2

    8,3

    2

    7,3

    2

    6,3

    2

    5,3

    88,377,366,355,3

    3

    cccc

    mcmcmcmck

    1

    2

    1

    321

    5,18,1

    1,16,1

    6.14,1

    4.12.12,1

    wc

    wc

    wc

    wwc2222

    8642

    1

    1213

    1213

    mmmmk

    2

    1

    2

    5

    5,28,2

    3,27,2

    6,24,2

    7,23,2

    wc

    wc

    wc

    wc

    2222

    8743

    2

    2125

    2125

    mmmm

    k

    4

    3

    1

    0

    5,38,3

    3,37,3

    1,36,3

    5,3

    wc

    wc

    wc

    c

    2222

    8765

    3 4310

    4310

    mmmmk

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    29/57

    IR Berlin Chen 29

    Generalized Vector Model (cont.)

    Example: Ranking15

    1213

    1213

    12138642

    2222

    8642

    1

    mmmmmmmmk

    34

    2125

    2125

    21258743

    2222

    8743

    2

    mmmmmmmmk

    26

    431

    4310

    4310876

    2222

    8765

    3

    mmmmmmmk

    87642

    311

    26

    41

    15

    12

    26

    31

    26

    11

    15

    22

    15

    12

    15

    32

    12

    mmmmm

    kkd

    876432

    321

    26

    43

    34

    22

    15

    11

    26

    33

    34

    12

    26

    13

    15

    21

    34

    22

    15

    11

    34

    52

    15

    31

    321

    mmmmmm

    kkkq

    sd1,4sd1,2sd1,6 sd1,7

    sq,6

    sd1,8

    sq,2 sq,3 sq,4

    876432

    321

    26

    43

    34

    22

    15

    11

    26

    33

    34

    12

    26

    13

    15

    21

    34

    22

    15

    11

    34

    52

    15

    31

    321

    mmmmmm

    kkkq

    sq,8

    sq,7

    222222222228,8,7,7,6,6,4,4,2,2,

    1

    00

    2

    00

    2

    00

    ,,

    8,17,16,14,12,18,7,6,4,3,2,

    11111

    ,,

    ,

    ,,

    ,

    ,,

    ,

    ),(consine,

    dddddqqqqqq

    rdrq

    rd

    rdrq

    rq

    rdrq

    sssssssssssssssssssssdqsim

    ss

    ss

    dqdqsim

    dqdqdqdqdq

    ssrssr

    ssr

    rdrq

    The similarity between the query and doc is

    calculated in the space of minterm vectors

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    30/57

    IR Berlin Chen 30

    Generalized Vector Model (cont.)

    Term Correlation

    The degree of correlation between the terms kiand kj

    can now be computed as

    Do not need to be normalized? (because we havedone it before! See p26)

    1)(1)(|

    ,, rjri mgmgr

    rjriji cckk

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    31/57

    IR Berlin Chen 31

    More on Generalized Vector Model

    Advantages

    Model considers correlations among index terms

    Model does introduce interesting new ideas

    Disadvantages

    Not clear in which situations it is superior to thestandard vector model

    Computation cost is fairly high with large collections

    Since the number of active minterms might be proportionalto the number of documents in the collection

    Despitethesedrawbacks,thegeneralizedvectormodeldoesintroducenewideaswhichareofimportancefromatheoreticalpointofview.

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    32/57

    Set-Based Model

    This is a more recent approach (2005) that

    combines set theory with a vectorial ranking

    The fundamental idea is to use mutualdependencies among index terms to improve

    results

    Term dependencies are captured throughtermsets, which are sets of correlated terms

    The approach, which leads to improved results

    with various collections, constitutes the first IRmodel that effectively took advantage of term

    dependence with general collections

    IR Berlin Chen 32

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    33/57

    Set-Based Model: Termsets

    Termset is a concept used in place of the index

    terms

    A termset Si= {ka, kb, ..., kn} is a subset of the terms inthe collection

    If all index terms in Sioccur in a document dj then we

    say that the termset Sioccurs in dj

    There are 2t termsets that might occur in the

    documents of a collection, where t is the

    vocabulary size

    However, most combinations of terms have no

    semantic meaning

    Thus, the actual number of termsets in a collection is

    far smaller than 2tIR Berlin Chen 33

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    34/57

    Set-Based Model: Termsets (cont.)

    Let tbe the number of terms of the collection

    Then, the set VS = {S1, S2, ..., S2t} is the vocabulary-

    set of the collection To illustrate, consider the document collection

    below

    IR Berlin Chen 34

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    35/57

    Set-Based Model: Termsets (cont.)

    To simplify notation, let us define

    Further, let the letters a...n refer to the index

    terms ka...kn , respectively

    IR Berlin Chen 35

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    36/57

    Set-Based Model: Termsets (cont.)

    Consider the query q as to do be it, i.e. q = {a,

    b, d, n}

    For this query, the vocabulary-set is as below

    IR Berlin Chen 36

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    37/57

    Set-Based Model: Termsets (cont.)

    At query processing time, only the termsets

    generated by the query need to be considered

    A termset composed ofn terms is called an n-termset

    Let Nibe the number of documents in which Sioccurs

    An n-termset Si is said to be frequent ifNi isgreater than or equal to a given threshold

    This implies that an n-termset is frequent if and only if

    all of its (n

    1)-termsets are also frequent Frequent termsets can be used to reduce the

    number of termsets to consider with long queries

    IR Berlin Chen 37

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    38/57

    Set-Based Model: Termsets (cont.)

    Let the threshold on the frequency of termsets be 2

    To compute all frequent termsets for the query

    q = {a, b, d, n} we proceed as follows1. Compute the frequent 1-termsets and their inverted lists:

    Sa = {d1, d2}

    Sb = {d1, d3, d4}

    Sd= {d1, d2, d3, d4}

    2. Combine the inverted lists to

    compute frequent 2-termsets:

    Sad= {d1, d2} Sbd= {d1, d3 , d4}

    3. Since there are no frequent

    3-termsets, stopIR Berlin Chen 38

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    39/57

    Set-Based Model: Termsets (cont.)

    Notice that there are only 5 frequenttermsets in

    our collection

    Inverted lists for frequent n-termsets can becomputed by starting with the inverted lists of

    frequent 1-termsets

    Thus, the only indices required are the standardinverted lists used by any IR system

    This is reasonably fast for short queries up to 4-

    5 terms

    IR Berlin Chen 39

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    40/57

    Set-Based Model: Ranking Computation

    The ranking computation is based on the vector model,

    but adopts termsets instead of index terms

    Given a query q,

    let {S1, S2, . . .} be the set of all termsets originated from q

    Nibe the number of documents in which termset Sioccurs

    Nbe the total number of documents in the collection

    Fi,jbe the frequency of termset Si in document dj

    For each pair [Si, dj] we compute a weight Wi,jgiven by

    We also compute a Wi,q value for each pair [Si, q]IR Berlin Chen 40

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    41/57

    Set-Based Model: Ranking Computation (cont.)

    Consider

    query q = {a, b, d, n}

    document d1 = a b c a d a d c a b

    IR Berlin Chen 41Assumehereaminimumthresholdfrequencyof1.

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    42/57

    A document djand a query q are represented as

    vectors in a 2t-dimensional space of termsets

    The rank ofdjto the query q is computed as

    follows

    For termsets that are not in the query q, Wi,q = 0IR Berlin Chen 42

    Set-Based Model: Ranking Computation (cont.)

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    43/57

    The document norm |dj| is hard to compute in

    the space of termsets

    Thus, its computation is restricted to 1-termsets

    Let again q = {a, b, d, n} and d1

    The document norm in terms of 1-termsets is

    given by

    IR Berlin Chen 43

    Set-Based Model: Ranking Computation (cont.)

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    44/57

    To compute the rank ofd1, we need to consider

    the seven termsets Sa, Sb, Sd, Sab, Sad, Sbd, and

    Sabd

    The rank ofd1 is then given by

    IR Berlin Chen 44

    Set-Based Model: Ranking Computation (cont.)

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    45/57

    BM25 (Best Match 25)

    BM25 was created as the result of a series of

    experiments on variations of the probabilistic

    model A good term weighting is based on three principles

    Inverse document frequency

    Term frequency Document length normalization

    The classic probabilistic model covers only the first

    of these principles

    This reasoning led to a series of experiments with

    the Okapi system, which led to the BM25 ranking

    formulaIR Berlin Chen 45

    BM1 BM11 d BM15 F l

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    46/57

    BM1, BM11 and BM15 Formulas

    At first, the Okapi system used the Equation

    below as ranking formula

    which is just the equation used in the probabilistic

    model, when no relevance information is provided

    It was referred to as the BM1 formula (Best

    Match 1)

    IR Berlin Chen 46

    jii dkqk i

    ij

    .n

    .nNqdsim

    50

    50log~),(

    BM1 BM11 d BM15 F l ( t )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    47/57

    BM1, BM11 and BM15 Formulas (cont.)

    The first idea for improving the ranking was to

    introduce a term-frequency factorFi,j in the BM1

    formula

    This factor, after some changes, evolved to become

    Where

    fi,j is the frequency of term kiwithin document dj

    K1 is a constant setup experimentally for each collection

    S1 is a scaling constant, normally set to S1 = (K1 + 1)

    IfK1 = 0, this whole factor becomes equal to 1 and

    bears no effect in the rankingIR Berlin Chen 47

    ji

    ji

    ji

    fK

    fSF

    ,1

    ,

    1,

    BM1 BM11 d BM15 F l ( t )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    48/57

    BM1, BM11 and BM15 Formulas (cont.)

    can be viewed as a saturation function

    IR Berlin Chen 48

    jif

    ,

    ji

    ji

    fK

    f

    ,

    ,

    ji

    ji

    fK

    f

    ,

    ,

    BM1 BM11 d BM15 F l ( t )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    49/57

    BM1, BM11 and BM15 Formulas (cont.)

    The next step was to modify the Fi,jfactor by

    adding document length normalization to it, as

    follows:

    Where

    len(dj) is the length of document dj (computed, for

    instance, as the number of terms in the document)

    avg_doclen is the average document length for the

    collection

    IR Berlin Chen 49

    ji

    j

    ji

    ji

    fdoclenavg

    dlenK

    fSF

    ,

    1

    ,

    1,

    _

    BM1 BM11 and BM15 Formulas (cont )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    50/57

    BM1, BM11 and BM15 Formulas (cont.)

    Next, a correction factorGj,q dependent on the

    document and query lengths was added

    Where

    len(q) is the query length (number of terms in the

    query)

    K2 is a constant

    IR Berlin Chen 50

    j

    j

    qjdlendoclenavg

    dlendoclenavgqlenKG

    _

    _2,

    BM1 BM11 and BM15 Formulas (cont )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    51/57

    BM1, BM11 and BM15 Formulas (cont.)

    A third additional factor, aimed at taking into

    account term frequencies within queries, was

    defined as

    where

    fi,q is the frequency of term ki within query q

    K3 is a constant

    S3 is an scaling constant related to K3, normally set

    to S3 = (K3 + 1)

    IR Berlin Chen 51

    qi

    qi

    qi

    fK

    fSF

    ,3

    ,

    3,

    BM1 BM11 and BM15 Formulas (cont )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    52/57

    BM1, BM11 and BM15 Formulas (cont.)

    Introduction of these three factors led to various

    BM (Best Matching) formulas, as follows:

    IR Berlin Chen 52

    jii

    jii

    jii

    dkqk i

    ii,qi,jqjjBM

    dkqk i

    ii,qi,jqjjBM

    dkqk i

    ijBM

    .n

    .nN

    FFGqdsim

    .n

    .nNFFGqdsim

    .n

    .nNqdsim

    50

    50

    log~),(

    50

    50log~),(

    50

    50log~),(

    ,11

    ,15

    1

    BM1 BM11 and BM15 Formulas (cont )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    53/57

    BM1, BM11 and BM15 Formulas (cont.)

    Experiments using TREC data have shown that

    BM11 outperforms BM15 (due to additional

    document length normalization)

    Further, empirical considerations can be used to

    simplify the previous equations, as follows:

    Empirical evidence suggests that a best value ofK2 is 0,

    which eliminates the Gj,q factor from these equations (i.e.,BM15 and BM12)

    Further, good estimates for the scaling constants S1 and

    S3 are K1 + 1 and K3 + 1, respectively

    Empirical evidence also suggests that making K3 very

    large is better. As a result, the Fi,q factor is reduced simply

    to fi,q

    For short queries, we can assume that fi,q is 1 for all termsIR Berlin Chen 53

    BM1 BM11 and BM15 Formulas (cont )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    54/57

    BM1, BM11 and BM15 Formulas (cont.)

    These considerations lead to simpler equations

    as follows

    IR Berlin Chen 54

    jii

    jii

    jii

    dkqk i

    i

    jij

    ji

    jBM

    dkqk i

    i

    ji

    ji

    jBM

    dkqk i

    ijBM

    .n

    .nN

    fdoclenavg

    dlenK

    fK

    qdsim

    .n.nN

    fKfKqdsim

    .n

    .nNqdsim

    50

    50

    log

    _

    1

    ~),(

    5050log1~),(

    50

    50log~),(

    ,1

    ,1

    11

    ,1

    ,1

    15

    1

    BM25 Ranking Formula

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    55/57

    BM25 Ranking Formula

    BM25: combination of the BM11 and BM15

    The motivation was to combine the BM11 and

    BM25 term frequency factors as follows

    Where b is a constant with values in the interval [0, 1]

    Ifb = 0, it reduces to the BM15 term frequency factor

    Ifb = 1, it reduces to the BM11 term frequency factor

    For values ofb between 0 and 1, the equation

    provides a combination of BM11 with BM15IR Berlin Chen 55

    jij

    ji

    ji

    fdoclenavg

    dlen

    bbK

    fKSB

    ,1

    ,1

    1,

    _1

    1

    BM25 Ranking Formula (cont )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    56/57

    BM25 Ranking Formula (cont.)

    The ranking equation for the BM25 model can

    then be written as

    Where K1 and b are empirical constants

    K1 = 1 works well with real collections

    b should be kept closer to 1 to emphasize the documentlength normalization effect present in the BM11 formula

    For instance, b = 0.75 is a reasonable assumption

    Constants values can be fine tuned for particular

    collections through proper experimentationIR Berlin Chen 56

    jii dkqk i

    ii,jjBM

    .n

    .nNBqdsim

    50

    50log~),(25

    BM25 Ranking Formula (cont )

  • 7/31/2019 IR2012S Lecture05 Modeling II(Set, Algebra & Probabilistic)

    57/57

    BM25 Ranking Formula (cont.)

    Unlike the probabilistic model, the BM25 formula

    can be computed without relevance information

    There is consensus that BM25 outperforms the

    classic vector model for general collections

    Thus, it has been used as a baseline forevaluating new ranking functions, in substitution

    to the classic vector model

    IR Berlin Chen 57