pert 05 aplikasi clustering

Overview of Clustering

Rong Jin

Outline K means for clustering Expectation Maximization algorithm for clustering Spectrum clustering (if time is permitted) w w

w

Clustering

$$$

age

Find out the underlying structure for given data points

Application (I): Search Result Clustering

Application (II): Navigation

Application (III): Google News

Application (III): Visualization

Islands of music (Pampalk et al., KDD’ 03)

Application (IV): Image Compression

http://www.ece.neu.edu/groups/rpl/kmeans/

How to Find good Clustering? Minimize the sum of

distance within clustersC1

C2

C3

C4C5

,

6 2,

1 1,arg min

j i j

n

i j i jj iC m

m x C

,

6

,1

1 the j-th cluster0 the j-th cluster

1

any a single cluster

ii j

i

i jj

i

xm

x

m

x

How to Efficiently Clustering Data?

,

6 2,

1 1,arg min

j i j

n

i j i jj iC m

m x C

,Memberships and centers are correlated.i j jm C

,

1,

,1

Given memberships ,

n

i j ii

i j j n

i ji

m xm C

m

2

,

1 arg min( ) Given centers { },

0 otherwise

i jkj i j

j x CC m

K-means for Clustering K-means

Start with a random guess of cluster centers

Determine the membership of each data points

Adjust the cluster centers

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)



2. Randomly guess k cluster Center locations




3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)




3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns




3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it ownsAny Computational Problem?

Computational Complexity: O(N) where N is the number of points?

Improve K-means Group points by region

KD tree SR tree

Key difference Find the closest center for

each rectangle Assign all the points within a

rectangle to one cluster

Improved K-means Find the closest center for

each rectangle Assign all the points within

a rectangle to one cluster

Improved K-means

A Gaussian Mixture Model for Clustering Assume that data are

generated from a mixture of Gaussian distributions

For each Gaussian distribution Center: i

Variance: i (ignore) For each data point

Determine membership

: if belongs to j-th clusterij iz x

Learning a Gaussian Mixture(with known covariance)

Probability ( )ip x x

2

/ 2 22

( ) ( , ) ( ) ( | )

1( ) exp22

j j

j

i i j j i j

i jj d

p x x p x x p p x x

xp


Probability ( )ip x x

2

/ 2 22

( ) ( , ) ( ) ( | )

1( ) exp22

j j

j

i i j j i j

i jj d

p x x p x x p p x x

xp

Log-likelihood of data

Apply MLE to find optimal parameters

2/ 2 22

1log ( ) log ( ) exp22j

i ji j d

i i

xp x x p

( ),j j jp


22

22

1 ( )2

1 ( )2

1

( )

( )

i j

i n

x

j

k x

nn

e p

e p

[ ] ( | )ij j iE z p x x E-Step

1

( | ) ( )

( | ) ( )

i j jk

i n jn

p x x p

p x x p


1

1

1 [ ][ ]

m

j ij imi

iji

E z xE z

M-Step

1

1( ) [ ]m

j iji

p E zm

Gaussian Mixture Example: Start

After First Iteration

After 2nd Iteration

After 3rd Iteration

After 4th Iteration

After 5th Iteration

After 6th Iteration

After 20th Iteration

Mixture Model for Doc Clustering A set of language models

1 2, ,..., K

1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w


1 2, ,..., K

1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w

( )ip d d

( , )

1

( ) ( , )

( ) ( | )

( ) ( | )

j

j

k i

j

i i j

j i j

V tf w dj k jk

p d d p d d

p p d d

p p w

Probability


1 2, ,..., K

1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w

( )ip d d

( , )

1

( ) ( , )

( ) ( | )

( ) ( | )

j

j

k i

j

i i j

j i j

V tf w dj k jk

p d d p d d

p p d d

p p w

Probability

Introduce hidden variable zijzij: document di is generated by the

j-th language model j.

Learning a Mixture Model

( , )

1

( , )

1 1

( | ) ( )

( | ) ( )

k i

k i

V tf w dm j j

mVK

tf w dm n n

n m

p w p

p w p

1

[ ] ( | )

( | ) ( )

( | ) ( )

ij j i

i j jK

i n nn

E z p d d

p d d p

p d d p

E-Step

K: number of language models

Learning a Mixture Model

M-Step

1

1( ) [ ]N

j iji

p E zN

1

1

[ ] ( , )( | )

[ ]

N

ij i kk

i j N

ij kk

E z tf w dp w

E z d

N: number of documents

Examples of Mixture Models

Other Mixture Models Probabilistic latent semantic index (PLSI) Latent Dirichlet Allocation (LDA)

Problems (I) Both k-means and mixture models need to compute

centers of clusters and explicit distance measurement Given strange distance measurement, the center of

clusters can be hard to computeE.g., ' ' '

1 1 2 2' max , ,..., n nx x x x x x x x

x y

z

x y x z

Problems (II) Both k-means and mixture models look for compact

clustering structures In some cases, connected clustering structures are more desirable

Graph Partition MinCut: bipartite graphs with minimal number of

cut edges

CutSize = 2

2-way Spectral Graph Partitioning Weight matrix W

wi,j: the weight between two vertices i and j

Membership vector q1 Cluster -1 Cluster i

i Aq

i B

[ 1,1]

2,

,

arg min

14

n

i j i ji j

CutSize

CutSize J q q w

qq

Solving the Optimization Problem

Directly solving the above problem requires combinatorial search exponential complexity

How to reduce the computation complexity?

2,

[ 1,1] ,

1arg min4n

i j i ji jq q w

qq

Relaxation Approach Key difficulty: qi has to be either –1, 1

Relax qi to be any real number Impose constraint

21

nii q n

2 2 2, ,

, ,

2, ,

,

1 1 24 4

1 12 24 4

i j i j i j i j i ji j i j

i i j i j i ji j i j

J q q w q q q q w

q w q q w

,i i jj

d w

2, , ,

,

1 1 12 2 2i i i j i j i i i j i j ji i j iq d q q w q d w q

,i i jD d

( )TJ q D W q

Relaxation Approach

2

* arg min arg min ( )

subject to

T

kk

J

q n

q q

q q D W q

Relaxation Approach

Solution: the second minimum eigenvector for D-W

2

* arg min arg min ( )

subject to

T

kk

J

q n

q q

q q D W q

2( )D W q q

Graph Laplacian

L is semi-positive definitive matrix For Any x, we have xTLx 0, why? Minimum eigenvalue 1 = 0 (what is the eigenvector?) The second minimum eigenvalue 2 gives the best bipartite

graph

, , ,: , i j i j i jjw w L D W W D

1 2 30 ... k

Recovering Partitions Due to the relaxation, q can be any number (not just

–1 and 1) How to construct partition based on the eigenvector?

Simple strategy: { | 0}, { | 0}i iA i q B i q

Spectral Clustering Minimum cut does not balance the size of bipartite

graphs

Normalized Cut (Shi & Malik, 1997)

Minimize the similarity between clusters and meanwhile maximize the similarity within clusters

,( , ) , ,

( , ) ( , )

i j A i B ii A j B i A i B

A B

s A B w d d d d

s A B s A BJd d

,( , ) ( , ) B A

i ji A j BA B A B

d ds A B s A BJ wd d d d

jj

d d

2

,B A

i ji A j B A B

d dw

d d d

BidddAiddd

iqBA

AB

if if

//

)(

2,i j i j

i jw q q

Normalized Cut 2

, ( - )

/ if

/ if

Ti j i j

i j

B Ai

A B

J w q q

d d d i Aq

d d d i B

q D W q

Normalized Cut

Relax q to real value under the constraint

2, ( - )

/ if

/ if

Ti j i j

i j

B Ai

A B

J w q q

d d d i Aq

d d d i B

q D W q

0,1 DeqDqq TT

Solution: DqqWD )(

Image Segmentation

Non-negative Matrix Factorization

pert 05 aplikasi clustering

Technology