pert 05 aplikasi clustering
TRANSCRIPT
Overview of Clustering
Rong Jin
Outline K means for clustering Expectation Maximization algorithm for clustering Spectrum clustering (if time is permitted) w w
w
Clustering
$$$
age
Find out the underlying structure for given data points
Application (I): Search Result Clustering
Application (II): Navigation
Application (III): Google News
Application (III): Visualization
Islands of music (Pampalk et al., KDD’ 03)
Application (IV): Image Compression
http://www.ece.neu.edu/groups/rpl/kmeans/
How to Find good Clustering? Minimize the sum of
distance within clustersC1
C2
C3
C4C5
,
6 2,
1 1,arg min
j i j
n
i j i jj iC m
m x C
,
6
,1
1 the j-th cluster0 the j-th cluster
1
any a single cluster
ii j
i
i jj
i
xm
x
m
x
How to Efficiently Clustering Data?
,
6 2,
1 1,arg min
j i j
n
i j i jj iC m
m x C
,Memberships and centers are correlated.i j jm C
,
1,
,1
Given memberships ,
n
i j ii
i j j n
i ji
m xm C
m
2
,
1 arg min( ) Given centers { },
0 otherwise
i jkj i j
j x CC m
K-means for Clustering K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers
K-means for Clustering K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers
K-means for Clustering K-means
Start with a random guess of cluster centers
Determine the membership of each data points
Adjust the cluster centers
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
K-means1. Ask user how many
clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it ownsAny Computational Problem?
Computational Complexity: O(N) where N is the number of points?
Improve K-means Group points by region
KD tree SR tree
Key difference Find the closest center for
each rectangle Assign all the points within a
rectangle to one cluster
Improved K-means Find the closest center for
each rectangle Assign all the points within
a rectangle to one cluster
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
A Gaussian Mixture Model for Clustering Assume that data are
generated from a mixture of Gaussian distributions
For each Gaussian distribution Center: i
Variance: i (ignore) For each data point
Determine membership
: if belongs to j-th clusterij iz x
Learning a Gaussian Mixture(with known covariance)
Probability ( )ip x x
2
/ 2 22
( ) ( , ) ( ) ( | )
1( ) exp22
j j
j
i i j j i j
i jj d
p x x p x x p p x x
xp
Learning a Gaussian Mixture(with known covariance)
Probability ( )ip x x
2
/ 2 22
( ) ( , ) ( ) ( | )
1( ) exp22
j j
j
i i j j i j
i jj d
p x x p x x p p x x
xp
Log-likelihood of data
Apply MLE to find optimal parameters
2/ 2 22
1log ( ) log ( ) exp22j
i ji j d
i i
xp x x p
( ),j j jp
Learning a Gaussian Mixture(with known covariance)
22
22
1 ( )2
1 ( )2
1
( )
( )
i j
i n
x
j
k x
nn
e p
e p
[ ] ( | )ij j iE z p x x E-Step
1
( | ) ( )
( | ) ( )
i j jk
i n jn
p x x p
p x x p
Learning a Gaussian Mixture(with known covariance)
1
1
1 [ ][ ]
m
j ij imi
iji
E z xE z
M-Step
1
1( ) [ ]m
j iji
p E zm
Gaussian Mixture Example: Start
After First Iteration
After 2nd Iteration
After 3rd Iteration
After 4th Iteration
After 5th Iteration
After 6th Iteration
After 20th Iteration
Mixture Model for Doc Clustering A set of language models
1 2, ,..., K
1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w
Mixture Model for Doc Clustering A set of language models
1 2, ,..., K
1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w
( )ip d d
( , )
1
( ) ( , )
( ) ( | )
( ) ( | )
j
j
k i
j
i i j
j i j
V tf w dj k jk
p d d p d d
p p d d
p p w
Probability
Mixture Model for Doc Clustering A set of language models
1 2, ,..., K
1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w
( )ip d d
( , )
1
( ) ( , )
( ) ( | )
( ) ( | )
j
j
k i
j
i i j
j i j
V tf w dj k jk
p d d p d d
p p d d
p p w
Probability
Mixture Model for Doc Clustering A set of language models
1 2, ,..., K
1 2{ ( | ), ( | ),..., ( | )}i i i V ip w p w p w
( )ip d d
( , )
1
( ) ( , )
( ) ( | )
( ) ( | )
j
j
k i
j
i i j
j i j
V tf w dj k jk
p d d p d d
p p d d
p p w
Probability
Introduce hidden variable zijzij: document di is generated by the
j-th language model j.
Learning a Mixture Model
( , )
1
( , )
1 1
( | ) ( )
( | ) ( )
k i
k i
V tf w dm j j
mVK
tf w dm n n
n m
p w p
p w p
1
[ ] ( | )
( | ) ( )
( | ) ( )
ij j i
i j jK
i n nn
E z p d d
p d d p
p d d p
E-Step
K: number of language models
Learning a Mixture Model
M-Step
1
1( ) [ ]N
j iji
p E zN
1
1
[ ] ( , )( | )
[ ]
N
ij i kk
i j N
ij kk
E z tf w dp w
E z d
N: number of documents
Examples of Mixture Models
Other Mixture Models Probabilistic latent semantic index (PLSI) Latent Dirichlet Allocation (LDA)
Problems (I) Both k-means and mixture models need to compute
centers of clusters and explicit distance measurement Given strange distance measurement, the center of
clusters can be hard to computeE.g., ' ' '
1 1 2 2' max , ,..., n nx x x x x x x x
x y
z
x y x z
Problems (II) Both k-means and mixture models look for compact
clustering structures In some cases, connected clustering structures are more desirable
Graph Partition MinCut: bipartite graphs with minimal number of
cut edges
CutSize = 2
2-way Spectral Graph Partitioning Weight matrix W
wi,j: the weight between two vertices i and j
Membership vector q1 Cluster -1 Cluster i
i Aq
i B
[ 1,1]
2,
,
arg min
14
n
i j i ji j
CutSize
CutSize J q q w
Solving the Optimization Problem
Directly solving the above problem requires combinatorial search exponential complexity
How to reduce the computation complexity?
2,
[ 1,1] ,
1arg min4n
i j i ji jq q w
Relaxation Approach Key difficulty: qi has to be either –1, 1
Relax qi to be any real number Impose constraint
21
nii q n
2 2 2, ,
, ,
2, ,
,
1 1 24 4
1 12 24 4
i j i j i j i j i ji j i j
i i j i j i ji j i j
J q q w q q q q w
q w q q w
,i i jj
d w
2, , ,
,
1 1 12 2 2i i i j i j i i i j i j ji i j iq d q q w q d w q
,i i jD d
( )TJ q D W q
Relaxation Approach
2
* arg min arg min ( )
subject to
T
kk
J
q n
q q
q q D W q
Relaxation Approach
Solution: the second minimum eigenvector for D-W
2
* arg min arg min ( )
subject to
T
kk
J
q n
q q
q q D W q
2( )D W q q
Graph Laplacian
L is semi-positive definitive matrix For Any x, we have xTLx 0, why? Minimum eigenvalue 1 = 0 (what is the eigenvector?) The second minimum eigenvalue 2 gives the best bipartite
graph
, , ,: , i j i j i jjw w L D W W D
1 2 30 ... k
Recovering Partitions Due to the relaxation, q can be any number (not just
–1 and 1) How to construct partition based on the eigenvector?
Simple strategy: { | 0}, { | 0}i iA i q B i q
Spectral Clustering Minimum cut does not balance the size of bipartite
graphs
Normalized Cut (Shi & Malik, 1997)
Minimize the similarity between clusters and meanwhile maximize the similarity within clusters
,( , ) , ,
( , ) ( , )
i j A i B ii A j B i A i B
A B
s A B w d d d d
s A B s A BJd d
,( , ) ( , ) B A
i ji A j BA B A B
d ds A B s A BJ wd d d d
jj
d d
2
,B A
i ji A j B A B
d dw
d d d
BidddAiddd
iqBA
AB
if if
//
)(
2,i j i j
i jw q q
Normalized Cut 2
, ( - )
/ if
/ if
Ti j i j
i j
B Ai
A B
J w q q
d d d i Aq
d d d i B
q D W q
Normalized Cut
Relax q to real value under the constraint
2, ( - )
/ if
/ if
Ti j i j
i j
B Ai
A B
J w q q
d d d i Aq
d d d i B
q D W q
0,1 DeqDqq TT
Solution: DqqWD )(
Image Segmentation
Non-negative Matrix Factorization