ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด๋ž€?

  • ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด๋ž€ ๋ฐ์ดํ„ฐ์…‹์„ ๊ทธ๋ฃน๋ณ„๋กœ ๋‚˜๋ˆ„๊ณ  ์œ ์‚ฌํ•œ ๊ด€์ธก๊ฐ’๋ผ๋ฆฌ ๊ตฌ์„ฑํ•˜๋Š” ๊ณผ์ •์„ ๋งํ•œ๋‹ค.
    • ๊ฐ™์€ ๋ผ๋ฒจ๋ผ๋ฆฌ๋Š” ์ตœ๋Œ€ํ•œ ๋™์งˆ์„ฑ์„ ์œ ์ง€ํ•˜๊ณ 
    • ๋‹ค๋ฅธ ๋ผ๋ฒจ๋ผ๋ฆฌ๋Š” ์ตœ๋Œ€ํ•œ ์ด์งˆ์„ฑ์„ ์œ ์ง€ํ•˜๋ ค๋Š” ์†์„ฑ์„ ๊ฐ–์Œ

      ์–ด๋””์— ์“ฐ์ž…๋‹ˆ๊นŒ?

  • ์ผ๋ฐ˜์ ์œผ๋กœ ์ถ”์ฒœ ์‹œ์Šคํ…œ์˜ ๊ตฐ์ง‘ ๋ถ„์„์— ๋งŽ์ด ์“ฐ์ž„

    ํด๋Ÿฌ์Šคํ„ฐ๋ง์˜ ์ข…๋ฅ˜

    Exclusive Clustering

  • ์—„๊ฒฉํ•œ ๊ตฐ์ง‘์„ ๊ตฌ์„ฑํ•˜๋ฉฐ(๊ฒน์น˜์ง€ ์•Š์Œ)
  • ๋Œ€ํ‘œ์ ์œผ๋กœ K-Means Clustering์ด ์žˆ๊ฒ ์Œ

    Overlapping Clustering

  • ๊ตฐ์ง‘๊ฐ„ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์†์„ฑ(๋ผ๋ฒจ)์„ ๊ฐ–๋Š” ๊ด€์ธก๊ฐ’์ด ์กด์žฌํ•˜๋ฉฐ
  • Soft Cluster๋ผ๊ณ ๋„ ํ•จ
  • Fuzzy์™€ C-Means Clustering ๋“ฑ์ด ์žˆ๊ฒ ์Œ

    Hierarchical Clustering

  • ๋‘ ๊ฐœ์˜ ๊ด€์ธก๊ฐ’๋ผ๋ฆฌ ์œ„๊ณ„๋ฅผ ๋‚˜๋ˆ„๋ฉฐ ๊ทธ๋ฃจํ•‘ํ•˜๋ฉฐ
  • ๋Œ€ํ‘œ์ ์ธ ๊ธฐ๋ฒ•์œผ๋กœ ๋Œ„๋“œ๋กœ๊ทธ๋žจ ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด ์žˆ๊ฒ ์Œ

    K-Means ํด๋Ÿฌ์Šคํ„ฐ๋ง

  • K๋Š” ํด๋Ÿฌ์Šคํ„ฐ์˜ ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

    ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ณผ์ •

  • Step 1: ํด๋Ÿฌ์Šคํ„ฐ์˜ ๊ฐœ์ˆ˜(K) ์„ ํƒ(์˜ˆ์ œ์—์„œ๋Š” 3๊ฐœ)
  • Step 2: ๋ฌด์ž‘์œ„๋กœ K๊ฐœ์˜ data point ์„ ํƒ
  • Step 3: ๊ด€์ธก๊ฐ’์˜ ์ฒซ๋ฒˆ์งธ ๊ฐ’๋ถ€ํ„ฐ ์ˆœ์„œ๋Œ€๋กœ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํด๋Ÿฌ์Šคํ„ฐ ํƒ์ƒ‰(์ฐจ์›์— ๋”ฐ๋ผ ๊ฑฐ๋ฆฌ๊ณ„์‚ฐ ๊ธฐ๋ฒ•์ด ๋‹ฌ๋ผ์ง)

  • Step 4: ํ•ด๋‹น data point๋ฅผ ์ตœ๊ทผ์ ‘ ํด๋Ÿฌ์Šคํ„ฐ์˜ label๋กœ ํ• ๋‹น

  • Step 5: ๊ฐฑ์‹ ๋œ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ๋ณ„ ์†์„ฑ์˜ ํ‰๊ท ๊ฐ’ ๊ณ„์‚ฐ

  • Step 6: ๊ณ„์‚ฐ๋œ ํ‰๊ท ๊ฐ’์„ ํ•ด๋‹น ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ฝ”์–ด๋กœ ๊ฐ€์ •ํ•˜์—ฌ ๋‹ค์Œ data point์™€์˜ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๊ธฐ์ค€์œผ๋กœ ์‚ผ์Œ
  • ์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐ๋œ ํด๋Ÿฌ์Šคํ„ฐ์˜ ํ‰๊ท ๊ฐ’์„ centroid ๋ผ๊ณ  ํ•จ

(๋ฐ˜๋ณต)

K-Means์˜ ํŠœ๋‹๋ฒ•

Initial data point ๊ฐœ์ˆ˜ ์กฐ์ •

  • K-Means๋ฅผ ์‹œํ–‰ํ•˜์˜€์„๋•Œ ํœด๋ฆฌ์Šคํ‹ฑํ•œ ๋ถ„๋ฅ˜๋ณด๋‹ค ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Œ
  • ๊ทธ๋Ÿฐ ๊ฒฝ์šฐ initial data points๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ iteration์„ ์ฃผ์–ด ๊ฐ ๊ตฐ์ง‘๋ณ„ ๋ถ„์‚ฐ ๊ฐ’์„ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํŠœ๋‹ํ•  ์ˆ˜ ์žˆ์Œ

    1. ์ตœ์ดˆ ๋ถ„์‚ฐ๊ฐ’ ๊ณ„์‚ฐ

2. initial data point ์กฐ์ •(iter.2 ๋‹จ๊ณ„)

3. ์ด์ „ ๋ชจ๋ธ๊ณผ ๋ถ„์‚ฐ๊ฐ’ ๋น„๊ต

K ๊ฐ’ ์กฐ์ •

K๊ฐ’์„ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ K๊ฐ’์— ๋”ฐ๋ฅธ ๋ถ„์‚ฐ ๋น„๊ต๋กœ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Œ

K ๊ฐ’์˜ ์กฐ์ •์— ๋”ฐ๋ผ ๋ถ„์‚ฐ์ด ํฌํ™”๋˜๋Š” ๋ณ€๊ณก์ ์ด ์žˆ๋Š”๋ฐ ์ด K ๊ฐ’์„ elbow point๋ผ๊ณ  ๋ถ€๋ฆ„

2์ฐจ์›์—์„œ์˜ K-Means ์ ์šฉ

K-Means Vanilla Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.DataFrame({
    'x': [12,20,28,18,29,33,24,45,45,52,51,52,55,53,55,61,64,69,72],
    'y': [39,36,30,52,54,46,55,59,63,70,66,63,58,23,14,8,19,7,24]
})

np.random.seed(200)
k = 3
#centrioids[i] = [x, y]
centroids = {
    i+1: [np.random.randint(0, 80), np.random.randint(0,80)]
    for i in range(k)
}

fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color='k')
colmap = {1: 'r', 2: 'g', 3: 'b'}
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0,80)
plt.ylim(0,80)
plt.show

<function matplotlib.pyplot.show(*args, **kw)>

# Assignment Stage
def assignment(df, centroids):
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2)
        df['distance_from_{}'.format(i)] = (
            np.sqrt(
                (df['x'] - centroids[i][0])**2
                + (df['y'] - centroids[i][1])**2
            )
        )
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
    df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
    df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_'))) # ์ธํŠธ ๋ณ€ํ™˜ํ•ด์„œ ๋ผ๋ฒจ๋ง
    df['color'] = df['closest'].map(lambda x: colmap[x]) # ์ปฌ๋Ÿฌ๋„ ๋ผ๋ฒจ๋ง
    return df

df = assignment(df, centroids)
print(df.head())

fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()

# Update Stage
import copy

old_centroids = copy.deepcopy(centroids)

def update(k):
    for i in centroids.keys():
        centroids[i][0] = np.mean(df[df['closest'] == i]['x']) # ์—ฌ๊ธฐ ์ž˜ ์ดํ•ด๊ฐ€ ์•ˆ ๋จ
        centroids[i][1] = np.mean(df[df['closest'] == i]['y'])
    return k

centroids = update(centroids)

fig = plt.figure(figsize=(5,5))
ax = plt.axes()
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0,80)
plt.ylim(0,80)
for i in old_centroids.keys():                        # ์—ฌ๊ธฐ ๋ญ์ž„
    old_x = old_centroids[i][0]
    old_y = old_centroids[i][1]
    dx = (centroids[i][0] - old_centroids[i][0])*0.75
    dy = (centroids[i][1] - old_centroids[i][1])*0.75
    ax.arrow(old_x, old_y, dx, dy, head_width=2, head_length=3, fc=colmap[i], ec=colmap[i])
plt.show()

# Repeat Assignment Stage
df = assignment(df, centroids)

# Plot results
fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0,80)
plt.show()

# Continue untill all assigned categories don't change anymore
while True:
    closest_centroids = df['closest'].copy(deep=True)
    centroids = update(centroids)
    df = assignment(df, centroids)
    if closest_centroids.equals(df['closest']):
        break
        
fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0,80)
plt.ylim(0,80)
plt.show()

df = pd.DataFrame({
    'x': [12,20,28,18,29,33,24,45,45,52,51,52,55,53,55,61,64,69,72],
    'y': [39,36,30,52,54,46,55,59,63,70,66,63,58,23,14,8,19,7,24]
})

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(df)
labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_
fig = plt.figure(figsize=(5,5))
colors = map(lambda x: colmap[x+1], labels)
colors1 = list(colors)
plt.scatter(df['x'], df['y'], color=colors1, alpha=0.5, edgecolor='k')
for idx, centroid in enumerate(centroids):
    plt.scatter(*centroid, color=colmap[idx+1])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()