ํด๋ฌ์คํฐ๋ง์ด๋?
- ํด๋ฌ์คํฐ๋ง์ด๋ ๋ฐ์ดํฐ์
์ ๊ทธ๋ฃน๋ณ๋ก ๋๋๊ณ ์ ์ฌํ ๊ด์ธก๊ฐ๋ผ๋ฆฌ ๊ตฌ์ฑํ๋ ๊ณผ์ ์ ๋งํ๋ค.
- ๊ฐ์ ๋ผ๋ฒจ๋ผ๋ฆฌ๋ ์ต๋ํ ๋์ง์ฑ์ ์ ์งํ๊ณ
- ๋ค๋ฅธ ๋ผ๋ฒจ๋ผ๋ฆฌ๋ ์ต๋ํ ์ด์ง์ฑ์ ์ ์งํ๋ ค๋ ์์ฑ์ ๊ฐ์
์ด๋์ ์ฐ์ ๋๊น?
- ์ผ๋ฐ์ ์ผ๋ก ์ถ์ฒ ์์คํ
์ ๊ตฐ์ง ๋ถ์์ ๋ง์ด ์ฐ์
ํด๋ฌ์คํฐ๋ง์ ์ข ๋ฅ
Exclusive Clustering
- ์๊ฒฉํ ๊ตฐ์ง์ ๊ตฌ์ฑํ๋ฉฐ(๊ฒน์น์ง ์์)
- ๋ํ์ ์ผ๋ก K-Means Clustering์ด ์๊ฒ ์
Overlapping Clustering
- ๊ตฐ์ง๊ฐ ์ฌ๋ฌ๊ฐ์ ์์ฑ(๋ผ๋ฒจ)์ ๊ฐ๋ ๊ด์ธก๊ฐ์ด ์กด์ฌํ๋ฉฐ
- Soft Cluster๋ผ๊ณ ๋ ํจ
- Fuzzy์ C-Means Clustering ๋ฑ์ด ์๊ฒ ์
Hierarchical Clustering
- ๋ ๊ฐ์ ๊ด์ธก๊ฐ๋ผ๋ฆฌ ์๊ณ๋ฅผ ๋๋๋ฉฐ ๊ทธ๋ฃจํํ๋ฉฐ
- ๋ํ์ ์ธ ๊ธฐ๋ฒ์ผ๋ก ๋๋๋ก๊ทธ๋จ ํด๋ฌ์คํฐ๋ง์ด ์๊ฒ ์
K-Means ํด๋ฌ์คํฐ๋ง
- K๋ ํด๋ฌ์คํฐ์ ๊ฐ์๋ฅผ ์๋ฏธํ๋ค.
ํด๋ฌ์คํฐ๋ง ๊ณผ์
- Step 1: ํด๋ฌ์คํฐ์ ๊ฐ์(K) ์ ํ(์์ ์์๋ 3๊ฐ)
- Step 2: ๋ฌด์์๋ก K๊ฐ์ data point ์ ํ
- Step 3: ๊ด์ธก๊ฐ์ ์ฒซ๋ฒ์งธ ๊ฐ๋ถํฐ ์์๋๋ก ๊ฐ์ฅ ๊ฐ๊น์ด ํด๋ฌ์คํฐ ํ์(์ฐจ์์ ๋ฐ๋ผ ๊ฑฐ๋ฆฌ๊ณ์ฐ ๊ธฐ๋ฒ์ด ๋ฌ๋ผ์ง)
- Step 4: ํด๋น data point๋ฅผ ์ต๊ทผ์ ํด๋ฌ์คํฐ์ label๋ก ํ ๋น
- Step 5: ๊ฐฑ์ ๋ ํด๋ฌ์คํฐ๋ฅผ ๋ฐํ์ผ๋ก ๊ฐ ํด๋ฌ์คํฐ๋ณ ์์ฑ์ ํ๊ท ๊ฐ ๊ณ์ฐ
- Step 6: ๊ณ์ฐ๋ ํ๊ท ๊ฐ์ ํด๋น ํด๋ฌ์คํฐ์ ์ฝ์ด๋ก ๊ฐ์ ํ์ฌ ๋ค์ data point์์ ๊ฑฐ๋ฆฌ ๊ณ์ฐ ๊ธฐ์ค์ผ๋ก ์ผ์
- ์ด๋ ๊ฒ ๊ณ์ฐ๋ ํด๋ฌ์คํฐ์ ํ๊ท ๊ฐ์ centroid ๋ผ๊ณ ํจ
(๋ฐ๋ณต)
K-Means์ ํ๋๋ฒ
Initial data point ๊ฐ์ ์กฐ์
- K-Means๋ฅผ ์ํํ์์๋ ํด๋ฆฌ์คํฑํ ๋ถ๋ฅ๋ณด๋ค ๋ถ๋ฅ ์ฑ๋ฅ์ด ๋จ์ด์ง๋ ๊ฒฝ์ฐ๊ฐ ์์
- ๊ทธ๋ฐ ๊ฒฝ์ฐ initial data points๋ฅผ ๋ฐ๊ฟ๊ฐ๋ฉฐ iteration์ ์ฃผ์ด ๊ฐ ๊ตฐ์ง๋ณ ๋ถ์ฐ ๊ฐ์ ์ค์ด๋ ๋ฐฉํฅ์ผ๋ก ํ๋ํ ์ ์์
1. ์ต์ด ๋ถ์ฐ๊ฐ ๊ณ์ฐ
2. initial data point ์กฐ์ (iter.2 ๋จ๊ณ)
3. ์ด์ ๋ชจ๋ธ๊ณผ ๋ถ์ฐ๊ฐ ๋น๊ต
K ๊ฐ ์กฐ์
K๊ฐ์ ์กฐ์ ํ๊ธฐ ์ํด ๋ง์ฐฌ๊ฐ์ง๋ก K๊ฐ์ ๋ฐ๋ฅธ ๋ถ์ฐ ๋น๊ต๋ก ์กฐ์ ํ ์ ์์
K ๊ฐ์ ์กฐ์ ์ ๋ฐ๋ผ ๋ถ์ฐ์ด ํฌํ๋๋ ๋ณ๊ณก์ ์ด ์๋๋ฐ ์ด K ๊ฐ์ elbow point๋ผ๊ณ ๋ถ๋ฆ
2์ฐจ์์์์ K-Means ์ ์ฉ
K-Means Vanilla Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({
'x': [12,20,28,18,29,33,24,45,45,52,51,52,55,53,55,61,64,69,72],
'y': [39,36,30,52,54,46,55,59,63,70,66,63,58,23,14,8,19,7,24]
})
np.random.seed(200)
k = 3
#centrioids[i] = [x, y]
centroids = {
i+1: [np.random.randint(0, 80), np.random.randint(0,80)]
for i in range(k)
}
fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color='k')
colmap = {1: 'r', 2: 'g', 3: 'b'}
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0,80)
plt.ylim(0,80)
plt.show
<function matplotlib.pyplot.show(*args, **kw)>
# Assignment Stage
def assignment(df, centroids):
for i in centroids.keys():
# sqrt((x1 - x2)^2 - (y1 - y2)^2)
df['distance_from_{}'.format(i)] = (
np.sqrt(
(df['x'] - centroids[i][0])**2
+ (df['y'] - centroids[i][1])**2
)
)
centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_'))) # ์ธํธ ๋ณํํด์ ๋ผ๋ฒจ๋ง
df['color'] = df['closest'].map(lambda x: colmap[x]) # ์ปฌ๋ฌ๋ ๋ผ๋ฒจ๋ง
return df
df = assignment(df, centroids)
print(df.head())
fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()
# Update Stage
import copy
old_centroids = copy.deepcopy(centroids)
def update(k):
for i in centroids.keys():
centroids[i][0] = np.mean(df[df['closest'] == i]['x']) # ์ฌ๊ธฐ ์ ์ดํด๊ฐ ์ ๋จ
centroids[i][1] = np.mean(df[df['closest'] == i]['y'])
return k
centroids = update(centroids)
fig = plt.figure(figsize=(5,5))
ax = plt.axes()
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0,80)
plt.ylim(0,80)
for i in old_centroids.keys(): # ์ฌ๊ธฐ ๋ญ์
old_x = old_centroids[i][0]
old_y = old_centroids[i][1]
dx = (centroids[i][0] - old_centroids[i][0])*0.75
dy = (centroids[i][1] - old_centroids[i][1])*0.75
ax.arrow(old_x, old_y, dx, dy, head_width=2, head_length=3, fc=colmap[i], ec=colmap[i])
plt.show()
# Repeat Assignment Stage
df = assignment(df, centroids)
# Plot results
fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0, 80)
plt.ylim(0,80)
plt.show()
# Continue untill all assigned categories don't change anymore
while True:
closest_centroids = df['closest'].copy(deep=True)
centroids = update(centroids)
df = assignment(df, centroids)
if closest_centroids.equals(df['closest']):
break
fig = plt.figure(figsize=(5,5))
plt.scatter(df['x'], df['y'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
plt.scatter(*centroids[i], color=colmap[i])
plt.xlim(0,80)
plt.ylim(0,80)
plt.show()
df = pd.DataFrame({
'x': [12,20,28,18,29,33,24,45,45,52,51,52,55,53,55,61,64,69,72],
'y': [39,36,30,52,54,46,55,59,63,70,66,63,58,23,14,8,19,7,24]
})
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df)
labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_
fig = plt.figure(figsize=(5,5))
colors = map(lambda x: colmap[x+1], labels)
colors1 = list(colors)
plt.scatter(df['x'], df['y'], color=colors1, alpha=0.5, edgecolor='k')
for idx, centroid in enumerate(centroids):
plt.scatter(*centroid, color=colmap[idx+1])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()