์˜ค๋Š˜์€ RNN ๋ชจ๋ธ์ค‘ ํ•˜๋‚˜์ธ LSTM์„ ํ†ตํ•ด ์‚ผ์„ฑ์ „์ž ์ฃผ๊ฐ€๋ฅผ ์˜ˆ์ธกํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ

๋ฐ์ดํ„ฐ์…‹์„ ๋‹ค์šด๋กœ๋“œ ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒฝ์šฐ ๋งํฌ์— ์ ‘์†ํ•˜์—ฌ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

df_price = pd.read_csv(os.path.join(data_path, '01-์‚ผ์„ฑ์ „์ž-์ฃผ๊ฐ€.csv'), encoding='utf8')
df_price.describe()

put image plz
์ปฌ๋Ÿผ์€ ์ผ์ž, ์‹œ๊ฐ€, ๊ณ ๊ฐ€, ์ €๊ฐ€, ์ข…๊ฐ€, ๊ฑฐ๋ž˜๋Ÿ‰์œผ๋กœ ๊ตฌ์„ฑ ๋˜์–ด์žˆ์œผ๋ฉฐ ์ด 9,288๊ฐœ์˜ record๋ฅผ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
ํ•ด๋‹น ๋ฐ์ดํ„ฐ ์Šคํ‚ค๋งˆ๋ฅผ ๊ฐ–๊ณ  ๋ฏธ๋ž˜ ํŠน์ • ์‹œ์ ์˜ โ€˜์ข…๊ฐ€โ€™๋ฅผ ์˜ˆ์ธกํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

import numpy as np # linear algebra  
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)  
import matplotlib.pyplot as plt  
import seaborn as sns  

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ์‹œ๊ฐํ™”

๋‚ ์งœํ˜• ๋ณ€ํ™˜(-> datetime)

pd.to_datetime(df_price['์ผ์ž'], format='%Y%m%d')
# 0      2020-01-07
# 1      2020-01-06
# 2      2020-01-03
# 3      2020-01-02
# 4      2019-12-30

df_price['์ผ์ž'] = pd.to_datetime(df_price['์ผ์ž'], format='%Y%m%d')
df_price['์—ฐ๋„'] =df_price['์ผ์ž'].dt.year
df_price['์›”'] =df_price['์ผ์ž'].dt.month
df_price['์ผ'] =df_price['์ผ์ž'].dt.day

1990๋…„๋„ ์ดํ›„ ์ฃผ๊ฐ€ ์‹œ๊ฐํ™”

df = df_price.loc[df_price['์—ฐ๋„']>=1990]

plt.figure(figsize=(16, 9))
sns.lineplot(y=df['์ข…๊ฐ€'], x=df['์ผ์ž'])
plt.xlabel('time')
plt.ylabel('price')

put image plz

๋ฐ์ดํ„ฐ ์ •๊ทœํ™”

๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต์„ ์›ํ™œํžˆ ํ•˜๊ธฐ ์œ„ํ•ด ๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜๋ฅผ ์ •๊ทœํ™”ํ•ด์ค€๋‹ค.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scale_cols = ['์‹œ๊ฐ€', '๊ณ ๊ฐ€', '์ €๊ฐ€', '์ข…๊ฐ€', '๊ฑฐ๋ž˜๋Ÿ‰']
df_scaled = scaler.fit_transform(df[scale_cols])

df_scaled = pd.DataFrame(df_scaled)
df_scaled.columns = scale_cols

print(df_scaled)

put image plz
๋ชจ๋“  ์ปฌ๋Ÿผ์˜ ์Šค์ผ€์ผ์ด 0~1๋กœ ๋ณ€๊ฒฝ๋˜์–ด ์ถœ๋ ฅ๋œ ๋ชจ์Šต

ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ

window_size๋ฅผ ์ •์˜ํ•˜์—ฌ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
window_size๋Š” ๋‚ด๊ฐ€ ์–ผ๋งˆ๋™์•ˆ(๊ธฐ๊ฐ„)์˜ ์ฃผ๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์Œ๋‚  ์ข…๊ฐ€๋ฅผ ์˜ˆ์ธกํ•  ๊ฒƒ์ธ๊ฐ€๋ฅผ ์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค.
GCP AutoML์—์„œ์˜ historical data feed size์™€ ๋™์ผํ•œ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.
ํ•ด๋‹น ์˜ˆ์ œ์—์„œ๋Š” ๊ณผ๊ฑฐ 20์ผ์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ ๋‹ค์Œ๋‚ ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

TEST_SIZE = 200
WINDOW_SIZE = 20

train = df_scaled[:-TEST_SIZE]
test = df_scaled[-TEST_SIZE:]

dataset ๋งŒ๋“ค์–ด์ฃผ๋Š” ํ•จ์ˆ˜ ์ž‘์„ฑ

def make_dataset(data, label, window_size=20):
    feature_list = []
    label_list = []
    for i in range(len(data) - window_size):
        feature_list.append(np.array(data.iloc[i:i+window_size]))
        label_list.append(np.array(label.iloc[i+window_size]))
    return np.array(feature_list), np.array(label_list)

์œ„ ํ•จ์ˆ˜๋Š” ์ •ํ•ด์ง„ window_size์— ๊ธฐ๋ฐ˜ํ•ด์„œ 20์ผ ๊ธฐ๊ฐ„์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๋ฌถ์–ด์ฃผ๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.
์ˆœ์ฐจ์ ์œผ๋กœ 20์ผ ๋™์•ˆ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๋ฌถ๊ณ , ์ด์— ๋งž๋Š” label์„ ๋งคํ•‘ํ•˜์—ฌ return ํ•ด์ค๋‹ˆ๋‹ค.

feature์™€ label ์ •์˜

feature_cols = ['์‹œ๊ฐ€', '๊ณ ๊ฐ€', '์ €๊ฐ€', '๊ฑฐ๋ž˜๋Ÿ‰']
label_cols = ['์ข…๊ฐ€']

train_feature = train[feature_cols]
train_label = train[label_cols]

# train dataset
train_feature, train_label = make_dataset(train_feature, train_label, 20)

# train, validation set ์ƒ์„ฑ
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(train_feature, train_label, test_size=0.2)

x_train.shape, x_valid.shape
# ((6086, 20, 4), (1522, 20, 4))

# test dataset (์‹ค์ œ ์˜ˆ์ธก ํ•ด๋ณผ ๋ฐ์ดํ„ฐ)
test_feature, test_label = make_dataset(test_feature, test_label, 20)
test_feature.shape, test_label.shape
# ((180, 20, 4), (180, 1))

LSTM ๋ชจ๋ธ ์ƒ์„ฑ

from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import LSTM

model = Sequential()
model.add(LSTM(16, 
               input_shape=(train_feature.shape[1], train_feature.shape[2]), 
               activation='relu', 
               return_sequences=False)
          )
model.add(Dense(1))

๋ชจ๋ธ ํ•™์Šต

model.compile(loss='mean_squared_error', optimizer='adam')
early_stop = EarlyStopping(monitor='val_loss', patience=5)
filename = os.path.join(model_path, 'tmp_checkpoint.h5')
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

history = model.fit(x_train, y_train, 
                    epochs=200, 
                    batch_size=16,
                    validation_data=(x_valid, y_valid), 
                    callbacks=[early_stop, checkpoint])

# ...
# ...

# Epoch 00015: val_loss did not improve from 0.00002
# Epoch 16/200
# 6086/6086 [==============================] - 12s 2ms/step - loss: 3.1661e-05 - val_loss: 4.1063e-05

# Epoch 00016: val_loss did not improve from 0.00002
# Epoch 17/200
# 6086/6086 [==============================] - 13s 2ms/step - loss: 2.4644e-05 - val_loss: 4.0085e-05

# Epoch 00017: val_loss did not improve from 0.00002
# Epoch 18/200
# 6086/6086 [==============================] - 13s 2ms/step - loss: 2.2936e-05 - val_loss: 2.4692e-05

# Epoch 00018: val_loss did not improve from 0.00002

Early Stop ์˜ต์…˜์œผ๋กœ ์ธํ•ด 18๋ฒˆ์งธ Epoch ๊ธฐ์ค€ 0.00002์˜ ํ‰๊ท ์˜ค์ฐจ์—์„œ ํ•™์Šต์ด ํฌํ™”๋˜์–ด ์ค‘๋‹จ๋์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ๋กœ ๋ฏธ๋ž˜ ์ฃผ๊ฐ€ ์˜ˆ์ธก์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

# weight ๋กœ๋”ฉ
model.load_weights(filename)

# ์˜ˆ์ธก
pred = model.predict(test_feature)

์‹ค์ œ ๋ฐ์ดํ„ฐ์™€ ์˜ˆ์ธกํ•œ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

((6086, 20, 4), (1522, 20, 4))
plt.figure(figsize=(12, 9))
plt.plot(test_label, label='actual')
plt.plot(pred, label='prediction')
plt.legend()
plt.show()

put image plz

Reference

Lee, T. (2020, February 14). ๋”ฅ๋Ÿฌ๋‹(LSTM)์„ ํ™œ์šฉํ•˜์—ฌ ์‚ผ์„ฑ์ „์ž ์ฃผ๊ฐ€ ์˜ˆ์ธก์„ ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.
Retrieved August 27, 2020, from https://teddylee777.github.io/tensorflow/LSTM์œผ๋กœ-์˜ˆ์ธกํ•ด๋ณด๋Š”-์‚ผ์„ฑ์ „์ž-์ฃผ๊ฐ€