Prophet을 활용한 시계열 데이터를 예측¶

활용 데이터
노르웨이의 신차 판매 데이터 (https://www.kaggle.com/datasets/dmi3kno/newcarsalesnorway),

1. 분석 배경 및 목적 설명¶

1) 배경¶

2017년 1월 10일 아침, 노르웨이 도로 협회인 Opplysningsrådet for Veitrafikken(OFV)는 회원 조직을 위한 비즈니스 조찬회를 열고, "Car Year 2016. Status and trend"라는 제목으로 연례 프레젠테이션을 발표했습니다.

OFV에서는 연간 신규 승용차 판매를 예측합니다. 방법론을 간단히 요약하면 아래와 같습니다.

수년간의 OFV 기술, 요약통계량 기반 계산
최근 4년간의 실질 월간 수치를 고려
전년도 실질매출액을 직전 8개월 평균과 합산하여 전년도 대비 실매출액으로 조정한 연도별 비율을 가중

참고로, 2016년 OFV는 157,500대의 신규 승용차 판매를 예측하였으며, 실제 판매는 154,603대였습니다.
본 교육자료를 통해 이보다 정확하게 예측할 수 있는지 확인해보겠습니다.

OFV는 노르웨이서 더 안전하고 효율적인 도로를 건설하기 위한 목적의 조직이며, 신차 판매량의 예측을 통해 정책 결정의 기초자료로 활용할 수 있을 것입니다.
이를 위해, 연간 차량 판매량을 예측해보고자 합니다.

2) 데이터셋 구성¶

본 교육자료에 활용할 데이터셋은 2007년 1월부터 2017년 1월까지의 제조사(제조업체 브랜드)별 신차의 월 판매량 데이터이며, 데이터 컬럼 구성요소는 아래와 같습니다.

Year: 판매 년도
Month: 판매 월
Make: 제조사(제조업체 브랜드)
Quantity: 판매 단위 수
Pct: 월별 총 점유율

본 자료에서는 브랜드를 구별하지 않고, 등록된 제조사들을 모두 통합하여 판매량 예측을 진행하도록 하겠습니다.

2. 데이터 수집 및 전처리¶

1) 패키지 로드 및 데이터 프레임 생성¶

In [1]:

import pandas as pd
from prophet import Prophet
from prophet.plot import add_changepoints_to_plot, plot_cross_validation_metric
from prophet.diagnostics import cross_validation, performance_metrics
import matplotlib.pyplot as plt

본 데이터는 캐글에서 제공하는 데이터로, 편의성을 위해 교육에 활용되는 데이터만 github에 따로 업로드하여 불러온 것입니다.

실제 다양한 요소를 고려한 추가적인 학습을 원하시면 링크(https://www.kaggle.com/datasets/dmi3kno/newcarsalesnorway) 에 있는 추가 데이터를 활용해보시길 권장드립니다.

In [2]:

df = pd.read_csv('https://raw.githubusercontent.com/hyeonkeemin/dip_test/main/CarPlatform/norway_new_car_sales_by_make.csv')
df.tail(5)

Out[2]:

	Year	Month	Make	Quantity
4372	2017	1	Nilsson	3
4373	2017	1	Maserati	2
4374	2017	1	Ferrari	1
4375	2017	1	Smart	1
4376	2017	1	Ssangyong	1

In [3]:

# 자동차 제조사를 통합한 월별 신차 판매량 데이터 프레임 생성
df = df.groupby(['Year', 'Month'])['Quantity'].sum().reset_index()
df.tail(5)

Out[3]:

	Year	Month	Quantity
116	2016	9	13854
117	2016	10	11932
118	2016	11	13194
119	2016	12	13602
120	2017	1	13055

본 자료는 오픈소스 시계열 분석 패키지인 Prophet을 활용합니다.
Prophet의 정확한 API 사용법 및 의미를 익히기 위해서 논문(https://peerj.com/preprints/3190/) 과 Documentation 사이트(https://facebook.github.io/prophet/) 를 살펴보는 것을 권장드립니다.

Prophet은 정확도가 높고, 빠르며 직관적인 파라미터로 모델 수정이 용이하다는 장점을 가집니다.
이는, 기존 시계열 모델이 시간 종속적인 특성을 고려하는 것과는 달리 곡선적합(curve-fitting) 방식으로 모델을 피팅하기 때문입니다.

이로 인해 얻게되는 대표적인 장점은 아래와 같습니다.

여러 주기를 가지는 계절성을 쉽게 반영할 수 있으며, 추세에 대해 다른 여러 가지 가정을 추가할 수 있습니다.
기존 시계열 모델이 가지는 여러 가정에 대해 자유롭습니다. 예를 들어, 시계열 자료의 측정 주기가 일정한 간격일 필요가 없고, 이상치 제거 및 결측치를 대치시킬 필요가 없어집니다.
모델 피팅 속도가 빠르므로, 다양한 시도를 해볼 수 있습니다.
파라미터의 해석이 용이하여, 기존 시계열 모델과 방법론들에 대해 지식이 거의 없는 비전문가들도 해당 데이터에 대한 도메인 지식이 충분하면 쉽게 튜닝이 가능합니다.
필요에 따라 모델에 새로운 요소가 필요한 상황에서 이를 추가 반영하기 용이합니다.
종합적으로, 이는 다양한 비즈니스 문제에 용이하게 활용할 수 있음을 나타냅니다.

모델의 주요 구성 요소는 Trend, Seasonality, Holiday 이며, 이 세 가지를 결합하여 아래의 공식으로 나타낼 수 있습니다.

$y(t)=g(t)+s(t)+h(t)+\epsilon_i$

$g(t)$ : 시계열 데이터의 비주기적인 변화를 모델링하기 위한 파라미터로, Trend를 나타냅니다. 선형(Linear) 또는 로지스틱(Logistic) 성장 곡선으로 이루어져 있습니다.
$s(t)$ : 주간, 연간 계절성 등 주기적인 변경 영향을 모델링 하기 위한 파라미터로, Seasonality를 나타냅니다.
$h(t)$ : 일정이 불규칙한 특정 이벤트의 영향을 모델링하기 위한 파라미터로, Holiday를 나타냅니다.
$\epsilon_i$ : 모델에 의해 설명되지 않는 비정상적인 오차를 의미합니다.

파이썬에서의 Prophet API 활용은 다음과 같은 특징을 가집니다.

scikit-learn의 model API를 따릅니다.
먼저, Prophet 클래스의 인스턴스를 만든 다음 fit과 predict 메서드를 호출합니다.
Prophet에 대한 입력은 항상 두 개의 열(ds 및 y)이 있는 데이터 프레임으로, 컬럼명까지 엄격하게 따집니다(ds와 y로 구성).
ds(datestamp) 컬럼은 Pandas에서 읽을 수 있는 형식이어야 하며, 날짜의 경우 YYYY-MM-DD, timestamp의 경우 YYYY-MM-DD HH:MM:SS를 맞추는 것을 권장합니다.
y 컬럼은 숫자여야 하며, 모델 피팅의 대상이 되는 타겟 값을 의미합니다.

이제, Prophet API 형식에 맞도록 데이터 프레임을 변경하도록 하겠습니다.

In [4]:

# Year 컬럼과 Month 컬럼 결합
df['ds'] = df['Year'].astype(str) + '-' + df['Month'].astype(str)

# timestamp 변환
df['ds'] = pd.DatetimeIndex(df['ds'])

# 컬럼명 변환
df = df.rename(columns={'Quantity': 'y'})

# 필요없는 행 제거
df = df.drop(['Year', 'Month'], axis=1)

In [5]:

df.head()

Out[5]:

	y	ds
0	12685	2007-01-01
1	9793	2007-02-01
2	11264	2007-03-01
3	8854	2007-04-01
4	12007	2007-05-01

2) 데이터 시각화¶

10년 간 신차 판매량의 변화를 시각화하여 확인해 보겠습니다.

2008년부터 2010년까지의 서브프라임 모기지 사태로 인해 월간 차량 판매 대수가 크게 감소하였으며, 해당 기간을 제외한다면 월간 차량 판매 대수가 약상승하는 것처럼 관찰됩니다.
해당 데이터의 분포로 보아, 데이터의 정상성(Stationary)을 가정하는 기존 시계열 모형을 적용하기 위해서는 복잡한 전처리 과정을 거쳐야 할 것으로 파악됩니다.
이를 Prophet을 활용해 간단히 예측 분석을 진행하도록 하겠습니다.

In [6]:

ax = df.set_index('ds').plot(figsize=(16, 8))
ax.set_ylabel('Monthly number of car sales')
ax.set_xlabel('Date')
plt.show()

3. 모델링¶

1) 간단한 모델링¶

이제 Prophet을 활용하여 시계열 데이터를 예측 분석하는 방법을 설명하겠습니다.
scikit-learn의 모델 API를 활용하는 것과 동일한 방식으로, 손쉽게 활용할 수 있습니다.

실제 데이터는 2017년 1월 1일까지 존재하고, OFV에서 2016년 판매량 예측 결과와 실제 판매량 결과가 존재하기 때문에 2015년 시점까지의 데이터만 활용해서 모델링을 진행하겠습니다.

In [7]:

# 2016년 이전 데이터만 필터링
df = df[df['ds'] < '2016-01-01']

In [8]:

# 모델을 활용하기 위해 인스턴스화합니다. 이 때, 모델 안의 여러 메서드를 활용하여 원하는 매개변수를 지정할 수 있습니다.
model = Prophet()

# Prophet 모델일 초기화 되었으므로 fit 메서드를 통해 만든 데이터 프레임을 입력으로 활용합니다.
model.fit(df)

09:59:40 - cmdstanpy - INFO - Chain [1] start processing
09:59:40 - cmdstanpy - INFO - Chain [1] done processing

Out[8]:

<prophet.forecaster.Prophet at 0x7f3a640cf790>

시계열의 예측값을 얻기 위해서, 원하는 날짜를 포함하는 ds 컬럼이 포함된 새 데이터 프레임을 생성하여야 합니다.
예를 들어, 현재 데이터 프레임의 마지막 날짜는 아래처럼 2015년 12월 1일이라면, 2016년 1월 1일 이후의 날짜를 생성하여야 합니다.

In [9]:

df.tail()

Out[9]:

	y	ds
103	12604	2015-08-01
104	12421	2015-09-01
105	13197	2015-10-01
106	12600	2015-11-01
107	13078	2015-12-01

이는 make_future_dataframe 함수를 활용하여 편리하게 생성할 수 있습니다.
아래 보시는 것 처럼 2016년 1월 이후의 날짜가 생성된 것을 확인할 수 있습니다. 이 때, 월간 데이터로 작업을 하기 때문에 timestamp의 frequency를 'MS'로 지정하였습니다.
즉, 월 별로 12단위 기간 이후의 데이터프레임을 생성한 것으로, 1년 이후 시점까지의 기간을 가지는 데이터 프레임을 생성한 것입니다.

In [10]:

future_dates = model.make_future_dataframe(periods=12, freq='MS')
future_dates

Out[10]:

	ds
0	2007-01-01
1	2007-02-01
2	2007-03-01
3	2007-04-01
4	2007-05-01
...	...
115	2016-08-01
116	2016-09-01
117	2016-10-01
118	2016-11-01
119	2016-12-01

120 rows × 1 columns

이후, 현재 모델에서 predict 함수를 활용하여 해당 모델의 입력값으로 미래시점까지 생성한 데이터 프레임을 사용하여 모델의 피팅 값을 확인할 수 있습니다.

In [11]:

# 예측값 생성
forecast = model.predict(future_dates)

# 모델 피팅값 확인
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]

Out[11]:

	ds	yhat	yhat_lower	yhat_upper
0	2007-01-01	9170.694436	7704.217633	10698.458628
1	2007-02-01	8941.814681	7331.481428	10398.918078
2	2007-03-01	10498.514677	8898.787826	12088.260359
3	2007-04-01	9868.739459	8295.248055	11414.273480
4	2007-05-01	10083.932061	8591.190195	11593.897393
...	...	...	...	...
115	2016-08-01	12806.452592	11302.149593	14351.444697
116	2016-09-01	13010.471863	11434.743476	14664.910556
117	2016-10-01	13749.564547	12194.153371	15241.175648
118	2016-11-01	13250.655525	11678.288516	14808.434580
119	2016-12-01	12824.500085	11261.389180	14382.529369

120 rows × 4 columns

실제 Prophet에서는 모델 피팅 완료 후 다양한 컬럼에 대한 값을 반환하지만, 예측과 관련된 가장 대표적인 열을 출력하였습니다. 이에 대한 설명은 아래와 같습니다.

ds: 모델 피팅에 활용된 날짜 단위
yhat: 모델의 예측값
yhat_lower: 예측값의 하한 범위
yhat_upper: 예측값의 상한 범위

또한, 아래 그림 처럼 예측 결과를 빠르게 표시할 수 있는 편리한 함수를 제공합니다.
관측값은 검정색 점으로, 예측값은 파란색 선으로, 예측값의 신뢰구간을 파란색 음영의 영역으로 표시합니다.

In [12]:

model.plot(forecast)
plt.show()

모델 피팅에 대한 주요 구성요소 정보도 확인할 수 있습니다.

아래 첫 번째 그림을 통해 시간이 지남에 따라 신차 판매 대수가 선형적으로 증가하는 트랜드를 보이고 있음을 확인할 수 있습니다.
또한, 두 번째 그림을 통해 5월 경 판매량이 증가하는 것을 확인할 수 있습니다.
본 데이터의 경우 월별 데이터이므로, Holiday를 통한 평일-주말의 추세 구성은 확인할 수 없습니다.

In [13]:

model.plot_components(forecast)
plt.show()

2) 세부 기능 추가하기¶

그려진 그래프에 몇 가지 기능을 더 추가하여 세밀하게 파악해보고자 합니다.

아래 그림은 트랜드와 Chagepoints를 표시한 것입니다. 트랜드는 바로 위 plot_components 함수에서 그려진 트랜드를 나타낸 빨간색 실선이며, Chagepoints는 빨간색 점선으로 표시된 지점입니다.

Chagepoints는 시계열 데이터 속 트랜드의 변화를 일으키는 시점입니다.
본 데이터에서는 서브프라임 모기지 사태가 일어난 2008년~2010년의 시간을 기점으로 체인지 포인트가 위치해있는 것을 확인할 수 있습니다.

위 Changepoints의 범위와 시점은 초기 Prophet 인스턴스화를 수행할 때 도메인 지식을 기반으로 세부적으로 설정할 수 있으나, 명시하지 않으면 자동으로 Chagepoints가 설정됩니다.
세부적인 Chagepoints를 설정하는 파라미터는 changepoint_range, changepoint_prior_scale, changepoints가 있으며, 해당 파라미터를 도메인 지식을 활용하여 설정하는 방식으로 세부적인 Trend 및 Seasonality 값을 수정할 수 있습니다.

In [14]:

fig = model.plot(forecast)
a = add_changepoints_to_plot(fig.gca(), model, forecast)

3) 추가 모델링 작업¶

Prophet 에선 기본적으로 모델 피팅을 위해 Bayesian 추정 방식을 사용하며, 세부적으로 설정 가능한 MAP, MCMC의 2가지 피팅 방식이 있습니다.
mcmc_samples 파라미터를 조정하여 이를 설정할 수 있으며, 1이상의 값을 입력시 MCMC 방식의 추정을 수행합니다.

MAP: Maximum A Posteriori 추정을 수행합니다. 기본 Default로 설정되어 있으며, 수렴속도가 상대적으로 빠르다는 장점을 가지고 있습니다.
MCMC: Markov Chain Monte Carlo 추정을 수행하며, 모형의 변동성을 더 자세히 살펴볼 수 있으나 모델링을 시도할 때마다 결과값에 다소 차이가 존재합니다.

본 판매량 데이터에서는 신뢰구간을 벗어난 관측값이 관찰되는 등 Trend와 Seasonality로 정확히 추세를 정확히 파악하기 힘든 분포를 보입니다. 또한, 노르웨이 차량 판매에 관한 도메인 지식도 부족한 상황이므로 MCMC 샘플링 추정방식을 수행하여 관측값의 변동성에 더욱 초점을 맞추어 모델을 피팅시킨 후 예측값을 확인해 보겠습니다.

In [15]:

model2 = Prophet(mcmc_samples=300).fit(df)

10:00:43 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:00:43 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:00:43 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:00:45 - cmdstanpy - INFO - CmdStan done processing.
10:00:45 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

In [16]:

forecast2 = model2.predict(future_dates)

In [17]:

fig = model.plot(forecast2)
a = add_changepoints_to_plot(fig.gca(), model2, forecast2)

In [18]:

forecast2[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]

Out[18]:

	ds	yhat	yhat_lower	yhat_upper
0	2007-01-01	9781.454928	7952.828441	11552.291879
1	2007-02-01	9503.015510	7672.338100	11203.555301
2	2007-03-01	11209.898889	9410.225958	13044.083300
3	2007-04-01	9651.080772	7844.170047	11451.913084
4	2007-05-01	10569.786804	8629.605040	12486.935715
...	...	...	...	...
115	2016-08-01	12397.828172	10456.301580	14206.150533
116	2016-09-01	13179.422991	11263.689148	15107.524596
117	2016-10-01	13876.572499	12120.644738	15854.123267
118	2016-11-01	13093.146322	11194.645300	14910.775129
119	2016-12-01	12591.875514	10618.820580	14525.927378

120 rows × 4 columns

4) 모델 평가¶

지금까지의 과정을 통해 Prophet을 활용한 시계열 예측 분석을 수행해보았습니다.

실제 차량 판매대수에 대하여 OFV의 예측값과 Prophet을 통한 예측값을 비교해 보았을 때, 모델링을 더 세부적으로 수행한 마지막 Prophet 모델이 가장 정확한 것을 확인할 수 있습니다.

In [19]:

result_dict = {'실제': 154603, '예측_OFV': 157500, '예측_MAP':forecast.iloc[-12:]['yhat'].sum(), '예측_MCMC': forecast2.iloc[-12:]['yhat'].sum()}
pd.DataFrame([result_dict])

Out[19]:

	실제	예측_OFV	예측_MAP	예측_MCMC
0	154603	157500	157125.285567	155887.520501

또한 Prophet에서는 자체적으로 크로스 벨리데이션을 통한 모델 성능 측정 함수를 제공합니다.

cross_validation 함수를 활용하며, horizon 파라미터를 365일(1년) 단위로 설정하게 되면, 각 특정 기간별 1년 단위씩 끊어서 모델 예측값과 실제값 등 성능지표를 측정할 수 있는 데이터 프레임을 제공합니다.

In [20]:

df_cv = cross_validation(model2, horizon='365 days')

  0%|          | 0/10 [00:00<?, ?it/s]

10:01:13 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:13 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:13 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:14 - cmdstanpy - INFO - CmdStan done processing.
10:01:14 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Matrix of independent variables is inf, but must be finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:14 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:14 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:14 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:15 - cmdstanpy - INFO - CmdStan done processing.
10:01:15 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:16 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:16 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:16 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:17 - cmdstanpy - INFO - CmdStan done processing.
10:01:17 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is inf, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is inf, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:18 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:18 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:18 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:19 - cmdstanpy - INFO - CmdStan done processing.
10:01:19 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:20 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:20 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:20 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:21 - cmdstanpy - INFO - CmdStan done processing.
10:01:21 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:22 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:22 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:22 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:23 - cmdstanpy - INFO - CmdStan done processing.
10:01:23 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:23 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:23 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:23 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:25 - cmdstanpy - INFO - CmdStan done processing.
10:01:25 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:25 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:25 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:25 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:26 - cmdstanpy - INFO - CmdStan done processing.
10:01:26 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:27 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:27 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:27 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:28 - cmdstanpy - INFO - CmdStan done processing.
10:01:28 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

10:01:29 - cmdstanpy - INFO - CmdStan installation /home/bigdata/anaconda3/envs/mhk/lib/python3.8/site-packages/prophet/stan_model/cmdstan-2.26.1 missing makefile, cannot get version.
10:01:29 - cmdstanpy - INFO - Cannot determine whether version is before 2.28.
10:01:29 - cmdstanpy - INFO - CmdStan start processing

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

10:01:30 - cmdstanpy - INFO - CmdStan done processing.
10:01:30 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Exception: normal_id_glm_lpdf: Scale vector is 0, but must be positive finite! (in '/project/python/stan/prophet.stan', line 137, column 2 to line 142, column 4)
Consider re-running with show_console=True if the above output is unclear!

In [21]:

# 성능확인을 위한 데이터프레임
df_cv

Out[21]:

	ds	yhat	yhat_lower	yhat_upper	y	cutoff
0	2010-07-01	11031.502171	8377.310601	13604.759069	11507	2010-06-02 12:00:00
1	2010-08-01	9737.661002	7128.355859	12275.387724	10414	2010-06-02 12:00:00
2	2010-09-01	9368.803773	6880.319193	11894.595259	11137	2010-06-02 12:00:00
3	2010-10-01	11009.076080	8081.330184	13620.444945	10683	2010-06-02 12:00:00
4	2010-11-01	9951.640201	7278.883225	12620.311035	11908	2010-06-02 12:00:00
...	...	...	...	...	...	...
115	2015-08-01	12547.841775	10622.339666	14487.159635	12604	2014-12-01 00:00:00
116	2015-09-01	12240.319630	10477.068971	14083.707968	12421	2014-12-01 00:00:00
117	2015-10-01	12916.294855	10969.281546	14873.731512	13197	2014-12-01 00:00:00
118	2015-11-01	12792.540915	10912.131688	14672.029420	12600	2014-12-01 00:00:00
119	2015-12-01	12073.094900	10130.360929	13970.620352	13078	2014-12-01 00:00:00

120 rows × 6 columns

이에 대하여, performance_metrics 함수의 입력값으로 생성된 데이터프레임을 넣어주게 되면 측정 단위별 성능지표를 확인할 수 있습니다.

In [22]:

# 각 기간 단위별 성능 확인
performance_metrics(df_cv) 

Out[22]:

	horizon	mse	rmse	mae	mape	mdape	smape	coverage
0	59 days 12:00:00	1.393781e+06	1180.584856	1041.485002	0.093594	0.078651	0.088914	1.000000
1	60 days 12:00:00	8.963432e+05	946.754040	826.681175	0.074491	0.066468	0.071549	1.000000
2	61 days 00:00:00	7.155378e+05	845.894645	712.651966	0.065211	0.060727	0.062867	1.000000
3	62 days 00:00:00	4.902532e+05	700.180830	586.630413	0.053310	0.036289	0.051863	1.000000
4	89 days 00:00:00	5.360492e+05	732.153821	621.210064	0.055024	0.046253	0.054236	1.000000
5	90 days 00:00:00	9.281619e+05	963.411576	786.305057	0.066724	0.050050	0.065163	0.916667
6	90 days 12:00:00	1.152986e+06	1073.771816	887.028008	0.075886	0.070009	0.075620	0.916667
7	91 days 12:00:00	1.317546e+06	1147.844250	957.284895	0.081626	0.070009	0.080779	0.916667
8	120 days 00:00:00	1.515585e+06	1231.091182	1052.610755	0.089874	0.070009	0.090154	0.916667
9	120 days 12:00:00	1.521182e+06	1233.361900	1048.225916	0.090362	0.070009	0.089818	0.916667
10	121 days 00:00:00	2.232195e+06	1494.053145	1182.771317	0.102404	0.098158	0.100251	0.916667
11	121 days 12:00:00	1.898574e+06	1377.887684	998.435698	0.085460	0.054640	0.082379	0.916667
12	150 days 00:00:00	1.875266e+06	1369.403632	1006.184750	0.085241	0.061380	0.082741	0.916667
13	151 days 00:00:00	1.608059e+06	1268.092818	896.832118	0.074204	0.058475	0.070936	0.916667
14	151 days 12:00:00	1.296321e+06	1138.561061	826.592738	0.068198	0.052759	0.067464	0.958333
15	152 days 12:00:00	7.911836e+05	889.485047	700.435790	0.057923	0.052759	0.058042	1.000000
16	181 days 00:00:00	1.083164e+06	1040.751810	838.251829	0.071585	0.058475	0.070442	1.000000
17	181 days 12:00:00	1.042561e+06	1021.058920	824.168691	0.071739	0.050856	0.070597	1.000000
18	182 days 00:00:00	1.534197e+06	1238.627119	1050.367337	0.090872	0.088343	0.089629	1.000000
19	182 days 12:00:00	2.452878e+06	1566.166563	1228.868858	0.112930	0.088343	0.104959	0.916667
20	211 days 00:00:00	2.792408e+06	1671.050049	1357.806842	0.124637	0.089456	0.115427	0.916667
21	212 days 00:00:00	2.987988e+06	1728.579725	1485.564637	0.133171	0.120039	0.123192	0.916667
22	212 days 12:00:00	3.034517e+06	1741.986598	1439.706484	0.131012	0.110956	0.119879	0.833333
23	213 days 12:00:00	2.527780e+06	1589.899413	1262.478108	0.115686	0.086799	0.105696	0.861111
24	242 days 00:00:00	2.132621e+06	1460.349742	1178.238268	0.106427	0.084508	0.098205	0.888889
25	243 days 00:00:00	1.095619e+06	1046.718313	767.533536	0.068595	0.053829	0.064427	0.916667
26	243 days 12:00:00	1.041791e+06	1020.681613	736.819288	0.067234	0.051789	0.063369	0.916667
27	244 days 12:00:00	7.234549e+05	850.561515	667.385020	0.060640	0.048382	0.057992	1.000000
28	271 days 12:00:00	9.229670e+05	960.711731	758.825865	0.066708	0.048382	0.065401	1.000000
29	272 days 12:00:00	1.496865e+06	1223.464307	941.758130	0.081641	0.050788	0.078798	0.916667
30	273 days 00:00:00	1.499746e+06	1224.641063	954.822753	0.082718	0.050788	0.080027	0.916667
31	274 days 00:00:00	1.717394e+06	1310.493678	986.934869	0.084576	0.055331	0.080982	0.833333
32	302 days 12:00:00	2.128048e+06	1458.782912	1123.961390	0.096462	0.055331	0.095165	0.833333
33	303 days 00:00:00	2.192652e+06	1480.760707	1164.939269	0.099704	0.064384	0.098208	0.833333
34	303 days 12:00:00	2.753929e+06	1659.496737	1229.615151	0.107013	0.066228	0.101948	0.812500
35	304 days 00:00:00	2.437175e+06	1561.145445	1103.795687	0.094929	0.063854	0.090848	0.875000
36	332 days 12:00:00	2.503839e+06	1582.352459	1157.447967	0.098211	0.085946	0.095248	0.895833
37	333 days 12:00:00	1.847404e+06	1359.192390	1009.111996	0.083435	0.070903	0.079032	0.937500
38	334 days 00:00:00	1.500865e+06	1225.097762	909.015588	0.074650	0.049538	0.071292	0.958333
39	335 days 00:00:00	1.026196e+06	1013.013372	867.262076	0.071047	0.086916	0.068975	1.000000
40	363 days 12:00:00	1.093355e+06	1045.636209	921.961873	0.077014	0.102641	0.074581	1.000000
41	364 days 00:00:00	1.070379e+06	1034.591257	919.831499	0.077116	0.086916	0.074862	1.000000
42	364 days 12:00:00	1.684821e+06	1298.006722	1115.202685	0.095769	0.090755	0.091192	1.000000
43	365 days 00:00:00	3.426956e+06	1851.203988	1451.661394	0.132393	0.090755	0.120547	0.916667

또한, plot_cross_validation_metric 함수를 통해 시각화한 결과도 확인할 수 있습니다.

In [23]:

plot_cross_validation_metric(df_cv, metric='mse')
plt.show()

이처럼, 해당 모델 성능 테스트를 위한 함수를 활용하고, 도메인 지식과 모델링 지식을 활용한 파라미터 튜닝을 수행한다면 더 나은 모델을 만들 수 있습니다.

이상으로 시계열 데이터를 활용한 예측 분석을 마무리하도록 하겠습니다.

감사합니다.

참고문헌¶

dmi3kno. (2017, February 18). New car sales in Norway. Kaggle. Retrieved September 15, 2022, from https://www.kaggle.com/datasets/dmi3kno/newcarsalesnorway
Taylor SJ, Letham B. 2017. Forecasting at scale. PeerJ Preprints 5:e3190v2 https://doi.org/10.7287/peerj.preprints.3190v2
Forecasting at scale. Prophet. (n.d.). Retrieved September 15, 2022, from https://facebook.github.io/prophet/
prashant, (2020, December 24). Tutorial: Time series forecasting with prophet. Kaggle. Retrieved September 15, 2022, from https://www.kaggle.com/code/prashant111/tutorial-time-series-forecasting-with-prophet
Be-Favorite. (2022, May 3). Prophet 모형. SLOG. Retrieved September 15, 2022, from https://be-favorite.tistory.com/64

In [ ]:

3D 이미지 데이터의 Point Cloud 변환 전처리 (0)	2022.10.26
LightGBM을 활용한 이상탐지(Anomaly detection) (1)	2022.10.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

데분데싸

Prophet을 활용한 시계열 데이터 예측 해보기

Prophet을 활용한 시계열 데이터를 예측¶

1. 분석 배경 및 목적 설명¶

1) 배경¶

2) 데이터셋 구성¶

2. 데이터 수집 및 전처리¶

1) 패키지 로드 및 데이터 프레임 생성¶

2) 데이터 시각화¶

3. 모델링¶

1) 간단한 모델링¶

2) 세부 기능 추가하기¶

3) 추가 모델링 작업¶

4) 모델 평가¶

참고문헌¶

'교육자료' 카테고리의 다른 글

'교육자료'의 다른글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

Prophet을 활용한 시계열 데이터 예측 해보기

Prophet을 활용한 시계열 데이터를 예측¶

1. 분석 배경 및 목적 설명¶

1) 배경¶

2) 데이터셋 구성¶

2. 데이터 수집 및 전처리¶

1) 패키지 로드 및 데이터 프레임 생성¶

2) 데이터 시각화¶

3. 모델링¶

1) 간단한 모델링¶

2) 세부 기능 추가하기¶

3) 추가 모델링 작업¶

4) 모델 평가¶

참고문헌¶

'교육자료' 카테고리의 다른 글

'교육자료'의 다른글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역