구현1. linear regression¶

x의 1차항만 고려하는 선형회귀(Linear Regression) 모형
13개의 Attribute 중 첫 번째 Attribute만 Feature varaible로 활용함 (강의에서 첫 번째 항은 더미데이터처럼 1이라 한 것 기억!)
$\hat \theta = argmin_{\theta}(f-\hat f)^2 ~~>> \theta = (X^TX)^{-1}X^TY$
y_est(= x_temp * θ): 위에서 구해진 theta로 도출된 예측치

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:.5f}'.format

x=pd.read_csv('X.csv', header=None)
y=pd.read_csv('Y.csv', header=None)
x

x = x.values
y = y.values

x_temp = x[:, 0:2]

theta = np.dot(np.dot(np.linalg.inv(np.dot(np.transpose(x_temp), x_temp)), np.transpose(x_temp)), y)
y_est = np.dot(x_temp, theta)

np.linalg.lstsq() 명령은 행렬 A 와 b 를 모두 인수로 받고

최소자승문제(least square problem)의 답 x,
잔차제곱합(residual sum of squares) resid,
랭크(rank) rank,
특잇값(singular value) s를 반환한다.

미지수와 방정식의 개수가 같고 행렬 A 의 역행렬이 존재하면 최소자승문제의 답과 선형 연립방정식의 답이 같으므로 lstsq() 명령으로 선형 연립방정식을 풀 수도 있다.

np.linalg.lstsq(x_temp, y, rcond=None)

# rcond
# Cut-off ratio for small singular values of a.
# For the purposes of rank determination,
# singular values are treated as zero if they are smaller than rcond times the largest singular value of a.

(array([[24.03310378],
        [-0.41519   ]]),
 array([36275.52515523]),
 2,
 array([209.87398774,  20.71756301]))

# m0, c0 = argmin |Y - (m0 * xYemp + c0)|^2
# m1, c1 = argmin |Y_est - (m1 * xYemp + c1)|^2

m0, c0 = np.linalg.lstsq(x_temp, y, rcond=None)[0]
m1, c1 = np.linalg.lstsq(x_temp, y_est, rcond=None)[0]

구현2. polynomial regression¶

8차항으로 이루어진 다항 회귀 모형
new_x: new_x[i] = $[1,~x_i,~x_i^2,~x_i^3~,~...~,~x_i^8]$
new_theta: 오차의 제곱을 최소화하는 parameter
new_y_est: new_theta로 측정된 예측치

new_x = np.zeros((x.shape[0], 9))
new_x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

new_x[:, 0:2] = x[:, 0:2]

for i in range(2, 9):
    new_x[:, i] = new_x[:, 1] * new_x[:, i-1]

new_theta = np.dot(np.dot(np.linalg.inv(np.dot(np.transpose(new_x), new_x)), np.transpose(new_x)),y)
new_y_est = np.dot(new_x, new_theta)

# m2, c2 = argmin |Y_est - (m2 * xYemp + c2)|^2
m2, c2 = np.linalg.lstsq(x_temp, new_y_est, rcond=None)[0]

구현3. 그래프¶

# 결과값을 그래프로 나타냄
# x축은 사용한 Feature Variable의 값이고, y축은 Dependent Variable의 값
plt.figure(1, figsize = (17, 5))

# 그래프 1
ax1 = plt.subplot(1, 3, 1)
plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], m0+c0*x[:, 1], 'r-')
plt.plot(x[:, 1], y_est, 'bo', markeredgecolor = 'none')
plt.plot(x[:, 1], m1+c1*x[:, 1], 'b-')
plt.xlabel('Feature Variable', fontsize = 14)
plt.ylabel('Dependent Variable', fontsize = 14)

# 그래프 2
ax2 = plt.subplot(1, 3, 2, sharey = ax1)
plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], new_y_est, 'go', markeredgecolor = 'none')
plt.plot(x[:, 1], m2 + c2*x[:, 1], 'g-')
plt.xlabel('Feature Variable', fontsize = 14)
plt.ylabel('Dependent Variable', fontsize = 14)

# 그래프3
ax3 = plt.subplot(1, 3, 3, sharey = ax2)
plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], y_est, 'bo', markeredgecolor = 'none')
plt.plot(x[:, 1], new_y_est, 'go', markeredgecolor = 'none')
plt.xlabel('Feature Variable', fontsize = 14)
plt.ylabel('Dependent Variable', fontsize = 14)

plt.show()

구현4. statsmodel: linear regression¶

from statsmodels.api import OLS

model1 = OLS(y, x_temp).fit()
p = model1.params
y_pred1 = model1.predict(x_temp)
model1.summary()

plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], y_pred1, 'bo', markeredgecolor = 'none')
plt.plot(x[:, 1], p[0] + p[1]*x[:, 1], 'b-')

[<matplotlib.lines.Line2D at 0x7f5c762d2a50>]

구현5. statsmodel+scikit learn: polynomial regression¶

from sklearn.preprocessing import PolynomialFeatures

polynomial_features = PolynomialFeatures(degree=8)
xp = polynomial_features.fit_transform(x[:, 1:2])

print(xp.shape)
xp

(506, 9)

array([[1.00000000e+00, 6.32000000e-03, 3.99424000e-05, ...,
        6.37239179e-14, 4.02735161e-16, 2.54528622e-18],
       [1.00000000e+00, 2.73100000e-02, 7.45836100e-04, ...,
        4.14887357e-10, 1.13305737e-11, 3.09437968e-13],
       [1.00000000e+00, 2.72900000e-02, 7.44744100e-04, ...,
        4.13067679e-10, 1.12726170e-11, 3.07629717e-13],
       ...,
       [1.00000000e+00, 6.07600000e-02, 3.69177760e-03, ...,
        5.03160559e-08, 3.05720356e-09, 1.85755688e-10],
       [1.00000000e+00, 1.09590000e-01, 1.20099681e-02, ...,
        1.73230980e-06, 1.89843831e-07, 2.08049854e-08],
       [1.00000000e+00, 4.74100000e-02, 2.24770810e-03, ...,
        1.13558522e-08, 5.38380953e-10, 2.55246410e-11]])

model2 = OLS(y, xp).fit()
y_pred2 = model2.predict(xp)
y_pred2.shape

(506,)

model2.summary()

plt.plot(x[:, 1], y, 'ro', markeredgecolor = 'none')
plt.plot(x[:, 1], y_pred2, 'go', markeredgecolor = 'none')

[<matplotlib.lines.Line2D at 0x7f5c751974d0>]

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	1	0.00632	18.00000	2.31000	0	0.53800	6.57500	65.20000	4.09000	1	296	15.30000	396.90000	4.98000
1	1	0.02731	0.00000	7.07000	0	0.46900	6.42100	78.90000	4.96710	2	242	17.80000	396.90000	9.14000
2	1	0.02729	0.00000	7.07000	0	0.46900	7.18500	61.10000	4.96710	2	242	17.80000	392.83000	4.03000
3	1	0.03237	0.00000	2.18000	0	0.45800	6.99800	45.80000	6.06220	3	222	18.70000	394.63000	2.94000
4	1	0.06905	0.00000	2.18000	0	0.45800	7.14700	54.20000	6.06220	3	222	18.70000	396.90000	5.33000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
501	1	0.06263	0.00000	11.93000	0	0.57300	6.59300	69.10000	2.47860	1	273	21.00000	391.99000	9.67000
502	1	0.04527	0.00000	11.93000	0	0.57300	6.12000	76.70000	2.28750	1	273	21.00000	396.90000	9.08000
503	1	0.06076	0.00000	11.93000	0	0.57300	6.97600	91.00000	2.16750	1	273	21.00000	396.90000	5.64000
504	1	0.10959	0.00000	11.93000	0	0.57300	6.79400	89.30000	2.38890	1	273	21.00000	393.45000	6.48000
505	1	0.04741	0.00000	11.93000	0	0.57300	6.03000	80.80000	2.50500	1	273	21.00000	396.90000	0.00000

Dep. Variable:	y	R-squared:	0.151
Model:	OLS	Adj. R-squared:	0.149
Method:	Least Squares	F-statistic:	89.49
Date:	Tue, 28 Jul 2020	Prob (F-statistic):	1.17e-19
Time:	06:40:53	Log-Likelihood:	-1798.9
No. Observations:	506	AIC:	3602.
Df Residuals:	504	BIC:	3610.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	24.0331	0.409	58.740	0.000	23.229	24.837
x1	-0.4152	0.044	-9.460	0.000	-0.501	-0.329

Omnibus:	139.832	Durbin-Watson:	0.713
Prob(Omnibus):	0.000	Jarque-Bera (JB):	295.403
Skew:	1.490	Prob(JB):	7.14e-65
Kurtosis:	5.264	Cond. No.	10.1

Dep. Variable:	y	R-squared:	0.219
Model:	OLS	Adj. R-squared:	0.208
Method:	Least Squares	F-statistic:	19.97
Date:	Tue, 28 Jul 2020	Prob (F-statistic):	1.24e-23
Time:	06:41:03	Log-Likelihood:	-1777.7
No. Observations:	506	AIC:	3571.
Df Residuals:	498	BIC:	3605.
Df Model:	7
Covariance Type:	nonrobust

데분데싸

인공지능및기계학습개론 lecture2: regression 구현

구현1. linear regression¶

구현2. polynomial regression¶

구현3. 그래프¶

구현4. statsmodel: linear regression¶

구현5. statsmodel+scikit learn: polynomial regression¶

'기계학습 > 인공지능및기계학습개론정리' 카테고리의 다른 글

'기계학습/인공지능및기계학습개론정리'의 다른글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	25.3876	0.518	48.984	0.000	24.369	26.406
x1	-1.6453	0.989	-1.663	0.097	-3.589	0.298
x2	0.1303	0.294	0.443	0.658	-0.448	0.708
x3	-0.0086	0.034	-0.256	0.798	-0.074	0.057
x4	0.0003	0.002	0.176	0.860	-0.003	0.004
x5	-6.584e-06	5.43e-05	-0.121	0.904	-0.000	0.000
x6	6.879e-08	8.63e-07	0.080	0.936	-1.63e-06	1.76e-06
x7	-3.34e-10	6.99e-09	-0.048	0.962	-1.41e-08	1.34e-08
x8	5.217e-13	2.26e-11	0.023	0.982	-4.39e-11	4.5e-11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인공지능및기계학습개론 lecture2: regression 구현

구현1. linear regression¶

구현2. polynomial regression¶

구현3. 그래프¶

구현4. statsmodel: linear regression¶

구현5. statsmodel+scikit learn: polynomial regression¶

'기계학습 > 인공지능및기계학습개론정리' 카테고리의 다른 글

'기계학습/인공지능및기계학습개론정리'의 다른글

관련글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역