选择诸如C,alpha之类的参数以及交叉验证
Score, and cross-validated scores
我们可以用score方法来判定fit 的质量
from sklearn import datasets, svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear') #C 是Penalty parameter C of the error term
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.97999999999999998
import numpy as np
scores = list()
X_folds = np.array_split(X_digits, 3) # 平均分为3类
y_folds = np.array_split(y_digits, 3)
for k in range(3):
X_train =list(X_folds) #list可以将X_folds 复制给X_train
X_test = X_train.pop(k)
X_train = np.concatenate(X_train) #将X_train join a sequence of arrays
y_train = list(y_folds)
y_test = y_train.pop(k)
y_train = np.concatenate(y_train)
scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
Cross-validation generators
scilit-learn 有很多类可以被用来产生train/test indices
split 方法返回的是 train/test set indices
from sklearn.model_selection import KFold, cross_val_score
X = ['a', 'a', 'b', 'c', 'c', 'c', 'a', 'b']
k_fold = KFold(n_splits=4) #n_splits 将整个数据分成n_splits份
for train_indices, test_indices in k_fold.split(X):
print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]
The cross-validation can be performed easily
kfold = KFold(n_splits=3)
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])\
for train,test in k_fold.split(X_digits)]
[0.95777777777777773,
0.9376391982182628,
0.97327394209354123,
0.92873051224944325]
#一种更方便的计算score的方法
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1) #n_jobs=-1,默认all cpus
array([ 0.95777778, 0.9376392 , 0.97327394, 0.92873051])
<img src=’http://note.youdao.com/yws/public/resource/2bc193fa9e391954cec6d611ba0ae81a/xmlnote/9D08E4574AF940C99F395D5061086161/297',
style=”width:600px;height:600px;float:middle”>
Grid-search(网格搜索) and cross-validated estimators
Grid-search
# computes the score during the fit of an estimator on a parameter grid
# and choosees the parameters to maximize the cross-validation score
from sklearn.model_selection import GridSearchCV, cross_val_score
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
clf.fit(X_digits[:1000], y_digits[:1000])
clf.best_score_
0.92500000000000004
clf.best_estimator_.C
0.0077426368268112772
一些内置算法可以自动提供像alpha,C之类的参数
C and alpha both have the same effect. The difference is a choice of terminology. C is proportional to 1/alpha. You should use GridSearchCV to select either alpha or C the same way, but remember a higher C is more likely to overfit, where a lower alpha is more likely to overfit.
(1 / (2 n_samples)) ||y - Xw||^2_2 + alpha * ||w||_1
from sklearn import linear_model, datasets
lasso = linear_model.LassoCV() #自动cross validation并从中选择score最好的
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes, y_diabetes)
# The estimator chose automatically its lambda:
lasso.alpha_
0.012291895087486173