Model selection——choosing estimators and their parametors

机器学习

发布日期: 2017-08-11

文章字数: 557

选择诸如C,alpha之类的参数以及交叉验证

Score, and cross-validated scores

我们可以用score方法来判定fit 的质量

from sklearn import datasets, svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear')  #C 是Penalty parameter C of the error term
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])

0.97999999999999998

import numpy as np
scores = list()
X_folds = np.array_split(X_digits, 3)  # 平均分为3类
y_folds = np.array_split(y_digits, 3)
for k in range(3):
    X_train =list(X_folds)  #list可以将X_folds 复制给X_train
    X_test = X_train.pop(k)  
    X_train = np.concatenate(X_train) #将X_train join a sequence of arrays
    y_train = list(y_folds)
    y_test  = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

Cross-validation generators

scilit-learn 有很多类可以被用来产生train/test indices
split 方法返回的是 train/test set indices

from sklearn.model_selection import KFold, cross_val_score
X = ['a', 'a', 'b', 'c', 'c', 'c', 'a', 'b']
k_fold = KFold(n_splits=4)   #n_splits 将整个数据分成n_splits份
for train_indices, test_indices in k_fold.split(X):
     print('Train: %s | test: %s' % (train_indices, test_indices))

Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]

The cross-validation can be performed easily

kfold = KFold(n_splits=3)
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])\
 for train,test in k_fold.split(X_digits)]

[0.95777777777777773,
 0.9376391982182628,
 0.97327394209354123,
 0.92873051224944325]

#一种更方便的计算score的方法
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)  #n_jobs=-1，默认all cpus

array([ 0.95777778,  0.9376392 ,  0.97327394,  0.92873051])

Grid-search(网格搜索) and cross-validated estimators

Grid-search

# computes the score during the fit of an estimator on a parameter grid
# and choosees the parameters to maximize the cross-validation score
from sklearn.model_selection import GridSearchCV, cross_val_score
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
clf.fit(X_digits[:1000], y_digits[:1000])
clf.best_score_

0.92500000000000004

clf.best_estimator_.C

0.0077426368268112772

一些内置算法可以自动提供像alpha,C之类的参数
C and alpha both have the same effect. The difference is a choice of terminology. C is proportional to 1/alpha. You should use GridSearchCV to select either alpha or C the same way, but remember a higher C is more likely to overfit, where a lower alpha is more likely to overfit.

(1 / (2 n_samples)) ||y - Xw||^2_2 + alpha * ||w||_1

from sklearn import linear_model, datasets
lasso = linear_model.LassoCV()    #自动cross validation并从中选择score最好的
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes, y_diabetes)

# The estimator chose automatically its lambda:
lasso.alpha_