KNN部分

基础实现

首先是完善k_nearest_neighbor.py中的四个函数，其中三个函数用于计算测试集和训练集两两之间的 $L_{2}$ 距离，限制条件分别是使用2个，1个，0个循环。

使用2个循环的写法最为简单：

def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
    is the Euclidean distance between the ith test point and the jth training
    point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        for j in range(num_train):
            #####################################################################
            # TODO:                                                          #
            # Compute the l2 distance between the ith test point and the jth    #
            # training point, and store the result in dists[i, j]. You should   #
            # not use a loop over dimension, nor use np.linalg.norm().        #
            #####################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

            dists[i,j] = np.sum((X[i]-self.X_train[j])**2)

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return dists

直接按照定义计算即可。

使用1重循环的写法则需要按行做一个求和：

def compute_distances_one_loop(self, X):
    """
        Compute the distance between each test point in X and each training point
        in self.X_train using a single loop over the test data.

        Input / Output: Same as compute_distances_two_loops
        """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
        #######################################################################
        # TODO:                                                            #
        # Compute the l2 distance between the ith test point and all training #
        # points, and store the result in dists[i, :].                      #
        # Do not use np.linalg.norm().                                      #
        #######################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        delta = (self.X_train-X[i])**2
        dists[i] = delta.sum(axis=1)

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        return dists

不让使用循环的写法推导稍微麻烦一些，主要是对 numpy 不是很熟悉，一些矩阵操作不知道怎么转化为对应的语句。大致思路是，将原式中的平方项完全展开，然后分别计算，最后相加。

def compute_distances_no_loops(self, X):
    """
        Compute the distance between each test point in X and each training point
        in self.X_train using no explicit loops.

        Input / Output: Same as compute_distances_two_loops
        """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    #########################################################################
    # TODO:                                                              #
    # Compute the l2 distance between all test points and all training    #
    # points without using any explicit loops, and store the result in    #
    # dists.                                                                #
    #                                                                      #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy,                #
    # nor use np.linalg.norm().                                          #
    #                                                                      #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #      and two broadcast sums.                                       #
    #########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    x1 = (X**2).sum(axis=1)
    x1 = x1.repeat(num_train).reshape(num_test, num_train)
    x2 = (self.X_train**2).sum(axis=1)
    x2 = x2.repeat(num_test).reshape(num_train, num_test)
    x2 = x2.T
    x3 = 2*np.dot(X,self.X_train.T)
    dists = x1+x2-x3

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    return dists

后面就是快乐的验证环节了。在 $k=1$ 时，获得了 $27.4\%$ 的正确率， $k=5$ 时正确率略有提升，达到了 $27.8\%$ ，和题面中的描述相符，基本上可以确定代码实现没有问题。

后面是对1重循环实现的检验。在检验时使用了 Frobenius 范数。所谓 Frobenius 范数，就是矩阵所有元素的平方和再开方。两个矩阵的范数几乎完全相同：

0重循环也是一样：

最后是三种实现的速度比较：

结果出来前卡了很久，完全没有想到纯向量计算可以省下这么多时间。

交叉验证

这一部分是实现对超参数 $k$ 的交叉验证。任务是将训练集分为5个部分，每次选择其中的4个部分构成新的训练集，剩下的一个部分作为验证集，跑5次 KNN 检验 $k$ 的效果。完整版代码如下：

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

length = X_train.shape[0] // num_folds

for i in range(num_folds):
    X_train_folds.append(X_train[i*length:(i+1)*length,:])
    y_train_folds.append(y_train[i*length:(i+1)*length])

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

for k in k_choices:
    k_to_accuracies[k] = []
    for i in range(num_folds):
        test = np.array([])
        labels = np.array([])

        for j in range(num_folds):
            if j!=i:
                if test.shape[0] == 0:
                    test = X_train_folds[j]
                    labels = y_train_folds[j]
                else:
                    test = np.vstack((test, X_train_folds[j]))
                    labels = np.vstack((labels, y_train_folds[j]))
        labels = labels.reshape(4000)
        classifier.train(test, labels)
        dists = classifier.compute_distances_no_loops(X_train_folds[i])
        print(k)
        pred = classifier.predict_labels(dists, k=k)
        num_correct = np.sum(pred == y_train_folds[i])
        accuracy = float(num_correct) / (length * (num_folds-1))

        k_to_accuracies[k].append(accuracy)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

绘制出的交叉验证的结果如下图所示：

从图上可以看出 $k$ 在取到10,13的时候最优。将 $k=10$ 代入，最终在测试集上取得了 $28.2\%$ 的正确率，符合描述。

Inline Question

KNN 部分有三道 Inline Question，相当于简答题。在这里给出我自己的答案(已参考过网上的答案)。

Q1

上面的图是距离矩阵。横坐标是测试集编号，纵坐标是训练集编号。图中白色的部分说明矩阵该位置的值较大，黑色说明值较小。

Q ： What in the data is the cause behind the distinctly bright rows?
A ：测试集中有些图片长的很奇怪，和训练集中大部分图片都长的不太像。
Q ：What causes the columns?
A ：训练集中有一些图片，和测试集中大部分图片相似度都很低。

Q2

答案为1,3,5。

Q3

1错。考虑 $L2$ 距离，平面上有一圈点围成一个圆，内部有另一个点，分属于两个类别。则边界会是一个环形，不是线性的。

2对。1-NN 只会找最接近的，也就是直接在训练集中找到了完全相同的图，而 5-NN 可能会有误差。

3错。测试集中随便构造一个数据集中没出现过的数据，就可以让 1-NN 分错，而 5-NN 能分类正确。

4对。从原理上不难得出。

感想

KNN 这一部分，代码实现的难度并不是很大，只在完成 no_loop 部分的时候由于对 numpy 不够熟悉卡了一会。毕竟是第一次作业，主要以熟悉 Python 与 Numpy 为主。

这种作业的形式个人非常喜欢，有很不错的正反馈，题目难度也适中，和视频搭配使用效果极佳，只能说不愧是斯坦福。

声明: 本文采用 BY-NC-SA 协议进行授权，如无注明均为原创，转载请注明转自 Marvolo
本文地址: 深度学习课程 CS231n Assignment1 KNN部分