Objective

The goal of this project is to build a simple classifier using the k-nearest neighbors algorithm in order to identify unique faces from the Extended Yale Faces Database B. This task will require us to wrangle image data and create new features so that we can pass our data into our knn method.

Data Description

The extended Yale Faces Database B contains 16128 images of 28 human subjects under 9 poses and 64 illumination conditions. For this analysis, we will be exclusively using the cropped images under 4 different pose/lighting conditions to improve processing time. The cropped images are in the portable graymap format (PGM) and have the dimensions of 192 pixels by 168 pixels.

Method

To build our k-nearest neighbors classifier, we will take the following approach:

Pre-processing

  1. Recursively iterate the data directory to find the appropriate image data.
  2. Read the image data into R as a list of matrices (192 x 168)
  3. Transform the list of matrices into rows of pixels with each row representing one image and each column representing one pixel.
  4. Label the rows with appropriate subject ID number.

Building the Model

  1. Partition the dataset into training and testing sets.
  2. Run principal component analysis on training and testing sets to mitigate effects of the curse of dimensionality.
  3. Train k-nearest neighbors classifier; tune the k parameter.

Assessing the Model

  1. Use kNN classifier to predict subject ID labels of testing set.
  2. Tweak parameters to assess model stability.
  3. ???
  4. Profit.

Code

Functions

newPar <- function(row, col, margin = TRUE) {
    #     Sets new settings for graphics device.
    #     Args:
    #         row: number of rows in image grid.
    #         col: number of columns in image grid.
    #         margin: If TRUE, keep margins. If not, remove.
    #     Returns:
    #         new settings for par()
    if (margin) {
        op <- par(mfrow = c(row, col))
    } else {
        op <- par(mfrow = c(row, col))
        op <- par(oma = c(0,0,0,0),
                  mar = c(0,0,1,0),
                  mai = c(0,0,0,0))
    }
    return(op)
}

createFileNames <- function(pics, views, dir) {
    #     Creates a list of file names from the Yale Faces database.
    #     
    #     Args:
    #         pics: a numeric vector representing the subjects
    #         views: a character vector representing the views
    #         dir: a character vector representing file paths
    #     Returns: a list of filenames
    filenames <- list() # initialize empty list
    for (i in 1:length(pics)) {
        for (j in 1:length(views)) {
            # generate string denoting filepath
            filename <- sprintf("CroppedYale/%s/%s_%s.pgm",
                                dir[pics[i]] , 
                                dir[pics[i]] , 
                                views[j])
            # append the filename to files
            filenames <- append(filenames, filename)
        }
    } 
    return(filenames)
}

loadPics <- function(filenames, pics, views) {
    #     Loads a list of pictures and converts to a list of matrices.
    #     
    #     Arg:
    #         filenames: a list of filenames
    #     Return:
    #         A list of matrices, each corresponding to one image.
    pic_data <- vector("list",length(pics)*length(views)) # pre-allocate list 
    pic_data <- lapply(filenames, read.pnm) 
    pic_matrix <- lapply(pic_data, getChannels) # convert to matrix (192 x 168)
    return(pic_matrix)
}

flattenMatrix <- function(pic_matrix) {
    #     Convert a list of matrices into a a list of vectors.
    #     
    #     Args:
    #         faces_matrix_list: a list of matrices representing images
    #     Returns:
    #         A list of vectors, each row reprsenting one image.
    # Pre-allocate a matrix with dimensions (number of pictures) 
    # x (number of pixels per picture)
    num_pixels <- prod(dim(pic_matrix[[1]]))
    
    flat_matrix <- matrix(nrow = length(pic_matrix), ncol = num_pixels)
    for (i in 1:length(pic_matrix)) {   
        flat_matrix[i, ] <- as.vector(pic_matrix[[i]])
    }
    return(flat_matrix)
}

createLabels <- function(pics, views, dir) {
    #     Creates a data frame of labels from the Yale Faces database.
    #     
    #     Args:
    #         pics: a numeric vector representing the subjects
    #         views: a character vector representing the views
    #         dir: a character vector representing file paths
    #     Returns: a data frame with columns for subject and view
    subject <- vector()
    view <- vector()
    for (i in 1:length(pics)) {
        for (j in 1:length(views)) {
            subject <- append(subject, dir[pics[i]])
            view <- append(view, views[j])
        }
    }
    df <- data.frame(subject, view)
    return(df)
}

createDataPartition <- function(face_matrix, 
                                labels, 
                                n_seed = NULL, 
                                pr_train = 0.8) {
    #     Creates a dpart object with training/testing set partition.
    #     Args:
    #         face_matrix: the pre-processed face dataset
    #         labels: a vector of corresponding labels for the face_matrix
    #         n_seed: numeric. a seed for the PRN generator
    #         pr_train: numeric. proportion of dataset allocated to training
    #     Returns: a dpart object with the following:
    #         X_train: features in the training set
    #         X_test: features in the testing set
    #         y_train: labels in the training set
    #         y_test: labels in the testing set
    #         in_train: index number of training observations
    #         in_test: index number of testing observations
    fm_size <- dim(face_matrix)
    ntrain <- floor(fm_size[1]*pr_train)
    
    set.seed(n_seed)
    in_train <- sample(1:fm_size[1], ntrain)
    in_test <- c(1:fm_size[1])[-in_train]
    
    X_train <- face_matrix[in_train, ]
    X_test <- face_matrix[in_test, ]
    y_train <- labels[in_train, 1]
    y_test <- labels[in_test, 1]
    
    dpart <- list("X_train" = X_train, 
                  "X_test" = X_test, 
                  "y_train" = y_train, 
                  "y_test" = y_test,
                  "in_train" = in_train,
                  "in_test" = in_test)
    return(dpart)
}

preProcessPCA <- function(dpart, pc_num = 25) {
    #     Runs principal component analysis on a dpart object.
    #     Args:
    #         dpart: a dpart object
    #         pc_num: the number of principal components to use
    #     Returns: 
    #         data_pca: pre-processed training and testing set
    mean_face <- colMeans(dpart$X_train)
    X_train <- t(apply(dpart$X_train, 1, function(row) {
        row - mean_face}))
    X_test <- t(apply(dpart$X_test, 1, function(row) {
        row - mean_face}))
    
    pca <- prcomp(X_train, center = TRUE, scale = FALSE)
    
    X_train_pca <- X_train %*% pca$rotation[, 1:pc_num] 
    X_test_pca <- X_test %*% pca$rotation[, 1:pc_num]
    
    data_pca <- list("X_train_pca" = X_train_pca,
                     "X_test_pca" = X_test_pca)
    return(data_pca)
}

predictKNN <- function(dpart, pc_num = NULL, k = 1) {
    #     Trains a k-nearest neighbors classifier on a dpart object.
    #     Args:
    #         dpart: a dpart object
    #         pc_num: the number of principal components to use, if any
    #         k: the number of nearest neighbors to use
    #     Returns: 
    #         knn_pred: a vector of predictions 
    if (is.null(pc_num)) {
        knn_pred <- knn(dpart$X_train, dpart$X_test, dpart$y_train, k = k)
    } else {
        pca <- preProcessPCA(dpart, pc_num)
        knn_pred <- knn(pca$X_train, pca$X_test, dpart$y_train, k = k)
    }
    correct <- sum(knn_pred == dpart$y_test)
    incorrect <- sum(knn_pred != dpart$y_test)
    test_error <- incorrect / (correct + incorrect)
    
    cat(paste("Correctly Identified: ", correct, "\n", sep = ""))
    cat(paste("Incorrectly Identified: ", incorrect, "\n", sep = ""))
    cat(paste("Test Error: ", test_error, "\n", sep = ""))
    return(knn_pred)
}

Pre-Processing

## specify the images to be used in the analysis
views_4a <- c('P00A+000E+00', 'P00A+005E+10', 'P00A+005E-10', 'P00A+010E+00' )
pic_list <- 1:38
# get directory structure
dir_list_1 <- dir(path="CroppedYale/", all.files = FALSE)
dir_list_2 <- dir(path="CroppedYale/", all.files = FALSE, recursive = TRUE)
file_list <- createFileNames(pic_list, views_4a, dir_list_1)
face_images_4a <- loadPics(file_list, pic_list, views_4a)
face_matrix_4a <- flattenMatrix(face_images_4a)

labels_4a <- createLabels(pic_list, views_4a, dir_list_1)

Building the Model

face_4a <- createDataPartition(face_matrix_4a, 
                               labels_4a, 
                               n_seed = 1, 
                               pr_train = 0.8)

Some quality assurance for good measure - are we getting what we expect?

# first 5 files in training set
head(labels_4a[face_4a$in_train, ], 5)
##     subject         view
## 41  yaleB11 P00A+000E+00
## 57  yaleB16 P00A+000E+00
## 86  yaleB23 P00A+005E+10
## 136 yaleB35 P00A+010E+00
## 30  yaleB08 P00A+005E+10
# first 5 files in testing set
head(labels_4a[face_4a$in_test, ], 5)
##    subject         view
## 5  yaleB02 P00A+000E+00
## 12 yaleB03 P00A+010E+00
## 18 yaleB05 P00A+005E+10
## 20 yaleB05 P00A+010E+00
## 21 yaleB06 P00A+000E+00

Let’s build a 1-NN classifier. Since 1-NN is a low bias model and fits to our training set very tightly, we expect the model to be a poor classifier due to high variance.

predictKNN(face_4a, pc_num = 25, k = 1)
## Correctly Identified: 31
## Incorrectly Identified: 0
## Test Error: 0
##  [1] yaleB02 yaleB03 yaleB05 yaleB05 yaleB06 yaleB06 yaleB08 yaleB09
##  [9] yaleB09 yaleB10 yaleB10 yaleB15 yaleB15 yaleB18 yaleB19 yaleB20
## [17] yaleB20 yaleB21 yaleB25 yaleB25 yaleB26 yaleB28 yaleB30 yaleB30
## [25] yaleB31 yaleB31 yaleB31 yaleB32 yaleB33 yaleB37 yaleB39
## 38 Levels: yaleB01 yaleB02 yaleB03 yaleB04 yaleB05 yaleB06 ... yaleB39

Interesting! There were no misclassified faces even though we predicted using a 1NN classifier. What gives? Perhaps it has to do with the variability of the lighting conditions? Let’s test this hypothesis.

Assessing the Model

Lighting Conditions

# Use different lighting conditions
views_4c = c('P00A-035E+15', 'P00A-050E+00', 'P00A+035E+15', 'P00A+050E+00')

file_list <- createFileNames(pic_list, views_4c, dir_list_1)
face_images_4c <- loadPics(file_list, pic_list, views_4c)
face_matrix_4c <- flattenMatrix(face_images_4c)

labels_4c <- createLabels(pic_list, views_4c, dir_list_1)
face_4c <- createDataPartition(face_matrix_4c,
                               labels_4c,
                               n_seed = 2,
                               pr_train = 0.8)

# first 5 files in training set
head(labels_4c[face_4c$in_train, ], 5)
##     subject         view
## 29  yaleB08 P00A-035E+15
## 107 yaleB28 P00A+035E+15
## 86  yaleB23 P00A-050E+00
## 26  yaleB07 P00A-050E+00
## 140 yaleB36 P00A+050E+00
# first 5 files in testing set
head(labels_4c[face_4c$in_test, ], 5)
##    subject         view
## 3  yaleB01 P00A+035E+15
## 5  yaleB02 P00A-035E+15
## 8  yaleB02 P00A+050E+00
## 9  yaleB03 P00A-035E+15
## 16 yaleB04 P00A+050E+00

Again, let’s build another 1-NN classifier.

knn_pred <- predictKNN(face_4c, pc_num = 25, k = 1)
## Correctly Identified: 4
## Incorrectly Identified: 27
## Test Error: 0.870967741935484

Aha! We have quite a few misclassified faces this time around. Let’s plot them and see what the deal is:

misclass_pred_idx <- which(knn_pred != face_4c$y_test)
misclass_test_idx <- face_4c$in_test[misclass_pred_idx]

par(newPar(5, 6, margin = FALSE))
for (i in 1:length(face_images_4c[misclass_test_idx])) {
    plot(pixmapGrey(face_images_4c[misclass_test_idx][[i]]))
    title(knn_pred[misclass_pred_idx][i])
}

Okay, so the first set of images had no misidentified subjects for a test error of 0%. The second set of images had 27 misidentified subjects for a test error of 87.1%. Why did this happen?

By preprocessing our data using PCA, we are projecting our data onto a feature space that captures the most variance in our data. In the second set of images, most of the variance in the image data is found in the shadows cast by the different lighting views. As such, the projection of our image data onto the first 25 principal components would result in data that does not capture many of the distingushing features of each subject, which is needed to train our model to accurately predict the subject labels.

Varying the Training/Testing Set Divide

Alright, let’s do one more test to see how minute changes can affect our prediction accuracy. What if we varied how we partitioned our training and testing sets?

pr_train <- seq(0.25, 0.90, length.out = 10)

for (i in 1:length(pr_train)) {
    dpart <- createDataPartition(face_matrix_4c,
                                 labels_4c,
                                 n_seed = 2,
                                 pr_train = pr_train[i])
    print(sprintf("Training: %s, Testing: %s", 
                  round(pr_train[i], 3), 
                  round(1-pr_train[i], 3)))
    predictKNN(dpart, pc_num = 25, k = 1)
}
## [1] "Training: 0.25, Testing: 0.75"
## Correctly Identified: 22
## Incorrectly Identified: 92
## Test Error: 0.807017543859649
## [1] "Training: 0.322, Testing: 0.678"
## Correctly Identified: 16
## Incorrectly Identified: 88
## Test Error: 0.846153846153846
## [1] "Training: 0.394, Testing: 0.606"
## Correctly Identified: 15
## Incorrectly Identified: 78
## Test Error: 0.838709677419355
## [1] "Training: 0.467, Testing: 0.533"
## Correctly Identified: 13
## Incorrectly Identified: 69
## Test Error: 0.841463414634146
## [1] "Training: 0.539, Testing: 0.461"
## Correctly Identified: 14
## Incorrectly Identified: 57
## Test Error: 0.802816901408451
## [1] "Training: 0.611, Testing: 0.389"
## Correctly Identified: 12
## Incorrectly Identified: 48
## Test Error: 0.8
## [1] "Training: 0.683, Testing: 0.317"
## Correctly Identified: 8
## Incorrectly Identified: 41
## Test Error: 0.836734693877551
## [1] "Training: 0.756, Testing: 0.244"
## Correctly Identified: 8
## Incorrectly Identified: 30
## Test Error: 0.789473684210526
## [1] "Training: 0.828, Testing: 0.172"
## Correctly Identified: 3
## Incorrectly Identified: 24
## Test Error: 0.888888888888889
## [1] "Training: 0.9, Testing: 0.1"
## Correctly Identified: 1
## Incorrectly Identified: 15
## Test Error: 0.9375

These numbers shows us how different divisions of the data into testing and training sets can influence the prediction accuracy (testing error) of our model. Too little training data and our paramter estimates will have greater variance, resulting in a larger test error. Moreover, we might be trying to fit a model onto a training set in which there is no corresponding label in the test set. Too little testing data and our performance statistic itself will have greater variance.

Conclusions

In this project, we built a simple 1-NN classifier to identify unique subjects within a set of cropped face images. While more sophisticated algorithms for facial recognition already exist, this report can serve as a gentle introduction to the world of image recognition.