K-nearest neighbors (KNN) algorithm is a non-parametric supervised machine learning algorithm used for classification and regression. We will go over the basics of how it works and will use it to predict whether a unit belongs to a group given some information.
Imagine we have some initial observations with various attributes (for example, whether the person is married, wage, education, experience). When we obtain a set of attributes of a new individual (such as wage, education, experience) we are trying to determine whether the individual is married or not. To do that, we compute the Euclidean distance between the attributes of the individual and the individuals already observed earlier. If the attributes of the new data point are closer to the married individuals’ attributes, KNN will predict or classify that individual as married. You can read more about KNN here.
To run the knn() and npreg() commands we need to load the following libraries.
library(class)
library(np)
In the example below we will try to predict whether a person is married given the information on wage, education, work experience, and tenure using K-nearest neighbors algorithm in R, and compare the results with other predictive methods. First, let’s load the data.
data(wage1, package='wooldridge')
data1=wage1
summary(data1)
## wage educ exper tenure
## Min. : 0.530 Min. : 0.00 Min. : 1.00 Min. : 0.000
## 1st Qu.: 3.330 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 4.650 Median :12.00 Median :13.50 Median : 2.000
## Mean : 5.896 Mean :12.56 Mean :17.02 Mean : 5.105
## 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.:26.00 3rd Qu.: 7.000
## Max. :24.980 Max. :18.00 Max. :51.00 Max. :44.000
## nonwhite female married numdep
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :1.000
## Mean :0.1027 Mean :0.4791 Mean :0.6084 Mean :1.044
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :6.000
## smsa northcen south west
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.7224 Mean :0.251 Mean :0.3555 Mean :0.1692
## 3rd Qu.:1.0000 3rd Qu.:0.750 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## construc ndurman trcommpu trade
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.04563 Mean :0.1141 Mean :0.04373 Mean :0.2871
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## services profserv profocc clerocc
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1008 Mean :0.2586 Mean :0.3669 Mean :0.1673
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## servocc lwage expersq tenursq
## Min. :0.0000 Min. :-0.6349 Min. : 1.0 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 1.2030 1st Qu.: 25.0 1st Qu.: 0.00
## Median :0.0000 Median : 1.5369 Median : 182.5 Median : 4.00
## Mean :0.1407 Mean : 1.6233 Mean : 473.4 Mean : 78.15
## 3rd Qu.:0.0000 3rd Qu.: 1.9286 3rd Qu.: 676.0 3rd Qu.: 49.00
## Max. :1.0000 Max. : 3.2181 Max. :2601.0 Max. :1936.00
plot(factor(data1$married))
We see that in our sample we have both married and non-married individuals. The numerical data (wage, education, experience, tenure) needs to be normalized. Data normalization is a simple and useful tool in many applications. It converts the variables to a new scale from 0 to 1. When computing the Euclidean distances, it is important to have the same scale. It avoids one parameter being much larger and having a larger weight (effect) than other variables. We technically should not include variable “married” in the normalization function but since it is already normalized (that is between 0 and 1), it is not affected by the function.
normalizationfx=function(x) { (x-min(x))/(max(x)-min(x)) }
data2=as.data.frame(lapply(data1[,c("wage","educ","exper","tenure","married")], normalizationfx))
summary(data2)
## wage educ exper tenure
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1145 1st Qu.:0.6667 1st Qu.:0.0800 1st Qu.:0.00000
## Median :0.1685 Median :0.6667 Median :0.2500 Median :0.04545
## Mean :0.2195 Mean :0.6979 Mean :0.3203 Mean :0.11601
## 3rd Qu.:0.2597 3rd Qu.:0.7778 3rd Qu.:0.5000 3rd Qu.:0.15909
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## married
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6084
## 3rd Qu.:1.0000
## Max. :1.0000
We will need a sample to train the algorithm or set-up rules which then will be used to predict the gender of the individual given the predictor variables. To train the algorithm, let’s use 80% of the full data set. Our goal is to best predict the rest 20% of the data.
set.seed(123) #makes sure the results are replicable
sample1=sample(1:nrow(wage1), 0.80*nrow(wage1))
data3_train=data2[sample1,]
data3_test=data2[-sample1,]
First, we will arbitrarily start with a low value of k, that is, k=3. To run the knn algorithm, we can use already predefined function knn(). Otherwise, we could write ourselves (for example, see here).
Knn1=knn(data3_train[,c(1:4)], data3_test[,c(1:4)], cl=data3_train[,5], k=3)
Result1=cbind(data3_test,Knn1)[,c(5,6)]
Accuracy1=sum(Result1[,1]==Result1[,2])/nrow(Result1)
Table1=table(Knn1,data3_test[,5])
Let’s try higher values of k. Increasing the value of k, in simple terms increases the size of the neighborhood in the multidimensional world.
Knn2=knn(data3_train[,c(1:4)], data3_test[,c(1:4)], cl=data3_train[,5], k=5)
Result2=cbind(data3_test,Knn2)[,c(5,6)]
Accuracy2=sum(Result2[,1]==Result2[,2])/nrow(Result2)
Table2=table(Knn2,data3_test[,5])
Again, let’s see what happens if we increase k to 10.
Knn3=knn(data3_train[,c(1:4)], data3_test[,c(1:4)], cl=data3_train[,5], k=10)
Result3=cbind(data3_test,Knn3)[,c(5,6)]
Accuracy3=sum(Result3[,1]==Result3[,2])/nrow(Result3)
Table3=table(Knn3,data3_test[,5])
And one more time. Now let’s set k=25.
Knn4=knn(data3_train[,c(1:4)], data3_test[,c(1:4)], cl=data3_train[,5], k=25)
Result4=cbind(data3_test,Knn4)[,c(5,6)]
Accuracy4=sum(Result4[,1]==Result4[,2])/nrow(Result4)
Table4=table(Knn4,data3_test[,5])
Finally, we can use all the neighbors, not only the few closest ones determined by k. Including more variables, as in a regression model, reduces the variance. However, we may end up missing some more intricate patterns happening in narrow ranges of data.
Knn5=knn(data3_train[,c(1:4)], data3_test[,c(1:4)], cl=data3_train[,5], use.all = TRUE)
Result5=cbind(data3_test,Knn5)[,c(5,6)]
Accuracy5=sum(Result5[,1]==Result5[,2])/nrow(Result5)
Table5=table(Knn5,data3_test[,5])
For comparative reasons, let’s also estimate a linear, probit and non-parametric regression models. A linear regression model with a binary dependent variable is called a linear probability model. Let’s take that if the resulting value is equal or over 0.5, we expect the individual to be married.
reg1=lm(married~wage+educ+exper+tenure, data=wage1[sample1,])
summary(reg1)
##
## Call:
## lm(formula = married ~ wage + educ + exper + tenure, data = wage1[sample1,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0146 -0.4137 0.1681 0.4143 0.6468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.019714 0.123486 0.160 0.8732
## wage 0.018351 0.007181 2.556 0.0110 *
## educ 0.021914 0.009590 2.285 0.0228 *
## exper 0.010659 0.001993 5.348 1.48e-07 ***
## tenure 0.002621 0.003703 0.708 0.4795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4575 on 415 degrees of freedom
## Multiple R-squared: 0.1417, Adjusted R-squared: 0.1334
## F-statistic: 17.12 on 4 and 415 DF, p-value: 5.219e-13
pred1=predict(reg1, newdata=wage1[-sample1,])
pred1a=pred1>=0.5
ResultReg1=data.frame(pred1a,wage1[-sample1,"married"])
AccuracyReg1=sum(ResultReg1[,1]==ResultReg1[,2])/nrow(ResultReg1);
TableReg1=table(pred1a,data3_test[,5])
Probit regression does not assume linear function as the values below 0 and above 1 are not logical, this model limits the possible values between 0 and 1.
reg2 = glm(married~wage+educ+exper+tenure, family = binomial(link = probit), data = wage1[sample1,])
summary(reg2)
##
## Call:
## glm(formula = married ~ wage + educ + exper + tenure, family = binomial(link = probit),
## data = wage1[sample1, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3464 -1.0086 0.5962 1.0069 1.4795
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.366939 0.362171 -3.774 0.00016 ***
## wage 0.062110 0.024081 2.579 0.00990 **
## educ 0.058401 0.027985 2.087 0.03690 *
## exper 0.028674 0.006026 4.758 1.95e-06 ***
## tenure 0.012872 0.012101 1.064 0.28744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 566.91 on 419 degrees of freedom
## Residual deviance: 501.99 on 415 degrees of freedom
## AIC: 511.99
##
## Number of Fisher Scoring iterations: 4
pred2=predict(reg2, newdata=wage1[-sample1,])
pred2a=pred2>=0.5
ResultReg2=data.frame(pred2a,wage1[-sample1,"married"])
AccuracyReg2=sum(ResultReg2[,1]==ResultReg2[,2])/nrow(ResultReg2);
TableReg2=table(pred2a,data3_test[,5])
Non-parametric regression does not assume any functional form of the regression. It requires significantly more time to run even for a small number of variables and a small data set like this one.
reg3=npreg(married~wage+educ+exper+tenure, regtype = "ll", bwmethod = "cv.aic", gradients = TRUE, data = wage1[sample1,])
##
Multistart 1 of 4 |
Multistart 1 of 4 |
Multistart 1 of 4 |
Multistart 1 of 4 /
Multistart 1 of 4 -
Multistart 1 of 4 \
Multistart 1 of 4 |
Multistart 1 of 4 |
Multistart 1 of 4 /
Multistart 1 of 4 -
Multistart 2 of 4 |
Multistart 2 of 4 |
Multistart 2 of 4 /
Multistart 2 of 4 -
Multistart 2 of 4 |
Multistart 2 of 4 |
Multistart 3 of 4 |
Multistart 3 of 4 |
Multistart 3 of 4 /
Multistart 3 of 4 -
Multistart 3 of 4 |
Multistart 3 of 4 |
Multistart 4 of 4 |
Multistart 4 of 4 |
Multistart 4 of 4 /
Multistart 4 of 4 -
Multistart 4 of 4 |
Multistart 4 of 4 |
summary(reg3)
##
## Regression Data: 420 training points, in 4 variable(s)
## wage educ exper tenure
## Bandwidth(s): 18.08872 169659436 4.812158 47540614
##
## Kernel Regression Estimator: Local-Linear
## Bandwidth Type: Fixed
## Residual standard error: 0.4182555
## R-squared: 0.2744703
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 4
pred3=predict(reg3, newdata=wage1[-sample1,])
pred3a=pred3>=0.5
ResultReg3=data.frame(pred3a,wage1[-sample1,"married"])
AccuracyReg3=sum(ResultReg3[,1]==ResultReg3[,2])/nrow(ResultReg3);
TableReg3=table(pred3a,data3_test[,5])
To compare the results of all the above predictions, we combine the tables into one table separated by a column of NAs. The first table for the first KNN estimation correctly predicts 17 unmarried people and correctly predicts 61 married people. However, 19 individuals were not married but we identified them as married. Similarly, 9 were actually married but our model predicted them not to be married. All in all, our first algorithm was 73.6% accurate with its predictions. As we increased the number of neighboring observations to be included in computation to k=5 and k=10, the prediction accuracy increased to 78.3%. If you look at the tables above, model 2 only predicted 3 people who were married as not married, and model 3 incorrectly predicted only two people as nor married.
Linear regression and non-parametric regression results yield very similar results, although slightly less accurate.
cbind(Table1,NA,Table2,NA,Table3,NA,Table4,NA,Table5,NA,TableReg1,NA,TableReg2,NA,TableReg3)
## 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
## 0 17 9 NA 16 3 NA 15 2 NA 15 3 NA 16 19 NA 19 10 NA 29 39 NA 17 5
## 1 19 61 NA 20 67 NA 21 68 NA 21 67 NA 20 51 NA 17 60 NA 7 31 NA 19 65
round(data.frame(Accuracy1,Accuracy2,Accuracy3,Accuracy4,Accuracy5,AccuracyReg1,AccuracyReg2, AccuracyReg3),3)
## Accuracy1 Accuracy2 Accuracy3 Accuracy4 Accuracy5 AccuracyReg1
## 1 0.736 0.783 0.783 0.774 0.632 0.745
## AccuracyReg2 AccuracyReg3
## 1 0.566 0.774
As we can see, our KNN models predicted who’s married with an accuracy of almost 80%. As we acquire more data, we could expect our prediction accuracy to increase.
We could have generalized the algorithm to try all k’s by using ‘for’ loop. For example, try the following code:
Accuracy=rep(0,100)
for (i in 1:100)
{
Knn5=knn(data3_train[,c(1:4)], data3_test[,c(1:4)], cl=data3_train[,5], k=i)
Result5=cbind(data3_test,Knn5)[,c(5,6)]
Accuracy[i]=sum(Result5[,1]==Result5[,2])/nrow(Result5)
}
plot(Accuracy, type="l")
max(Accuracy)
## [1] 0.7924528
which(Accuracy==max(Accuracy))
## [1] 15 17 22