So now that we have our goal set out for us we need to formulate it as a mathematical model. We are trying to find the placement of the dividing lane in such a way as to build the widest road that just touches any of our data points. The separating hyperplane is the dividing lane on this road.
We try to build a straight road that is as wide as possible between the data. You can think of the margin as the width of a road. You can see that the margin is larger for H2. For H2 the parallel hyperplanes are plotted in orange. For H1 the parallel hyperplanes are plotted in green. In below plot I show these margins for each decision boundary. The decision boundary is equidistant from these parallel hyperplanes so that we do not bias our decision in favour of any particular class. The distance between these parallel hyperplane boundaries is called a margin. We widen the parallel hyperplanes until we touch one or more of our data points on either side but without allowing any of the points to penetrate the boundaries. We first select a hyperplane (decision boundary) and then extend two parallel hyperplanes on either side of our decision boundary. A solution to this problem of finding an ‘optimal’ separating hyperplane is to introduce the concept of a margin. Points above the line are classified as +1 and points below the line are classified as -1. Below is an example of two candidate hyperplanes for our data sample.īoth H1 and H2 separate the data without making any classification errors. Unfortunately there are many ways in which we can place a hyperplane to divide the data. All points on one side of the plane will belong to class C1 and all points on the other side of the plane will belong to the second class C2. This hyperplane will be our decision boundary. What a linear classifier attempts to accomplish is to split the feature space into two half spaces by placing a hyperplane between the data points. In the plot I have + for +1 and – for -1. Each data point is labeled as a +1 or a -1. In general we will have N observations and P will denote the number of features. Our features space will consist of 12 observations of x1 and x2 pairs. We will work with a binary classification problem. All the points discussed apply equally in higher dimensions. In this post I will work with an example in two dimensions so that we can easily visualize the geometry of our classifier. I find that excel is a great pedagogical tool and hope you agree with me and find the toy model interesting. I borrow heavily from the authors that I will mention at the end of the post so to add some original content I implement SVM in an excel spreadsheet using Solver. In a final post I will discuss what is commonly referred to as the kernel trick to handle non linear decision boundaries. In a second post I will discuss a soft margin SVM.
This will be a three part series with today’s post covering the linearly separable hard margin formulation of SVM. I decided to check out some other online content on SVM and have a stab at presenting the material in a style that is friendly to newbies. I hate to be critical of these two courses since overall they are amazing and I am very grateful to the instructors for making this content available for the masses. Statistical Learning by Trevor Hastie and Rob Tibshirani was an amazing course but unfortunately SVM was treated somewhat superficially. Andrew Ng watered down the presentation of SVM classifiers compared to his Stanford class notes but he still treated the subject much better than his competitors. Both courses miss the mark when it comes to SVM in my opinion. I should mention that there are two exceptions, Andrew Ng’s Machine Learning on Coursera and Statistical Learning course on Stanford’s Lagunita platform. It’s a shame really since other popular classification algorithms are covered. The one weakness so far is the treatment of support vector machines (SVM). I have been on a machine learning MOOCS binge in the last year.