An Introduction To Mahout’s Logistic Regression SGD Classifier « Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions
Clustering versus Classification
One of my previous blogs focused on text clustering in Mahout. Clustering is an example of unsupervised learning. The clustering algorithm finds groups within the data without being told what to look for upfront. This contrasts with classification, an example of supervised machine learning, which is the process of determining to which class an observation belongs. A common application of classification is spam filtering. With spam filtering we use labeled data to train the classifier: e-mails marked as spam or ham. We then can test the classifier to see whether it has done a good job of detecting spam from e-mail messages it hasn't seen during the training phase.
The basic classification process
Classification is a deep and broad subject with many different algorithms and optimizations. The basic process however remains the same:
- Obtain a dataset
- Transform the dataset into a set of records with a field-oriented format containing the features the classifier trains on
- Label items in the training set
- Split a dataset into test set and training set
- Encode the training set and the test set into vectors
- Create a model by training the classifier with the training set, with multiple runs and passes if necessary
- Test the classifier with the test set
- Evaluate the classifier
- Improve the classifier and repeat the process
Logistic Regression & Stochastic Gradient Descent
The Logistic function
Before I discuss Logistic Regression and SGD let's look it's foundation, the logistic function. Thelogistic function is an S-shaped function whose range lies between 0 and 1, which makes it useful to model probabilities. When used in classification, an output close to 1 can indicate that an item belongs to a certain class. See the formula and graph below.
Logistic function
Logistic Regression model
Logistic Regression builds upon the logistic function. In contrast to the logistic function above which has a single x value as input, a Logistic Regression model allows many input variables: a vector of variables. Additionally, it consists of weights or coefficients for each input variable. The resulting Logistic Regression model looks like this:
Logistic regression model
The goal now is to find the values for β s, the regression coefficients, in such a way that the model can classify data with high accuracy. The classifier is accurate if the difference between observed and actual probabilities is low. This difference is also called the cost. By minimizing the cost function of the Logistic Regression model we can learn the values of the β coefficients. See the following Coursera video on minimizing the cost function.
Read full article from An Introduction To Mahout’s Logistic Regression SGD Classifier « Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions
No comments:
Post a Comment