All About Programming: An Introduction To Mahout's Logistic Regression SGD Classifier << Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions

An Introduction To Mahout’s Logistic Regression SGD Classifier « Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions

Clustering versus Classification

One of my previous blogs focused on text clustering in Mahout. Clustering is an example of unsupervised learning. The clustering algorithm finds groups within the data without being told what to look for upfront. This contrasts with classification, an example of supervised machine learning, which is the process of determining to which class an observation belongs. A common application of classification is spam filtering. With spam filtering we use labeled data to train the classifier: e-mails marked as spam or ham. We then can test the classifier to see whether it has done a good job of detecting spam from e-mail messages it hasn't seen during the training phase.

The basic classification process

Classification is a deep and broad subject with many different algorithms and optimizations. The basic process however remains the same:

Obtain a dataset
Transform the dataset into a set of records with a field-oriented format containing the features the classifier trains on
Label items in the training set
Split a dataset into test set and training set
Encode the training set and the test set into vectors
Create a model by training the classifier with the training set, with multiple runs and passes if necessary
Test the classifier with the test set
Evaluate the classifier
Improve the classifier and repeat the process

Logistic Regression & Stochastic Gradient Descent

The Logistic function

Before I discuss Logistic Regression and SGD let's look it's foundation, the logistic function. Thelogistic function is an S-shaped function whose range lies between 0 and 1, which makes it useful to model probabilities. When used in classification, an output close to 1 can indicate that an item belongs to a certain class. See the formula and graph below.

$\frac{1}{1+e^{-x}}$

Logistic function

Graph of the logistic function

Logistic Regression model

Logistic Regression builds upon the logistic function. In contrast to the logistic function above which has a single x value as input, a Logistic Regression model allows many input variables: a vector of variables. Additionally, it consists of weights or coefficients for each input variable. The resulting Logistic Regression model looks like this:

$\frac{1}{1+e^{-(\beta_0 + \beta_{1} x_{1} + \beta_{2} x_{2} + ... + \beta_{n} x_{n})}}$

Logistic regression model

The goal now is to find the values for

βs, the regression coefficients, in such a way that the model can classify data with high accuracy. The classifier is accurate if the difference between observed and actual probabilities is low. This difference is also called the cost. By minimizing the cost function of the Logistic Regression model we can learn the values of the

β coefficients. See the following Coursera video on minimizing the cost function.

Read full article from An Introduction To Mahout’s Logistic Regression SGD Classifier « Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions

An Introduction To Mahout's Logistic Regression SGD Classifier << Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions

Clustering versus Classification

The basic classification process

Logistic Regression & Stochastic Gradient Descent

The Logistic function

Logistic Regression model

No comments:

Post a Comment

Labels

Popular Posts