5 Reasons Why Automakers Should Fear Google's Driverless Car - Forbes



Opinions expressed by Forbes Contributors are their own. Recent Posts Popular Posts Full Bio I'm the coauthor of The New Killer Apps: How Large Companies Can Out-Innovate Start-Ups .  I'm also the coauthor of Unleashing the Killer App: Digital Strategies for Market Dominance (Harvard Business School Press, 1998) and Billion-Dollar Lessons: What You Can Learn from the Most Inexcusable Business Failures of the Last 25 Years (Portfolio, 2008). I cofounded and am the managing director of the Devil's Advocate Group ,

Read full article from 5 Reasons Why Automakers Should Fear Google's Driverless Car - Forbes


Google Glass Release Date News: $600 Price Tag Is Too Much For Consumers According To UK Poll



Thu, 01/09/2014 - 17:36 Gadgets Google Glass Release Date News: $600 Price Tag Is Too Much For Consumers According To UK Poll Email * The Google Glass release date is looming around the corner with a TBD 2014 launch. Will the device go mainstream come its public debut? Some think Google Glass is doomed , while other believe the wearable camera will be adopted by the masses. In a recent survey , though, popular opinion leans towards the former belief. A Google Glass poll conducted by UK blog Lifestyle surveyed 1,

Read full article from Google Glass Release Date News: $600 Price Tag Is Too Much For Consumers According To UK Poll


Google Glass | Release date, price and specs | Explorer program - PC Advisor



Google Glass release date, price and specs: now you can buy Google Glass in the UK Everything you need to know about Google Glass - how anyone (over 18) can buy Google Glass today By Chris Martin | PC Advisor | 23 June 14 Google Glass is now available in the UK through the Explorer Programme . Here's everything you need to know about the wearable tech , including Google Glass release date, price, specs and more, including how to get Google Glass in the UK. Updated on 23/06/14. See also: Google Glass Explorer Edition 2.0 review .

Read full article from Google Glass | Release date, price and specs | Explorer program - PC Advisor


Google Apps update alerts: Feature enhancements to Google Classroom



This official feed from the Google Apps team provides essential information about new features and improvements. 10/14/2014 Feature enhancements to Google Classroom Google Classroom launched this summer to make Google Apps for Education even simpler — saving teachers time and making it easier to collaborate with students. Today, we're launching five improvements to Classroom, focusing on things educators and students around the world told us were most important to them: Groups integration: Ability to pre-populate classes using existing Google Groups.

Read full article from Google Apps update alerts: Feature enhancements to Google Classroom


Google Apps update alerts: Manage revisions for non-Google files in the new Google Drive




Read full article from Google Apps update alerts: Manage revisions for non-Google files in the new Google Drive


Google Hangouts and Google Voice are finally playing nice | Android Central



Well, here's a pleasant late-night surprise. Looks like Google Hangouts finally has the long-overdue integration with Google Voice , meaning that your messages from Google Voice (and not traditional SMS messages from the actual phone number assigned to your SIM card — it's a bit of a mess to explain, we know) can finally show up in Google Hangouts. That means a couple things. First is that you'll only need one app — Hangouts — to receive text messages and Hangouts messages. (Not sure yet what that means for voicemails through Google Voice.

Read full article from Google Hangouts and Google Voice are finally playing nice | Android Central


Google Hangouts getting free voice calls and voicemail | Android Central



It's been a long loooong time coming, but Google Voice is finally getting more-or-less fully integrated into Hangouts . Earlier today we saw Google migrating Voice messaging into Hangouts , and now with an update to the Hangouts app for both Android (version 2.3, which will be rolling out over the next few days) and iOS you'll be able to make phone calls with ease. With Google Voice in Hangouts you'll be able to call other Hangouts users for free, place free calls to phone numbers in the United States and Canada, as well as take advantage of low international rates. Here's how it will work :

Read full article from Google Hangouts getting free voice calls and voicemail | Android Central


Uncle Lance's Ultra Whiz Bang: Document Summarization with LSA #2: Test with newspaper articles



The Experiment This analysis evaluates many variants of the LSA algorithm against some measurements appropriate for an imaginary document summarization UI. This UI displays the most important two sentences with the important theme words highlighted. The measurements try to match the expectations of the user. Supervised v.s. Unsupervised Learning Machine Learning algorithms are classified as supervised, unsupervised, and semi-supervised. A supervised algorithm creates a model (usually statistical) from training data, then applies test data against the model.

Read full article from Uncle Lance's Ultra Whiz Bang: Document Summarization with LSA #2: Test with newspaper articles


Comparing Document Classification Functions of Lucene and Mahout | soleami | Visualize the needs of your visitors.



Comparing Document Classification Functions of Lucene and Mahout | soleami | Visualize the needs of your visitors.
Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule.
You need to have IndexReader with prepared index open and specify it as the first argument of the train() method because Classifier uses index as learning data. Also, set the Lucene field name that has text, which is tokenized and indexed, as the second argument of train() method. In addition, set the Lucene field that has document category as the third argument of train() method. In the same manner, set a Lucene Analyzer to the fourth argument and Query to the fifth argument. Analyzer then specifies Analyzer that is used to classify unknown document (In my personal opinion, this is a bit complicated and should use them as arguments for after-mentioned assignClass() method instead) . While Query is used to narrow down documents that are used for learning, null is used if there’s no need to do so. The train() method has 2 more varieties that have different arguments but I will skip the explanation for now.
Use unknown document in the String type as an argument to call the assignClass() method after you call train() of Classifier interface to obtain the result of classification. Classifier is an interface that uses Java Generics, and the ClassificationResult class that uses type variable T is the returned value of assignClass().
Calling the getAssignedClass() method of ClassificationResult gives you a classification result of the type T.
Note that Lucene’s classifier is unique in that the train() method does little work while the assignClass() does most of the work. This is where it is very different from the other commonly used machine learning software. In the learning phase of commonly used machine learning software, a model file is created by learning corpus according to a selected machine learning algorithm (This is where the most time/effort is put into. As Mahout is based on Hadoop, it uses MapReduce to try to reduce the time required here). And in the classification phase, an unknown document is classified by referring to a previously created model file. This phase usually requires little resource.
As Lucene uses an index as a model file, train() method, which is a learning phase, does almost nothing here (Its learning completes as soon as index is created). Lucene’s index, however, is optimized to perform high-speed keyword search and is not in an appropriate format for document classification model file. Therefore, here we do document classification by searching index with the assignClass() method that is a classification phase. Contrary to commonly used machine learning software, Lucene’s classifier requires very high computing power in the classification phase. For sites mainly focused on searching, this function that enables document classification should be appealing as they can create indexes without additional cost.

SimpleNaiveBayesClassifier is the first implement class of Classifier interface. As you can see from the name, it’s a Naive Bayes classifier. Naive Bayes classification finds c where conditional probability P(c|d), the probability of class being c in document d, becomes the highest. Here you use Bayes’ theorem to do deformation of P(c|d) but you need to find P(c)P(d|c) to calculate class c with the highest probability. While you usually calculate logarithm to avoid underflow, the assignClass() method of SimpleNaiveBayesClassifier repeats this calculation as many times as the number of classes to perform MLE (maximum likelihood estimation).
Using Lucene KNearestNeighborClassifier
Another implement class for Classifier is KNearestNeighborClassifier. KNearestNeighborClassifier specifies k, which is no less than 1, in an argument for constructor to create an instance. You can use the program exactly the same as one for SimpleNaiveBayesClassifier. Only you need to do is to replace the portion that is creating an instance for SimpleNaiveBayesClassifier with KNearestNeighborClassifier.
The assignClass() method does all the work for KNearestNeighborClassifier as well in the same manner described before but one interesting point is that it is using Lucene MoreLikeThis. MoreLikeThis is a tool that sees document to become criteria as a query and performs search. With this, you can find documents that are similar to the ones to be criteria. KNearestNeighborClassifier uses MoreLikeThis to “k” number of documents that are most similar to the unknown document passed to the assignClass() method. Then, the majority rule is applied to that k number of documents to determine the document category of unknown document.
Executing the same program as KNearestNeighborClassifier will display the following when k=1.

In this article, we used the same corpus to do document classification of the both Lucene and Mahout to compare their results. The accuracy rate seems to be higher for Mahout but, as already stated, its learning data classification use not all word but only top 2,000 important words in the body field. On the other hand, Lucene’s classifier, which accuracy rate was only 70%, uses the all words in body field. Lucene will be able to pass the 90% accuracy rate if you have a field to hold only the words reviewed specially for document classification. It may also be a good idea to create another Classifier implement class for train() method that has such function.
I should add that the accuracy rate goes down to around 80% when you do not use test data for learning but test it as real unknown data.
Read full article from Comparing Document Classification Functions of Lucene and Mahout | soleami | Visualize the needs of your visitors.

Lucene 4 is Super Convenient for Developing NLP Tools



Lucene 4 is Super Convenient for Developing NLP Tools
Lucene 4.0 classes that I used for developing this system are as follows:
  • IndexSearcher, TermQuery, TopDocs
    This system calculates similarities of synonym candidates that consist of nouns extracted from keywords and their descriptions. The system determines that the candidate is a synonym of keyword if similarity is bigger than a threshold value and output it to a CSV file.
    But how I calculate the similarity of a keyword and its synonym candidate. This system determines the similarity by calculating the similarity of keyword description Aa and dictionary entry description set {Ab} that are written using synonym candidates.
    Thus, I have to find {Ab} where I used classes such as IndexSearcher, TermQuery, and TopDocsto to search description field using synonym candidate.
  • PriorityQueue
    Next, I have to pick out “feature word” from Aa and {Ab} to calculate similarity of the two. In order to do so, I select N most important words to structure feature vector. Here, I use TF*IDF of the target word as their degree of importance. See the above SlideShare for the detail. Here, I use PriorityQueue to select “N most important words”
  • DocsEnum, TotalHitCountCollector
    I used TF*IDF to calculate weight to extract the above feature word and used DocsEnum.freq() to obtain TF. docFreq (number of articles including synonym candidate), which is a required parameter to obtain IDF, has been calculated by passing TotalHitCountCollector to the search() method of IndexSearcher.
  • Terms, TermsEnum
    I use these classes to search “description” field for synonym candidates.
These are usage examples for Lucene 4.0 on this system. I also believe Lucene will be a great help for NLP tool developers as well. For lexical knowledge obtention task using Bootstrap, for example, I can use a cycle (1: pattern extraction, 2: pattern selection, 3: instance extraction, 4: instance selection) to obtain knowledge from a small number of seed instances. I believe that you can replace pattern extraction and instance extraction with a simple search task if you use Lucene for these tasks.
Please read full article from Lucene 4 is Super Convenient for Developing NLP Tools

Text categorization with Lucene and Solr - YouTube




Read full article from Text categorization with Lucene and Solr - YouTube


Text categorization with Lucene and Solr



Text categorization with Lucene and Solr
Let the algorithm assign one or more labels (classes) to some item given some previous knowledge
l Spam filter
l Tagging system
l Digit recognition system
l Text categorization 

l Lucene already has a lot of features for common information retrieval needs
l Postings
l Term vectors
l Statistics
l Positions
l TF / IDF
l maybe Payloads
l etc.
l We may avoid bringing in new components
to do classification just leveraging what we
get for free from Lucene

l Lucene has so many features stored you can take advantage of for free
l Therefore writing the classification algorithm is relatively simple
l In many cases you’re just not adding anything to the architecture
l Your Lucene index was already there for searching l Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm
l And it is fast enough 

Classifier API
l Training
l void train(atomicReader, contentField, classField, analyzer) throws IOException


K Nearest neighbor classifier
l Fairly simple classification algorithm
l Given some new unseen item
l I search in my knowledge base the k items which are nearer to the new one
l I get the k classes assigned to the k nearest items
l I assign to the new item the class that is most frequent in the k returned items 

K Nearest neighbor classifier
l How can we do this in Lucene?
l We have VSM for representing documents as
vectors and eventually find distances
l Lucene MoreLikeThis module can do a lot for it
l Given a new document
l It’s represented as a MoreLikeThisQuery which filters
out too frequent words and helps on keeping only the
relevant tokens for finding the neighbors
l The query is executed returning only the first k results
l The result is then browsed in order to find the most
frequent class and that is then assigned with a score
of classFreq / k 

Naïve Bayes classifier
l Slightly more complicated
l Based on probabilities
l C = argmax( P(d|c) * P(c) )
l P(d|c) : likelihood
l P(c) : prior
l With some assumptions:
l bag of words assumption: positions don't matter
l conditional independence: the feature probabilities
are independent given a class


Things to consider - bootstrapping
l How are your first documents classified?
l Manually
l Categories are already there in the documents
l Someone is explicitly charged to do that (e.g. article
authors) at some point in time
l (semi) automatically
l Using some existing service / library
l With or without human supervision
l In either case the classifier needs something to
be fed with to be effective 

As specific search services
l A classification based more like this
l While indexing
l For automatic text categorization

Automatic text categorization
l Once a doc reaches Solr
l We can use the Lucene classifiers to automate assigning document’s category
l We can leverage existing Solr facilites for enhancing the indexing pipeline
l An UpdateChain can be decorated with one or more UpdateRequestProcessors

CategorizationUpdateRequestProcessorFactory
CategorizationUpdateRequestProcessor
l void processAdd(AddUpdateCommand
cmd) throws IOException
l String text = solrInputDocument.getFieldValue(“text”);
l String class = classifier.assignClass(text);
l solrInputDocument.addField(“cat”, class);
l Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances 

CategorizationUpdateRequestProcessor
l Finer grained control
l Use automatic text categorization only if a value
does not exist for the “cat” field
l Add the classifier output class to the “cat” field only if it’s above a certain score 

Implement a MaxEnt Lucene based classifier
l which takes into account words correlation 

Please read full article from Text categorization with Lucene and Solr

Solr 4.10.1 UpdateRequestProcessor factories | Solr Start



Factories

UpdateRequestProcessorFactory
A factory to generate an UpdateRequestProcessor for each request.

AbstractDefaultValueUpdateProcessorFactory
Base class that can be extended by any UpdateRequestProcessorFactory designed to add a default value to the document in an AddUpdateCommand when that field is not already specified.

DefaultValueUpdateProcessorFactory
An update processor that adds a constant default value to any document being added that does not already have a value in the specified field.

TimestampUpdateProcessorFactory
An update processor that adds a newly generated Date value of "NOW" to any document being added that does not already have a value in the specified field.

AddSchemaFieldsUpdateProcessorFactory
This processor will dynamically add fields to the schema if an input document contains one or more fields that don't match any field or dynamic field in the schema.

CloneFieldUpdateProcessorFactory
Clones the values found in any matching source field into the configured dest field.

DistributedUpdateProcessorFactory
Factory for DistributedUpdateProcessor.

DocBasedVersionConstraintsProcessorFactory
This Factory generates an UpdateProcessor that helps to enforce Version constraints on documents based on per-document version numbers using a configured name of a versionField.

DocExpirationUpdateProcessorFactory
Update Processor Factory for managing automatic "expiration" of documents.

FieldMutatingUpdateProcessorFactory
Base class for implementing Factories for FieldMutatingUpdateProcessors and FieldValueMutatingUpdateProcessors.

ConcatFieldUpdateProcessorFactory
Concatenates multiple values for fields matching the specified conditions using a configurable delimiter which defaults to ", ".

CountFieldValuesUpdateProcessorFactory
Replaces any list of values for a field matching the specified conditions with the the count of the number of values for that field.

FieldLengthUpdateProcessorFactory
Replaces any CharSequence values found in fields matching the specified conditions with the lengths of those CharSequences (as an Integer).

FieldValueSubsetUpdateProcessorFactory
Base class for processors that want to mutate selected fields to only keep a subset of the original values.

FirstFieldValueUpdateProcessorFactory
Keeps only the first value of fields matching the specified conditions.

LastFieldValueUpdateProcessorFactory
Keeps only the last value of fields matching the specified conditions.

MaxFieldValueUpdateProcessorFactory
An update processor that keeps only the the maximum value from any selected fields where multiple values are found.

MinFieldValueUpdateProcessorFactory
An update processor that keeps only the the minimum value from any selected fields where multiple values are found.

UniqFieldsUpdateProcessorFactory
Removes duplicate values found in fields matching the specified conditions.

HTMLStripFieldUpdateProcessorFactory
Strips all HTML Markup in any CharSequence values found in fields matching the specified conditions.

IgnoreFieldUpdateProcessorFactory
Ignores & removes fields matching the specified conditions from any document being added to the index.

ParseBooleanFieldUpdateProcessorFactory
Attempts to mutate selected fields that have only CharSequence-typed values into Boolean values.

ParseDateFieldUpdateProcessorFactory
Attempts to mutate selected fields that have only CharSequence-typed values into Date values.

ParseNumericFieldUpdateProcessorFactory
Abstract base class for numeric parsing update processor factories.

ParseDoubleFieldUpdateProcessorFactory
Attempts to mutate selected fields that have only CharSequence-typed values into Double values.

ParseFloatFieldUpdateProcessorFactory
Attempts to mutate selected fields that have only CharSequence-typed values into Float values.

ParseIntFieldUpdateProcessorFactory
Attempts to mutate selected fields that have only CharSequence-typed values into Integer values.

ParseLongFieldUpdateProcessorFactory
Attempts to mutate selected fields that have only CharSequence-typed values into Long values.

PreAnalyzedUpdateProcessorFactory
An update processor that parses configured fields of any document being added using PreAnalyzedField with the configured format parser.

RegexReplaceProcessorFactory
An updated processor that applies a configured regex to any CharSequence values found in the selected fields, and replaces any matches with the configured replacement string.

RemoveBlankFieldUpdateProcessorFactory
Removes any values found which are CharSequence with a length of 0.

TrimFieldUpdateProcessorFactory
Trims leading and trailing whitespace from any CharSequence values found in fields matching the specified conditions and returns the resulting String.

TruncateFieldUpdateProcessorFactory
Truncates any CharSequence values found in fields matching the specified conditions to a maximum character length.

LangDetectLanguageIdentifierUpdateProcessorFactory in solr-langid-4.10.1.jar ( dist/ )
Identifies the language of a set of input fields using http://code.google.com/p/language-detection The UpdateProcessorChain config entry can take a number of parameters which may also be passed as HTTP parameters on the update request and override the defaults.

LogUpdateProcessorFactory
A logging processor.

NoOpDistributingUpdateProcessorFactory
A No-Op implementation of DistributingUpdateProcessorFactory that allways returns null.

RegexpBoostProcessorFactory
Factory which creates RegexBoostProcessors

RunUpdateProcessorFactory
Executes the update commands using the underlying UpdateHandler.

SignatureUpdateProcessorFactory

StatelessScriptUpdateProcessorFactory
An update request processor factory that enables the use of update processors implemented as scripts which can be loaded by the SolrResourceLoader (usually via the conf dir for the SolrCore).

TikaLanguageIdentifierUpdateProcessorFactory in solr-langid-4.10.1.jar ( dist/ )
Identifies the language of a set of input fields using Tika's LanguageIdentifier.

UIMAUpdateRequestProcessorFactory in solr-uima-4.10.1.jar ( dist/ )
Factory for UIMAUpdateRequestProcessor

URLClassifyProcessorFactory
Creates URLClassifyProcessor

UUIDUpdateProcessorFactory
An update processor that adds a newly generated UUID value to any document being added that does not already have a value in the specified field.


Read full article from Solr 4.10.1 UpdateRequestProcessor factories | Solr Start


Hama - a general BSP framework on top of Hadoop



Apache Hama

Apache Hama is an Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce.

Many data analysis techniques such as machine learning and graph algorithms require iterative computations, this is where Bulk Synchronous Parallel model can be more effective than "plain" MapReduce. Therefore To run such iterative data analysis applications more efficiently, Hama offers pure Bulk Synchronous Parallel computing engine.

Hama can make jobs that take hours in Hadoop run in minutes. For details, see the Performance Benchmarks

Read full article from Hama - a general BSP framework on top of Hadoop


Comparing Document Classification Functions of Lucene and Mahout | soleami | Visualize the needs of your visitors.



07/03/2014 Author: Koji Sekiguchi Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to compare the results. Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule. Meanwhile,

Read full article from Comparing Document Classification Functions of Lucene and Mahout | soleami | Visualize the needs of your visitors.


Road to Revolution: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM - Lucidworks



In the Lucene/Solr Revolution session, "Text Classification with Lucene/Solr, Apache Hadoop and LibSVM," Majirus Fansi, SOA and Search Engine Developer at Valtech, will show you how to build a text classifier using Apache Lucene/Solr with libSVM libraries. They classify their corpus of job offers into a number of predefined categories. Each indexed document (a job offer) then belongs to zero, one or more categories. Known machine learning techniques for text classification include naïve bayes model, logistic regression, neural network, support vector machine (SVM), etc..

They use Lucene/Solr to construct the features vector. Then they use the libsvm library, known as the reference implementation of the SVM model, to classify the document. They construct as many one-vs-all svm classifiers as there are classes in their setting. Then using the Hadoop MapReduce Framework, they reconcile the result of the classifiers. The end result is a scalable multi-class classifier. Finally they outline how the classifier is used to enrich basic Solr keyword search.


Read full article from Road to Revolution: Text Classification with Lucene/Solr, Apache Hadoop and LibSVM - Lucidworks


Search-Aware Product Recommendation in Solr | OpenSource Connections | Solr, Big Data, and NoSQL consultants



Incorporating Solr Search-Aware Product Recommendations into Your Search

For even a moderately large inventory and user base, building the recommendation data is a relatively straightforward process that can be performed on a single machine and completed in minutes. Incorporating these recommendations into search results is as simple as adding a parameter to the Solr search URL. In every case outlined above, the recommended subset is clearly better than the full set of search results. What's more, these recommendations are actually personalized for every user on your site! Incorporating this functionality into search will allow customers to find the items they are looking for much more quickly than they would be able to otherwise.


Read full article from Search-Aware Product Recommendation in Solr | OpenSource Connections | Solr, Big Data, and NoSQL consultants


java - is it possible to use apache mahout without hadoop dependency? - Stack Overflow



Definitely, yes. In the Mahout Recommender First-Timer FAQ they advise against starting out with a Hadoop-based implementation (unless you know you're going to be scaling past 100 million user preferences relatively quickly).

You can use the implementations of the Recommender interface in a pure-Java fashion relatively easily. Or place one in the servlet of your choice.

Technically, Mahout has a Maven dependency on Hadoop. But you can use recommenders without the Hadoop JARs easily. This is described in the first few chapters of Mahout in Action - you can download the sample source code and see how it's done - look at the file RecommenderIntro.java.

However, if you're using Maven, you would need to exclude Hadoop manually - the dependency would look like this:

<dependency>          <groupId>org.apache.mahout</groupId>          <artifactId>mahout-core</artifactId>          <exclusions>              <exclusion>                  <groupId>org.apache.hadoop</groupId>                  <artifactId>hadoop-core</artifactId>              </exclusion>          </exclusions>  </dependency>

Read full article from java - is it possible to use apache mahout without hadoop dependency? - Stack Overflow


(30) Is Mahout totally Hadoop focussed or is the community open to non-Hadoop implementations for shared memory machines? - Quora



Mahout is also open to performant single machine implementations. Examples of single machine solutions include a variety of recommender algorithms, logistic regression solved with SGD and Streaming k-means

Read full article from (30) Is Mahout totally Hadoop focussed or is the community open to non-Hadoop implementations for shared memory machines? - Quora


Getting Started with Topic Modeling and MALLET



Getting Started with Topic Modeling and MALLET
To create an environment variable in Windows 7, click on your Start Menu -> Control Panel -> System -> Advanced System Settings (Figures 1,2,3). Click new and type MALLET_HOME in the variable name box. It must be like this – all caps, with an underscore – since that is the shortcut that the programmer built into the program and all of its subroutines. Then type the exact path (location) of where you unzipped MALLET in the variable value, e.g., c:\mallet.

bin\mallet import-dir --help
To work with this corpus and find out what the topics are that compose these individual documents, we need to transform them from several individual text files into a single MALLET format file. MALLET can import more than one file at a time. We can import the entire directory of text files using the import command. The commands below import the directory, turn it into a MALLET file, keep the original texts in the order in which they were listed, and strip out the stop words (words such as andthebut, and if that occur in such frequencies that they obstruct analysis) using the default English stop-words dictionary. 

This file now contains all of your data, in a format that MALLET can work with.
bin\mallet train-topics --input tutorial.mallet

This command opens your tutorial.mallet file, and runs the topic model routine on it using only the default settings. As it iterates through the routine, trying to find the best division of words into topics.
MALLET includes an element of randomness, so the keyword lists will look different every time the program is run, even if on the same set of data.

bin\mallet train-topics --input tutorial.mallet --num-topics 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_compostion.txt

The second number in each paragraph is the Dirichlet parameter for the topic. This is related to an option which we did not run, and so its default value was used (this is why every topic in this file has the number 2.5).
If when you ran the topic model routine you had included
--optimize-interval 20
bin\mallet train-topics --input tutorial.mallet --num-topics 20 --optimize-interval 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_composition.txt

That is, the first number is the topic (topic 0), and the second number gives an indication of the weight of that topic. In general, including –optimize-interval leads to better topics.


What topics compose your documents? The answer is in thetutorial_composition.txt file. 

How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse. There are computational ways of searching for this, including using MALLETs hlda command, but for the reader of this tutorial, it is probably just quicker to cycle through a number of iterations 

Read full article from Getting Started with Topic Modeling and MALLET

Introducing myself to MALLET - Emerging Tech in Libraries



Introducing myself to MALLET - Emerging Tech in Libraries
Here’s how I understand it: topic modeling, like other text mining techniques, considers text as a ‘bag of words’ that is more or less organized. It draws out clusters of words (topics) that appear to be related because they statistically occur near each other. We’ve all been subjected to wordles — this is like DIY wordles that can get very specific and can seem to approach semantic understanding with statistics alone.
One tool that DH folks mention often is MALLET, the MAchine Learning for LanguagE Toolkit, open-source software developed at UMass Amherst starting in 2002. I was pleased to see that it not only models topics, but does the things I’d wanted Oracle Data Miner to do, too — classify with decision trees, Naïve Bayes, and more. There are many tutorials and papers written on/about MALLET, but the one I picked was Getting Started with Topic Modeling and MALLET from The Programming Historian 2, a project out of CHNM. The tutorial is very easy to follow and approaches the subject with a DH-y literariness.

Text Analytics in Enterprise Search - Daniel Ling



Text Analytics in Enterprise Search - Daniel Ling
Document Categorization

 To assign a label to the document / content / data.
 Labels for the category or for the sentiment.
 Threshold values for matching a category before labeling.
 Statistics and “knowledge” from previous examples can be used.


 Mallet and the process of setup and train:
Training the component, Mallet (Machine Learning for Language Toolkit).
• Alternative components includes Lucene (TFIDF) index
(MoreLikeThis), OpenNLP, Textcat, Classifier4j.
 Running the new documents against the model/index of trained
documents.
 Training from interface, adhoc, or index pre-categorized

Document Summarization

Summarize a document, at index time or on-demand.
 Leverage from the knowledge and term statistics of the document
and the index.
 Picks the “most important” sentences based on the statistics and
displays those.

Example Solution: Document Summarization

Custom RequestHandler that receives document ID and field to summarize.
 Custom Search Component making the selection of top sentences.
 Selecting a subset of sentences and sends these back in a field.

Please read full article from Text Analytics in Enterprise Search - Daniel Ling

Uncle Lance's Ultra Whiz Bang: Document Summarization with LSA #1: Introduction



Document Summarization with LSA This is a several-part series on document summarization using Latent Semantic Analysis (LSA). I wrote a document summarizer and did an exhaustive measurement pass using it to summarize newspaper articles from the first Reuters corpus. The code is structured as a web service in Solr, using Lucene for text analysis and the OpenNLP package for tuning the algorithm with Parts-of-Speech analysis. Introduction Document summarization is about finding the "themes" in a document: the important words and sentences that contain the core concepts.

Read full article from Uncle Lance's Ultra Whiz Bang: Document Summarization with LSA #1: Introduction


[SOLR-3975] Document Summarization toolkit, using LSA techniques - ASF JIRA



This package analyzes sentences and words as used across sentences to rank the most important sentences and words. The general topic is called "document summarization" and is a popular research topic in textual analysis.

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look at the large gray box marked 'Document Summary'. This has a table of statistics about the analysis, the three most important sentences, and several of the most important words in the documents. The sentences have the important words in italics.

The code is packaged as a search component and as an analysis handler. The /browse demo uses the search component, and you can also post raw text to http://localhost:8983/solr/collection1/analysis/summary. Here is a sample command:

curl -s "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml" --data-binary @$FILE -H 'Content-type:application/xml'  

This is an implementation of LSA-based document summarization. A short explanation and a long evaluation are described in my blog, Uncle Lance's Ultra Whiz Bang, starting here: http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html


Read full article from [SOLR-3975] Document Summarization toolkit, using LSA techniques - ASF JIRA


Text Summarization Api For Java - TextSummarization | Text Summarization Online | Text Summarization Demo | Text Summarization API



About Unirest

Unirest is a set of lightweight HTTP libraries available in multiple languages, ideal for most applications:

  • Make GET, POST, PUT, PATCH, DELETE requests
  • Both syncronous and asynchronous (non-blocking) requests
  • It supports form parameters, file uploads and custom body entities
  • Supports gzip
  • Supports Basic Authentication natively
  • Customizable timeout
  • Customizable default headers for every request (DRY)
  • Automatic JSON parsing into a native object for JSON responses

Read full article from Text Summarization Api For Java - TextSummarization | Text Summarization Online | Text Summarization Demo | Text Summarization API


Getting Started with the Automatic Text Summarization API on Mashape | Text Mining Online | Text Analysis Online | Text Processing Online



Automatic Text Summarization is one of popular text processing tasks, according wikipedia, Text Summarization is referred as Automatic Summarization:

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document. As the problem of information overload has grown, and as the quantity of data has increased, so has interest in automatic summarization. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. An example of the use of summarization technology is search engines such as Google. Document summarization is another.

If you remember the Summly App, which created by a 15 years old teenager and finally acquired by Yahoo, based on this text summarization technology. But automatic text summarization is very difficult in academic research, if you have the interest on this field, you can refer the Summarization Website, and there are a lot of text summarization resources, such as Bibliography, Papers and Summarization Systems links on it. We also recommend you to read the classic survey paper "A Survey on Automatic Text Summarization" by Dipanjan Das Andre F.T. Martins, where you can grasp the main summary methods in the text summarization field.


Read full article from Getting Started with the Automatic Text Summarization API on Mashape | Text Mining Online | Text Analysis Online | Text Processing Online


Technologies



Our search solutions are built on top of mature, enterprise-quality, widely-used open source frameworks, libraries, and components. Data collection Apache ManifoldCF Apache ManifoldCF provides a framework for connecting source content repositories like file systems, DB, CMIS ... to target repositories or indexes, such as Apache Solr. http://manifoldcf.apache.org/ Apache Nutch Apache Nutch is a mature, highly scalable web crawler which provides extensible interfaces for parsing (for example for Tika), indexing (for example through Solr, SolrCloud, ...), and filters for custom implementations.

Read full article from Technologies


(30) What are some classification/machine learning libraries in Java? - Quora



  • Weka(Data Mining with Open Source Machine Learning Software in Java):-  Weka is data mining toolkit and supports many data mining algorithms.  But most of the algorithms cannot directly applied to text document. For  document classification check out link(Text categorization with Weka). You can either use StringtoWordVector filter  to tokor you can try out third party tool like taghelper(Page on Cmu). TagHelper is build on top of weka.
     
  • Apache Mahout(Page on Apache You can use naive base/ complement naive bayes classification. Complement naive bayes algorithm is useful if you are working on imbalanced dataset. Mahout is useful if you are working on large dataset.
  • Rapid miner(Rapid - I - RapidMiner):- Rapid Miner is data mining toolkit and similar to weka it supports lot of algorithm. Check out video link(

  • Read full article from (30) What are some classification/machine learning libraries in Java? - Quora


    Java Machine Learning | Machine Learning Mastery



    Environments This section describes Java-based environments or workbenches that can be used for machine learning. They are called environments because they provided graphical user interfaces for performing machine learning tasks, but also provided Java APIs for developing your own applications. Weka Waikato Environment for Knowledge Analysis (Weka) is a machine learning platform developed by the University of Waikato, New Zealand. It is written in Java and provides a graphical user interface, command line interface and Java API.

    Read full article from Java Machine Learning | Machine Learning Mastery


    17 Great Machine Learning Libraries



    17 Great Machine Learning Libraries
    Java
    • Spark: Apache’s new upstart, supposedly up to a hundred times faster than Hadoop, now includes MLLib, which contains a good selection of machine learning algorithms, including classification, clustering and recommendation generation. Currently undergoing rapid development. Development can be in Python as well as JVM languages.
    • Mahout: Apache’s machine learning framework built on top of Hadoop, this looks promising, but comes with all the baggage and overhead of Hadoop.
    • Weka: this is a Java based library with a graphical user interface that allows you to run experiments on small datasets. This is great if you restrict yourself to playing around to get a feel for what is possible with machine learning. However, I would avoid using this in production code at all costs: the API is very poorly designed, the algorithms are not optimised for production use and the documentation is often lacking.
    • Mallet: another Java based library with an emphasis on document classification. I’m not so familiar with this one, but if you have to use Java this is bound to be better than Weka.
    • JSAT: stands for “Java Statistical Analysis Tool” - created by Edward Raff and was born out of his frustation with Weka (I know the feeling). Looks pretty cool.

    • LibSVM and LibLinear: these are C libraries for support vector machines; there are also bindings or implementations for many other languages. These are the libraries used for support vector machine learning in Scikit-learn.
    Read full article from 17 Great Machine Learning Libraries

    5 ways to add machine learning to Java, JavaScript, and more | JavaWorld



    For Java: Aside from the aforementioned Mahout, which focuses on Hadoop, a number of other other machine learning libraries for Java are in wide use. Weka, created by the University of Waikato in New Zealand, is a workbench-like app that adds visualizations and data-mining capabilities to the usual mix of algorithms. For people who want a front end for their work and plan on doing a good part of it in Java to begin with, Weka might be the best place to start. A more conventional library, the Java-ML, is also available, although it's meant for people already comfortable working with both Java and machine learning.

    For JavaScript: The joke about JavaScript ("Atwood's Law") is that anything that can be written in JavaScript eventually will be. So it is for machine learning libraries. Granted, there's relatively little in this field available for JavaScript as of this writing -- most options consist of individual algorithms rather than whole libraries -- but a few useful tools have already surfaced. ConvNetJS lets you perform deep learning neural-network training directly in a browser, and the appropriately named brain provides neural networking as an NPM-installable module. Also worth noting is the Encog library, available for multiple platforms: Java, C#, C/C++, and JavaScript.


    Read full article from 5 ways to add machine learning to Java, JavaScript, and more | JavaWorld


    Reddit takes on Kickstarter with crowdfunding site Redditmade | Technology | The Guardian



    Reddit takes on Kickstarter with crowdfunding site Redditmade The social media site has launched a crowdfunding offshoot which lets users create merchandise around specialist interests Redditmade. Photograph: Reddit Reddit has entered the crowdfunding arena with Redditmade, a new competitor to the likes of Kickstarter and Indiegogo. But unlike those sites, Redditmade is targeting a particular user with a laser focus: the existing moderators of Reddit communities, called subreddits.

    Read full article from Reddit takes on Kickstarter with crowdfunding site Redditmade | Technology | The Guardian


    Mobile App Usage By The Numbers [Infographic] - Forbes



    Opinions expressed by Forbes Contributors are their own. Recent Posts Popular Posts Full Bio I am a Statista data journalist, covering technological, societal and media topics through visual representation. In fact, I love to write about all trending topics, illustrating patterns and trends in a quick, clear and meaningful way. Our work at Statista has been featured in publications including Mashable, the Wall Street Journal and Business Insider. We're living in a world of mobile apps.

    Read full article from Mobile App Usage By The Numbers [Infographic] - Forbes


    How to enable pretty print JSON output (Gson)



    In tutorial, we show you how to enable pretty print JSON output in Gson framework. In last Gson – object to/from json example :

    	Gson gson = new Gson();  	String json = gson.toJson(obj);  	System.out.println(json);

    the JSON output is display as compact mode like following :

    {"data1":100,"data2":"hello","list":["String 1","String 2","String 3"]}

    To enable pretty print, you should use GsonBuilder return a Gson object :

    	Gson gson = new GsonBuilder().setPrettyPrinting().create();  	String json = gson.toJson(obj);  	System.out.println(json);

    Read full article from How to enable pretty print JSON output (Gson)


    Mother F**k the ScheduledExecutorService! | Nomad Labs Code



    Mother F**k the ScheduledExecutorService! Thats right! Motherfuck this service. Deep hidden in the javadoc of this magnificent class is this gem "If any execution of the task encounters an exception, subsequent executions are suppressed.". In other words, if your runnable task has any fuckups, your task will no longer be run. What sucks about this, is there is no clear indication that there was a fuckup in the task. No warning, just silenty the task gets canceled leading you to say: "Dude! where my task!?" Lets use this example: import java.util.concurrent.Executors;

    Read full article from Mother F**k the ScheduledExecutorService! | Nomad Labs Code


    Labels

    Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

    Popular Posts