Feature Extraction and Transformation - MLlib - Spark 1.1.1 Documentation
$t$ $t$ $t$ . If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., "a", "the", and "of". If a term appears very often across the corpus, it means it doesn't carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides: \[ IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1}, \] where $|D|$ is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF: \[ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). \] There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible.Read full article from Feature Extraction and Transformation - MLlib - Spark 1.1.1 Documentation
No comments:
Post a Comment