Aho-Corasick自动机算法原理与详解-colwater-ChinaUnix博客



1. 初识AC算法

Step1: 将由patterns组成的集合(要同时匹配多个patterns嘛)构成一个有限状态自动机。

Step2: 将要匹配的text作为自动机输入,输出含有哪些patternspatterns在全文中位置。

 

 

自动机的执行动作由三个部分组成:

(1)       一个goto function

(2)       一个failure function

(3)       一个output function

 

我们先通过一个具体实例来了解一下这三个部分,及该自动机的运作方式。先有个大概印象,后面会具体讲解。patterns集合{he, she, his ,hers},我们要在"ushers"中查找并匹配。

goto function 

(1) goto function

 

 

i       1   2   3   4   5   6   7   8   9

f(i)     0   0   0   1   2   0   3   0   (发现没?貌似if(i)有相同前缀哦^_^)

(2) failure function

 

 

i           output(i)

2           {he}

5           {she,he}

7           {his}

9           {hers}

(3) output function

 

 

       首先我们从状态0开始,接收匹配字符串的第一个字符u,在goto(简称goto function)中可以看到回到状态0,接着第二个字符s,发现转到状态3,在output中查找一下output(3)为空字符串,说明没有匹配到patterns。继续匹配h,转到状态4,查找output发现仍然没有匹配,继续字符e,状态转到了5,查找output,发现output(5)匹配了两个字符串shehe,并输出在整个字符串中的位置。然后接着匹配r,但发现g(5,r)=fail,这时候我们需要查找failure,发现f(5)=2,所以就转到状态2,并接着匹配r,状态转移到了8,接着匹配s,状态转移到了9,查看output,并输出output(9):hers,记录下匹配位置。至此字符串ushers匹配完毕。


Read full article from Aho-Corasick自动机算法原理与详解-colwater-ChinaUnix博客


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts