C# TrieTree介绍及实现方法

2020-01-24 03:23:58

字体：大中小

来源：转载

供稿：网友

在自然语言处理（NLP）研究中，NGram是最基本但也是最有用的一种比对方式，这里的N是需要比对的字符串的长度，而今天我介绍的TrieTree，正是和NGram密切相关的一种数据结构，有人称之为字典树。TrieTree简单的说是一种多叉树，每个节点保存一个字符，这么做的好处是当我们要做NGram比对时，只需要直接从树的根节点开始沿着某个树叉遍历下去，就能完成比对；如果没找到，停止本次遍历。这话讲得有些抽象，我们来看一个实际的例子。

假设我们现在词库里面有以下一些词：

上海市
上海滩
上海人
上海公司
北京
北斗星
杨柳
杨浦区

如图所示：挂在根节点上的字有上、北、杨，

如果我们现在对“上海市杨浦区”这个词做3gram就有上海市、海市杨、市杨浦、杨浦区，现在我们要知道哪些词是能够被这个字典识别的，通常我们可以用NGram来做分词。有了这颗树，我们只需要依次取每个字符，从根开始进行比对，比如上海市，我们能够匹配上->海->市，这个路径，所以匹配；比如海市杨，由于没有“海”字挂在根节点上，所以停止；市杨浦也无法匹配；最终匹配杨浦区，得到杨->浦->区这个路径，匹配。

最终我们可以把“上海市杨浦区”切分为上海市|杨浦区。

尽管TrieTree要比普通字符串数组节省很多时间，但这并不是没有代价的，因为你要先根据字典构建这棵树，这个代价并不低，当然对于某个应用来说一旦TrieTree构建完成就可以重复使用，所以针对大规模比对来说，性能提升还是很客观的。

下面是TrieTree的C#实现。

复制代码代码如下:

   public class TrieTree    
   {        
   TrieNode _root = null;  
   private TrieTree()     
   {           
   _root = new TrieNode(char.MaxValue,0);  
   charCount = 0;   
   }       
   static TrieTree _instance = null; 
   public static TrieTree GetInstance()   
   {            
   if (_instance == null)        
   {             
   _instance = new TrieTree();       
   }           
   return _instance;   
   }        
   public TrieNode Root   
   {           
   get { return _root; 
   }   
   }        
   public void AddWord(char ch) 
   {           
TrieNode newnode=_root.AddChild(ch);   
newnode.IncreaseFrequency();           
newnode.WordEnded = true;      
}        int charCount;    
public void AddWord(string word)   
{          
if (word.Length == 1)     
{               
AddWord(word[0]);     
charCount++;       
}         
else    
{                 
char[] chars=word.ToCharArray();     
TrieNode node = _root;           
charCount += chars.Length;      
for (int i = 0; i < chars.Length; i++)  
{                    
TrieNode newnode=node.AddChild(chars[i]);    
newnode.IncreaseFrequency();           
node = newnode;           
}           
node.WordEnded = true;  
}       
}       
public int GetFrequency(char ch)   
{           
TrieNode matchedNode = _root.Children.FirstOrDefault(n => n.Character == ch);  
if (matchedNode == null)      
{               
return 0;        
}           
return matchedNode.Frequency;  
}       
public int GetFrequency(string word)
{        
if (word.Length == 1) 
{              
return GetFrequency(word[0]); 
}            
else      
{            
char[] chars = word.ToCharArray(); 
TrieNode node = _root;        
for (int i = 0; i < chars.Length; i++)   
{                 
if (node.Children == null)   
return 0;              
TrieNode matchednode = node.Children.FirstOrDefault(n => n.Character == chars[i]);
if (matchednode == null)          
{                      
return 0;         
}                  
node = matchednode;    
}              
if (node.WordEnded == true)        
return node.Frequency;       
else                   
return -1;           
}      
}   
}

这里我们使用了单例模式，因为TrieTree类似缓存，不需要重复创建。下面是TreeNode的实现：

复制代码代码如下:

   public class TrieNode    
   {       
   public TrieNode(char ch,int depth)
   {           
   this.Character=ch; 
   this._depth=depth;
   }       
   public char Character; 
   int _depth;        
   public int Depth   
   {            
   get{return _depth;
   }      
   }     
   TrieNode _parent=null;
   public TrieNode Parent     
   {          
   get { 
   return _parent;
   }          
   set { _parent = value;
   } 
   }       
   public bool WordEnded = false;  
   HashSet<TrieNode> _children=null;  
   public HashSet<TrieNode> Children 
   {           
   get { 
   return _children;
   }       
   }        
   public TrieNode GetChildNode(char ch) 
   {            
   if (_children != null)    
   return _children.FirstOrDefault(n => n.Character == ch);  
   else               
   return null;      
   }       
   public TrieNode AddChild(char ch) 
   {           
   TrieNode matchedNode=null;      
   if (_children != null)      
   {               
   matchedNode = _children.FirstOrDefault(n => n.Character == ch);  
   }           
   if (matchedNode != null)   
   //found the char in the list   
   {                
   //matchedNode.IncreaseFrequency();      
   return matchedNode;         
   }           
   else          
   {  
   //not found       
   TrieNode node = new TrieNode(ch, this.Depth + 1);     
   node.Parent = this;      
   //node.IncreaseFrequency();            
   if (_children == null)               
   _children = new HashSet<TrieNode>();   
   _children.Add(node);              
   return node;          
   }       
   }       
   int _frequency = 0;       
   public int Frequency    
   {         
   get { return _frequency;
   }        
   }       
   public void IncreaseFrequency()      
   {          
   _frequency++;   
   }      
   public string GetWord() 
   {             
   TrieNode tmp=this;     
   string result = string.Empty;  
   while(tmp.Parent!=null) //until root node  
   {                
   result = tmp.Character + result;   
   tmp = tmp.Parent;     
   }            
   return result;     
   }        
   public override string ToString()
   {          
   return Convert.ToString(this.Character);
   }    
   }