From d2f9066238724cb431f5cda74042f711aca8839f Mon Sep 17 00:00:00 2001 From: Aakash Panchal <51417248+Aakash-Panchal27@users.noreply.github.com> Date: Tue, 2 Jun 2020 13:47:27 +0530 Subject: [PATCH] Create Trie.html --- articles/Akash Articles/rendered/Trie.html | 526 +++++++++++++++++++++ 1 file changed, 526 insertions(+) create mode 100644 articles/Akash Articles/rendered/Trie.html diff --git a/articles/Akash Articles/rendered/Trie.html b/articles/Akash Articles/rendered/Trie.html new file mode 100644 index 0000000..016d1ca --- /dev/null +++ b/articles/Akash Articles/rendered/Trie.html @@ -0,0 +1,526 @@ + + +
+Do you know how the "auto-completion feature" provided by different software like IDEs, Search Engines, command-line interpreters, text editors, etc works?
+Below is an input box, which has an autocomplete feature for "country names". Try it out!
+ + +The basic data structure behind all these scenes is Trie.
+Spell checkers can also be designed using Trie.
+String processing is widely used across real-world applications, for example data analytics, search engines, bioinformatics, plagiarism detection, etc.
+Trie is a very useful and special kind of data structure for string processing.
+Below is a very simple representation of trie consisting of "cat"
, "bat"
, "dog"
strings.
Now, suppose we are given a string-array and we are told that check whether "cat"
string is present in the array. Then we can check it via brute force-compare with each and every string present in the string-array, which would take $O(N*length("cat"))$ in the worst-case situation, where $N$ is the number of string in the array.
Now, if you create a trie from all the strings present in the array, then you can simply check it in $O(length("cat"))$ time by traversing through trie(confused? we will see it soon), which is very efficient and therefore trie is an efficient information retrieval data structure.
+Trie is a tree of nodes, where the specifications of a node can be given as below:
+Each node has,
+Notes
+alphabet_size
is $26$.We will see "why do we need these two variables?" soon.
+struct trie_node
+{
+ // Array of pointers of type
+ // trie_node
+ vector<trie_node*> links;
+ bool isEndofString;
+
+ // Constructor
+ trie_node()
+ {
+ links.assign(alphabet_size, nullptr);
+ isEndofString = false;
+ }
+};
+
+Now, we have seen how a trie node looks like. Let's see how we are going to store strings in a trie using this kind of node.
+Look at the image below, which represents a string "act" stored in a trie. Observe something!
+Note: Empty places in the array have null values(nullptr
in c++).
'a'
, 'c'
, and 't'
respectively.Therefore, now for the shake of ease we are going to represent the nodes of trie as below.
+And therefore representation of trie containing string "act" will be as below.
+.jpg)
Note: Root node will be shown empty, as it only represents an empty string, so to speak.
+Now, observe the trie below, which contains two strings "act" and "ace".
+.jpg)
Note that the node representing character c
in the above trie, in a magnified sense would look as below:
What did you observe?
+A common prefix of "ace"
and "act"
is "ac"
and therefore we are having the same nodes until we traverse "ac"
and then we create a new node for character e
.
Therefore, we are not creating any new node until we need one and Trie is a very efficient data storage, when we have a large list of strings sharing common prefixes. It is also known as prefix tree.
+Now, look the trie below, which contains three strings "act"
, "ace"
and "cat"
.
.jpg)
Let's see a proper algorithm to insert a string in a trie.
+isEndofString
to true in the last ending node.void insert(trie_node* root, string s)
+{
+ trie_node* temp = root;
+ int n = s.size();
+ for(int i = 0; i < n; i++){
+ if(temp->link[s[i]-'a'] == nullptr)
+ temp->link[s[i]-'a'] = new trie_node();
+ // Traverse using link
+ temp = temp->link[s[i]-'a'];
+ }
+ temp->isEndofString = true;
+}
+
+Time Complexity: $O(N)$, where $N$ is the length of the string we are inserting.
+Can you figure out, how can we check whether the given string is present in trie or not, from whatever we have discussed so far?
+For example, if you are searching "aco"
in the trie below, then how will you do?
.jpg)
Can you see, why isEndofString
is needed?
Observe the trie given below and try to search whether "on"
is present or not.
.jpg)
If you don't have isEndofString
variable, then you will not be able to correctly check whether on
is present or not. Because it is the prefix of once
.
Algorithm:
+isEndofString
variable of the last node.bool search(trie_node* root, string s)
+{
+ trie_node* temp = root;
+ int n = s.size();
+ for(int i = 0; i < n; i++){
+ // There is not further link
+ if(temp->link[s[i]-'a'] == nullptr)
+ return false;
+ temp = temp->link[s[i]-'a'];
+ }
+ return temp->isEndofString;
+}
+
+Can you find recursive version of the above function?
+Recursive version:
+// @param: root -> root of the trie
+// @param: s -> the string we are deleting
+// @param: i -> index of s currently reached via recursive traversal
+bool Rec_search(trie_node* root, string& s, int i = 0)
+{
+ // No link present
+ // so string is not present
+ if(root == nullptr)
+ return false;
+ if(i == s.size()) {
+ // present
+ if(root->isEndofString)
+ return true;
+ else
+ return false;
+ }
+ // Recusively traverse using links
+ return Rec_search(root->link[s[i]-'a'], s, i+1);
+}
+
+Time Complexity: $O(N)$, where $N$ is the length of the string we are searching for.
+How will you delete "ace"
from the trie below?
.jpg)
Things to take care about while you are deleting a string from the trie,
+We are going to use a recursive procedure. If the string is not present, then we will return false
and true
otherwise. Recursive procedure for delete is a modified version of the recursive search procedure and therefore make sure you understand that.
Can you figure it out on your own?
+Procedure:
+Rec_search()
procedure.root == nullptr
) for the current character, then the string is not present in the trie and return false
.i==s.size()
, then finally check isEndofString
of the last node. If the string is present(isEndofString = true)
, then set it to false
and return true
. Otherwise, return false
-not present.Now, go through the code below with very intuitive comments.
+// Checks whether any link is present
+bool isEmptyNode(trie_node* node)
+{
+ for(auto i:node->link)
+ if(i != nullptr)
+ return false;
+ return true;
+}
+
+// Returns true if the string is successfully deleted
+// And if the string is not present in the trie then returns false.
+bool deleteString(trie_node* root, string& s, int i = 0)
+{
+ if(root == nullptr)
+ return false;
+
+ if(i == s.size()) {
+ // present
+ if(root->isEndofString) {
+ // delete it
+ root->isEndofString = false;
+ return true;
+ }
+ else
+ return false;
+ }
+
+ bool ans = deleteString(root->link[s[i]-'a'], s, i+1);
+
+ // String is present
+ if(ans) {
+ // Check whether any other string
+ // passes through this link node
+ // If not passing, then delete it
+ if(isEmptyNode(root->link[s[i]-'a'])) {
+
+ // Deallocate used memory
+ delete root->link[s[i]-'a'];
+ root->link[s[i]-'a'] = nullptr;
+ }
+ return true;
+ }
+
+ // Not present the return false
+ return false;
+}
+
+Time Complexity: $O(N)$, where $N$ is the length of the string we are deleting.
+The availability of dynamic arrays allows us to create Trie without using pointers.
+Now, we are going to store trie as a dynamic array of TrieNodes
. In this implementation, we are going to use an array of integers instead of pointers in TrieNode
and as a link, we are going to store the index of a node rather than the address of a node in the former case.
See the below implementation of trie as an array, which is quite similar and intuitive as the previous implementation.
+struct TrieNode
+{
+ vector<int> id_link;
+ bool isEndofString;
+
+ TrieNode(bool end = false)
+ {
+ end = isEndofString;
+ id_link.assign(26,-1);
+ }
+};
+
+void insert(vector<TrieNode>& trie, string s)
+{
+ int temp = 0;
+ int n = s.size();
+ for(int i = 0; i < n; i++) {
+ if(trie[temp].id_link[s[i]-'a'] == -1) {
+ trie[temp].id_link[s[i]-'a'] = (int)trie.size();
+ trie.push_back(TrieNode());
+ }
+ temp = trie[temp].id_link[s[i]-'a'];
+ }
+ trie[temp].isEndofString = true;
+}
+
+bool search(vector<TrieNode>& trie, string s)
+{
+ int temp = 0;
+ int n = s.size();
+ for(int i = 0; i < n; i++) {
+ if(trie[temp].id_link[s[i]-'a'] == -1)
+ return false;
+ temp = trie[temp].id_link[s[i]-'a'];
+ }
+ return trie[temp].isEndofString;
+}
+
+But it has a downside that you can not generally delete strings present in the trie. Why?
+Try deleting a single node(other than last one), you will realize that indexes of each subsequent node will change, and also deleting in an array has a very bad performance.
+It is an easy implementation, but with a single downside. Therefore, use as per the requirement.
+How will you find the number of words(strings) present in the trie below?
+.jpg)
Ultimately, It means to find the total number of nodes having true
value of isEndofString
. Which can be easily done using recursive traversal of all the nodes present in the trie.
The basic idea of the recursive procedure is as follow:
+Start from the $\text{root}$ node and go through all $26$ positions of the link
array. For each not-null link, recursively call countWords()
considering that linked node as a $\text{root}$. And therefore formula will be as below:
$\text{TotalWords} = \text{TotalWords} + \text{countWords}(link_i)$, do it for all not-null links.
+Finally, add $1$ to $\text{TotalWords}$ if the current node has isEndofString = true
.
int countWords(trie_node* root)
+{
+ int total = 0;
+ if(root == nullptr)
+ return 0;
+ for(auto i:root->link)
+ if(i != nullptr)
+ total += countWords(i);
+ total += root->isEndofString;
+ return total;
+}
+
+Time complexity: $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node.
+Space complexity: $O(1)$
It is similar to finding the total number of words but instead of adding $1$ for each isEndofString
's true value, we are going to store the word representing that particular end.
The code is similar to finding the total number of words.
+void printAllWords(trie_node* root, vector<string>& ans, string s="")
+{
+ if(root == nullptr)
+ return;
+ for(int i = 0; i < alphabet_size; i++) {
+ if(root->link[i] != nullptr) {
+ char c = 'a' + i;
+ string temp = s;
+ temp += c;
+ printAllWords(root->link[i], ans, temp);
+ }
+ }
+ if(root->isEndofString)
+ ans.push_back(s);
+}
+
+Time complexity: $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node.
+Space Complexity: $O(\text{Total length of all words present in the trie})$
How will you design the autocompletion feature using Trie?
+For example, we have stored C++ keywords in a trie. Now, when you type "n"
it should show all keywords starting from "n"
. For simplicity, only keywords starting from "n"
are shown in the trie below,
.jpg)
How will you print all keywords starting from "n"
? OR how will you print all keywords having "n"
as a prefix?
Simply use printAllWords()
on node n
, and the problem is solved!
A common procedure is as below:
+Traverse nodes in trie according to the given uncomplete string s
. If we are successfully able to traverse s
, then there are keywords having a prefix of s
. Otherwise, there will be nothing to suggest.
Now, use printAllWords()
considering the last node(after traversal of trie according to s
) as a root.
void autocomplete(trie_node* root, string s)
+{
+ int n = s.size();
+ trie_node* temp = root;
+ for(int i = 0; i < n; i++) {
+ if(temp->link[s[i]-'a'] == nullptr)
+ return;
+ temp = temp->link[s[i]-'a'];
+ }
+ vector<string> suggest;
+ printWords(temp, suggest, s);
+ for(auto i:suggest)
+ cout << i << endl;
+ /*
+ OR
+ printWords(temp, suggest);
+ for(auto i:suggest)
+ cout << s << i << endl;
+ */
+}
+
+Time complexity: $O(\text{Length of S + Total length of all suggestions excluding common prefix(S) from all})$, where s
is the string you want suggestions for.
+Space complexity: $O(\text{Total length of all possible suggestions})$
It is a widely used feature, as discussed at the start of the article.
+There is also something called "Ternary Search Tree". When each node in the trie has most of its links used(having many similar prefix words), a trie is substantially more space-efficient and time-efficient than the ternary search tree.
+But, If each node stores a few links, then the ternary search tree is much more space-efficient, because we are using $26$ pointers in each node of trie and many of them may be unused.
+Therefore, use as per the requirements.
+What are the common features of an English dictionary?
+Hashtable can be used to implement a dictionary. After precomputation of hash for each word in $O(M)$, where $M$ is the total length of all words in the dictionary, we can have efficient lookups if we design a very efficient hashtable.
+But as the dictionary is very large there will be collisions between two or more words. Still, you can design a hash table to have efficient look-ups.
+But hashtable has a very high space usages, as we simply store each word and attatched data. But what if we design it using a trie?
+As in a dictionary we have many common-prefix words, trie will save a substantial amount of memory consumption. Trie supports look-up in $O(\text{word length})$, which is higher than a very efficient hash table.
+Other advantages of the trie are as below:
+So, if you want some of the above features, then using a trie is good. Also, we don't have to deal with collisions.
+Note that in the dictionary along with a word, we have explanations or meanings of that word. That can be handled by separately maintaining an array that stores all those extra stuff. Then store one integer in the TrieNode
structure to store the index of the corresponding data in the array.
struct trie_node
+{
+ // Array of pointers of type
+ // trie_node
+ vector<trie_node*> links;
+ bool isEndofString;
+ // To store id of data
+ int idOfData;
+};
+
+The below image shows a typical trie structure for the dictionary.
+