Update Trie.md

This commit is contained in:
Aakash Panchal 2020-05-23 16:52:01 +05:30 committed by GitHub
parent 99b27b0733
commit e5fe2c4725
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,4 +1,5 @@
Do you know how autocompletion provided by different softwares like IDEs, Search Engines, command-line interpreters, text editors, etc works?
Do you know how the "auto-completion feature" provided by different software like IDEs, Search Engines, command-line interpreters, text editors, etc works?
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/1.png)
@ -9,41 +10,41 @@ The basic data structure behind all these scenes is **Trie**.
Spell checkers can also be designed using **Trie**.
# Trie
String processing is widely used across real world applications, for example data analytics, search engines, bioinformatics, plagiarism detection, etc.
String processing is widely used across real-world applications, for example data analytics, search engines, bioinformatics, plagiarism detection, etc.
Trie is very useful and special kind of data structure for string processing.
Trie is a very useful and special kind of data structure for string processing.
## Introduction
Trie is basically a tree of nodes, where specification of a node can be given as below:
Trie is a tree of nodes, where the specifications of a node can be given as below:
Each node has,
1. An array of datatype node and of size of alphabet.
2. A boolean value(We will see why it is needed).
1. An array of datatype `node` having the size of the alphabet(see the note below).
2. A boolean variable.
We will see usages of these two variables soon.
**We will see usages of these two variables soon.**
```cpp
struct trie_node
{
// Array of pointers of type
// trie_node
vector<trie_node*> links;
bool isEndofString;
// Array of pointers of type
// trie_node
vector<trie_node*> links;
bool isEndofString;
trie_node(bool end = false)
{
links.assign(alphabet_size, nullptr);
isEndofString = end;
}
trie_node(bool end = false)
{
links.assign(alphabet_size, nullptr);
isEndofString = end;
}
};
```
**Note:** For easy understanding purpose, we are assuming that all strings contain lowercase alphabet letters that is alphabet size is $26$. **We can convert characters to a number by using `c-'a'`, `c` is a lowercase character.**
**Note:** For an easy understanding purpose, we are assuming that all strings contain lowercase alphabet letters, i.e. `alphabet_size` is $26$. **We can convert characters to a number by using `c-'a'`, `c` is a lowercase character.**
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/3.png)
Now, we have seen how trie node looks like. Let's see how we are going to store strings in a trie using these kind of nodes.
Now, we have seen how a trie node looks like. Let's see how we are going to store strings in a trie using this kind of node.
## How to insert a string in a trie?
@ -51,18 +52,18 @@ Look at the image below, which represents a string "act" stored in a trie. Obser
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/4.png)
**Note: Empty places in the array have null values(`nullptr` in c++).**
What did you observe?
Observations:
1. **Other root node, each node in trie represents a single character.**
1. **Other than the root node, each node in trie represents a single character.**
2. **We set isEndofString to true in the node at which the string ends.**
Therefore, now for the shake of ease we are going to represent the nodes of trie as below.
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/5.png)
**Note: Empty places in array have null values.**
And therefore representation of trie containing string "act" will be as below.
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/6(1).jpg)
@ -73,13 +74,13 @@ Now, observe the trie below, which contains two strings "act" and "ace".
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/7(1).jpg)
Note that the node representing character `c` in the above trie, in magnified sense would look as below:
Note that the node representing character `c` in the above trie, in a magnified sense would look as below:
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/8.png)
What did you observe?
Common prefix of `"ace"` and `"act"` is `"ac"` and therefore we are having same nodes until we traverse `"ac"` and then we create a new node for character `e`.
A common prefix of `"ace"` and `"act"` is `"ac"` and therefore we are having the same nodes until we traverse `"ac"` and then we create a new node for character `e`.
Therefore, we are not creating any new node until we need one and **Trie is a very efficient data storage, when we have a large list of strings sharing common prefixes.** It is also known as **prefix tree**.
@ -87,24 +88,24 @@ Now, observe the trie below, which contains three strings `"act"`, `"ace"` and `
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/9(1).jpg)
Let's see proper algorithm to insert a string in a trie.
Let's see a proper algorithm to insert a string in a trie.
1. Starting from the root, if there is already a node representing corresponding character of a string, then simply traverse.
2. Otherwise, create a new node representing corresponding character.
3. At the end of string, set `isEndofString` to true in the last ending node.
1. Starting from the root, if there is already a node representing the corresponding character of a string, then simply traverse.
2. Otherwise, create a new node representing the corresponding character.
3. At the end of the string, set `isEndofString` to true in the last ending node.
```cpp
void insert(trie_node* root, string s)
{
trie_node* temp = root;
int n = s.size();
for(int i = 0; i < n; i++){
if(temp->link[s[i]-'a'] == nullptr)
temp->link[s[i]-'a'] = new trie_node();
// Traverse using link
temp = temp->link[s[i]-'a'];
}
temp->isEndofString = true;
trie_node* temp = root;
int n = s.size();
for(int i = 0; i < n; i++){
if(temp->link[s[i]-'a'] == nullptr)
temp->link[s[i]-'a'] = new trie_node();
// Traverse using link
temp = temp->link[s[i]-'a'];
}
temp->isEndofString = true;
}
```
@ -124,28 +125,54 @@ Observe the trie given below and try to search whether `"on"` is present or not.
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/10(1).jpg)
If you don't have `isEndofString` variable, then you will not be able to correctly check whether `on` is present or not. Because it is prefix of `once`.
If you don't have `isEndofString` variable, then you will not be able to correctly check whether `on` is present or not. Because it is the prefix of `once`.
**Algorithm**:
1. Starting from the root, try to traverse corresponding character of the string. If a link is present, then go ahead.
1. Starting from the root, try to traverse the corresponding character of the string. If a link is present, then go ahead.
2. Otherwise, simply given string is not present in the trie.
3. If you are successfully able to traverse according to the string, then check whether the query string is really present or not via `isEndofString` variable of a last node.
3. If you are successfully able to traverse all corresponding characters of the string, then check whether the query string is present or not via `isEndofString` variable of the last node.
```cpp
bool search(trie_node* root, string s)
{
trie_node* temp = root;
int n = s.size();
for(int i = 0; i < n; i++){
// There is not further link
if(temp->link[s[i]-'a'] == nullptr)
return false;
temp = temp->link[s[i]-'a'];
}
return temp->isEndofString;
trie_node* temp = root;
int n = s.size();
for(int i = 0; i < n; i++){
// There is not further link
if(temp->link[s[i]-'a'] == nullptr)
return false;
temp = temp->link[s[i]-'a'];
}
return temp->isEndofString;
}
```
Can you find recursive version of the above function?
**Recursive version:**
```cpp
// @param: root -> root of the trie
// @param: s -> the string we are deleting
// @param: i -> index of s currently reached via recursive traversal
bool Rec_search(trie_node* root, string& s, int i = 0)
{
// No link present
// so string is not present
if(root == nullptr)
return false;
if(i == s.size()) {
// present
if(root->isEndofString)
return true;
else
return false;
}
// Recusively traverse using links
return Rec_search(root->link[s[i]-'a'], s, i+1);
}
```
**Time Complexity:** $O(N)$, where $N$ is the length of the string we are searching for.
## Delete
@ -157,126 +184,126 @@ Things to take care about while you are deleting a string from the trie,
1. It should not affect any other string present in the trie.
2. Therefore, we are only going to delete **the nodes which are present only due to the presence of the given string**. And no other string is passing through them.
We are going to use recursive procedure. If the string is not present, then we will return `false` and `true` otherwise.
We are going to use a recursive procedure. If the string is not present, then we will return `false` and `true` otherwise. **Recursive procedure for delete is a modified version of the recursive search procedure** and therefore make sure you understand that.
1. We are traversing trie via the given string recursively.
2. While traversing, if we find that no link is present(`nullptr`) for the current character, then string is not present in the trie and return `false`.
3. If we are successfully able to traverse the string(`i==s.size())`, then finally check `isEndofString` of the last node. If the string is really present, then return `true`. Otherwise return `false`.
4. Now, while backtracking stage of recursion, delete nodes if it is no longer needed after deletion of the given string.
Can you figure it out on your own?
**Procedure:**
1. We are traversing the trie recursively, the same way as in `Rec_search()` procedure.
2. While traversing, if we find that no link is present(`root == nullptr`) for the current character, then the string is not present in the trie and return `false`.
3. If we are successfully able to traverse the whole string until `i==s.size()`, then finally check `isEndofString` of the last node. If the string is present(`isEndofString = true)`, then set it to `false` and return `true`. Otherwise, return `false`-not present.
4. Now, while backtracking stage of the recursion, delete nodes if it is no longer needed after deletion of the given string.
Now, go through the code below with very intuitive comments.
Now, Go through the code below, very intuitive comments are written.
```cpp
// Checks whether any link is present
bool isEmptyNode(trie_node* node)
{
for(auto i:node->link)
if(i != nullptr)
return false;
return true;
for(auto i:node->link)
if(i != nullptr)
return false;
return true;
}
// Returns true, if the string is successfully deleted
// Returns true if the string is successfully deleted
// And if the string is not present in the trie then returns false.
// @param: root -> root of the trie
// @param: s -> string we are deleting
// @param: i -> index of @s currently reached via recursive traversal
bool deleteString(trie_node* root, string& s, int i = 0)
{
// Means string is not present
if(root == nullptr)
return false;
// Successfully traversed the whole string
if(i == s.size()) {
// Check whether the string is really present
// by checking `isEndofString` variable of the last node
if(root->isEndofString) {
root->isEndofString = false;
return true;
}
else
return false;
}
bool ans = deleteString(root->link[s[i]-'a'], s, i+1);
// String is present
if(ans) {
if(root == nullptr)
return false;
if(i == s.size()) {
// present
if(root->isEndofString) {
// delete it
root->isEndofString = false;
return true;
}
else
return false;
}
bool ans = deleteString(root->link[s[i]-'a'], s, i+1);
// String is present
if(ans) {
// Check whether any other string
// passes through this link node
// If not passing, then delete it
if(isEmptyNode(root->link[s[i]-'a'])) {
// Check whether any other string
// passes through this node
// Not passing, then delete this node
if(isEmptyNode(root->link[s[i]-'a'])) {
// Deallocate used memory
delete root->link[s[i]-'a'];
root->link[s[i]-'a'] = nullptr;
}
return true;
}
// Not present the return false
return false;
// Deallocate used memory
delete root->link[s[i]-'a'];
root->link[s[i]-'a'] = nullptr;
}
return true;
}
// Not present the return false
return false;
}
```
**Time Complexity:** $O(N)$, where $N$ is the length of the string we are deleting.
## Trie as an array
Availability of dynamic arrays allow use to create Trie without using pointers.
The availability of dynamic arrays allows us to create Trie without using pointers.
Now, we are going to store trie as a dynamic array of `TrieNodes`. In this implementation, we are going to use an array of integers instead of pointers in `TrieNode` and as a link, we are going to store index of a node rather than address of a node in the former case.
Now, we are going to store trie as a dynamic array of `TrieNodes`. In this implementation, we are going to use an array of integers instead of pointers in `TrieNode` and as a link, we are going to store the index of a node rather than the address of a node in the former case.
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/11.jpg)
See the below implementation of trie as an array, which is quite similar and intuitive as previous implementation.
See the below implementation of trie as an array, which is quite similar and intuitive as the previous implementation.
```cpp
struct TrieNode
{
vector<int> id_link;
bool isEndofString;
TrieNode(bool end = false)
{
end = isEndofString;
id_link.assign(26,-1);
}
vector<int> id_link;
bool isEndofString;
TrieNode(bool end = false)
{
end = isEndofString;
id_link.assign(26,-1);
}
};
void insert(vector<TrieNode>& trie, string s)
{
int temp = 0;
int n = s.size();
for(int i = 0; i < n; i++) {
if(trie[temp].id_link[s[i]-'a'] == -1) {
trie[temp].id_link[s[i]-'a'] = (int)trie.size();
trie.push_back(TrieNode());
}
temp = trie[temp].id_link[s[i]-'a'];
}
trie[temp].isEndofString = true;
int temp = 0;
int n = s.size();
for(int i = 0; i < n; i++) {
if(trie[temp].id_link[s[i]-'a'] == -1) {
trie[temp].id_link[s[i]-'a'] = (int)trie.size();
trie.push_back(TrieNode());
}
temp = trie[temp].id_link[s[i]-'a'];
}
trie[temp].isEndofString = true;
}
bool search(vector<TrieNode>& trie, string s)
{
int temp = 0;
int n = s.size();
for(int i = 0; i < n; i++) {
if(trie[temp].id_link[s[i]-'a'] == -1)
return false;
temp = trie[temp].id_link[s[i]-'a'];
}
return trie[temp].isEndofString;
int temp = 0;
int n = s.size();
for(int i = 0; i < n; i++) {
if(trie[temp].id_link[s[i]-'a'] == -1)
return false;
temp = trie[temp].id_link[s[i]-'a'];
}
return trie[temp].isEndofString;
}
```
But it has a downside that you can not delete strings present in the trie. Why?
But it has a downside that you can not generally delete strings present in the trie. Why?
Try deleting a single node, you will realize that indexes of each subsequent node will change and moreover deleting in an array has a very bad performance.
Try deleting a single node(other than last one), you will realize that indexes of each subsequent node will change, and also deleting in an array has a very bad performance.
It is easy implemention, but with single downside. Therefore, use as per the requirement.
It is an easy implementation, but with a single downside. Therefore, use as per the requirement.
## Count total number of words present in a Trie
## Count the total number of words present in a Trie
How will you find the number of words(strings) present in the trie below?
@ -284,7 +311,7 @@ How will you find the number of words(strings) present in the trie below?
Ultimately, It means to find the total number of nodes having `true` value of `isEndofString`. Which can be easily done using recursive traversal of all the nodes present in the trie.
The basic idea of recursive procedure is as follow:
The basic idea of the recursive procedure is as follow:
Start from the $\text{root}$ node and go through all $26$ positions of the `link` array. For each not-null link, recursively call `countWords()` considering that linked node as a $\text{root}$. And therefore formula will be as below:
@ -295,135 +322,134 @@ Finally, add $1$ to $\text{TotalWords}$ if the current node has `isEndofString =
```cpp
int countWords(trie_node* root)
{
int total = 0;
if(root == nullptr)
return 0;
for(auto i:root->link)
if(i != nullptr)
total += countWords(i);
total += root->isEndofString;
return total;
int total = 0;
if(root == nullptr)
return 0;
for(auto i:root->link)
if(i != nullptr)
total += countWords(i);
total += root->isEndofString;
return total;
}
```
**Time complexity:** $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node.
**Time complexity:** $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node. <br>
**Space complexity:** $O(1)$
## Print all words stored in Trie
It is similar to finding total number of words but instead of adding $1$ for each `isEndofString`'s true value, we are going to store the word representing that particular end.
It is similar to finding the total number of words but instead of adding $1$ for each `isEndofString`'s true value, we are going to store the word representing that particular end.
The code is similar as finding total number of words.
The code is similar to finding the total number of words.
```cpp
void printAllWords(trie_node* root, vector<string>& ans, string s="")
{
if(root == nullptr)
return;
for(int i = 0; i < alphabet_size; i++) {
if(root->link[i] != nullptr) {
char c = 'a' + i;
string temp = s;
temp += c;
printAllWords(root->link[i], ans, temp);
}
}
if(root->isEndofString)
ans.push_back(s);
if(root == nullptr)
return;
for(int i = 0; i < alphabet_size; i++) {
if(root->link[i] != nullptr) {
char c = 'a' + i;
string temp = s;
temp += c;
printAllWords(root->link[i], ans, temp);
}
}
if(root->isEndofString)
ans.push_back(s);
}
```
**Time complexity:** $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node.
**Time complexity:** $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node. <br>
**Space Complexity:** $O(\text{Total length of all words present in the trie})$
## Auto-suggestion features
How will you design autocompletion feature using Trie?
How will you design the autocompletion feature using Trie?
For example, we have stored C++ keywords in a trie. Now, when you type `"n"` it should show all keywords starting from `"n"`. For simplicity only keywords starting from `"n"` are shown in the trie below,
For example, we have stored C++ keywords in a trie. Now, when you type `"n"` it should show all keywords starting from `"n"`. For simplicity, only keywords starting from `"n"` are shown in the trie below,
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/12(1).jpg)
How will you print all keywords starting from `"n"`? OR how will you print all keywords having `"n"` as prefix?
How will you print all keywords starting from `"n"`? OR how will you print all keywords having `"n"` as a prefix?
Simply use `printAllWords()` on node `n`, and problem is solved!
Simply use `printAllWords()` on node `n`, and the problem is solved!
Common procedure is as below:
A common procedure is as below:
1. Traverse nodes in trie according to the given uncomplete string `s`. If we are successfully able to traverse `s`, then there are keywords having prefix of `s`. Otherwise, there will be nothing to suggest.
1. Traverse nodes in trie according to the given uncomplete string `s`. If we are successfully able to traverse `s`, then there are keywords having a prefix of `s`. Otherwise, there will be nothing to suggest.
2. Now, use `printAllWords()` considering the last node(after traversal of trie according to `s`) as a root.
```cpp
void autocomplete(trie_node* root, string s)
{
int n = s.size();
trie_node* temp = root;
for(int i = 0; i < n; i++) {
if(temp->link[s[i]-'a'] == nullptr)
return;
temp = temp->link[s[i]-'a'];
}
vector<string> suggest;
printWords(temp, suggest, s);
for(auto i:suggest)
cout << i << endl;
/*
OR
printWords(temp, suggest);
for(auto i:suggest)
cout << s << i << endl;
*/
int n = s.size();
trie_node* temp = root;
for(int i = 0; i < n; i++) {
if(temp->link[s[i]-'a'] == nullptr)
return;
temp = temp->link[s[i]-'a'];
}
vector<string> suggest;
printWords(temp, suggest, s);
for(auto i:suggest)
cout << i << endl;
/*
OR
printWords(temp, suggest);
for(auto i:suggest)
cout << s << i << endl;
*/
}
```
**Time complexity:** $O(\text{Length of S + Total length of all suggestions excluding common prefix(S) from all})$, where `s` is the string you want suggestions for.
**Time complexity:** $O(\text{Length of S + Total length of all suggestions excluding common prefix(S) from all})$, where `s` is the string you want suggestions for. <br>
**Space complexity:** $O(\text{Total length of all possible suggestions})$
It is widely used feature, as discussed at the start of the article.
It is a widely used feature, as discussed at the start of the article.
There is also something called **"Ternary Search Tree"**. When each node in the trie has most of its links used(having many similar prefixe words), trie is substantially more space efficient and time efficient than ternary search tree.
There is also something called **"Ternary Search Tree"**. When each node in the trie has most of its links used(having many similar prefix words), a trie is substantially more space-efficient and time-efficient than the ternary search tree.
But, If each node stores few links, then ternary search tree is much more space efficient, because we are using $26$ pointers in each node of trie and many of them may be unused.
But, If each node stores a few links, then the ternary search tree is much more space-efficient, because we are using $26$ pointers in each node of trie and many of them may be unused.
Therefore, use as per the requirements.
## Dictionary using Trie
What are common features of an english dictionary?
What are the common features of an English dictionary?
1. Efficient Lookup of words
2. As dictionary is very large, Less memory usages
2. As the dictionary is very large, Lesser memory usages
Hashtable can be used to implement dictionary. After precomputation of hash for each word in $O(M)$, where $M$ is total length of all words in the dictionary, we can have efficient lookups if we design a very efficient hashtable.
Hashtable can be used to implement a dictionary. After precomputation of hash for each word in $O(M)$, where $M$ is the total length of all words in the dictionary, we can have efficient lookups if we design a very efficient hashtable.
But as dictionary is very large there will be collisions between two or more words. But still you can design hash table to have efficient look-ups.
But as the dictionary is very large there will be collisions between two or more words. Still, you can design a hash table to have efficient look-ups.
But space usages is very high, as we simply store each words. But what if we design it using a trie?
But space usages is very high, as we simply store each word. But what if we design it using a trie?
As in a dictionary we have many common-prefix words, trie will save substantial amount of memory consumption. Trie supports look-up in $O(word length)$, which is higher than a very efficient hash table.
As in a dictionary we have many common-prefix words, trie will save a substantial amount of memory consumption. Trie supports look-up in $O(\text{word length})$, which is higher than a very efficient hash table.
Other advantages of trie is as below:
Other advantages of the trie are as below:
1. Auto-complete feature
2. It also supports ordered traversal of words with given prefix
3. No need for complex hash functions
So, if you want some of the above features then using trie is good for you. Also, we don't have to deal with collisions.
Note that in dictionary along with a word, we have explanations or meanings of that word. That can be handled by seperately maintaining an array which stores all those extra stuffs. Then store one integer in the `TrieNode` structure to store the index of the corresponding data in the array.
Note that in the dictionary along with a word, we have explanations or meanings of that word. That can be handled by separately maintaining an array that stores all those extra stuff. Then store one integer in the `TrieNode` structure to store the index of the corresponding data in the array.
```cpp
struct trie_node
{
// Array of pointers of type
// trie_node
vector<trie_node*> links;
bool isEndofString;
// To store id of data
int idOfData;
// Array of pointers of type
// trie_node
vector<trie_node*> links;
bool isEndofString;
// To store id of data
int idOfData;
};
```
Below image shows a typical trie structure for dictionary.
The below image shows a typical trie structure for the dictionary.
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/13(1).jpg)
![enter image description here](https://github.com/KingsGambitLab/Lecture_Notes/blob/master/articles/Akash%20Articles/md/Images/Trie/13.jpg)