mirror of
https://github.com/dholerobin/Lecture_Notes.git
synced 2025-03-15 21:59:56 +00:00
Update Trie.md
This commit is contained in:
parent
99b27b0733
commit
e5fe2c4725
@ -1,4 +1,5 @@
|
||||
Do you know how autocompletion provided by different softwares like IDEs, Search Engines, command-line interpreters, text editors, etc works?
|
||||
|
||||
Do you know how the "auto-completion feature" provided by different software like IDEs, Search Engines, command-line interpreters, text editors, etc works?
|
||||
|
||||

|
||||
|
||||
@ -9,19 +10,19 @@ The basic data structure behind all these scenes is **Trie**.
|
||||
Spell checkers can also be designed using **Trie**.
|
||||
|
||||
# Trie
|
||||
String processing is widely used across real world applications, for example data analytics, search engines, bioinformatics, plagiarism detection, etc.
|
||||
String processing is widely used across real-world applications, for example data analytics, search engines, bioinformatics, plagiarism detection, etc.
|
||||
|
||||
Trie is very useful and special kind of data structure for string processing.
|
||||
Trie is a very useful and special kind of data structure for string processing.
|
||||
|
||||
## Introduction
|
||||
|
||||
Trie is basically a tree of nodes, where specification of a node can be given as below:
|
||||
Trie is a tree of nodes, where the specifications of a node can be given as below:
|
||||
|
||||
Each node has,
|
||||
1. An array of datatype node and of size of alphabet.
|
||||
2. A boolean value(We will see why it is needed).
|
||||
1. An array of datatype `node` having the size of the alphabet(see the note below).
|
||||
2. A boolean variable.
|
||||
|
||||
We will see usages of these two variables soon.
|
||||
**We will see usages of these two variables soon.**
|
||||
|
||||
```cpp
|
||||
struct trie_node
|
||||
@ -39,11 +40,11 @@ struct trie_node
|
||||
};
|
||||
```
|
||||
|
||||
**Note:** For easy understanding purpose, we are assuming that all strings contain lowercase alphabet letters that is alphabet size is $26$. **We can convert characters to a number by using `c-'a'`, `c` is a lowercase character.**
|
||||
**Note:** For an easy understanding purpose, we are assuming that all strings contain lowercase alphabet letters, i.e. `alphabet_size` is $26$. **We can convert characters to a number by using `c-'a'`, `c` is a lowercase character.**
|
||||
|
||||

|
||||
|
||||
Now, we have seen how trie node looks like. Let's see how we are going to store strings in a trie using these kind of nodes.
|
||||
Now, we have seen how a trie node looks like. Let's see how we are going to store strings in a trie using this kind of node.
|
||||
|
||||
## How to insert a string in a trie?
|
||||
|
||||
@ -51,18 +52,18 @@ Look at the image below, which represents a string "act" stored in a trie. Obser
|
||||
|
||||

|
||||
|
||||
**Note: Empty places in the array have null values(`nullptr` in c++).**
|
||||
|
||||
What did you observe?
|
||||
|
||||
Observations:
|
||||
1. **Other root node, each node in trie represents a single character.**
|
||||
1. **Other than the root node, each node in trie represents a single character.**
|
||||
2. **We set isEndofString to true in the node at which the string ends.**
|
||||
|
||||
Therefore, now for the shake of ease we are going to represent the nodes of trie as below.
|
||||
|
||||

|
||||
|
||||
**Note: Empty places in array have null values.**
|
||||
|
||||
And therefore representation of trie containing string "act" will be as below.
|
||||
|
||||
.jpg)
|
||||
@ -73,13 +74,13 @@ Now, observe the trie below, which contains two strings "act" and "ace".
|
||||
|
||||
.jpg)
|
||||
|
||||
Note that the node representing character `c` in the above trie, in magnified sense would look as below:
|
||||
Note that the node representing character `c` in the above trie, in a magnified sense would look as below:
|
||||
|
||||

|
||||
|
||||
What did you observe?
|
||||
|
||||
Common prefix of `"ace"` and `"act"` is `"ac"` and therefore we are having same nodes until we traverse `"ac"` and then we create a new node for character `e`.
|
||||
A common prefix of `"ace"` and `"act"` is `"ac"` and therefore we are having the same nodes until we traverse `"ac"` and then we create a new node for character `e`.
|
||||
|
||||
Therefore, we are not creating any new node until we need one and **Trie is a very efficient data storage, when we have a large list of strings sharing common prefixes.** It is also known as **prefix tree**.
|
||||
|
||||
@ -87,11 +88,11 @@ Now, observe the trie below, which contains three strings `"act"`, `"ace"` and `
|
||||
|
||||
.jpg)
|
||||
|
||||
Let's see proper algorithm to insert a string in a trie.
|
||||
Let's see a proper algorithm to insert a string in a trie.
|
||||
|
||||
1. Starting from the root, if there is already a node representing corresponding character of a string, then simply traverse.
|
||||
2. Otherwise, create a new node representing corresponding character.
|
||||
3. At the end of string, set `isEndofString` to true in the last ending node.
|
||||
1. Starting from the root, if there is already a node representing the corresponding character of a string, then simply traverse.
|
||||
2. Otherwise, create a new node representing the corresponding character.
|
||||
3. At the end of the string, set `isEndofString` to true in the last ending node.
|
||||
|
||||
```cpp
|
||||
void insert(trie_node* root, string s)
|
||||
@ -124,13 +125,13 @@ Observe the trie given below and try to search whether `"on"` is present or not.
|
||||
|
||||
.jpg)
|
||||
|
||||
If you don't have `isEndofString` variable, then you will not be able to correctly check whether `on` is present or not. Because it is prefix of `once`.
|
||||
If you don't have `isEndofString` variable, then you will not be able to correctly check whether `on` is present or not. Because it is the prefix of `once`.
|
||||
|
||||
**Algorithm**:
|
||||
|
||||
1. Starting from the root, try to traverse corresponding character of the string. If a link is present, then go ahead.
|
||||
1. Starting from the root, try to traverse the corresponding character of the string. If a link is present, then go ahead.
|
||||
2. Otherwise, simply given string is not present in the trie.
|
||||
3. If you are successfully able to traverse according to the string, then check whether the query string is really present or not via `isEndofString` variable of a last node.
|
||||
3. If you are successfully able to traverse all corresponding characters of the string, then check whether the query string is present or not via `isEndofString` variable of the last node.
|
||||
|
||||
```cpp
|
||||
bool search(trie_node* root, string s)
|
||||
@ -146,6 +147,32 @@ bool search(trie_node* root, string s)
|
||||
return temp->isEndofString;
|
||||
}
|
||||
```
|
||||
Can you find recursive version of the above function?
|
||||
|
||||
**Recursive version:**
|
||||
```cpp
|
||||
// @param: root -> root of the trie
|
||||
// @param: s -> the string we are deleting
|
||||
// @param: i -> index of s currently reached via recursive traversal
|
||||
bool Rec_search(trie_node* root, string& s, int i = 0)
|
||||
{
|
||||
// No link present
|
||||
// so string is not present
|
||||
if(root == nullptr)
|
||||
return false;
|
||||
if(i == s.size()) {
|
||||
// present
|
||||
if(root->isEndofString)
|
||||
return true;
|
||||
else
|
||||
return false;
|
||||
}
|
||||
// Recusively traverse using links
|
||||
return Rec_search(root->link[s[i]-'a'], s, i+1);
|
||||
}
|
||||
```
|
||||
|
||||
**Time Complexity:** $O(N)$, where $N$ is the length of the string we are searching for.
|
||||
|
||||
## Delete
|
||||
|
||||
@ -157,14 +184,19 @@ Things to take care about while you are deleting a string from the trie,
|
||||
1. It should not affect any other string present in the trie.
|
||||
2. Therefore, we are only going to delete **the nodes which are present only due to the presence of the given string**. And no other string is passing through them.
|
||||
|
||||
We are going to use recursive procedure. If the string is not present, then we will return `false` and `true` otherwise.
|
||||
We are going to use a recursive procedure. If the string is not present, then we will return `false` and `true` otherwise. **Recursive procedure for delete is a modified version of the recursive search procedure** and therefore make sure you understand that.
|
||||
|
||||
1. We are traversing trie via the given string recursively.
|
||||
2. While traversing, if we find that no link is present(`nullptr`) for the current character, then string is not present in the trie and return `false`.
|
||||
3. If we are successfully able to traverse the string(`i==s.size())`, then finally check `isEndofString` of the last node. If the string is really present, then return `true`. Otherwise return `false`.
|
||||
4. Now, while backtracking stage of recursion, delete nodes if it is no longer needed after deletion of the given string.
|
||||
Can you figure it out on your own?
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. We are traversing the trie recursively, the same way as in `Rec_search()` procedure.
|
||||
2. While traversing, if we find that no link is present(`root == nullptr`) for the current character, then the string is not present in the trie and return `false`.
|
||||
3. If we are successfully able to traverse the whole string until `i==s.size()`, then finally check `isEndofString` of the last node. If the string is present(`isEndofString = true)`, then set it to `false` and return `true`. Otherwise, return `false`-not present.
|
||||
4. Now, while backtracking stage of the recursion, delete nodes if it is no longer needed after deletion of the given string.
|
||||
|
||||
Now, go through the code below with very intuitive comments.
|
||||
|
||||
Now, Go through the code below, very intuitive comments are written.
|
||||
```cpp
|
||||
// Checks whether any link is present
|
||||
bool isEmptyNode(trie_node* node)
|
||||
@ -175,23 +207,17 @@ bool isEmptyNode(trie_node* node)
|
||||
return true;
|
||||
}
|
||||
|
||||
// Returns true, if the string is successfully deleted
|
||||
// Returns true if the string is successfully deleted
|
||||
// And if the string is not present in the trie then returns false.
|
||||
// @param: root -> root of the trie
|
||||
// @param: s -> string we are deleting
|
||||
// @param: i -> index of @s currently reached via recursive traversal
|
||||
bool deleteString(trie_node* root, string& s, int i = 0)
|
||||
{
|
||||
// Means string is not present
|
||||
if(root == nullptr)
|
||||
return false;
|
||||
|
||||
// Successfully traversed the whole string
|
||||
if(i == s.size()) {
|
||||
|
||||
// Check whether the string is really present
|
||||
// by checking `isEndofString` variable of the last node
|
||||
// present
|
||||
if(root->isEndofString) {
|
||||
// delete it
|
||||
root->isEndofString = false;
|
||||
return true;
|
||||
}
|
||||
@ -203,10 +229,9 @@ bool deleteString(trie_node* root, string& s, int i = 0)
|
||||
|
||||
// String is present
|
||||
if(ans) {
|
||||
|
||||
// Check whether any other string
|
||||
// passes through this node
|
||||
// Not passing, then delete this node
|
||||
// passes through this link node
|
||||
// If not passing, then delete it
|
||||
if(isEmptyNode(root->link[s[i]-'a'])) {
|
||||
|
||||
// Deallocate used memory
|
||||
@ -221,15 +246,17 @@ bool deleteString(trie_node* root, string& s, int i = 0)
|
||||
}
|
||||
```
|
||||
|
||||
**Time Complexity:** $O(N)$, where $N$ is the length of the string we are deleting.
|
||||
|
||||
## Trie as an array
|
||||
|
||||
Availability of dynamic arrays allow use to create Trie without using pointers.
|
||||
The availability of dynamic arrays allows us to create Trie without using pointers.
|
||||
|
||||
Now, we are going to store trie as a dynamic array of `TrieNodes`. In this implementation, we are going to use an array of integers instead of pointers in `TrieNode` and as a link, we are going to store index of a node rather than address of a node in the former case.
|
||||
Now, we are going to store trie as a dynamic array of `TrieNodes`. In this implementation, we are going to use an array of integers instead of pointers in `TrieNode` and as a link, we are going to store the index of a node rather than the address of a node in the former case.
|
||||
|
||||

|
||||
|
||||
See the below implementation of trie as an array, which is quite similar and intuitive as previous implementation.
|
||||
See the below implementation of trie as an array, which is quite similar and intuitive as the previous implementation.
|
||||
|
||||
```cpp
|
||||
struct TrieNode
|
||||
@ -270,13 +297,13 @@ bool search(vector<TrieNode>& trie, string s)
|
||||
return trie[temp].isEndofString;
|
||||
}
|
||||
```
|
||||
But it has a downside that you can not delete strings present in the trie. Why?
|
||||
But it has a downside that you can not generally delete strings present in the trie. Why?
|
||||
|
||||
Try deleting a single node, you will realize that indexes of each subsequent node will change and moreover deleting in an array has a very bad performance.
|
||||
Try deleting a single node(other than last one), you will realize that indexes of each subsequent node will change, and also deleting in an array has a very bad performance.
|
||||
|
||||
It is easy implemention, but with single downside. Therefore, use as per the requirement.
|
||||
It is an easy implementation, but with a single downside. Therefore, use as per the requirement.
|
||||
|
||||
## Count total number of words present in a Trie
|
||||
## Count the total number of words present in a Trie
|
||||
|
||||
How will you find the number of words(strings) present in the trie below?
|
||||
|
||||
@ -284,7 +311,7 @@ How will you find the number of words(strings) present in the trie below?
|
||||
|
||||
Ultimately, It means to find the total number of nodes having `true` value of `isEndofString`. Which can be easily done using recursive traversal of all the nodes present in the trie.
|
||||
|
||||
The basic idea of recursive procedure is as follow:
|
||||
The basic idea of the recursive procedure is as follow:
|
||||
|
||||
Start from the $\text{root}$ node and go through all $26$ positions of the `link` array. For each not-null link, recursively call `countWords()` considering that linked node as a $\text{root}$. And therefore formula will be as below:
|
||||
|
||||
@ -305,14 +332,14 @@ int countWords(trie_node* root)
|
||||
return total;
|
||||
}
|
||||
```
|
||||
**Time complexity:** $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node.
|
||||
**Time complexity:** $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node. <br>
|
||||
**Space complexity:** $O(1)$
|
||||
|
||||
## Print all words stored in Trie
|
||||
|
||||
It is similar to finding total number of words but instead of adding $1$ for each `isEndofString`'s true value, we are going to store the word representing that particular end.
|
||||
It is similar to finding the total number of words but instead of adding $1$ for each `isEndofString`'s true value, we are going to store the word representing that particular end.
|
||||
|
||||
The code is similar as finding total number of words.
|
||||
The code is similar to finding the total number of words.
|
||||
|
||||
```cpp
|
||||
void printAllWords(trie_node* root, vector<string>& ans, string s="")
|
||||
@ -331,24 +358,24 @@ void printAllWords(trie_node* root, vector<string>& ans, string s="")
|
||||
ans.push_back(s);
|
||||
}
|
||||
```
|
||||
**Time complexity:** $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node.
|
||||
**Time complexity:** $O(\text{Number of nodes present in the trie})$, as we are visiting each and every node. <br>
|
||||
**Space Complexity:** $O(\text{Total length of all words present in the trie})$
|
||||
|
||||
## Auto-suggestion features
|
||||
|
||||
How will you design autocompletion feature using Trie?
|
||||
How will you design the autocompletion feature using Trie?
|
||||
|
||||
For example, we have stored C++ keywords in a trie. Now, when you type `"n"` it should show all keywords starting from `"n"`. For simplicity only keywords starting from `"n"` are shown in the trie below,
|
||||
For example, we have stored C++ keywords in a trie. Now, when you type `"n"` it should show all keywords starting from `"n"`. For simplicity, only keywords starting from `"n"` are shown in the trie below,
|
||||
|
||||
.jpg)
|
||||
|
||||
How will you print all keywords starting from `"n"`? OR how will you print all keywords having `"n"` as prefix?
|
||||
How will you print all keywords starting from `"n"`? OR how will you print all keywords having `"n"` as a prefix?
|
||||
|
||||
Simply use `printAllWords()` on node `n`, and problem is solved!
|
||||
Simply use `printAllWords()` on node `n`, and the problem is solved!
|
||||
|
||||
Common procedure is as below:
|
||||
A common procedure is as below:
|
||||
|
||||
1. Traverse nodes in trie according to the given uncomplete string `s`. If we are successfully able to traverse `s`, then there are keywords having prefix of `s`. Otherwise, there will be nothing to suggest.
|
||||
1. Traverse nodes in trie according to the given uncomplete string `s`. If we are successfully able to traverse `s`, then there are keywords having a prefix of `s`. Otherwise, there will be nothing to suggest.
|
||||
|
||||
2. Now, use `printAllWords()` considering the last node(after traversal of trie according to `s`) as a root.
|
||||
|
||||
@ -376,40 +403,40 @@ void autocomplete(trie_node* root, string s)
|
||||
```
|
||||
|
||||
|
||||
**Time complexity:** $O(\text{Length of S + Total length of all suggestions excluding common prefix(S) from all})$, where `s` is the string you want suggestions for.
|
||||
**Time complexity:** $O(\text{Length of S + Total length of all suggestions excluding common prefix(S) from all})$, where `s` is the string you want suggestions for. <br>
|
||||
**Space complexity:** $O(\text{Total length of all possible suggestions})$
|
||||
|
||||
It is widely used feature, as discussed at the start of the article.
|
||||
It is a widely used feature, as discussed at the start of the article.
|
||||
|
||||
There is also something called **"Ternary Search Tree"**. When each node in the trie has most of its links used(having many similar prefixe words), trie is substantially more space efficient and time efficient than ternary search tree.
|
||||
There is also something called **"Ternary Search Tree"**. When each node in the trie has most of its links used(having many similar prefix words), a trie is substantially more space-efficient and time-efficient than the ternary search tree.
|
||||
|
||||
But, If each node stores few links, then ternary search tree is much more space efficient, because we are using $26$ pointers in each node of trie and many of them may be unused.
|
||||
But, If each node stores a few links, then the ternary search tree is much more space-efficient, because we are using $26$ pointers in each node of trie and many of them may be unused.
|
||||
|
||||
Therefore, use as per the requirements.
|
||||
|
||||
## Dictionary using Trie
|
||||
|
||||
What are common features of an english dictionary?
|
||||
What are the common features of an English dictionary?
|
||||
|
||||
1. Efficient Lookup of words
|
||||
2. As dictionary is very large, Less memory usages
|
||||
2. As the dictionary is very large, Lesser memory usages
|
||||
|
||||
Hashtable can be used to implement dictionary. After precomputation of hash for each word in $O(M)$, where $M$ is total length of all words in the dictionary, we can have efficient lookups if we design a very efficient hashtable.
|
||||
Hashtable can be used to implement a dictionary. After precomputation of hash for each word in $O(M)$, where $M$ is the total length of all words in the dictionary, we can have efficient lookups if we design a very efficient hashtable.
|
||||
|
||||
But as dictionary is very large there will be collisions between two or more words. But still you can design hash table to have efficient look-ups.
|
||||
But as the dictionary is very large there will be collisions between two or more words. Still, you can design a hash table to have efficient look-ups.
|
||||
|
||||
But space usages is very high, as we simply store each words. But what if we design it using a trie?
|
||||
But space usages is very high, as we simply store each word. But what if we design it using a trie?
|
||||
|
||||
As in a dictionary we have many common-prefix words, trie will save substantial amount of memory consumption. Trie supports look-up in $O(word length)$, which is higher than a very efficient hash table.
|
||||
As in a dictionary we have many common-prefix words, trie will save a substantial amount of memory consumption. Trie supports look-up in $O(\text{word length})$, which is higher than a very efficient hash table.
|
||||
|
||||
Other advantages of trie is as below:
|
||||
Other advantages of the trie are as below:
|
||||
1. Auto-complete feature
|
||||
2. It also supports ordered traversal of words with given prefix
|
||||
3. No need for complex hash functions
|
||||
|
||||
So, if you want some of the above features then using trie is good for you. Also, we don't have to deal with collisions.
|
||||
|
||||
Note that in dictionary along with a word, we have explanations or meanings of that word. That can be handled by seperately maintaining an array which stores all those extra stuffs. Then store one integer in the `TrieNode` structure to store the index of the corresponding data in the array.
|
||||
Note that in the dictionary along with a word, we have explanations or meanings of that word. That can be handled by separately maintaining an array that stores all those extra stuff. Then store one integer in the `TrieNode` structure to store the index of the corresponding data in the array.
|
||||
|
||||
```cpp
|
||||
struct trie_node
|
||||
@ -423,7 +450,6 @@ struct trie_node
|
||||
};
|
||||
```
|
||||
|
||||
Below image shows a typical trie structure for dictionary.
|
||||
The below image shows a typical trie structure for the dictionary.
|
||||
|
||||
|
||||
.jpg)
|
||||

|
||||
|
Loading…
x
Reference in New Issue
Block a user