All the data we store on a computer is either numbers or character strings or complex form of them.
We store data and retrieve it when needed. It becomes very expensive-in terms of time and computational power-to find out the needed data from a very large pool of data.
Usually, data can be considered as a key-value pair. The key part is used when we are searching for the data and the value part contains the information related to it(key).
Key-the main part to search the required data-can be an integer or a string. For example, in the English dictionary we have string-keys but students' data accessible using their registration numbers has integer-keys(reg. no.).
A basic way to find required data requires a comparison of keys. So, It is always preferred to have keys as an integer, because we can compare any two integers in $O(1)$ but comparing large strings has bad performance.
Let say $\text{hash}(s)$ function returns hash(integer) of the provided string($s$). Generally, we want to map strings to numbers on a fixed range $[0,m)$, so that **string comparison turns to an integer comparison($O(1)$)**.
Our goal is only satisfied if hash function returns $\text{hash}(s) \neq \text{hash}(t)$, for different strings $s \neq t$. Because otherwise, it will create problems like we may end up comparing two different strings as same strings because their hashes are same.
When the hash function returns same hash for different strings, It is called a **"collision"**. We will discuss it soon, but let's first see how do we calculate a hash.
1. $s_i$ is the hash of character at the $i^{th}$ index in $s$. For example, string having only lower case characters we calculate hash as `s[i]-'a'+1`. Ex. `'a'->1`, `'b'->2`, `'c'->3`, ....
Note that if we take a hash of `'a'` as $0$, then all strings `"a"`, `"aa"`, `"aaa"`, `"aaaa"`, ... will have the same hash-0, which will create problems.
2. $p$ is a prime number, which is taken to be greater than the number of distinct characters we are expecting in a string. $31$ in case of only lowercase characters and $53$ in case of lowercase and uppercase characters both.
3. $m$ is also a prime number, which is chosen such that we can handle integer overflows(using `long long`). Generally, it is taken to be $10^9+7$ or $10^9+9$.
But as the number of strings increases, let say we are evaluating hashes of $10^6$ different strings and comparing them with each other, then the probability of at least one collision happening is almost $1$. This almost guarantees that at least one collision will occur and we may end up with wrong results.
**Usually, in competitive programming environment it works fine.**
But, how to decrease collision probabilities?
One easy way: we can just compute two different hashes for each string (by using two different $p$ and/or different $m$) and **compare their pairs** instead. It decreases the probability to around $10^{-6}$, which was around $1$ before.
Suppose we want to compare two substrings, we can find their hash and compare easily. Now, what if we are given too many queries for substrings comparisons? Can we use the prefix-sum method here as well?
Yes. In simple prefix-sums, we can find subarray sum in $O(1)$ after precomputation of prefix-sums. Now, note that to find a hash of a string we are also doing the overall sum of terms of form $s_i*p^i$.
Given string "$s_0s_1s_2s_3s_4s_5$", we want to compare two substrings "$s_0s_1s_2$" and "$s_3s_4s_5$". Simple formula of hashing for "$s_0s_1s_2$" will be $((s_0+s_1*p+s_2*p^2) \mod m)$ and for "$s_3s_4s_5$" will be $((s_3+s_4*p+s_5*p^2) \mod m)$.
For "$s_0s_1s_2$" substring, $h[3]$ already have value $(s_0+s_1*p+s_2*p^2) \mod m$, but for "$s_3s_4s_5$" substring (same as prefix-sums) here $h[6]-h[3]$ gives, $(s_3*p^3+s_4*p^4+s_5*p^5) \mod m$. But actual value we want is $(s_3+s_4*p+s_5*p^2) \mod m$.
1. Multiply $((s_0+s_1*p+s_2*p^2) \mod m)$ by $p^3$ to get $((s_0*p^3+s_1*p^4+s_2*p^5) \mod m)$
2. Divide $((s_3*p^3+s_4*p^4+s_5*p^5) \mod m)$ by $p^3$ to get $((s_3+s_4*p+s_5*p^2) \mod m)$-in modular arithmetic sense it means to multiply by modular inverse of $p^3$.
Either of these methods brings down the comparisons to usual hash comparisons-$(((s_0+s_1*p+s_2*p^2) \mod m)$ == $((s_3+s_4*p+s_5*p^2) \mod m))$-we want.
Let say, you want to compare "$s_1s_2s_3s_4$" and "$s_7s_8s_9s_{10}$" substrings, then you should compare `(h[5] - h[1] + m ) % m` and `((h[11] - h[7]) * p_pow[7-1]) % m`
Rabin-Karp algorithm is an algorithm to find all occurrences of a pattern(or substring)`p` into a larger string `s` using the string-hashing method. For example, `p = "ab"` and `s = "abbab"`, then $0$ and $3$ are indexes where p is present as a substring.
We are going to compare the hash of all substrings of `s` having the same length as `p` one by one, with the hash of `p`. If the hash matches, then it is an occurrence of `p` and so on.
$O(|p|)$ is required for calculating the hash of the pattern and $O(|s|)$ for comparing the hash of each substring of length $|p|$ with the hash of the pattern.
The idea here is to compute hashes of each substring of same length and find the number of unique hashes, which can be easily done using data structures like `set` and do this for every possible substring-lengths i.e. $\text{length} \in [1,n]$.
**Time Complexity:** $O(n^2\log(n))$ : $O(n^2)$ is to find indices of each substring and finding hash in $O(1)$. $O(\log(n))$ is taken to insert a hash in the set.