KMP-algorithm is a widely used **string-matching algorithm**, which is used to find a place where a string is found within a larger string. For example, `p = "ab"` and `s = "abbbabab"`, then KMP will find us `[0,4,6]` because $s$ has 3 occurrences of `"ab"`. It uses the value of **prefix function** to do so.
Prefix function for a given string($s$) of length $n$ returns an array $\Pi$ of length $n$, such that $i^{th}$ element of an array contains length of longest proper prefix of $S[0,i]$ which is also suffix of $S[0,i]$.
Basic algorithm to find such a value say at index $i$ is to compare all prefixes and suffixes of substring $S[0, i]$ one by one, of course, of the same lengths.
The value of prefix function at any index $i$ can either increase by at most 1 or decrease by certain amount. That means, $\Pi[i] <= \Pi[i-1] + 1$. Why?
By finding the value of prefix function using the truth discussed above, we will end up having only $O(N)$ string comparisons because of the value of prefix function(in total) increases by at most $N$ steps and also decreases by at most $N$ steps(see in reverse mode). Therefore, total complexity now is $O(N*N)$-as one string comparison takes $O(N)$.
In step-1, we have seen that the value of prefix function overall increases by at most n steps, but due to string comparisons we ended up having $O(N^2)$ complexity. Here we will eliminate string comparisons completely.
Now, suppose that we are finding value of $\Pi[i]$ for string $s$, then from the value of $\Pi[i-1]$ we know that prefix of $s[0,i-1]$ of length $\Pi[i-1]$ is equal to the suffix of substring $s[0,i-1]$ of length $\Pi[i-1]$ i.e. $s[0,(\Pi[i-1]-1)] = s[(i-\Pi[i-1]),i-1]$
If $s[\Pi[i-1]] != s[i]$, then we are in search for a next **longest prefix**(say has lenght $k$) which supports $s[0,k-1]=s[i-k,i-1]$, such that $s[k]=s[i]$ may turn out to be true.
If $s[k] = s[i]$, then we can say that $\Pi[i]=k+1$ because $s[0,k]$ is the **longest prefix** which matches with a suffix and that is the definition of $\Pi[i]$.
Therefore, basically we are searching for the **longest proper prefix** of length less than $\Pi[i-1]$ such that it matches a suffix of $s[0,(\Pi[i-1]-1)]$(which equals to $s[(i-\Pi[i-1]),i-1]$ **from point-1 above**), which is the definion of the prefix function at index $\Pi[i-1] -1$ i.e. $\Pi[\Pi[i-1]-1]$.
If $s[k]=s[i]$, then we stop and assign $\Pi[i]=k+1$. Otherwise, we similarly go like $k = \Pi[\Pi[\Pi[i] - 1] - 1]$, $k = \Pi[\Pi[\Pi[\Pi[i] - 1] - 1]-1]$, ... until $k = 0$. Then if $s[i]=s[0]$, we assign $\Pi[i]=1$, and $\Pi[i]=0$ otherwise.
**To make sure that the value of prefix-function does not exceed the length of $p$, we add a character that is never going to appear in string $s$ like `'#'`**.
Period of a string is the shortest length such that a larger string $s$ can be represented as a concatenation of one or more copies of a substring($t$).
If you compare all blocks from the start and the end, then it turns out that all blocks of k size are equal. This means that $k$ is the period of $s$, as the same blocks of size $k$ repeats in $s$.
**Time Complexity:** $O(N)$, $N$ is the length of $s$.
### Compressing a string
Now, we know how to find a period of a string and therefore we can compress string as only one block of size $k$ which repeats all over again and again in $s$.
**Brief idea:** Basic idea here is to take an empty string $t$ and add characters one by one from string $s$ and along with that check how many new substrings are created, due to the addition of a character in $t$, using prefix-function.
Let say we have already added some characters to $t$ from $s$ and $k$ is the number of distinct substrings currently. Now, we are a adding character $c$ to $t$, $t = t+c$.
Note that total number of new substrings created by appending a character to any string($t$) is equal to the length of new string($t=t+c$) created. **For example, Appending `'d'` in `"abc"` creates 4 new substrings: `"d"`, `"cd"`, `"bcd"`, `"abcd"`.**
By reversing $t$, our task burns down into computing how many prefixes there are that don't appear anywhere else in $t$, which can be done by finding the prefix function of $t$.
After finding value of prefix function, we will find maximum value $\Pi_{max}$($\Pi_{max} = max\{\Pi[i]\}, \forall i$) in the prefix function of reversed $t$, which shows the length of longest prefix which is already in $t$ as a substring and it also implies that all smaller prefixes are already present as substrings in $t$.
Therefore, we will deduct this number of already present substrings i.e. $\Pi_{max}$, from the total number of new substrings i.e. $|t|$.