Name : Rabiya Babar

Id : 15009065158

Teacher’s Name : Tayaba

anjum

Course name : Analysis of algorithm

Title : Rabin-karp

algorithm

Date : 3rd January 2018

1. Abstract

There are many string

matching algorithms. The two main categories are, Exact string matching and

approximate string matching. Rabin-Karp is an exact string matching technique that

uses hashing for searching of string. It is an average case analysis of string

matching algorithm. Hashing is also known as finger print technique. Rabin Karp

converts these finger prints in constant time.

This technique sum ASCII values for each and every letter and take mod

of this letter by some prime number. We say it a good hash function if it looks

at every word in

the string. After the Computations

done with function of hash computation, If the answer appeared to be same then

we need to do normal string comparison. If both the strings appeared same, this results in complete search. Next,

calculate the hash of next variable, continue the process until the actual

string is found. Or we reach to the end of text string.

Keywords— hashing, ASCII, hash function.

2.

Introduction

Rabin-Karp

algorithm is the most often used string matching algorithm. To find a hash code pattern ‘pt’ from a given string ‘s’. first of all, it divides

the hash code pattern with a prime number ‘pri’ that is defined before to

compute the remainder of pattern ‘pt’. Then it takes input ‘l’. here ‘l’ is the

length of pattern ‘ptr’ from the string ‘str’ with the first shift ‘sh’ to

compute remainder of ‘l’ chracters from the string ‘s’. If the remainder of the

pattern ‘pt’ and the remainder of the string ‘str’ are equal than we have comparison

between the string and pattern both. Except, In other cases it is not needed to

have a comparison. This is because if remainders are not equal to each other

than these numbers will not be equal in any case. This process is repeated for

every next characters from string for all possible shifts. Rabin-karp algorithm

practical application is detecting plagiarism. When source material is given,

the algorithm can instantaneously search through a given string. Ignoring the

minor details like punctuations. When good hashing is done, it results in good

performance for the encountered data. Similarly if hashing is done poorly then

the whole algorithm takes the worst case time O(mn).

Figure

Text t

R

A

B

I

y

A

i

y

A

Pattern pt

Fig.1

overview of string matching

Rabin

karp string matching problem is the problem of finding all possible shifts with

which it occurs in the described pattern. Figure 1 represent this description.

3. Literature review

Rabin

karp is the increment version to naïve approach by implementing a most powerful

implementation called hash function. Before processing takes place it computes

hash value of pattern ‘ptr’ (with length ‘l’) along with hash value for each

sub string in pattern ptr. Next it compares numerical values instead of

comparing the symbols. According to naïve approach, If any similarity is found

than it compares the pattern with the substring. Or it shifts to next substring

of ‘str’ with pattern ‘ptr’ by using hash values. It shows that performance of

Rabin karp depends on computation of hash values. The string used in the

algorithm can be treated as array of characters. These characters can be

converted to integers or hash values depending upon what type of translating

technique is being used. Hence the string can be used as array of integers.

Let

L be a character sequence as a L digit number in base B. following equations

such as equation

1 can be used to find the substring where ti is the integer at ith position.

Similarly equation 2 is used to compute a hash integer of the sub character where

P is the large prime number.

Ai=

ti B^ (L–1) + ti+1 B^(L–2) + … + ti+L–1 R^ 0

EQ.1

K(Ai)=

ti B (L–1) + ti+1 B(L–2) + … + ti+L–1 B^ 0 (mod Q)

EQ.2

Moreover, after

computing Ai we can compute Ai+1 for the next substring ti+1 ..i+L in

constant time.

xi+1 = ( xi – ti*BL–1 )

B + t i + B

Calculation of the

hash value following equation

can be given.

A(xi+1)= ( xi – t

i*BL–1 ) B + t i + B (mod Q)

EQ.4

4. Methodology

Let P be the pattern

that has length L and string Str has length n. following are the ways that find

sub string in the string are:

1. Hashing

calculations of P to get A(p)O(L).

2. Keep

iterating through the length L of string and substrings of S, hashing integers

of those substrings and comparison with A(p) O(nB).

3. If

the hash answer of substring does not matches the given string A(P). compute

the string comparison once more with the sub string and pattern. stop immediately

if the pattern match and continue if they do not match.

pseudocode

a ? inputstring

b ? searchstring

l ? length(a)

m ? length(b)

hSrchStr 😕 hash(b)

for i = 0 ? l ? m

do

hStr ? hash(ai ? m)

if ai ? m = b

then

return “F OUND”

else

return “NOT F OUND” end if

end for

Calculating the hash

S ? inputstring

Sum1 ? 0

y ? length(S)

x ? S

for i = 0 to y

do

sum1 = sum1 + Si

end for

return sum1 % 3

Figure

Fig.2 Calculation

of hash value

Time complexity

Table

Algorithm

Pre-processing

Searching

Execution

time

Rabin Karp

O(m)

O(mn)

O(mn)

Table.1

Time complexity of Rabin karp.

Table

1 shows the time complexity of

the rabin karp.

Working

hash function working describes

a string as a base b number. Where b is a constant that represents the total

number of characters. calculate the hash value of next character by shifting to

upcoming character. The best practice is

to choose the prime number as large as possible which does not result in

overflow. Hence the time period of hash function is much bigger than time

complexity of normal flow algorithm.

Implementation

Next step to naïve

approach is to move a level above to match a string by using the comparisons that

are being done in each iteration and making use of it in next iteration.

1. Firstly

define the string of numbers only. Let S={0,…10}. After which pattern and

string are defined. Then these numbers are converted to characters. So now

these are set of characters.

2. Let

pt defines the decimal value of pattern pt1,…l. l represents the length of

the pattern.

3. A

major problem that can occur is that the decimal values can cause over flow if

they are large enough. Hence, modulus is computed by a large prime number.

4. Rabin

karp worst case time complexity is O(m-(n-m+1)). And best and average case time

complexity is O (n+m) that makes it better approach of string matching.

5.

Analysis and comparison with other algo’s solving

same problem.

Comparisons

1.

Knuth

Morris Pratt Algorithms

Knuth morris pratt

algorithm is a string searching algorithm that takes linear time for search. A

table of prefixes is build for the string that is to be searched before the

string matching phase starts. The string matching starts with the left most

character in the pattern. prefix table is build only for mismatch strings.

Complexity

It compares string from

left to right. Before processing its time complexity is O(m) that assumes to be

longest length time complexity of the string of chracters. Time complexity of search

mode is O(n+m). overall Total comparisons performed by knuth morris pratt

algorithm is 2n-1 which includes the searching phase and the delay.

2.

Boyer

Moore Horspool Algorithm

Boyer Moore algorithm

performs searching from right to left. It begins the comparison from right to

left. If string mismatches, two pre computed functions are designed to shift

the string window to the right. These sliding functions are called good suffix

shift and bad character shift. This algorithm is the simplified form of

boyer-moore algorithm. The string that is to be searched is preprocessed to

build a table that is designed for bad match occurrence. It contain the length

that shifts. It has values for every character in the string.

Val=len-ind-1

EQ.5 By using shift

valuesof

table2. Eq.5 is used to calculate the values.

Table

Letter

t

e

A

M

Value

8

6

1

3

Table.2 bad

match table

Figure

Text t

R

a

B

I

Y

A

i

y

A

Pattern pt

Comparing from right

to left

Fig.3

overview of BMH algorithm

3.

Brute force

Algorithm

Brute force algorithm compares one character at a

time. It starts comparing from left to right until a mismatch is found. This algorithm

has no separate preprocessing phase. It matches first character of pattern with

first character of string. If it does not match then it move towards the next

character of the string and then compares second character of string with first

character of pattern.if match is found than it compares second character of

pattern with the next character of the string.

Figure

Text t

R

a

B

I

Y

a

B

i

R

a

b

i

Y

a

b

I

R

a

b

i

Y

a

b

i

Fig.4 Brute force example

Analysis

Time complexity

Table

Algorithm

preprocessing

Execution

time

Accuracy

Knuth Morris Pratt

O(M)

O(M+N)

65%

Brute Force

None

O(MN)

66.7%

Boyer Moore

O(M+N)

O(MN)

75%

Rabin Karp

O(N)

O(MN)

70%

Table.3 Analysis

by calculations.

Old algorithm

studies

The time complexity of exact string pattern matching

can be improved if the most efficient algorithm is used. String matching

algorithms were studied with proteins and DNA of biological sequences. After

the analysis it was concluded that KMP algorithm is easier to implement because

it does not back track the given sequence and consume extra space. Rabin karp

algorithm is used to check plagiarism that also requires additional space.

Brute force does not require additional space or preprocessing. It is slow and

does not produce efficient result. BM algorithm is very fast for large

sequences. It discard lots of unnecessary comparisons. Rasool

et.al (2012) had done a research which shows that KMP is more

efficient than Brute force algorithm.

6.

Major finding

·

Rabin Karp algorithm is independent of length of

pattern

·

It runs in real time and require a constant number

of storage location.

·

They are easy and simple to understand and

implement.

·

This algorithm can be used to readily generalize to

high length pattern matching problems.

·

It is also used for multiple pattern matching

problem.

·

One of the application of Rabin karp is detecting

plagiarism.

·

It is easy to implement with good hashing function

affectivity.

·

This algorithm is not faster than brute force, but

its time complexity is O(n+m).

Disadvantages

·

There are many other algorithms that are fatser than

O(n+m).

·

It requires additional space like brute force and is

practically slow.

·

It is not so faster than brute force

7.

Conclusion

In this paper,

string searching algorithm especially Rabin karp is in main focus. Rabin karp

is a great and effective algorithm. It has many advantages one of the best

advantage is that it is used in multiple pattern matching. This makes it a

perfect algorithm for large strings. Finding the correct text In today’s online

scenario is most important thing in minimum time. String matching algorithms

are used for it. Work on software and hardware levels is being done to make

pattern searching faster. Approximately best algorithm is determined by applying

algorithms in different applications. The best algorithm thus have less time

complexity and reduced computation time. Algorithms may not be best optimal

algorithms but better than the general algorithms that are tested in various

practical applications. From the study it is concluded that Rabin karp requires

extra space as it is used to detect plagiarism but brute force do not require

preprocessing of the string. The only problem is it is very slow. It sometimes

produces effective results. Additionally boyer moore algorithm is relatively

faster for large strings as it avoids extra comparisons.

8.

References

1 Ahmad

Fadel,” Towards Faster String Matching”.

2 S.

Viswanadha Raju, Joseph, McAlerney,”abstract”.

3

Miachel O.Rabin, “Computer Algorithms- String Pattern Matching”.

4

Nuser M, Hussain I,” Journal of Computer Science Issues,”.

5 Knuth

D.E, Singla N” International Journal of Soft Computing and Engineering”.