Categories
Computer Science

String Hashing

String hashing is a technique for converting a string into a hash number. Why do we need that? Sometimes we need to compare large strings, but instead comparing the string we can compare the generated hash.

Imagine that you had to compare the next two string:

const A = "imagine that this is a string of 200 characters"

const B = "imagine that this is a string of 200 characterz"

Both strings have 200 characters. A brute force way would be to compare each character and see if they match. Something like:

const A = "imagine that this is a string of 200 characters"

const B = "imagine that this is a string of 200 characterz"

function equal(A, B) {
  let i;
  for(i = 0; i < A.length; i++){
    if (A[i] !== B[i]) {
      return false;
    }
  }
  return true;
}

console.log(equal(A,B));

This isn’t optimal with big strings because of it’s O(min(A, B)) performance.

Of course we could make this O(n) by adding a condition that compares A size and B size. Something like:

function equal(A, B) {
  if (A.lenght !== B.length) return false;
  let i;
  for(i = 0; i < A.length; i++){
    if (A[i] !== B[i]) {
      return false;
    }
  }
  return true;
}

As I said, worst scenario would be O(n), but imagine that we had to compare a really large string.

String Hashing is a technique where we can convert our strings into an Integer which will act as a hash number. Therefore, we will be comparing two hashes hash(A) == hash(B) this will be considered O(1). That’s the best option we can have for comparing two strings.

String Hash Formula

Where p and m are prime numbers, and s[0], s[1], s[2]... are values for each character, in this case the char code.

p: It is a prime number roughly equal to number of different characters used. For example if we are going to calculate the hash for a word that includes only lower case characters of the English alphabet, 31 would be a good number. However, if we will include upper case characters, then 53 is an appropriate option.

m: The bigger this number is, the less chance we have to collide two random strings. this variable should be also a prime number. 10 ^9 + 9 is a common choice.

Lets use this data:

p = 31
m = 1e9 + 9
word = 'apple'

We want to know what’s the hash code for the word apple, so if we add our data to our formula, we would have something like this:

then

then we get the char code:

"a" = 97
"p" = 112
"p" = 112
"l" = 108
"e" = 101

and replace this in the formula:

then we reduce the formula:

Finally

So our final hash for the word apple would be 96604250

I’ll leave you the implementation on JavaScript

function hash(word) {
    var p = 31;
    var m = 1e9 + 9;
    var hash_value = 0;
    for(var i = 0; i < word.length; i++) {
        var letter = word[i];
        var charCode = letter.charCodeAt();
        hash_value = hash_value + (charCode * Math.pow(p, i))
    }
    return hash_value % m;
}

console.log(hash("apple"));

ping me on my twitter @jorgechavz and send me your own implementation.