Friday, April 15, 2016

MurmurHash in Transact-SQL

UPDATE: it's now a Gist.

UPDATE: now in Pascal too.

UPDATE: now for MySQL too.

Sometimes, one needs an intermediate strength hash. Something less collision prone than CRC32 or Adler-32, but smaller and less costly than MDx or SHAxxx. My favorite for a while has been MurmurHash64 - it produces a 64-bit value, just the right size to fit into a scalar variable in a C program, or a bigint database field.

There are several flavors of MurmurHash; I'm partial to the one called MurmurHash64B. MSSQL is capable of storing 64-bit variables, but it doesn't support unsigned integer types, and treats arithmetic overflow as an exception. So the B flavor, which was originally geared towards 32-bit machines, is a better fit than A. One can use signed bigint to emulate the arithmetics of unsigned int.

The official home of MurmurHash is at Github. Over there, one can find an implementation in C.

This code produces a signed 64-bit value - half the hash values are negative. The C implementation produces an unsigned one, always positive or zero. Keep that in mind when building your interoperation logic.

I'll just leave it here. An implementation of MurmurHash64B in Microsoft SQL Server's Transact-SQL. Tested with MSSQL 2008 and 2014.

create function [dbo].[MurmurHash64B]
(
    @s varbinary(max),
    @seed bigint = 0
)
returns bigint
as
begin
    declare @len int = datalength(@s), @i int = 1
    declare @m bigint = 0x5bd1e995
    declare @h1 bigint = @len ^ (@seed & 0xffffffff)
    declare @h2 bigint = 0
    declare @k1 bigint, @k2 bigint

    while @len >= 8
    begin
        set @k1 = cast(convert(binary(4), reverse(substring(@s, @i, 4))) as int)
        set @k1 = (@k1 * @m) & 0xffffffff
        set @k1 ^= @k1 / 0x1000000
        set @k1 = (@k1 * @m) & 0xffffffff
        set @h1 = (@h1 * @m) & 0xffffffff
        set @h1 ^= @k1;

        set @k2 = cast(convert(binary(4), reverse(substring(@s, @i+4, 4))) as int)
        set @k2 = (@k2 * @m) & 0xffffffff
        set @k2 ^= @k2 / 0x1000000
        set @k2 = (@k2 * @m) & 0xffffffff
        set @h2 = (@h2 * @m) & 0xffffffff
        set @h2 ^= @k2
        set @len -= 8
        set @i += 8
    end

    if @len >= 4
    begin
        set @k1 = cast(convert(binary(4), reverse(substring(@s, @i, 4))) as int)
        set @k1 = (@k1 * @m) & 0xffffffff
        set @k1 ^= @k1 / 0x1000000
        set @k1 = (@k1 * @m) & 0xffffffff
        set @h1 = (@h1 * @m) & 0xffffffff
        set @h1 ^= @k1
        set @len -= 4
        set @i += 4
    end

    if @len >= 1
    begin
        if @len >= 2 
        begin
            if @len >= 3
                set @h2 ^= ascii(substring(@s, @i+2, 1)) * 0x10000
            set @h2 ^= ascii(substring(@s, @i+1, 1)) * 0x100
        end
        set @h2 ^= ascii(substring(@s, @i, 1))
        set @h2 = (@h2 * @m) & 0xffffffff
    end

    set @h1 = ((@h1 ^ (@h2 / 0x40000)) * @m) & 0xffffffff
    set @h2 = ((@h2 ^ (@h1 / 0x400000)) * @m) & 0xffffffff
    set @h1 = ((@h1 ^ (@h2 / 0x20000)) * @m) & 0xffffffff
    set @h2 = ((@h2 ^ (@h1 / 0x80000)) * @m) & 0xffffffff

    declare @h bigint
    if (@h1 & 0x80000000) <> 0
    begin
        set @h1 &= 0x7fffffff
        set @h = (@h1 * 0x100000000) | 0x8000000000000000
    end
    else
        set @h = @h1 * 0x100000000

    set @h |= @h2

    return @h
end

4 comments:

  1. Thank you for the above code. But can you provide the C# implementation which yield the same result in SQL as well as C#.

    Thank you in advance.

    ReplyDelete
  2. I believe there's a NuGet package for that.

    ReplyDelete
  3. Anyone found MurmurHash3 in tsql?

    ReplyDelete
  4. Can you try and adapt what I have? The MurmurHash3_x86_32 flavor should be amenable to the same techniques. Uint32 is represented by bigint, whatever may cause overflow of the lower 32 bits needs to be masked with 0xFFFFFFFF.

    Rotation will look clumsy, but oh well.

    ReplyDelete