Thursday, April 21, 2016

UTF-8 in Transact-SQL: no CLR necessary

UPDATE: it's now a Gist.

While we're on the subject of Transact-SQL utilities, here's one for converting a Unicode string (as NVARCHAR) to a UTF-8 binary data block. It works correctly with non-UCS2 strings encoded as UTF-16. To make sure, try it on a string with an Emoji character, for example, N'😂'. That's codepoint U+1F602, the widely smiling face.

create function [dbo].[ToUTF8](@s nvarchar(max))
returns varbinary(max)
as
begin
    declare @i int = 1, @n int = datalength(@s)/2, @r varbinary(max) = 0x, @c int, @d varbinary(4)
    while @i <= @n
    begin
        set @c = unicode(substring(@s, @i, 1))
        if (@c & 0xFC00) = 0xD800
        begin
            set @i += 1
            set @c = ((@c & 0x3FF) * 0x400) | 0x10000 | (unicode(substring(@s, @i, 1)) & 0x3FF)
        end

        if @c < 0x80
            set @d = cast(@c as binary(1))
        if @c >= 0x80 and @c < 0x800 
            set @d = cast(((@c * 4) & 0xFF00) | (@c & 0x3F) | 0xC080 as binary(2))
        if @c >= 0x800 and @c < 0x10000
            set @d = cast(((@c * 0x10) & 0xFF0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xe08080 as binary(3))
        if @c >= 0x10000
            set @d = cast(((@c * 0x40) & 0xFF000000) | ((@c * 0x10) & 0x3F0000) | ((@c * 4) & 0x3F00) | (@c & 0x3F) | 0xf0808080 as binary(4))
            
        set @r += @d
        set @i += 1
    end
    return @r
end

No comments:

Post a Comment