Thursday, April 21, 2016

UTF-8 in Transact-SQL: no CLR necessary

While we're on the subject of Transact-SQL utilities, here's one for converting a Unicode string (as NVARCHAR) to a UTF-8 binary data block. It works correctly with non-UCS2 strings encoded as UTF-16.

create function [dbo].[ToUTF8](@s nvarchar(max))
returns varbinary(max)
    declare @i int = 1, @n int = len(@s), @r varbinary(max) = 0x, @c int, @d varbinary(4)
    while @i <= @n
        set @c = unicode(substring(@s, @i, 1))
        if (@c & 0xfc00) = 0xd800
            set @i += 1
            set @c = ((@c & 0x3ff) * 0x400) | 0x10000 | (unicode(substring(@s, @i, 1)) & 0x3ff)

        if @c < 0x80
            set @d = cast(@c as binary(1))
        if @c >= 0x80 and @c < 0x800 
            set @d = cast(((@c * 4) & 0xFF00) | (@c & 0x3f) | 0xc080 as binary(2))
        if @c >= 0x800 and @c < 0x10000
            set @d = cast(((@c * 0x10) & 0xFF0000) | ((@c * 4) & 0xFF00) | (@c & 0x3f) | 0xe08080 as binary(3))
        if @c >= 0x10000
            set @d = cast(((@c * 0x40) & 0xFF000000) | ((@c * 0x10) & 0xFF0000) | ((@c * 4) & 0xFF00) | (@c & 0x3f) | 0xf0808080 as binary(4))
        set @r += @d
        set @i += 1
    return @r

Saturday, April 16, 2016

Your software WILL crash.

When reporting on unhandled .NET exceptions, don't use the builtin Exception.StackTrace. Instead, construct an instance of System.Diagnostics.StackTrace with the exception object as the parameter, and generate your own trace. The difference is, with the latter, you can get offset to the crash point within the method.

The StackTrace contains an array of StackFrame objects. Each of those has the following methods:

  • GetILOffset() - offset to the crash/call within method's bytecode (MSIL) body
  • GetNativeOffset() - the same in JIT-compiled, native form
  • GetMethod() - returns System.Reflection.MethodBase, which can be used to reconstruct a human readable method prototype.
The IL offset can be resolved to sources using ILDASM.

The native offset can be resolved to sources using the debugger, as long as you're debugging precisely the same DLL as the one the crash happened on. Go to the Disassembly window.

Generating the method name in the same format as the builtin StackTrace provides is rather straightforward:

static public string MethodNameWithParameters(MethodBase mb)
    return mb.DeclaringType.FullName + "." +
        mb.Name + "(" +
        string.Join(", ",
                pi => pi.ToString())) + ")";

Friday, April 15, 2016

Object Linking and Embedding Database

Recently, I've been messing with a database API I've never used directly before, even though I've been using it all the time. I'm talking about OLE DB, the interface behind ActiveX Data Objects (ADO).

The job is rather simple. Native C++ application, MS SQL Server database, performance is a priority, so no managed code. Massive inserts into the database, so I want to use table-valued parameters (TVPs), the most friendly option for those. The ADO interface doesn't support TVPs, while OLE DB does. So OLE DB it is.

Turns out, while the OLE DB API per se is not badly documented, the body of samples and discussion on the 'Net is rather poor, so I thought I'd share some pitfalls I've spent time in along the way.

When you're calling a stored procedure that returns both recordsets and output parameters, the latter only appear in the bound variables when the recordsets have been scrolled through and released. In fact, it's the Release() call on the recordset that triggers output parameter filling.

When you want to provide an empty table as a TVP, bind both parameter value and status via the DBBINDING structure, and provide the value DBSTATUS_S_DEFAULT in the bound status variable, as opposed to DBSTATUS_S_ISNULL. If you provide the latter, the stored procedure call will fail, and the value of the status variable on completion will be DBSTATUS_E_BADSTATUS.

A useful pattern is generating an accessor handle for a procedure, and calling it multiple times with the same accessor. When doing so, watch for parameter status values changing after the ICommand::Execute() call. In my case, a variable was too big to fit into a parameter, and the status was reset to DBSTATUS_S_TRUNCATED. There was no error on the procedure call - the Execute() method returned S_OK. However, the next call to Execute() did fail, since DBSTATUS_S_TRUNCATED is not a valid status value upon input.

MurmurHash in Transact-SQL

I'll just leave it here. An implementation of MurmurHash64B in Microsoft SQL Server's Transact-SQL. Tested with MSSQL 2008 and 2014.

create function [dbo].[MurmurHash64B]
    @s varbinary(max),
    @seed bigint = 0
returns bigint
    declare @len int = datalength(@s), @i int = 1
    declare @m bigint = 0x5bd1e995
    declare @h1 bigint = @len ^ (@seed & 0xffffffff)
    declare @h2 bigint = 0
    declare @k1 bigint, @k2 bigint

    while @len >= 8
        set @k1 = cast(convert(binary(4), reverse(substring(@s, @i, 4))) as int)
        set @k1 = (@k1 * @m) & 0xffffffff
        set @k1 ^= @k1 / 0x1000000
        set @k1 = (@k1 * @m) & 0xffffffff
        set @h1 = (@h1 * @m) & 0xffffffff
        set @h1 ^= @k1;

        set @k2 = cast(convert(binary(4), reverse(substring(@s, @i+4, 4))) as int)
        set @k2 = (@k2 * @m) & 0xffffffff
        set @k2 ^= @k2 / 0x1000000
        set @k2 = (@k2 * @m) & 0xffffffff
        set @h2 = (@h2 * @m) & 0xffffffff
        set @h2 ^= @k2
        set @len -= 8
        set @i += 8

    if @len >= 4
        set @k1 = cast(convert(binary(4), reverse(substring(@s, @i, 4))) as int)
        set @k1 = (@k1 * @m) & 0xffffffff
        set @k1 ^= @k1 / 0x1000000
        set @k1 = (@k1 * @m) & 0xffffffff
        set @h1 = (@h1 * @m) & 0xffffffff
        set @h1 ^= @k1
        set @len -= 4
        set @i += 4

    if @len >= 1
        if @len >= 2 
            if @len >= 3
                set @h2 ^= ascii(substring(@s, @i+2, 1)) * 0x10000
            set @h2 ^= ascii(substring(@s, @i+1, 1)) * 0x100
        set @h2 ^= ascii(substring(@s, @i, 1))
        set @h2 = (@h2 * @m) & 0xffffffff

    set @h1 = ((@h1 ^ (@h2 / 0x40000)) * @m) & 0xffffffff
    set @h2 = ((@h2 ^ (@h1 / 0x400000)) * @m) & 0xffffffff
    set @h1 = ((@h1 ^ (@h2 / 0x20000)) * @m) & 0xffffffff
    set @h2 = ((@h2 ^ (@h1 / 0x80000)) * @m) & 0xffffffff

    declare @h bigint
    if (@h1 & 0x80000000) <> 0
        set @h1 &= 0x7fffffff
        set @h = (@h1 * 0x100000000) | 0x8000000000000000
        set @h = @h1 * 0x100000000

    set @h |= @h2

    return @h

MSSQL is capable of storing 64-bit variables, but it doesn't support unsigned integer types, and treats arithmetic overflow as an exception. So the B flavor, which was originally geared towards 32-bit machines, is a better fit than A.

The official home of MurmurHash is at Github.

Sunday, August 23, 2015

Pointer to a pointer to a pointer

I never thought I'd ever use the datatype void**** in a real project. Yet I did. That's the datatype of a pointer to a smart pointer to a COM object, which contains a pointer to the virtual function table, which is an array of function pointers.

Are you lost yet? Let's recap. The virtual function table is an array of pointers, function pointers to be precise. The pointer to it, which by convention constitutes the first data element in a COM object, is a pointer to an array of pointers, type void**. The interface pointer to a COM object points at the object itself, which is to say, points at the first data element, so it's void***. A smart pointer object holds the interface pointer as its first and only data element (no virtual functions there), so a pointer to a smart pointer object is a pointer to its first data element, so it's void****.

Friday, August 14, 2015

Lack of a typelib is not a security fault

Amazing discovery of the day: Microsoft Word pops up a security prompt for a perfectly good macro if a reference to a typelib can't be resolved.

Here's the scenario. I've built a managed DLL with a COM-visible component to be called from a MS Word macro. When managed DLLs are built in the usual way, they don't have a type library resource in them. Dynamic type discovery for created Automation objects is supported by the framework, but the regular typelib-in-file discovery would fail - for DLLs, it's looking for a custom resource of type "TYPELIB" with ID 1. That's why you can't open a managed DLL in OLEView and see the types.

You can, however, export a .TLB file from a managed DLL either with the "tlbexp" utility, or with "regasm". Both come with the .NET framework. Google it up.

Enter MS Word and Visual Basic for Applications aka VBA, it's tried and true macro language. In order to reference a component library during macro development, one needs the typelib registered; the "References" window of the VBA editor presents a list of typelibs. So on the dev machine, I have the typelib registered anyway.

The runtime environment is not the same as the dev environment, one would think. On the runtime machine, do you need the typelib registered, too? Trick question. In theory, no. Once CLSIDs/IIDs of all relevant objects are stored in the project, and enums are resolved to integers, it should be possible to call methods without a typelib. In fact, JavaScript has no problem with creating and invoking a component without a registered typelib. But not VBA.

When invoked on a machine where the managed coclass is registered, but the respective typelib is not, you get the following message:

"The macro cannot be found or has been disabled because of your Macro security settings."

It's really neither. It's a dependency to the macro that can't be found.

The remedy is obvious: use tlbexp to generate a typelib, ship it along with the DLL, register when installing. A TLB file can't register itself, so some manual registry fiddling would be in order.

Tuesday, July 7, 2015

Abusing operator overloading for expression building

Sharepoint lists have a built-in SOAP API. In order to query a list, one has to build a query in XML, along the following lines:

<And><Eq><FieldRef Name="ID"/><Value>100</Value></Eq><NotNull><FieldRef Name="Name"/></NotNull></And>

Those strings are hard to follow, and building them is error prone. To the rescue comes operator overloading in C#. The idea is: field names are wrapped in class instances, comparison between those and numbers/strings produces a condition object, AND/OR between those produces compound condition objects, on the final stage you serialize that chain into an XML string.

The only caveat is, C# doesn't allow for && and || overloading, so we have to overload the bitwize operators & and | instead. The precedence is still what one would expect.

With that in place, the expression above can be written like that:

CAMLField f_ID = F("ID"), f_Name = F("Name");
(f_ID == 100 & f_Name.NotNull).ToString();