Rather Technical: Hash of a hash of a hash

Problem: there are two MySQL databases on two different websites that should be mostly identical, except for maybe a few tables. I would like to figure out quickly which tables don't match so that I can run a manual sync.

Solution: hash every record, then hash the records together. Then compare per-table hashes by hand.

In PHP, here's how it goes:

<table>
<?php
function WrapField($f)
{
    return "MD5(IFNULL(`$f`,''))";
}

//Retrieve the list of table names
$TableList = DB_GetScalarArray('show tables');

DB_ExecuteSQL('SET SESSION group_concat_max_len = 10000000;');
//10 MB; you might need more

foreach($TableList as $Table)
{
    if(DB_GetScalar("select count(*) from `$Table`") > 0)
    //Skip empty tables
    {
        //Retrieve the field names
        $OneRecord = DB_GetRecord("select * from `$Table` limit 1");
        $Fields = array_keys($OneRecord);

        //Compose the expression for all field hashes
        $SelectSet = implode(',', array_map("WrapField", $Fields));

        //Now the real work: hash of record hashes
        $SQL =
            "select MD5(GROUP_CONCAT(".
                "MD5(CONCAT($SelectSet)) ".
            "ORDER BY `".$Fields[0]."`)) FROM `$Table`";
        $Hash = DB_GetScalar($SQL);
        echo '<tr><th>'.$Table.'</th><td>'.$Hash.'</td></tr>';
    }
}
?>
</table>

The database helper functions DB_GetScalar(), DB_GetRecord(), DB_GetScalarArray() are thin wrappers on top of the MySQL client library (Mysqli in my case). What do they do should be clear enough. DB_GetRecord() returns an associative array with field names as keys.

The triple hashing was a necessity, I'm afraid. First, simply concatenating the field values will error out if any two the fields in a table happen to have different collations. One can probably work around that by casting to binary, but hashing each individual field works as well.

Hashing of the entire record was done to make the source set of the group_concat() smaller. If the table is wide, the concat of all field hashes would be pretty wide, too. So the total length of the group_concat argument would be MD5Size*NumRecords*NumFields. If the table is long, too, there'd be a real risk of running into the limitation of group_concat() source length. Producing a per-record hash makes the length MD5Size*NumRecords instead.

One assumption that this code makes is that the first field in the table is the primary key, or at least a somewhat identifying field. If it's not, there might be false negatives caused by mismatched record ordering. A better script would analyse MySQL metadata to retrieve the proper primary key; but then there'd be need for a fallback for keyless tables anyway.

Rather Technical

Monday, April 28, 2014

Hash of a hash of a hash

No comments:

Post a Comment