So I have an innocent desire to run a report off the content of some files in source control. The question is, how are they stored? There are four tables of interest:
- tbl_Version
- tbl_FileReference
- tbl_FileMetadata
- tbl_Content
The first table, tbl_Version, has the file name and path. Since I'm only interested in latest versions of files, I filter it by VersionTo=0x7FFFFFFF and DeletionId = 0.
It connects to the home project via the ItemDataspaceId field. In order to get to file content, you have to join it to tbl_FileReference via a triple PartitionId-DataspaceId-FileID.
I'm honestly not sure when does PartitionID come into play. In my on-prem TFS instance, it's uniformly 1. Maybe it is only used in VSTS, or for setups with multiple database servers. What is important, though, the indices on those tables all include PartitionId as the first field, so whenever you join them, you get a major performance boost from joining by PartitionId too.
There's a field ResourceId in tbl_FileReference, tbl_FileMetadata, and tbl_Content. It's a GUID, and one would expect that it identifies a file uniquely, but the indices on the first two are on (PartitionId, ResourceId).
The file contents are in tbl_Content. The OffsetFrom/OffsetTo fields suggest it stores file chunks, but in my experience, files up to 200 KB (uncompressed) are stored as one chunk. So I never had to write any chunk combination logic.
LATE CORRECTION: I've seen chunks in Tfs_Configuration.tbl_Content. Technically it's not a TFVC related table.
There's another piece of logic to it, which I had to support. The field Content in tbl_Content can be compressed, or it can be stored as a patch (delta) relative to another record in tbl_Content. You can tell by looking at the file's record in tbl_FileMetadata:
- ContentType = 2 - it's a delta, use RelatedResourceId to track the original
- ContentType <> 2 and CompressionType = 1 - it's compressed
- Otherwise, it's just the file data as VARBINARY
The compression is straightforward GZip, either SQL Server's DECOMPRESS function or .NET's GZipStream class will take care of that.
The delta logic is trickier. The format of deltas is PA30, AKA Intra Package Delta, AKA IPD. It's a proprietary binary patch format, used in CAB files among other places. It's not documented (to the best of my knowledge), but there's a public implementation by Microsoft, which ships with Windows as msdelta.dll.
The problem is that there's no way to call msdelta.dll directly from Transact-SQL. This is where realtime reporting from the TFS databases is next to impossible (except maybe with CLR functions), and one might have to spend some time and build an ETL package. Applying the delta to the base is rather straightforward in C#, with a bit of P/Invoke. I've created a gist that wraps the relevant function. The only comment to that code is that the value of DELTA_INPUT.Editable should be true, and the Flags parameter of ApplyDeltaB should be AllowPa19, which is 1. That's the values that TFS' own file retrieval logic passes.
The original to a delta might be a delta, too. So some chain chasing would be necessary.
Another table that might be of interest is tbl_Changeset. It connects to tbl_Version via ChangeSetId, which corresponds to VersionTo in the latter. There's no DataspaceId in tbl_Changeset, because a changeset can span multiple projects.
Hi
ReplyDeleteFor TFVC the tbl_Version table has informaciĆ³n the files and route (ChildItem, ParentPath, etc).
Do you have information for Git ??
Sorry, no.
Delete