Saturday, September 20, 2014

Processing SQL Server FILESTREAM Data, Part 1

I recently finished a utility for a client that was a perfect situation to gain some experience with a technology that I hadn't used before, SQL Server's FILESTREAM capability. This post and subsequent entries will discuss my travails with this technology, but let's set up a little backstory first. (Cue wavy flashback effect)

Of Telephone Books And Happiness

In the early 2000's I co-founded a startup that offered IT services to the Yellow Pages advertising industry. The reasons why and how I ended up in the Yellow Pages industry form a strange and wondrous tale full of action and danger that is best left for another post, or over a lot of drinks. However, I thoroughly enjoyed being an entrepreneur despite the hours, effort, and challenges. And one of the challenges I had to overcome had to do with pages - lots and lots of pages.
As part of our services we offered what are known as electronic tear sheets, i.e. electronic copies of the page on which an actual advertisement was placed. So we had to carry all the pages from every book supplied by every Yellow Page publisher. Some of these were provided as individual PDF files and some were not provided at all. For the latter we took the physical book, sliced off the binding, and scanned each individual page which was then OCR'ed for headings and indexed into a SQL Server database. In either form, with so many publishers and pages, we ended up with millions of individual page files.
As I noted in the previous paragraph each of these page files was indexed in a series of database tables but we needed access to the page image file without the overhead of having to retrieve and store said data into a SQL Server BLOB. Therefore, the page image files were stored on an NTFS file system on fast RAID storage. And everything worked very well, except for one thing - when the files are stored on the file system and not in SQL Server there is no relational integrity between the two data stores. Delete a row from the index table and you have an orphan file. Delete a file from a folder and you have an orphan index record. Maintaining as much integrity as we were able was a constant work-in-progress, with nightly pruning processes, validation routines, and reports. Very ugly but we made it work.
In the release of SQL Server 2008 Microsoft included support for FILESTREAM BLOBs, that is binary large objects that were stored on the file system instead of within a SQL Server MDF file. The BLOB data is part of a row in a database table but essentially it becomes a reference to an individual file on the file system. The big advantage is that SQL Server maintains relational integrity between the table row and the data file. This wonder arrived too late for me, since the startup folded in 2011, but I recently discovered I could make use of it on a project for a current client.

Email Logging For Fun And Profit

My client has myriad nightly processes and constantly running services that send notification emails to relevant parties. Their mail server, however, is outsourced and there have been occasions where the processes were unable to send the notification emails because the server or Internet access was unavailable. So they were looking for a solution that would ensure delivery of their email notifications. My first inclination was to use MSMQ since it's tailor made for guaranteed message delivery. But after further discussion with my client I discovered they additionally wanted to be able to log the messages for proof of delivery and frequency reporting so I started to lean towards a more database-centric solution. I've done this before - most email message information can be stored in a single table row.
Unless there are attachments.
A single email message can have zero to many file attachments, a traditional one to many cardinality. I toyed with the idea of storing the files in a BLOB column but based on my prior experience I wasn't thrilled about the idea. This StackOverflow discussion has some great points on both sides of the debate - I'll let you draw your own conclusions. So I started to devise a file storage solution like the one I created for my Yellow Pages startup, until I remembered the SQL Server feature that handles exactly this situation. Clearly Microsoft has run across this situation themselves and felt that a comprehensive solution was needed. So I rolled up my sleeves and started playing with the unfamiliar technology - a pursuit that's always fun but also frustrating. This was no exception.

No comments:

Post a Comment