Wednesday, September 24, 2014

Processing SQL Server FILESTREAM Data, Part 3 - Creating Tables

In Parts 1 and 2 of this series I discussed my experience with the SQL Server FILESTREAM technology, specifically the background of the decision and setup of the SQL Server. In this installment I discuss the tables created and how I specified the FILESTREAM BLOB column.

Setting The Table

So after some struggle I had SQL Server ready to handle FILESTREAMS. What I needed now were the requisite tables to store the data. This is achieved by adding a column to a table and indicating that BLOB data will live there in a file that is stored on a FILESTREAM filegroup. Here are the tables I used for my email and attachments log:

Most of the columns in the EmailMessages table are fairly self-explanatory. The TransmitStatusId column is a reference into a simple lookup table with an integer ID and description that indicates what state the message is in, e.g. Queued, Transmitted, Failed, etc. As you can see in the EmailAttachments table there are two columns that are somewhat out of the ordinary, the AttachmentFileId and FileData columns. But I'll explain each column so you can understand my approach to this design.
  • EmailAttachmentId - Monotonically increasing surrogate value to be used as a primary key. I prefer these to a GUID when a natural key is not handy but if you want to have a religious war about it there are plenty of places where the battle rages. Feel free to take it there;
  • EmailMessageId - Parent key to the EmailMessages table;
  • AttachmentFileId - This is a unique GUID identifier for the row, as signified by the ROWGUIDCOL indicator, necessary for the FILESTREAM feature to uniquely identify the data;
  • SequenceNum - Indicates the listing sequence of the attachment, for later reporting purposes;
  • Filename - Saves the original file name, since FILESTREAM will create generated file names, and I will want to recreate the file names later when I'm actually transmitting the file via SMTP;
  • FileData - The binary column where the file data is stored, although the data is read and written on the operating system file storage not the SQL Server data file.
  • timestamp - Yes, I still use timestamp files for concurrency. I'm an old-school kind of guy.
The last part of the CREATE TABLE statement for the EmailAttachments table is where you specify the filegroup on which the FILESTREAM data will be stored. This references the filegroup we created in Processing SQL Server FILESTREAM Data, Part 2 - The Setup. And with that, we're finally ready to start coding!
Next up - Processing SQL Server FILESTREAM Data, Part 4 - Readin' and Writin'

Monday, September 22, 2014

Processing SQL Server FILESTREAM Data, Part 2 - The Setup

In Part 1 of this topic I discussed the reasoning behind the decision to use Microsoft's FILESTREAM technology for a recent client project. In this installment I discuss the setup portion of this on the SQL Server side. I'll spare you much of the swing-and-a-miss frustration while attempting to understand how the parts work, but I'll try to pinpoint the traps that I located the hard way.

Stream of Consciousness

The first step is to insure that SQL Server's FILESTREAM technology is enabled for the instance in which you're working. This isn't too difficult to configure but there is a portion of it that might be confusing.
In SQL Server Configuration Manager you will be presented with a list of SQL Server services that have been installed. Double click the SQL Server (MSSQLSERVER) service to see its configuration. The third tab in that dialog is the FILESTREAM configuration (see Image 1). The selections on this page require some explanation:
  1. The "Enable FILESTREAM for Transact-SQL Access" seems pretty simple. This option is necessary for any FILESTREAM access. But what's subtle here is what it omits, which is the next portion;
  2. The "Enable FILESTREAM for file I/O streaming access" is the portion that will allow you as a developer to read and write FILESTREAM data as if it were any other .Net Stream. I recommend enabling this since it allows some nifty capabilities that will be seen in the code for a subsequent post;
  3. The "Windows share name" was another option that seemed obvious but was more subtle. This essentially creates a pseudo-share, like any other network share, that contains files that can be read and written. But it won't show up in Windows Explorer. It's only accessible via the SqlFileStream .Net Framework class;
  4. The final option, "Allow remote clients to have streaming access to FILESTREAM data" is still a bit of a mystery to me. Why would you enable the access without allowing remote clients to stream to it? Is it likely that only local clients would use it? It doesn't seem so to me but perhaps I'm mistaken.

Image 1 - FILESTREAM Configuration

Instance Kharma

Next we need to ensure that our database instance is enabled to utilize FILESTREAM capabilities. This can be done from SQL Server Management Studio. Right click on the database instance and choose Properties from the resulting menu. The Advanced configuration selection in that dialog has a dropdown list for FILESTREAM support right at the very top (see Image 2). It's uncertain to me whether this step is necessary or not because I didn't necessarily do this in the prescribed order but it seemed to me that it needed to be done. I chose the "Full access enabled" option in order to employ the remote streaming access that will be shown in a subsequent post.

Image 2 - FILESTREAM Instance Configuration

Filegroup Therapy

Since FILESTREAM BLOB data is stored on the file system it can't live inside the PRIMARY filegroup for a database. So we need to create a new filegroup and file to contain this data. This is done pretty simply with a few SQL statements, or so it would seem.
First the filegroup.

This is very simple and straightforward. It creates a logical filegroup that specifies that the files contained within will be where FILESTREAM BLOB data is stored.

Pernicious Permissions

Now that I had a filegroup I needed to add files to it. This is where things went a little sideways.
The SQL code to add a file to a filegroup is not terribly complicated.

Upon execution of this piece of code I was presented with an the following noxious error
Operating system error 0x80070005(Access is denied.) occurred while creating or opening file 'E:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\FilestreamExampleFiles'. Diagnose and correct the operating system error, and retry the operation.
As I investigated this issue I began to understand what was happening. SQL Server was attempting to create a folder on disk with the name I specified in the ALTER DATABASE command, which is where it would store the files that would comprise the BLOB data. But there was clearly a permissions issue creating the folder.
Well, I'm a developer not an IT technician but I know enough to solve this issue. But I was unable to do so in a satisfactory way. The SQL Server service was running under the NetworkService account, which seemed appropriate for the situation. That account had full control to the entire SQL Server folder tree and everything beneath it. But no matter what I did the problem persisted. I finally changed the service account to LocalSystem and the problem disappeared but I'm uncomfortable with that answer. If I set the permissions for the NetworkService user why was it unable to write to a local disk resource?
Up Next - Processing SQL Server FILESTREAM Data, Part 3 - Creating Tables

Saturday, September 20, 2014

Processing SQL Server FILESTREAM Data, Part 1

I recently finished a utility for a client that was a perfect situation to gain some experience with a technology that I hadn't used before, SQL Server's FILESTREAM capability. This post and subsequent entries will discuss my travails with this technology, but let's set up a little backstory first. (Cue wavy flashback effect)

Of Telephone Books And Happiness

In the early 2000's I co-founded a startup that offered IT services to the Yellow Pages advertising industry. The reasons why and how I ended up in the Yellow Pages industry form a strange and wondrous tale full of action and danger that is best left for another post, or over a lot of drinks. However, I thoroughly enjoyed being an entrepreneur despite the hours, effort, and challenges. And one of the challenges I had to overcome had to do with pages - lots and lots of pages.
As part of our services we offered what are known as electronic tear sheets, i.e. electronic copies of the page on which an actual advertisement was placed. So we had to carry all the pages from every book supplied by every Yellow Page publisher. Some of these were provided as individual PDF files and some were not provided at all. For the latter we took the physical book, sliced off the binding, and scanned each individual page which was then OCR'ed for headings and indexed into a SQL Server database. In either form, with so many publishers and pages, we ended up with millions of individual page files.
As I noted in the previous paragraph each of these page files was indexed in a series of database tables but we needed access to the page image file without the overhead of having to retrieve and store said data into a SQL Server BLOB. Therefore, the page image files were stored on an NTFS file system on fast RAID storage. And everything worked very well, except for one thing - when the files are stored on the file system and not in SQL Server there is no relational integrity between the two data stores. Delete a row from the index table and you have an orphan file. Delete a file from a folder and you have an orphan index record. Maintaining as much integrity as we were able was a constant work-in-progress, with nightly pruning processes, validation routines, and reports. Very ugly but we made it work.
In the release of SQL Server 2008 Microsoft included support for FILESTREAM BLOBs, that is binary large objects that were stored on the file system instead of within a SQL Server MDF file. The BLOB data is part of a row in a database table but essentially it becomes a reference to an individual file on the file system. The big advantage is that SQL Server maintains relational integrity between the table row and the data file. This wonder arrived too late for me, since the startup folded in 2011, but I recently discovered I could make use of it on a project for a current client.

Email Logging For Fun And Profit

My client has myriad nightly processes and constantly running services that send notification emails to relevant parties. Their mail server, however, is outsourced and there have been occasions where the processes were unable to send the notification emails because the server or Internet access was unavailable. So they were looking for a solution that would ensure delivery of their email notifications. My first inclination was to use MSMQ since it's tailor made for guaranteed message delivery. But after further discussion with my client I discovered they additionally wanted to be able to log the messages for proof of delivery and frequency reporting so I started to lean towards a more database-centric solution. I've done this before - most email message information can be stored in a single table row.
Unless there are attachments.
A single email message can have zero to many file attachments, a traditional one to many cardinality. I toyed with the idea of storing the files in a BLOB column but based on my prior experience I wasn't thrilled about the idea. This StackOverflow discussion has some great points on both sides of the debate - I'll let you draw your own conclusions. So I started to devise a file storage solution like the one I created for my Yellow Pages startup, until I remembered the SQL Server feature that handles exactly this situation. Clearly Microsoft has run across this situation themselves and felt that a comprehensive solution was needed. So I rolled up my sleeves and started playing with the unfamiliar technology - a pursuit that's always fun but also frustrating. This was no exception.