Monday, October 13, 2014

Processing SQL Server FILESTREAM Data, Part 4 - Readin' and Writin'


In the prior installments in this series I covered some background, FILESTREAM setup, and the file and table creation for this project. In this final installment we'll finally see some C# code that I used to read and write the FILESTREAM data.

The Three "R"s

I was always confused by the irony that only one of the legendary Three "R"'s actually starts with an "R". Yet another indictment of American education? But I digress.
Before we work on the code to read FILESTREAM data, let's write to it first. First, we'll need a couple of structures to store information returned from various database operations.

Then we can create a routine that mimics an SMTP send, but instead stores the email information to the database tables we created in "Processing SQL Server FILESTREAM Data, Part 3 - Creating Tables". Pardon the formatting in order to make the overlong lines fit within the blog template.

A couple of notes about the code shown above:
  • The code uses Marc Gravell and Sam Saffron's superb Micro-ORM Dapper which I highly recommend. While religious wars rage over the use of Micro-ORMs vs heavy ORMs I far prefer Dapper to other approaches;
  • The INSERT statements use the SQL Server OUTPUT clause to return ID information about the inserted rows, which is a more efficient method than sending a subsequent SELECT query for the information;
  • Once the streams have been opened, the .Net 4.0 CopyTo method will do a nice job of copying the bytes. If you're on an earlier version of the framework this method can easily be created. See Jon Skeet's sample implementation here.

Once the email message has been inserted into the master table and we have its ID we can then attempt to insert the attachments into their corresponding detail table. This is done in two steps:
  1. Insert the metadata about the attachment to the EmailAttachments table. Once this is complete you can retrieve a file name and context ID for streaming attachment data to the FILESTREAM;
  2. Open the FILESTREAM using provided framework methods for doing so. Write the attachment data to the FILESTREAM;

Seems simple, but there is a subtlety. The INSERT statement to add the metadata must add at least one byte of data to the file using Transact-SQL. That is indicated by the null byte ("0x00") that is the last value of the statement. If you don't supply this, instead supplying NULL or, as I initially attempted, default, SQL Server will not create a file since you haven't given it any data. Consequently the SQL Server PathName() function will return NULL and the call to open the SqlFileStream will fail unceremoniously.
There are two ways that I could have submitted the attachment data to SQL Server, as the last value of the INSERT statement to the EmailAttachments table, or using streaming as I did in the example. I chose the latter so that, in the case of very large attachment, I could stream the file in chunks rather than reading the entire file into memory to submit via INSERT statement. This is less resource intensive under the heavy load I expect for this utility.
I then created a separate Windows service to read the messages, attempt to send via SMTP, log successes and failures, and queue for retrying a certain number of times. The heart of the portion that reads the attachments looks quite similar to the write operation

Some notes about the code shown above:
  • I created a result class, shown earlier in this post, for retaining the file path and transaction context returned from the query;
  • Note that you must create a transaction for the SELECT in order for the GET_FILESTREAM_TRANSACTION_CONTEXT method to return a context that can be used in the SqlFileStream constructor;
  • Once again I have used the CopyTo method to move the bytes between the streams.

Summary

That finishes the heart of the SQL Server FILESTREAM operations for the utility I was constructing. The real trick of it was the initial configuration and understanding the process. Hopefully this series of articles will help someone past the problems I encountered. Good luck and good coding!

Wednesday, September 24, 2014

Processing SQL Server FILESTREAM Data, Part 3 - Creating Tables

In Parts 1 and 2 of this series I discussed my experience with the SQL Server FILESTREAM technology, specifically the background of the decision and setup of the SQL Server. In this installment I discuss the tables created and how I specified the FILESTREAM BLOB column.

Setting The Table

So after some struggle I had SQL Server ready to handle FILESTREAMS. What I needed now were the requisite tables to store the data. This is achieved by adding a column to a table and indicating that BLOB data will live there in a file that is stored on a FILESTREAM filegroup. Here are the tables I used for my email and attachments log:

Most of the columns in the EmailMessages table are fairly self-explanatory. The TransmitStatusId column is a reference into a simple lookup table with an integer ID and description that indicates what state the message is in, e.g. Queued, Transmitted, Failed, etc. As you can see in the EmailAttachments table there are two columns that are somewhat out of the ordinary, the AttachmentFileId and FileData columns. But I'll explain each column so you can understand my approach to this design.
  • EmailAttachmentId - Monotonically increasing surrogate value to be used as a primary key. I prefer these to a GUID when a natural key is not handy but if you want to have a religious war about it there are plenty of places where the battle rages. Feel free to take it there;
  • EmailMessageId - Parent key to the EmailMessages table;
  • AttachmentFileId - This is a unique GUID identifier for the row, as signified by the ROWGUIDCOL indicator, necessary for the FILESTREAM feature to uniquely identify the data;
  • SequenceNum - Indicates the listing sequence of the attachment, for later reporting purposes;
  • Filename - Saves the original file name, since FILESTREAM will create generated file names, and I will want to recreate the file names later when I'm actually transmitting the file via SMTP;
  • FileData - The binary column where the file data is stored, although the data is read and written on the operating system file storage not the SQL Server data file.
  • timestamp - Yes, I still use timestamp files for concurrency. I'm an old-school kind of guy.
The last part of the CREATE TABLE statement for the EmailAttachments table is where you specify the filegroup on which the FILESTREAM data will be stored. This references the filegroup we created in Processing SQL Server FILESTREAM Data, Part 2 - The Setup. And with that, we're finally ready to start coding!
Next up - Processing SQL Server FILESTREAM Data, Part 4 - Readin' and Writin'

Monday, September 22, 2014

Processing SQL Server FILESTREAM Data, Part 2 - The Setup

In Part 1 of this topic I discussed the reasoning behind the decision to use Microsoft's FILESTREAM technology for a recent client project. In this installment I discuss the setup portion of this on the SQL Server side. I'll spare you much of the swing-and-a-miss frustration while attempting to understand how the parts work, but I'll try to pinpoint the traps that I located the hard way.

Stream of Consciousness

The first step is to insure that SQL Server's FILESTREAM technology is enabled for the instance in which you're working. This isn't too difficult to configure but there is a portion of it that might be confusing.
In SQL Server Configuration Manager you will be presented with a list of SQL Server services that have been installed. Double click the SQL Server (MSSQLSERVER) service to see its configuration. The third tab in that dialog is the FILESTREAM configuration (see Image 1). The selections on this page require some explanation:
  1. The "Enable FILESTREAM for Transact-SQL Access" seems pretty simple. This option is necessary for any FILESTREAM access. But what's subtle here is what it omits, which is the next portion;
  2. The "Enable FILESTREAM for file I/O streaming access" is the portion that will allow you as a developer to read and write FILESTREAM data as if it were any other .Net Stream. I recommend enabling this since it allows some nifty capabilities that will be seen in the code for a subsequent post;
  3. The "Windows share name" was another option that seemed obvious but was more subtle. This essentially creates a pseudo-share, like any other network share, that contains files that can be read and written. But it won't show up in Windows Explorer. It's only accessible via the SqlFileStream .Net Framework class;
  4. The final option, "Allow remote clients to have streaming access to FILESTREAM data" is still a bit of a mystery to me. Why would you enable the access without allowing remote clients to stream to it? Is it likely that only local clients would use it? It doesn't seem so to me but perhaps I'm mistaken.

Image 1 - FILESTREAM Configuration

Instance Kharma

Next we need to ensure that our database instance is enabled to utilize FILESTREAM capabilities. This can be done from SQL Server Management Studio. Right click on the database instance and choose Properties from the resulting menu. The Advanced configuration selection in that dialog has a dropdown list for FILESTREAM support right at the very top (see Image 2). It's uncertain to me whether this step is necessary or not because I didn't necessarily do this in the prescribed order but it seemed to me that it needed to be done. I chose the "Full access enabled" option in order to employ the remote streaming access that will be shown in a subsequent post.

Image 2 - FILESTREAM Instance Configuration

Filegroup Therapy

Since FILESTREAM BLOB data is stored on the file system it can't live inside the PRIMARY filegroup for a database. So we need to create a new filegroup and file to contain this data. This is done pretty simply with a few SQL statements, or so it would seem.
First the filegroup.

This is very simple and straightforward. It creates a logical filegroup that specifies that the files contained within will be where FILESTREAM BLOB data is stored.

Pernicious Permissions

Now that I had a filegroup I needed to add files to it. This is where things went a little sideways.
The SQL code to add a file to a filegroup is not terribly complicated.

Upon execution of this piece of code I was presented with an the following noxious error
Operating system error 0x80070005(Access is denied.) occurred while creating or opening file 'E:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\FilestreamExampleFiles'. Diagnose and correct the operating system error, and retry the operation.
As I investigated this issue I began to understand what was happening. SQL Server was attempting to create a folder on disk with the name I specified in the ALTER DATABASE command, which is where it would store the files that would comprise the BLOB data. But there was clearly a permissions issue creating the folder.
Well, I'm a developer not an IT technician but I know enough to solve this issue. But I was unable to do so in a satisfactory way. The SQL Server service was running under the NetworkService account, which seemed appropriate for the situation. That account had full control to the entire SQL Server folder tree and everything beneath it. But no matter what I did the problem persisted. I finally changed the service account to LocalSystem and the problem disappeared but I'm uncomfortable with that answer. If I set the permissions for the NetworkService user why was it unable to write to a local disk resource?
Up Next - Processing SQL Server FILESTREAM Data, Part 3 - Creating Tables

Saturday, September 20, 2014

Processing SQL Server FILESTREAM Data, Part 1

I recently finished a utility for a client that was a perfect situation to gain some experience with a technology that I hadn't used before, SQL Server's FILESTREAM capability. This post and subsequent entries will discuss my travails with this technology, but let's set up a little backstory first. (Cue wavy flashback effect)

Of Telephone Books And Happiness

In the early 2000's I co-founded a startup that offered IT services to the Yellow Pages advertising industry. The reasons why and how I ended up in the Yellow Pages industry form a strange and wondrous tale full of action and danger that is best left for another post, or over a lot of drinks. However, I thoroughly enjoyed being an entrepreneur despite the hours, effort, and challenges. And one of the challenges I had to overcome had to do with pages - lots and lots of pages.
As part of our services we offered what are known as electronic tear sheets, i.e. electronic copies of the page on which an actual advertisement was placed. So we had to carry all the pages from every book supplied by every Yellow Page publisher. Some of these were provided as individual PDF files and some were not provided at all. For the latter we took the physical book, sliced off the binding, and scanned each individual page which was then OCR'ed for headings and indexed into a SQL Server database. In either form, with so many publishers and pages, we ended up with millions of individual page files.
As I noted in the previous paragraph each of these page files was indexed in a series of database tables but we needed access to the page image file without the overhead of having to retrieve and store said data into a SQL Server BLOB. Therefore, the page image files were stored on an NTFS file system on fast RAID storage. And everything worked very well, except for one thing - when the files are stored on the file system and not in SQL Server there is no relational integrity between the two data stores. Delete a row from the index table and you have an orphan file. Delete a file from a folder and you have an orphan index record. Maintaining as much integrity as we were able was a constant work-in-progress, with nightly pruning processes, validation routines, and reports. Very ugly but we made it work.
In the release of SQL Server 2008 Microsoft included support for FILESTREAM BLOBs, that is binary large objects that were stored on the file system instead of within a SQL Server MDF file. The BLOB data is part of a row in a database table but essentially it becomes a reference to an individual file on the file system. The big advantage is that SQL Server maintains relational integrity between the table row and the data file. This wonder arrived too late for me, since the startup folded in 2011, but I recently discovered I could make use of it on a project for a current client.

Email Logging For Fun And Profit

My client has myriad nightly processes and constantly running services that send notification emails to relevant parties. Their mail server, however, is outsourced and there have been occasions where the processes were unable to send the notification emails because the server or Internet access was unavailable. So they were looking for a solution that would ensure delivery of their email notifications. My first inclination was to use MSMQ since it's tailor made for guaranteed message delivery. But after further discussion with my client I discovered they additionally wanted to be able to log the messages for proof of delivery and frequency reporting so I started to lean towards a more database-centric solution. I've done this before - most email message information can be stored in a single table row.
Unless there are attachments.
A single email message can have zero to many file attachments, a traditional one to many cardinality. I toyed with the idea of storing the files in a BLOB column but based on my prior experience I wasn't thrilled about the idea. This StackOverflow discussion has some great points on both sides of the debate - I'll let you draw your own conclusions. So I started to devise a file storage solution like the one I created for my Yellow Pages startup, until I remembered the SQL Server feature that handles exactly this situation. Clearly Microsoft has run across this situation themselves and felt that a comprehensive solution was needed. So I rolled up my sleeves and started playing with the unfamiliar technology - a pursuit that's always fun but also frustrating. This was no exception.

Monday, June 16, 2014

A Recipe For Password Security

Several months ago I helped architect a password security scheme for a client. During that process I learned quite a bit about how to encrypt passwords in a secure fashion.

Encryption vs. Hashing

Most developers have heard the term "encryption", which means that data is encoded in such a way that it is not human-readable. But in the context of password security the word “encryption” implies that the encoding can be decoded, that is it’s a “two-way” encryption. While it may be advantageous to decode a user’s password, especially in situations where they have forgotten it, it opens up a security hole. Simply put, if someone attacking your security implementation can guess the algorithm and parameters used to encrypt passwords they can then decrypt all the passwords in your system! At this point you have the equivalent of passwords stored in your system in plaintext – not an excellent approach.

A much more secure method for storing encrypted passwords is to use a cryptographically secure hash1. A “hash” is an algorithm that will take a block of data and from that information generate a value such that if any of the data is changed the hashed value will change as well. The block of data is generally called a “message” and the hashed value is called a “digest”. What is valuable about cryptographic hashes with regard to password security is that they are “one-way”, that is once the password has been hashed it cannot be decrypted back to its original plaintext form. This eliminates the security vulnerability that exists with two way encryption.

By now I’m sure some of you have thought, “Great, if I have this hashed value how to I validate that against the plaintext password typed in by the user?” The answer is, you don’t. When the user types in their password you hash the value that they entered using the same hash algorithm. You then compare that hashed value with the hashed password stored in your system. If they match then the user is authenticated.

Adding Some Salt

So we now have a process for storing passwords in our system in a secure form that cannot be decrypted, thus closing the door that allows attackers access to all the passwords stored in the system. But determined attackers are not so easily thwarted. They will use a rainbow of methods to gain access to your systems, which segues (in a ham-handed fashion) into the next topic, rainbow tables.

Since they can no longer decrypt your passwords attackers will try the next best thing. They’ll take a large list of common words and passwords and hash them using some of the well-known standard algorithms. They’ll then compare this list of hashed words to your password list. Any matches will immediately indicate a successful password search. Given users’ penchant for commonly used passwords the chances are good that the attacker will end up with quite a few successes.

The generally accepted practice for prohibiting this practice is to use a “password salt”2. A salt value is just a randomly generated value that is added to the user’s password before hashing. The salt value is then stored with the user’s hashed password so that the authentication method can use it to hash a password entered by the user.

Now I’m sure some of you are wondering how this prevents rainbow table attacks if the salt value is easily accessible. What the salt value does is require the attacker to regenerate all the values in their rainbow table using the specified salt value. Even if they have a match it will only work for the one user for which that particular salt value was used. While it doesn’t prevent a successful attack it certainly limits it to one success and makes it very slow and cumbersome for the attacker to make additional attempts on other passwords.

Needs Some Pepper

So how can we make it even more difficult for the determined attacker? Well, we can add a “secret salt” value not stored in the database to the password before we hash it. This value would be well known to the system so that it can reproduce it as necessary for authentication but would not be stored in the database. This type of value is commonly known as a “pepper” value. The fact this it is not published or stored makes it even more difficult for an attacker to guess what the plaintext value was before hashing. Unless they have access to the source code for generating the pepper value they may never be able to generate a successful rainbow table.

Simmer Slowly

So it seems like we’ve covered all the bases. But we can’t forget about Moore’s Law3. As CPUs and GPUs get faster and faster it becomes easier to generate multiple rainbow tables so that an attacker can take many guesses at an encrypted password list. What’s a poor, security-minded developer to do?

Well, how about we purposely slow them down?

There are several well known cryptographic hash algorithms4, such as the Message Digest derivatives (MD2, MD4, MD5) and the Secure Hash Algorithms from the National Security Agency (SHA-1, SHA-256) but many of these were designed to work quickly. In some cases, like MD5, the algorithm is considered “cryptographically broken”5. What we really need is a hash algorithm that can be adapted so that it is slow enough to discourage the generation of multiple rainbow tables but fast enough to hash a password quickly after a user types it in for authentication.

Enter bcrypt6. Bcrypt is a hashing function based on the well-regarded Blowfish encryption algorithm that includes an iteration count to make it process more slowly. Even if the attacker knows that bcrypt is the algorithm in use if a properly selected iteration count is employed it renders the generation of rainbow tables very expensive. Furthermore, the iteration count is stored in the hashed result value so it’s forward compatible; that is as computing power continues to increase the iteration count can be increased and applied to existing password hashes so that the generation of rainbow tables continues to be expensive.

A Spicy Meatball

So by using a combination of the right spices (salt and pepper) and the proper cook time (iterations) we can end up with an excellently prepared plate of hash. It’s not perfect - no security approach ever is - but we can certainly make our systems less vulnerable to the point where an attacker will look for victims that are less well-protected. And that’s all we can really hope for, that they look somewhere else.

Additional References

Coding Horror: You're Probably Storing Passwords Incorrectly
1. http://en.wikipedia.org/wiki/Cryptographic_hash_function
2. http://en.wikipedia.org/wiki/Salt_(cryptography)
3. http://en.wikipedia.org/wiki/Moores_law
4. http://en.wikipedia.org/wiki/Cryptographic_hash_function#Cryptographic_hash_algorithms
5. http://en.wikipedia.org/wiki/MD5
6. http://en.wikipedia.org/wiki/Bcrypt

Sunday, June 15, 2014

Throwing a Great Block

Last year I was working on a cloud-hosted Windows service for a client that contained an application-specific logging implementation. The existing architecture had log entries posted at various process points, i.e. file discovery, pickup, dropoff, and download. The log code would post a message to the Microsoft Messaging Queueing service (MSMQ) and a separate database writer service would dequeue those messages and post them to a series of tables in SQL Server.

Lagging The Play

While this setup worked perfectly well it had one minor issue - the queueing of a log message to MSMQ happened sequentially. That means that while the service was attempting to post a log message to the queue all other file processing was temporarily suspended. Since posting a log message to MSMQ means you're performing an inter-process communication there will be a noticeable lag imposed on the calling thread. Add to that the possibility that the MSMQ service could be located on another server and you've now imposed network lag time on the calling process as well. That's potentially alotta-lag! In the worst possible case, if MSMQ cannot be reached for some reason file processing could be suspended for a very long time. For a platform that expects to be able to process thousands of messages a day this was clearly not going to work as a long term solution. However, the client wanted to retain the use of MSMQ as a persistent message forwarding mechanism so that if the writer service was unavailable the log messages would not end up getting lost.

Block For Me

It seemed clear that what was needed was some way for the service to save log messages internally for near-term posting to MSMQ in a way that would minimally impact file processing. What came to mind initially was to have an internal Queue object on which the service could store log messages that could be dequeued and posted to MSMQ by another thread. It's a classic Producer-Consumer pattern1. While this is a threading implementation that is not of surpassing difficulty to implement it has some subtleties that make it non-trivial. First, all access to the Queue object has to be thread-safe. Second, the MSMQ posting thread needs to enter a low-CPU-load no-operation loop while it's waiting for a log message to be queued. Wouldn't it be nice if there was something built into the .Net Framework to do all this?

Well, sometimes Microsoft gets it right. In the .Net Framework 4 release Microsoft added something called a Blocking Collection2 that does exactly what we needed. It allows for thread-safe Producer-Consumer patterns that do not consume CPU resources when there is nothing on the queue.

Here's an example of how to implement it in a simple console application.

First, we'll need a message class. In the service for the client the log information message was more complex, but this should give you the general idea.

The real "meat" of the operation is in the class that encapsulates the blocking collection. Here's the first portion of the class definition.
You'll notice that the class implements the IDisposable interface. This is so that the thread that dequeues the messages from the blocking collection can clean up after itself. This will be seen in another section of the code for this class.

You'll also notice that when the BlockingCollection is defined we specify the class of objects that will be placed on the collection. However, when we instantiate the collection we signify that it should use a ConcurrentQueue object as the backing data store for the blocking collection. This ensures that the items placed in the collection will be handled in a thread-safe manner on a first-in, first-out (FIFO) basis.

The finalizer method merely calls our Dispose method with a parameter indicating that this was called from the class' destructor, a common patterm for IDisposable implementations3. The Dispose methods will be shown in their entirety later in this post.

The AddLog method is very simple; it invokes the blocking collection's Add method to enqueue the message in a thread safe manner. The DequeueMessageThread method appears to be an endless loop that keeps attempting to dequeue a message, causing a CPU spike from the tight looping. But here's where the magic of the blocking collection comes into play. The Take method of the blocking collection will enter into a low-CPU wait state if nothing is found on the queue, blocking the loop from proceeding. As soon as a message is enqueued the Take method will return from the wait state and the loop will proceed. Note that the Take mehod will also return immediately if the blocking collection has been closed down, indicating completion, hence the IsCompleted check right after the call.

The exception handler in the method captures two specific exceptions:
  1. The InvalidOperationException will be signaled if the blocking collection is stopped. We'll see this in the Dispose method;
  2. The ThreadAbortException will be signaled if the thread had to be killed because the Dispose method timed out waiting for the thread to finish.
In this code snippet the first Dispose method is our public interface that satisfies the requirement for IDisposable implementation. It simply calls our private Dispose method that takes a parameter indicating whether it was called from the class destructor method.

The second private Dispose method is where some housekeeping for the blocking collection and dequeue thread happens. First we call the blocking collection's CompleteAdding method. This will disallow any further additions to the queue, minimizing the chance that the dequeue thread will never end because messages continue to be added. We then attempt to wait for the thread to complete by calling the thread's Join method, specifying a timeout value for the thread. If the thread is not complete within the specified timeout we forcibly destroy it and exit. Finally, if called from the class' destructor we can suppress the finalize method of the garbage collector.

To utilize a producer-consumer queue like this one is quite simple:
The using statement ensures that the queue's Dispose method is invoked upon completion, thereby stopping the dequeing thread. When executed in a loop like this one that enqueues 100 messages the tail end of the output looks like this:

Enqueueing: Message with ID 92 and value Message text # 92.
Enqueueing: Message with ID 93 and value Message text # 93.
Enqueueing: Message with ID 94 and value Message text # 94.
Dequeueing: Message with ID 88 and value Message text # 88.
Dequeueing: Message with ID 89 and value Message text # 89.
Dequeueing: Message with ID 90 and value Message text # 90.
Dequeueing: Message with ID 91 and value Message text # 91.
Enqueueing: Message with ID 95 and value Message text # 95.
Enqueueing: Message with ID 96 and value Message text # 96.
Enqueueing: Message with ID 97 and value Message text # 97.
Enqueueing: Message with ID 98 and value Message text # 98.
Dequeueing: Message with ID 92 and value Message text # 92.
Dequeueing: Message with ID 93 and value Message text # 93.
Dequeueing: Message with ID 94 and value Message text # 94.
Dequeueing: Message with ID 95 and value Message text # 95.
Enqueueing: Message with ID 99 and value Message text # 99.
Enqueueing: Message with ID 100 and value Message text # 100.
Dequeueing: Message with ID 96 and value Message text # 96.
Dequeueing: Message with ID 97 and value Message text # 97.
Dequeueing: Message with ID 98 and value Message text # 98.
Dequeueing: Message with ID 99 and value Message text # 99.
Dequeueing: Message with ID 100 and value Message text # 100.
Shutting down queue. Waiting for dequeue thread completion.
Dequeue thread complete.

As you can see the dequeue process slightly lags the enqueue process, as you would expect for processes running in separate threads. The messages are interspersed as the threads compete for the shared resource.

Finishing It Off

So what we've demonstrated is a way to implement a producer-consumer pattern without writing a lot of thread management code. While this pattern is not applicable in a great many situations it certainly has its uses. Any time you need to queue up items for processing but don't want to slow down the primary process give this pattern a try.


1. http://en.wikipedia.org/wiki/Producer-consumer_problem
2. http://msdn.microsoft.com/en-us/library/dd267312.aspx
3. http://stackoverflow.com/a/538238/49954