Storage-based compression and de-duplication overview

Managing storage is always a challenge, so anything to simplify it is worth a look. Rick Vanover shares notes on storage-based compression and de-duplication.

At the recent Gestalt IT Field Day, Silicon Valley companies allowed attendees to visit and see technologies in use.
One of the stops during the event was Ocarina Networks. Ocarina specializes in online storage optimization to reduce disk consumption. The main point of the visit was to obtain a clearer understanding of compression and de-duplication for data management.
For compression, there are a few standard ways to approach it.
There are two techniques to compression. The first is a dictionary-based technique implemented by mainstream products such as ZIP. This algorithm doesn't help much with rich content, such as multimedia, due to its lack of repetitive patterns.
Today, with faster processors, statistical compression techniques can now be used. A statistical approach for compression can be used to make predictive assignments for the content of data. This is especially relevant for predicting pixels in images.
Compressors can utilize powerful processors to use complex algorithms for different data types. There are countless compressors available for various data sets. In Ocarina's case, more than 120 compressors for various file types are used. Then, the right compressor for an application's data is used to obtain the most efficiency.
De-duplication simply gains efficiency by not consuming storage by many of the same types of content; there are a few ways this can be realized. One method is whole-file single instancing de-duplication. This looks to find the same exact file, including different file names. While quite simple, this scenario is not that frequent in real practice.
De-duplication can work with multiple files, looking for sections that are the same within different files. Each file can be represented to a series of chunks. When these chunks appear in other files, a de-duplication efficiency can be made. An example of this type of de-duplication can be a Word document with a graphic object of a logo. The de-duplication algorithm will reference one instance of the chunk in what's called a sliding window, fixed size chunk.
Considerations for daily use of compression and de-duplication
While it is beneficial to realize de-duplication and compression benefits, there are some considerations that go into what it means for day-to-day usage. One example is where a file that has been compressed and de-duplicated on disk is emailed. Once it is removed from the de-duplicated storage, it is restored to its uncompressed size.
The other consideration is the decompression engine for the compressed data. There can be overhead for compression, and there can be an incredible amount of math involved. For complex compressors, there may be CPU latency to decompress the data. It truly depends--decompression can be immeasurable for certain compressors but can be noticeable for larger applications.