Wednesday, May 21, 2014

DataWarehousing with Redshift: Benefits of a columnar data storage

This a short article stating the benefits of a columnar storage architecture.
In traditional database systems, records are stored into disk blocks by row. Each row is stored one after other. Each row sits in a block. Within those blocks each column sits one after another.
Following diagram illustrates this row-based storage architecture:

This kind of data storage architecture has been created keeping in mind that often a complete record is queried. This is a good fit for scenarios where in CRUD operations are in picture, and data is sought often for a complete row. This architecture is good for OLTP type of applications. 

However, for data analytics and data warehousing use-cases this type of storage architecture isn't a good fit. Amazon's Redshift takes up this issue and have come up with a columnar storage which helps in optimizing data warehousing related queries. In this architecture values for each column are stored sequentially into disk blocks.

Following diagram illustrates this column-based storage architecture:

Row-wise storage disadvantages:

  • If block size is smaller than the size of a record, storage for an entire record may take more than one block.
  • If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space.

So, if you can see, storing records in row-wise fashion isn't storage space optimized.

Column-wise storage advantages:

  • Each data block holds column field values for as many as three times as many records as row-based storage.
  • Requires a third of the I/O operations compared to row-wise storage. In practice, using tables with very large numbers of columns and very large row counts, storage efficiency is even greater.
  • Since each block holds the same type of data, it can use a compression scheme selected specifically for the column data type, further reducing disk space and I/O.
Hence, by using a columnar storage architecture Redshift has reduced the disk space being used to store same data resulting in reduced number of seeks it would be required to read the stored data in order to process an user query, ultimately increasing the query performance on large datasets.

Redshift achieves high performance over queries on very huge amount of data. Along with the columnar architecture there are other features Redshift has added in order to achieve this performance, they are:

  • Massively parallel processing
  • Data compression
  • Query optimization
  • Compiled code

I will take these up in detail in one of future posts. Try out Redshift if you need to analyze your data using some of the popular BI tools like Tableau and Jaspersoft. If your dataset size is huge and you are struggling to get good query performace, then I would certainly suggest you to try out Redshift once. Read more about it here.

Disclaimer: Many details including diagrams were taken from official AWS documentations.

Any feedback, good or bad is most welcome.


Email *

Message *