Tuesday, December 22, 2020

Azure Cosmos DB Analytics Storage Engine

 



     Azure Cosmos DB has a new storage engine specialized for analyzing data in databases. All available APIs do all operations in Transactional Storage, data is saved in in row store format in Transactional Storage. When you try to retrieve data by using SQL API, you look for rows which have the data you are looking for. Rows are written to disk row by row. When a row is found on disk, computer does not need to look for the rest of the file since row is written to disk row by row. This works great when you search by any type of id, date, or any indexed property. It is fast, cheap, and efficient. It is like fishing small fishes, you do not need special equipment to do this.



     Problem arises when you try to read data for reports or try to analyze data. When you analyze data, you select good amount of data and filter might be looking for the whole year's data or all data of a specific location/unit. It takes longer to find this data because data is written row by row and rows might not be saved in order you like to retrieve. In this case, disk and CPU works hard to find data you are looking for. Transactional Storage is not a good place to analyze data because it is expensive, slow and it can cause locks if you use relational databases. It is like fishing a big fish without right equipment.


     Analytics storage uses Column Store format to save your data. This means data is written to disk column by column rather than row by row. This makes all aggregation function run fast because disk does not need to work hard to find data row by row anymore. Cosmos DB takes responsibility to move data from Transaction Store to Analytical Store too. You do not need to write any ETL packages to accomplish this. That means you do not need to figure out which data needs to update, which data should be deleted. Azure Cosmos DB figures all data for you, syncs the data between these two storage engines. This gives us the isolation we have been looking for between transactional and analytical environments. Data written to transactional storage will be available in Analytical Storage less than 5 minutes. In my experience, it really depends on the size of the database, if you have a smaller database usually data becomes available in Analytical Storage in less than a minute.



     In Transactional Storage, there is a setting named TTL. You can cache data or control retention very easily by using Time-To-Live feature. This feature exists in Analytical Storage too. You can have a different retention policy for analytical storage. For example, you might want to keep data for 2 years in Transactional Storage for your web applications and keep data 5 years in Analytical Storage to analyze it by using Time-To-Live function. 

   Analytical Storage Engine is still in in-preview mode, so Cosmos DB team does not recommend it for production use yet. You can turn it on for new containers. I will cover how to set up the Analytical Storage and retrieve data from it in my next post.

No comments:

Post a Comment