Databases, Indexing ft. Designing Data-Intensive Applications — Part 2

Darshan kadu
4 min readJul 25, 2020

Databases.. The end

Welcome, to the second part of the Databases, Indexing ft. Designing Data-Intensive Applications , for better understanding please read the first part.

Online Transaction Processing (OLTP) : We discussed the patterns for faster reads for the application like the comment section, gaming, transactions etc. These are all interactive and access patterns are called online transaction processing (OLTP).

Online Analytical Processing(OLAP): Use cases like business analytics where one needs to scan over the entire dataset just to get only a few columns, doing statistical operations like(sum, avg) belongs to the access pattern OLAP.

Companies used to use SQL for OLAP but 1990s they started usign Data Warehousing.

Data Warehousing :

OLTP systems should be highly available and should not be used for business queries as it might impact on customer requests. Hence for analysis purpose separate read only database is maintained, its update by periodic dump or streams of update with appropriate modified structured so that it can be used of analytical queries. Indexing structure used in their previous post works well for OLTP but not for OLAP .

Stars and Snowflakes: Schemas for Analytics

  • The center table of the schema is the “fact table”.
  • Each row of fact table represents event that occured
  • Columns in fact tables can be foreign keys to other tables called dimension tables, ex (customer id), they can also be value like selling price.
  • Name Star schema comes from the fact that the “fact table” is related to many dimension tables and looks like a star in table relationship diagram.
  • The name Snowflake schema comes into picture when dimension tables are more normalised into sub dimension tables giving more of a snowflake like structure to relationship diagram.
  • Fact tables are wide can have 100s of columns,
  • Fact tables can have trillions of rows as they can represent all the sales, transactions etc, whereas dimension tables are smaller and can have millions of rows

Column-Oriented Storage:

  • A typical analytical query queries only a few columns out of 100s of columns, from trillions of dataset.
  • Indexing like OLTP on 100s columns wouldn’t be the best practise for such use cases, as it will need to load all the columns from the memory of a row and then filter out.
  • To overcome this inefficiency, data of the columns are stored together in sequence as of row. Hence if the columns are stored in the different files only those files will be used for filtering which are queried
  • So, let’s say if 5th row matches the query result then all the desired column file’s 5th value is returned.

Column Compression:

  • Loading all the require columns data can be more optimised if columns values are compressed
  • A typical column may contain many distinct values (n)(countries 200, product : 100k) which are less as compared to number of rows.
  • If the n is small like 200 , we can use a bit to indicate if the column value is present in a row and create a bitmap. Hence if the number of rows are 5, values in rows for a column are (1, 2, 2, 3, 3) then bitmap for 1 will be. (1, 0, 0, 0, 0).
  • Each column value will have its own bitmap.
  • If n is very large like 100k then we will have lots of 0s and 1s, to avoid redundant data we can use run length encoding. Ex, (1,2,2,3,3) can be encoded for 1 as (1ones, 4 zeros), for 2(1zeros, 2ones, 2zeros).
  • Demo query1: where countryId = 1 or countryId = 4 , the bitmap for the countrid 1 and 4 are loaded, a bitwise OR operation is done and rows with a set bit are used for output generation.
  • Demo query2: where countryId = 1 and countryId = 4, the bitmap for the countrid 1 and 4 are loaded, a bitwise AND operation is done and rows with a set bit are used for output generation.

Sort Order in Column Storage:

  • If the values in columns are sorted it will be easy to query but all the column cant be sorted as we need to maintain the structure such that a nth entry in all column belong to all the same row.
  • So according to common query use case, we sort a particular column values, then on this sorted rows we sort the data having same 1st sorted column value, on the basis next column used for querying,
  • It helps in compression as we can use run length encoding for compressing the values as sorted data will have repeated values. This encoding will be most helpful for the 1st sorted column than 2nd.

SEVERAL DIFFERENT SORT ORDERS:

  • There can be a use case when we need different coulmn’s sort orders.
  • Since the data store is replicated on different machines, at every replication it is stored with different columns sort order.
  • In OLTP we don’t do this, only the primary index column is sorted, while the secondary index just contains a pointer to the data .

Aggregation: Data Cubes and Materialized Views:

  • Materialised View: Since the purpose of the OLAP is to analyse data we can pre-compute raw data, and use this computed data to perform queries.
  • It makes writing slower but in OLAP we dont worry about this write latency.
  • Disadvantage is one gets less flexibility to query over raw data.

That’s it.

If you have any feedback, please feel free to comment or reply to the email.

If you like this and want to read more, please share this and subscribe Newsletter .

Find me @darshan_kadu

--

--