The scale at which we operate at callstats grew 30x during the lockdown, which meant that callstats infrastructure required an immense amount of input/output (I/O) and processing power at the storage level. Our costs grew astronomically as we enhanced our storage capacity, this would be untenable without passing that storage cost to you, the customer. Additionally, we wanted to make sure that the data load performance of 90% of the conference queries were unaffected by the new design.
Over the past few months, we started working on v3 of our infrastructure, and as part of that revamped architecture, we decided to make use of hot, warm, and cold storage. In this blogpost, we will not talk about cold storage, which is basically used for data archival.
The old design is as follows. The webrtc endpoints (enduser, selective forwarding unit, media control unit) streams events to kafka and are consumed by the processing data applications. For example, stats and events are encrypted, obfuscated, and aggregated at a per conference level and stored in MongoDB. All the aggregated data in the UI is served from MongoDB storage.
We decided to use the combination of hot-warm architecture to save storage cost and trade-off latency. In the new architecture, the conferences are simultaneously stored in hot and warm storage as soon as the data ingestion pipeline encrypts the metadata (See our previous blog on data privacy).
This means that any data that was received or accessed recently is served from the hot storage, while all other data within the data retention period is imported from the warm to hot storage and served from the hot storage. We decided to keep the hot storage as is, while the warm data can be stored in anything that is less expensive and easy to access -- Amazon S3 was our choice.
Presently, the infrastructure classifies (hot/warm) storage based on the age of the raw data and recency of access. As stated before, we picked the age for N_hot (duration in hot storage) based on the time queries made by our customers, such that the change would not affect 90% of the queries.
Pipeline: How does it work?
The hot storage means all raw data that was recently accessed or recently received (less than N_hot time duration) and the warm data has everything up to the retention period. When the customer wants to access live calls or conference data that is recent within the hot storage period (N_hot), the UI is served directly from hot storage and there is minimal latency, as the data is readily available to be served to the customer. The only delay or latency in this case is the data access latency, which depends on your distance from the data storage location (for example data is stored in US/EU) and the size of the conference (depending on the participation or duration of the conference).
When the customer requests a conference that is no longer in the hot storage, the data related to that conference is imported from warm to hot storage (data import latency) and subsequently served from the hot storage to the UI. In this case, you’ll notice longer delays because of the data import latency and data access latency.
It should be noted that each time a particular conference is accessed from the hot storage its delete timer is increased, this means that frequently accessed conference once are cached for longer. Thus subsequent queries to view that conference are not penalized.
In order to get this architecture in place we built an application that routinely cleans data that has not been requested by the dashboard in the last N_hot period. When the data is requested by the UI, we quickly check if the data is available in MongoDB, if not, it is imported from S3. In addition to cost savings, data compliance and privacy requires that our storage is kept clean of old data.
Given the long history of analytics it is easy to fall back into old provisioning fashion where data in traditional databases and the data retrieval takes ages , especially when it comes to cloud storage. Storage can have huge implications on your cloud subscriptions. We at callstats overcame this problem in a sophisticated way using a hot-warm-cold architecture.
We rolled out this update recently, if you are experiencing any issues, please let us know via the help icon within the dashboard.
The blogpost was contributed by: Gowtham, Eljas, Dejan