Polyglot persistence indicates using different types of databases in a single application.This trend is partly driven by microservices, dictating isolation of data stores per service. Another factor influencing its widespread adoption is proliferation of new database options like graph, document, key-value etc., that are best suited for specific use cases.
Microservices in particular, nicely encapsulate persistence mechanism, thereby enabling teams to choose any of the options, for instance, time series database for handling sensor data in IoT related use cases. The additional complexity associated with managing multiple data stores is compensated with improvements in scalability, and more importantly higher performance due to more efficient internal data structure offered by these databases. Overall, Polyglot Persistence is an idea that promotes using multiple databases appropriate for specific use cases within a single application domain.
Some of the key benefits of using polyglot persistence in an application are:
Flexibility - The development teams can choose data stores based on use cases e.g. graph databases like Neo4j or AWS Neptune can be selected to model entity relationships. With this approach, the application layer need not build abstractions on top of relational databases to represent complex relationships using combinations of joins.
Decentralization - In absence of polyglot persistence, applications tend to centralize database access layer, towards a single database instance which leads to complicated issues of connection management and query tuning. The concept of polyglot persistence inherently promotes isolation of infrastructure resources and loose coupling.
Scalability - It's hard to horizontally scale out relational databases using cloud specific infrastructure primitives. AWS, Azure and other public clouds support multiple categories of purpose built data stores like AWS DynamoDB, Azure Time Series Insights, Azure CosmosDB etc. Managed services like these, make it easy to achieve scaling and uptime guarantees if the data model is malleable to change.
Supportability - Relational databases typically need upfront estimation to provision enough resources beforehand. One of the tenets of cloud native architecture is “stop guessing capacity” and new age databases are designed with such design principles in mind. Serverless flavours of AWS managed services like DynamoDB or Timestream support on-demand scaling which is a cost effective strategy and simplifies operational concerns.
Some Patterns of Usage
Change Data Capture(CDC) is a commonly used pattern to replicate data across data stores. Debezium is a notable example. It offers pre-defined connectors for mysql and postgresql, and simplifies connecting relational databases with multiple applications. It saves the hassles of specialized code to connect and parse WAL(Write Ahead Logging) logs for different databases.
Data pipeline is another pattern observed in the polyglot ecosystem where multiple databases & applications are connected by event streams. Messaging infrastructure like Apache Kafka topics serve as glue to facilitate data movement across multiple systems.
Eventual consistency is a typical pattern as read and write requests are served by different databases. Replication across databases takes a few milliseconds depending on the number of intermediate applications. Compute path is also separated by frequency of data arrival in context of IoT parlance e.g. hot path for data items sent by sensor and cold path for feeds meant for analytics.
Relational database represents a centralized pattern that handles most of the business rules. Polyglot ecosystem promotes distributed systems. The same is exemplified by systems referred here like Debezium and Kafka.
Applicability for IoT
Most IoT platforms can benefit by leveraging polyglot persistence as below:
Sensor data can be modelled as JSON with relational or nosql databases. But it does add entity conversion to the application layer. In case of time series databases like Kairos or InfluxDB, the ingestion layer can continue to push sensor data to data stores with no or minimal conversions. Also it natively supports querying of data based on timestamps like fetching data for the last 10 minutes.
Ingestion layer is usually handled by AWS Kinesis or Azure Event hub, which act like a mini data store for streaming analytics. Kinesis also supports SQL like language to aggregate data received in the tumbling window of the last few seconds or minutes. Kafka offers ksqlDB as an event streaming database to execute stateless and stateful operations.
Device upgrade processes entail large files which can be handled by blob stores like AWS S3 or Azure blob store. API for blob stores also simplify adding metadata like the software update version. The relational databases complement metadata management in case of multitude of device types.
Business applications can process sensor data and park it into a data warehouse or data lake for further processing. Snowflake or Redshift can serve as a data warehouse to power visualization apps like Tableau or Micro strategy. AWS and Azure offer standard data lake templates to build data lakes on top of affordable object-based storages (e.g. S3) that can handle machine learning use cases.
Polyglot persistence is quite effective in today’s age of data intensive applications. It simplifies the approach for meeting uptime guarantees, high performance and flexibility to scale. Having said that, organizations need to define checklists to map use cases with database options to ensure standardization across teams. These checklists can act as guardrail for teams to choose graph database options from AWS Neptune and Neo4j or time series option from AWS Time stream, Azure Time Series Insights or InfluxDB.