One place for hosting & domains

      Understanding

      Understanding Managed Databases


      Introduction

      Secure, reliable data storage is a must for nearly every modern application. However, the infrastructure needed for a self-managed, on-premises database can be prohibitively expensive for many teams. Similarly, employees who have the skills and experience needed to maintain a production database effectively can be difficult to come by.

      The spread of cloud computing services has lowered the barriers to entry associated with provisioning a database, but many developers still lack the time or expertise needed to manage and tune a database to suit their needs. For this reason, many businesses are turning to managed database services to help them build and scale their databases in line with their growth.

      In this conceptual article, we will go over what managed databases are and how they can be beneficial to many organizations. We will also cover some practical considerations one should make before building their next application on top of a managed database solution.

      Managed Databases in a Nutshell

      A managed database is a cloud computing service in which the end user pays a cloud service provider for access to a database. Unlike a typical database, users don’t have to set up or maintain a managed database on their own; rather, it’s the provider’s responsibility to oversee the database’s infrastructure. This allows the user to focus on building their application instead of spending time configuring their database and keeping it up to date.

      The process of provisioning a managed database varies by provider, but in general it’s similar to that of any other cloud-based service. After registering an account and logging in to the dashboard, the user reviews the available database options — such as the database engine and cluster size — and then chooses the setup that’s right for them. After you provision the managed database, you can connect to it through a GUI or client and can then begin loading data and and integrating the database with your application.

      Managed data solutions simplify the process of provisioning and maintaining a database. Instead of running commands from a terminal to install and set one up, you can deploy a production-ready database with just a few clicks in your browser. By simplifying and automating database management, cloud providers make it easier for anyone, even novice database users, to build data-driven applications and websites. This was the result of a decades-long trend towards simplifying, automating, and abstracting various database management tasks, which was itself a response to pain points long felt by database administrators.

      Pain Points of On-Premises and Self-Managed Databases

      Prior to the rise of the cloud computing model, any organization in need of a data center had to supply all the time, space, and resources that went into setting one up. Once their database was up and running, they also had to maintain the hardware, keep its software updated, hire a team to manage the database, and train their employees on how to use it.

      As cloud computing services grew in popularity in the 2000s, it became easier and more affordable to provision server infrastructure, since the hardware and the space required for it no longer had to be owned or managed by those using it. Likewise, setting up a database entirely within the cloud became far less difficult; a business or developer would just have to requisition a server, install and configure their chosen database management system, and begin storing data.

      While cloud computing did make the process of setting up a traditional database easier, it didn’t address all of its problems. For instance, in the cloud it can still be difficult to pinpoint the ideal size of a database’s infrastructure footprint before it begins collecting data. This is important because cloud consumers are charged based on the resources they consume, and they risk paying for more than what they require if the server they provision is larger than necessary. Additionally, as with traditional on-premises databases, managing one’s database in the cloud can be a costly endeavor. Depending on your needs, you may still need to hire an experienced database administrator or spend a significant amount of time and money training your existing staff to manage your database effectively.

      Many of these issues are compounded for smaller organizations and independent developers. While a large business can usually afford to hire employees with a deep knowledge of databases, smaller teams usually have fewer resources available, leaving them with only their existing institutional knowledge. This makes tasks like replication, migrations, and backups all the more difficult and time consuming, as they can require a great deal of on-the-job learning as well as trial and error.

      Managed databases help to resolve these pain points with a host of benefits to businesses and developers. Let’s walk through some of these benefits and how they can impact development teams.

      Benefits of Managed Databases

      Managed database services can help to reduce many of the headaches associated with provisioning and managing a database. For one thing, developers build applications on top of managed database services to drastically speed up the process of provisioning a database server. With a self-managed solution, you must obtain a server (either on-premises or in the cloud), connect to it from a client or terminal, configure and secure it, and then install and set up the database management software before you can begin storing data. With a managed database, you only have to decide on the initial size of the database server, configure any additional provider-specific options, and you’ll have a new database ready to integrate with your app or website. This can usually be done in just a few minutes through the provider’s user interface.

      Another appeal of managed databases is automation. Self-managed databases can consume a large amount of an organization’s resources because its employees have to perform every administrative task — from scaling to performing updates, running migrations, and creating backups — manually. With a managed database, however, these and other tasks are done either automatically or on-demand, which markedly reduces the risk of human error.

      This relates to the fact that managed database services help to streamline the process of database scaling. Scaling a self-managed database can be very time- and resource-intensive. Whether you choose sharding, replication, load balancing, or something else as your scaling strategy, if you manage the infrastructure yourself then you’re responsible for ensuring that no data is lost in the process and that the application will continue to work properly. If you integrate your application with a managed database service, however, you can scale the database cluster on demand. Rather than having to work out the optimal server size or CPU usage beforehand, you can quickly provision more resources on-the-fly. This helps you avoid using unnecessary resources, meaning you also won’t pay for what you don’t need.

      Managed solutions tend to have built-in high-availability. In the context of cloud computing, a service is said to be highly available if it is stable and likely to run without failure for long periods of time. Most reputable cloud providers’ products come with a service level agreement (SLA), a commitment between the provider and its customers that guarantees the availability and reliability of their services. A typical SLA will specify how much downtime the customer should expect, and many also define the compensation for customers if these service levels are not met. This provides assurance for the customer that their database won’t crash and, if it does, they can at least expect some kind of reparation from the provider.

      In general, managed databases simplify the tasks associated with provisioning and maintaining a database. Depending on the provider, you or your team will still likely need some level of experience working with databases in order to provision a database and interact with it as you build and scale your application. Ultimately, though, the database-specific experience needed to administer a managed database will be much less than with self-managed solution.

      Of course, managed databases aren’t able to solve every problem, and may prove to be a less-than-ideal choice for some. Next, we’ll go over a few of the potential drawbacks one should consider before provisioning a managed database.

      Practical Considerations

      A managed database service can ease the stress of deploying and maintaining a database, but there are still a few things to keep in mind before committing to one. Recall that a principal draw of managed databases is that they abstract away most of the more tedious aspects of database administration. To this end, a managed database provider aims to deliver a rudimentary database that will satisfy the most common use cases. Accordingly, their database offerings won’t feature tons of customization options or the unique features included in more specialized database software. Because of this, you won’t have as much freedom to tailor your database and you’ll be limited to what the cloud provider has to offer.

      A managed database is almost always more expensive than a self-managed one. This makes sense, since you’re paying for the cloud provider to support you in managing the database, but it can be a cause for concern for teams with limited resources. Moreover, pricing for managed databases is usually based on how much storage and RAM the database uses, how many reads it handles, and how many backups of the database the user creates. Likewise, any application using a managed database service that handle large amounts of data or traffic will be more expensive than if it were to use a self-managed cloud database.

      One should also reflect on the impact switching to a managed database will have on their internal workflows and whether or not they’ll be able to adjust to those changes. Every provider differs, and depending on their SLA they may shoulder responsibility for only some administration tasks, which would be problematic for developers looking for a full-service solution. On the other hand, some providers could have a prohibitively restrictive SLA or make the customer entirely dependent on the provider in question, a situation known as vendor lock-in.

      Lastly, and perhaps most importantly, one should carefully consider whether or not any managed database service they’re considering using will meet their security needs. All databases, including on-premises databases, are prone to certain security threats, like SQL injection attacks or data leaks. However, the security dynamic is far different for databases hosted in the cloud. Managed database users can’t control the physical location of their data or who has access to it, nor can they ensure compliance with specific security standards. This can be especially problematic if your client has heightened security needs.

      To illustrate, imagine that you’re hired by a bank to build an application where its clients can access financial records and make payments. The bank may stipulate that the app must have data at rest encryption and appropriately scoped user permissions, and that it must be compliant with certain regulatory standards like PCI DSS. Not all managed database providers adhere to the same regulatory standards or maintain the same security practices, and they’re unlikely to adopt new standards or practices for just one of their customers. For this reason, it’s critical that you ensure any managed database provider you rely on for such an application is able to meet your security needs as well as the needs of your clients.

      Conclusion

      Managed databases have many features that appeal to a wide variety of businesses and developers, but a managed database may not solve every problem or suit everyone’s needs. Some may find that a managed database’s limited feature set and configuration options, increased cost, and reduced flexibility outweigh any of its potential advantages. However, compelling benefits like ease of use, scalability, automated backups and upgrades, and high availability have led to increased adoption of managed database solutions in a variety of industries.

      If you’re interested in learning more about DigitalOcean Managed Databases, we encourage you to check out our Managed Databases product documentation.



      Source link

      Understanding Database Sharding


      Introduction

      Any application or website that sees significant growth will eventually need to scale in order to accommodate increases in traffic. For data-driven applications and websites, it’s critical that scaling is done in a way that ensures the security and integrity of their data. It can be difficult to predict how popular a website or application will become or how long it will maintain that popularity, which is why some organizations choose a database architecture that allows them to scale their databases dynamically.

      In this conceptual article, we will discuss one such database architecture: sharded databases. Sharding has been receiving lots of attention in recent years, but many don’t have a clear understanding of what it is or the scenarios in which it might make sense to shard a database. We will go over what sharding is, some of its main benefits and drawbacks, and also a few common sharding methods.

      What is Sharding?

      Sharding is a database architecture pattern related to horizontal partitioning — the practice of separating one table’s rows into multiple different tables, known as partitions. Each partition has the same schema and columns, but also entirely different rows. Likewise, the data held in each is unique and independent of the data held in other partitions.

      It can be helpful to think of horizontal partitioning in terms of how it relates to vertical partitioning. In a vertically-partitioned table, entire columns are separated out and put into new, distinct tables. The data held within one vertical partition is independent from the data in all the others, and each holds both distinct rows and columns. The following diagram illustrates how a table could be partitioned both horizontally and vertically:

      Example tables showing horizontal and vertical partitioning

      Sharding involves breaking up one’s data into two or more smaller chunks, called logical shards. The logical shards are then distributed across separate database nodes, referred to as physical shards, which can hold multiple logical shards. Despite this, the data held within all the shards collectively represent an entire logical dataset.

      Database shards exemplify a shared-nothing architecture. This means that the shards are autonomous; they don’t share any of the same data or computing resources. In some cases, though, it may make sense to replicate certain tables into each shard to serve as reference tables. For example, let’s say there’s a database for an application that depends on fixed conversion rates for weight measurements. By replicating a table containing the necessary conversion rate data into each shard, it would help to ensure that all of the data required for queries is held in every shard.

      Oftentimes, sharding is implemented at the application level, meaning that the application includes code that defines which shard to transmit reads and writes to. However, some database management systems have sharding capabilities built in, allowing you to implement sharding directly at the database level.

      Given this general overview of sharding, let’s go over some of the positives and negatives associated with this database architecture.

      Benefits of Sharding

      The main appeal of sharding a database is that it can help to facilitate horizontal scaling, also known as scaling out. Horizontal scaling is the practice of adding more machines to an existing stack in order to spread out the load and allow for more traffic and faster processing. This is often contrasted with vertical scaling, otherwise known as scaling up, which involves upgrading the hardware of an existing server, usually by adding more RAM or CPU.

      It’s relatively simple to have a relational database running on a single machine and scale it up as necessary by upgrading its computing resources. Ultimately, though, any non-distributed database will be limited in terms of storage and compute power, so having the freedom to scale horizontally makes your setup far more flexible.

      Another reason why some might choose a sharded database architecture is to speed up query response times. When you submit a query on a database that hasn’t been sharded, it may have to search every row in the table you’re querying before it can find the result set you’re looking for. For an application with a large, monolithic database, queries can become prohibitively slow. By sharding one table into multiple, though, queries have to go over fewer rows and their result sets are returned much more quickly.

      Sharding can also help to make an application more reliable by mitigating the impact of outages. If your application or website relies on an unsharded database, an outage has the potential to make the entire application unavailable. With a sharded database, though, an outage is likely to affect only a single shard. Even though this might make some parts of the application or website unavailable to some users, the overall impact would still be less than if the entire database crashed.

      Drawbacks of Sharding

      While sharding a database can make scaling easier and improve performance, it can also impose certain limitations. Here, we’ll discuss some of these and why they might be reasons to avoid sharding altogether.

      The first difficulty that people encounter with sharding is the sheer complexity of properly implementing a sharded database architecture. If done incorrectly, there’s a significant risk that the sharding process can lead to lost data or corrupted tables. Even when done correctly, though, sharding is likely to have a major impact on your team’s workflows. Rather than accessing and managing one’s data from a single entry point, users must manage data across multiple shard locations, which could potentially be disruptive to some teams.

      One problem that users sometimes encounter after having sharded a database is that the shards eventually become unbalanced. By way of example, let’s say you have a database with two separate shards, one for customers whose last names begin with letters A through M and another for those whose names begin with the letters N through Z. However, your application serves an inordinate amount of people whose last names start with the letter G. Accordingly, the A-M shard gradually accrues more data than the N-Z one, causing the application to slow down and stall out for a significant portion of your users. The A-M shard has become what is known as a database hotspot. In this case, any benefits of sharding the database are canceled out by the slowdowns and crashes. The database would likely need to be repaired and resharded to allow for a more even data distribution.

      Another major drawback is that once a database has been sharded, it can be very difficult to return it to its unsharded architecture. Any backups of the database made before it was sharded won’t include data written since the partitioning. Consequently, rebuilding the original unsharded architecture would require merging the new partitioned data with the old backups or, alternatively, transforming the partitioned DB back into a single DB, both of which would be costly and time consuming endeavors.

      A final disadvantage to consider is that sharding isn’t natively supported by every database engine. For instance, PostgreSQL does not include automatic sharding as a feature, although it is possible to manually shard a PostgreSQL database. There are a number of Postgres forks that do include automatic sharding, but these often trail behind the latest PostgreSQL release and lack certain other features. Some specialized database technologies — like MySQL Cluster or certain database-as-a-service products like MongoDB Atlas — do include auto-sharding as a feature, but vanilla versions of these database management systems do not. Because of this, sharding often requires a “roll your own” approach. This means that documentation for sharding or tips for troubleshooting problems are often difficult to find.

      These are, of course, only some general issues to consider before sharding. There may be many more potential drawbacks to sharding a database depending on its use case.

      Now that we’ve covered a few of sharding’s drawbacks and benefits, we will go over a few different architectures for sharded databases.

      Sharding Architectures

      Once you’ve decided to shard your database, the next thing you need to figure out is how you’ll go about doing so. When running queries or distributing incoming data to sharded tables or databases, it’s crucial that it goes to the correct shard. Otherwise, it could result in lost data or painfully slow queries. In this section, we’ll go over a few common sharding architectures, each of which uses a slightly different process to distribute data across shards.

      Key Based Sharding

      Key based sharding, also known as hash based sharding, involves using a value taken from newly written data — such as a customer’s ID number, a client application’s IP address, a ZIP code, etc. — and plugging it into a hash function to determine which shard the data should go to. A hash function is a function that takes as input a piece of data (for example, a customer email) and outputs a discrete value, known as a hash value. In the case of sharding, the hash value is a shard ID used to determine which shard the incoming data will be stored on. Altogether, the process looks like this:

      Key based sharding example diagram

      To ensure that entries are placed in the correct shards and in a consistent manner, the values entered into the hash function should all come from the same column. This column is known as a shard key. In simple terms, shard keys are similar to primary keys in that both are columns which are used to establish a unique identifier for individual rows. Broadly speaking, a shard key should be static, meaning it shouldn’t contain values that might change over time. Otherwise, it would increase the amount of work that goes into update operations, and could slow down performance.

      While key based sharding is a fairly common sharding architecture, it can make things tricky when trying to dynamically add or remove additional servers to a database. As you add servers, each one will need a corresponding hash value and many of your existing entries, if not all of them, will need to be remapped to their new, correct hash value and then migrated to the appropriate server. As you begin rebalancing the data, neither the new nor the old hashing functions will be valid. Consequently, your server won’t be able to write any new data during the migration and your application could be subject to downtime.

      The main appeal of this strategy is that it can be used to evenly distribute data so as to prevent hotspots. Also, because it distributes data algorithmically, there’s no need to maintain a map of where all the data is located, as is necessary with other strategies like range or directory based sharding.

      Range Based Sharding

      Range based sharding involves sharding data based on ranges of a given value. To illustrate, let’s say you have a database that stores information about all the products within a retailer’s catalog. You could create a few different shards and divvy up each products’ information based on which price range they fall into, like this:

      Range based sharding example diagram

      The main benefit of range based sharding is that it’s relatively simple to implement. Every shard holds a different set of data but they all have an identical schema as one another, as well as the original database. The application code just reads which range the data falls into and writes it to the corresponding shard.

      On the other hand, range based sharding doesn’t protect data from being unevenly distributed, leading to the aforementioned database hotspots. Looking at the example diagram, even if each shard holds an equal amount of data the odds are that specific products will receive more attention than others. Their respective shards will, in turn, receive a disproportionate number of reads.

      Directory Based Sharding

      To implement directory based sharding, one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory based sharding:

      Directory based sharding example diagram

      Here, the Delivery Zone column is defined as a shard key. Data from the shard key is written to the lookup table along with whatever shard each respective row should be written to. This is similar to range based sharding, but instead of determining which range the shard key’s data falls into, each key is tied to its own specific shard. Directory based sharding is a good choice over range based sharding in cases where the shard key has a low cardinality and it doesn’t make sense for a shard to store a range of keys. Note that it’s also distinct from key based sharding in that it doesn’t process the shard key through a hash function; it just checks the key against a lookup table to see where the data needs to be written.

      The main appeal of directory based sharding is its flexibility. Range based sharding architectures limit you to specifying ranges of values, while key based ones limit you to using a fixed hash function which, as mentioned previously, can be exceedingly difficult to change later on. Directory based sharding, on the other hand, allows you to use whatever system or algorithm you want to assign data entries to shards, and it’s relatively easy dynamically add shards using this approach.

      While directory based sharding is the most flexible of the sharding methods discussed here, the need to connect to the lookup table before every query or write can have a detrimental impact on an application’s performance. Furthermore, the lookup table can become a single point of failure: if it becomes corrupted or otherwise fails, it can impact one’s ability to write new data or access their existing data.

      Should I Shard?

      Whether or not one should implement a sharded database architecture is almost always a matter of debate. Some see sharding as an inevitable outcome for databases that reach a certain size, while others see it as a headache that should be avoided unless it’s absolutely necessary, due to the operational complexity that sharding adds.

      Because of this added complexity, sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database:

      • The amount of application data grows to exceed the storage capacity of a single database node.
      • The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.
      • The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas, resulting in slowed response times or timeouts.

      Before sharding, you should exhaust all other options for optimizing your database. Some optimizations you might want to consider include:

      • Setting up a remote database. If you’re working with a monolithic application in which all of its components reside on the same server, you can improve your database’s performance by moving it over to its own machine. This doesn’t add as much complexity as sharding since the database’s tables remain intact. However, it still allows you to vertically scale your database apart from the rest of your infrastructure.
      • Implementing caching. If your application’s read performance is what’s causing you trouble, caching is one strategy that can help to improve it. Caching involves temporarily storing data that has already been requested in memory, allowing you to access it much more quickly later on.
      • Creating one or more read replicas. Another strategy that can help to improve read performance, this involves copying the data from one database server (the primary server) over to one or more secondary servers. Following this, every new write goes to the primary before being copied over to the secondaries, while reads are made exclusively to the secondary servers. Distributing reads and writes like this keeps any one machine from taking on too much of the load, helping to prevent slowdowns and crashes. Note that creating read replicas involves more computing resources and thus costs more money, which could be a significant constraint for some.
      • Upgrading to a larger server. In most cases, scaling up one’s database server to a machine with more resources requires less effort than sharding. As with creating read replicas, an upgraded server with more resources will likely cost more money. Accordingly, you should only go through with resizing if it truly ends up being your best option.

      Bear in mind that if your application or website grows past a certain point, none of these strategies will be enough to improve performance on their own. In such cases, sharding may indeed be the best option for you.

      Conclusion

      Sharding can be a great solution for those looking to scale their database horizontally. However, it also adds a great deal of complexity and creates more potential failure points for your application. Sharding may be necessary for some, but the time and resources needed to create and maintain a sharded architecture could outweigh the benefits for others.

      By reading this conceptual article, you should have a clearer understanding of the pros and cons of sharding. Moving forward, you can use this insight to make a more informed decision about whether or not a sharded database architecture is right for your application.



      Source link