One place for hosting & domains


      How To Migrate a MySQL Database to PostgreSQL Using pgLoader


      PostgreSQL, also known as “Postgres,” is an open-source relational database management system (RDBMS). It has seen a drastic growth in popularity in recent years, with many developers and companies migrating their data to Postgres from other database solutions.

      The prospect of migrating a database can be intimidating, especially when migrating from one database management system to another. pgLoader is an open-source database migration tool that aims to simplify the process of migrating to PostgreSQL. It supports migrations from several file types and RBDMSs — including MySQL and SQLite — to PostgreSQL.

      This tutorial provides instructions on how to install pgLoader and use it to migrate a remote MySQL database to PostgreSQL over an SSL connection. Near the end of the tutorial, we will also briefly touch on a few different migration scenarios where pgLoader may be useful.


      To complete this tutorial, you’ll need the following:

      • Access to two servers, each running Ubuntu 18.04. Both servers should have a firewall and a non-root user with sudo privileges configured. To set these up, you can follow our Initial Server Setup guide for Ubuntu 18.04.
      • MySQL installed on one of the servers. To set this up, follow Steps 1, 2, and 3 of our guide on How To Install MySQL on Ubuntu 18.04. Please note that in order to complete all the prerequisite tutorials linked here, you will need to configure your root MySQL user to authenticate with a password, as described in Step 3 of the MySQL installation guide.
      • PostgreSQL installed on the other server. To set this up, complete Step 1 of our guide How To Install and Use PostgreSQL on Ubuntu 18.04.
      • Your MySQL server should also be configured to accept encrypted connections. To set this up, complete every step of our tutorial on How To Configure SSL/TLS for MySQL on Ubuntu 18.04, including the optional Step 6. As you follow this guide, be sure to use your PostgreSQL server as the MySQL client machine, as you will need to be able to connect to your MySQL server from your Postgres machine in order to migrate the data with pgLoader.

      Please note that throughout this guide, the server on which you installed MySQL will be referred to as the “MySQL server” and any commands that should be run on this machine will be shown with a blue background, like this:

      Similarly, this guide will refer to the other server as the "PostgreSQL" or "Postgres" server and any commands that must be run on that machine will be shown with a red background:

      Please keep these in mind as you follow this tutorial so as to avoid any confusion.

      Step 1 — (Optional) Creating a Sample Database and Table in MySQL

      This step describes the process of creating a test database and populating it with dummy data. We encourage you to practice using pgLoader with this test case, but if you already have a database you want to migrate, you can move on to the next step.

      Start by opening up the MySQL prompt on your MySQL server:

      After entering your root MySQL user's password, you will see the MySQL prompt.

      From there, create a new database by running the following command. You can name your database whatever you'd like, but in this guide we will name it source_db:

      • CREATE DATABASE source_db;

      Then switch to this database with the USE command:


      Database changed

      Within this database, use the following command to create a sample table. Here, we will name this table sample_table but feel free to give it another name:

      • CREATE TABLE sample_table (
      • employee_id INT PRIMARY KEY,
      • first_name VARCHAR(50),
      • last_name VARCHAR(50),
      • start_date DATE,
      • salary VARCHAR(50)
      • );

      Then populate this table with some sample employee data using the following command:

      • INSERT INTO sample_table (employee_id, first_name, last_name, start_date, salary)
      • VALUES (1, 'Elizabeth', 'Cotten', '2007-11-11', '$105433.18'),
      • (2, 'Yanka', 'Dyagileva', '2017-10-30', '$107540.67'),
      • (3, 'Lee', 'Dorsey', '2013-06-04', '$118024.04'),
      • (4, 'Kasey', 'Chambers', '2010-08-18', '$116456.98'),
      • (5, 'Bram', 'Tchaikovsky', '2018-09-16', '$61989.50');

      Following this, you can close the MySQL prompt:

      Now that you have a sample database loaded with dummy data, you can move on to the next step in which you will install pgLoader on your PostgreSQL server.

      Step 2 — Installing pgLoader

      pgLoader is a program that can load data into a PostgreSQL database from a variety of different sources. It uses PostgreSQL's COPY command to copy data from a source database or file — such as a comma-separated values (CSV) file — into a target PostgreSQL database.

      pgLoader is available from the default Ubuntu APT repositories and you can install it using the apt command. However, in this guide we will take advantage of pgLoader's useSSL option, a feature that allows for migrations from MySQL over an SSL connection. This feature is only available in the latest version of pgLoader which, as of this writing, can only be installed using the source code from its GitHub repository.

      Before installing pgLoader, you will need to install its dependencies. If you haven't done so recently, update your Postgres server's package index:

      Then install the following packages:

      • sbcl: A Common Lisp compiler
      • unzip: A de-archiver for .zip files
      • libsqlite3-dev: A collection of development files for SQLite 3
      • gawk: Short for "GNU awk", a pattern scanning and processing language
      • curl: A command line tool for transferring data from a URL
      • make: A utility for managing package compilation
      • freetds-dev: A client library for MS SQL and Sybase databases
      • libzip-dev: A library for reading, creating, and modifying zip archives

      Use the following command to install these dependencies:

      • sudo apt install sbcl unzip libsqlite3-dev gawk curl make freetds-dev libzip-dev

      When prompted, confirm that you want to install these packages by pressing ENTER.

      Next, navigate to the pgLoader GitHub project's Releases page and find the latest release. For this guide, we will use the latest release at the time of this writing: version 3.6.1. Scroll down to its Assets menu and copy the link for the tar.gz file labeled Source code. Then paste the link into the following wget command. This will download the tarball to your server:

      • wget

      Extract the tarball:

      This will create a number of new directories and files on your server. Navigate into the new pgLoader parent directory:

      Then use the make utility to compile the pgloader binary:

      This command will take some time to build the pgloader binary.

      Move the binary file into the /usr/local/bin directory, the location where Ubuntu looks for executable files:

      • sudo mv ./build/bin/pgloader /usr/local/bin/

      You can test that pgLoader was installed correctly by checking its version, like so:


      pgloader version "3.6.1" compiled with SBCL 1.4.5.debian

      pgLoader is now installed, but before you can begin your migration you'll need to make some configuration changes to both your PostgreSQL and MySQL instances. We'll focus on the PostgreSQL server first.

      Step 3 — Creating a PostgreSQL Role and Database

      The pgloader command works by copying source data, either from a file or directly from a database, and inserting it into a PostgreSQL database. For this reason, you must either run pgLoader as a Linux user who has access to your Postgres database or you must specify a PostgreSQL role with the appropriate permissions in your load command.

      PostgreSQL manages database access through the use of roles. Depending on how the role is configured, it can be thought of as either a database user or a group of database users. In most RDBMSs, you create a user with the CREATE USER SQL command. Postgres, however, comes installed with a handy script called createuser. This script serves as a wrapper for the CREATE USER SQL command that you can run directly from the command line.

      Note: In PostgreSQL, you authenticate as a database user using the Identification Protocol, or ident, authentication method by default, rather than with a password. This involves PostgreSQL taking the client's Ubuntu username and using it as the allowed Postgres database username. This allows for greater security in many cases, but it can also cause issues in instances where you'd like an outside program to connect to one of your databases.

      pgLoader can load data into a Postgres database through a role that authenticates with the ident method as long as that role shares the same name as the Linux user profile issuing the pgloader command. However, to keep this process as clear as possible, this tutorial describes setting up a different PostgreSQL role that authenticates with a password rather than with the ident method.

      Run the following command on your Postgres server to create a new role. Note the -P flag, which tells createuser to prompt you to enter a password for the new role:

      • sudo -u postgres createuser --interactive -P

      You may first be prompted for your sudo password. The script will then prompt you to enter a name for the new role. In this guide, we'll call this role pgloader_pg:


      Enter name of role to add: pgloader_pg

      Following that, createuser will prompt you to enter and confirm a password for this role. Be sure to take note of this password, as you'll need it to perform the migration in Step 5:


      Enter password for new role: Enter it again:

      Lastly, the script will ask you if the new role should be classified as a superuser. In PostgreSQL, connecting to the database with a superuser role allows you to circumvent all of the database's permissions checks, except for the right to log in. Because of this, the superuser privilege should not be used lightly, and the PostgreSQL documentation recommends that you do most of your database work as a non-superuser role. However, because pgLoader needs broad privileges to access and load data into tables, you can safely grant this new role superuser privileges. Do so by typing y and then pressing ENTER:


      . . . Shall the new role be a superuser? (y/n) y

      PostgreSQL comes with another useful script that allows you to create a database from the command line. Since pgLoader also needs a target database into which it can load the source data, run the following command to create one. We'll name this database new_db but feel free to modify that if you like:

      • sudo -u postgres createdb new_db

      If there aren't any errors, this command will complete without any output.

      Now that you have a dedicated PostgreSQL user and an empty database into which you can load your MySQL data, there are just a few more changes you'll need to make before performing a migration. You'll need to create a dedicated MySQL user with access to your source database and add your client-side certificates to Ubuntu's trusted certificate store.

      Step 4 — Creating a Dedicated User in MySQL and Managing Certificates

      Protecting data from snoopers is one of the most important parts of any database administrator's job. Migrating data from one machine to another opens up an opportunity for malicious actors to sniff the packets traveling over the network connection if it isn't encrypted. In this step, you will create a dedicated MySQL user which pgLoader will use to perform the migration over an SSL connection.

      Begin by opening up your MySQL prompt:

      From the MySQL prompt, use the following CREATE USER command to create a new MySQL user. We will name this user pgloader_my. Because this user will only access MySQL from your PostgreSQL server, be sure to replace your_postgres_server_ip with the public IP address of your PostgreSQL server. Additionally, replace password with a secure password or passphrase:

      • CREATE USER 'pgloader_my'@'your_postgres_server_ip' IDENTIFIED BY 'password' REQUIRE SSL;

      Note the REQUIRE SSL clause at the end of this command. This will restrict the pgloader_my user to only access the database through a secure SSL connection.

      Next, grant the pgloader_my user access to the target database and all of its tables. Here, we'll specify the database we created in the optional Step 1, but if you have your own database you'd like to migrate, use its name in place of source_db:

      • GRANT ALL ON source_db.* TO 'pgloader_my'@'your_postgresql_server_ip';

      Then run the FLUSH PRIVILEGES command to reload the grant tables, enabling the privilege changes:

      After this, you can close the MySQL prompt:

      Now go back to your Postgres server terminal and attempt to log in to the MySQL server as the new pgloader_my user. If you followed the prerequisite guide on configuring SSL/TLS for MySQL then you will already have mysql-client installed on your PostgreSQL server and you should be able to connect with the following command:

      • mysql -u pgloader_my -p -h your_mysql_server_ip

      If the command is successful, you will see the MySQL prompt:

      After confirming that your pgloader_my user can successfully connect, go ahead and close the prompt:

      At this point, you have a dedicated MySQL user that can access the source database from your Postgres machine. However, if you were to try to migrate your MySQL database using SSL the attempt would fail.

      The reason for this is that pgLoader isn't able to read MySQL's configuration files, and thus doesn't know where to look for the CA certificate or client certificate that you copied to your PostgreSQL server in the prerequisite SSL/TLS configuration guide. Rather than ignoring SSL requirements, though, pgLoader requires the use of trusted certificates in cases where SSL is needed to connect to MySQL. Accordingly, you can resolve this issue by adding the ca.pem and client-cert.pem files to Ubuntu's trusted certificate store.

      To do this, copy over the ca.pem and client-cert.pem files to the /usr/local/share/ca-certificates/ directory. Note that you must also rename these files so they have the .crt file extension. If you don't rename them, your system will not be able to recognize that you've added these new certificates:

      • sudo cp ~/client-ssl/ca.pem /usr/local/share/ca-certificates/ca.pem.crt
      • sudo cp ~/client-ssl/client-cert.pem /usr/local/share/ca-certificates/client-cert.pem.crt

      Following this, run the update-ca-certificates command. This program looks for certificates within /usr/local/share/ca-certificates, adds any new ones to the /etc/ssl/certs/ directory, and generates a list of trusted SSL certificates — ca-certificates.crt — based on the contents of the /etc/ssl/certs/ directory:

      • sudo update-ca-certificates


      Updating certificates in /etc/ssl/certs... 2 added, 0 removed; done. Running hooks in /etc/ca-certificates/update.d... done.

      With that, you're all set to migrate your MySQL database to PostgreSQL.

      Step 5 — Migrating the Data

      Now that you've configured remote access from your PostgreSQL server to your MySQL server, you're ready to begin the migration.

      Note: It's important to back up your database before taking any action that could impact the integrity of your data. However, this isn't necessary when performing a migration with pgLoader, since it doesn't delete or transform data; it only copies it.

      That said, if you're feeling cautious and would like to back up your data before migrating it, you can do so with the mysqldump utility. See the official MySQL documentation for details.

      pgLoader allows users to migrate an entire database with a single command. For a migration from a MySQL database to a PostgreSQL database on a separate server, the command would have the following syntax:

      • pgloader mysql://mysql_username:password@mysql_server_ip_/source_database_name?option_1=value&option_n=value postgresql://postgresql_role_name:password@postgresql_server_ip/target_database_name?option_1=value&option_n=value

      This includes the pgloader command and two connection strings, the first for the source database and the second for the target database. Both of these connection strings begin by declaring what type of DBMS the connection string points to, followed by the username and password that have access to the database (separated by a colon), the host address of the server where the database is installed, the name of the database pgLoader should target, and various options that affect pgLoader's behavior.

      Using the parameters defined earlier in this tutorial, you can migrate your MySQL database using a command with the following structure. Be sure to replace any highlighted values to align with your own setup:

      • pgloader mysql://pgloader_my:mysql_password@mysql_server_ip/source_db?useSSL=true postgresql://pgloader_pg:postgresql_password@localhost/new_db

      Note that this command includes the useSSL option in the MySQL connection string. By setting this option to true, pgLoader will connect to MySQL over SSL. This is necessary, as you've configured your MySQL server to only accept secure connections.

      If this command is successful, you will see an output table describing how the migration went:


      table name errors rows bytes total time ----------------------- --------- --------- --------- -------------- fetch meta data 0 2 0.111s Create Schemas 0 0 0.001s Create SQL Types 0 0 0.005s Create tables 0 2 0.017s Set Table OIDs 0 1 0.010s ----------------------- --------- --------- --------- -------------- source_db.sample_table 0 5 0.2 kB 0.048s ----------------------- --------- --------- --------- -------------- COPY Threads Completion 0 4 0.052s Index Build Completion 0 1 0.011s Create Indexes 0 1 0.006s Reset Sequences 0 0 0.014s Primary Keys 0 1 0.001s Create Foreign Keys 0 0 0.000s Create Triggers 0 0 0.000s Install Comments 0 0 0.000s ----------------------- --------- --------- --------- -------------- Total import time ✓ 5 0.2 kB 0.084s

      To check that the data was migrated correctly, open up the PostgreSQL prompt:

      From there, connect to the database into which you loaded the data:

      Then run the following query to test whether the migrated data is stored in your PostgreSQL database:

      • SELECT * FROM source_db.sample_table;

      Note: Notice the FROM clause in this query specifying the sample_table held within the source_db schema:

      • . . . FROM source_db.sample_table;

      This is called a qualified name. You could go further and specify the fully qualified name by including the database's name as well as those of the schema and table:

      • . . . FROM new_db.source_db.sample_table;

      When you run queries in a PostgreSQL database, you don't need to be this specific if the table is held within the default public schema. The reason you must do so here is that when pgLoader loads data into Postgres, it creates and targets a new schema named after the original database — in this case, source_db. This is pgLoader's default behavior for MySQL to PostgreSQL migrations. However, you can use a load file to instruct pgLoader to change the table's schema topubliconce it's done loading data. See the next step for an example of how to do this.

      If the data was indeed loaded correctly, you will see the following table in the query's output:


      employee_id | first_name | last_name | start_date | salary -------------+------------+-------------+------------+------------ 1 | Elizabeth | Cotten | 2007-11-11 | $105433.18 2 | Yanka | Dyagileva | 2017-10-30 | $107540.67 3 | Lee | Dorsey | 2013-06-04 | $118024.04 4 | Kasey | Chambers | 2010-08-18 | $116456.98 5 | Bram | Tchaikovsky | 2018-09-16 | $61989.50 (5 rows)

      To close the Postgres prompt, run the following command:

      Now that we've gone over how to migrate a MySQL database over a network and load it into a PostgreSQL database, we will go over a few other common migration scenarios in which pgLoader can be useful.

      Step 6 — Exploring Other Migration Options

      pgLoader is a highly flexible tool that can be useful in a wide variety of situations. Here, we'll take a quick look at a few other ways you can use pgLoader to migrate a MySQL database to PostgreSQL.

      Migrating with a pgLoader Load File

      In the context of pgLoader, a load file, or command file, is a file that tells pgLoader how to perform a migration. This file can include commands and options that affect pgLoader's behavior, giving you much finer control over how your data is loaded into PostgreSQL and allowing you to perform complex migrations.

      pgLoader's documentation provides comprehensive instructions on how to use and extend these files to support a number of migration types, so here we will work through a comparatively rudimentary example. We will perform the same migration we ran in Step 5, but will also include an ALTER SCHEMA command to change the new_db database's schema from source_db to public.

      To begin, create a new load file on the Postgres server using your preferred text editor:

      Then add the following content, making sure to update the highlighted values to align with your own configuration:


           FROM      mysql://pgloader_my:mysql_password@mysql_server_ip/source_db?useSSL=true
           INTO pgsql://pgloader_pg:postgresql_password@localhost/new_db
       WITH include drop, create tables
      ALTER SCHEMA 'source_db' RENAME TO 'public'

      Here is what each of these clauses do:

      • LOAD DATABASE: This line instructs pgLoader to load data from a separate database, rather than a file or data archive.
      • FROM: This clause specifies the source database. In this case, it points to the connection string for the MySQL database we created in Step 1.
      • INTO: Likewise, this line specifies the PostgreSQL database in to which pgLoader should load the data.
      • WITH: This clause allows you to define specific behaviors for pgLoader. You can find the full list of WITH options that are compatible with MySQL migrations here. In this example we only include two options:
        • include drop: When this option is used, pgLoader will drop any tables in the target PostgreSQL database that also appear in the source MySQL database. If you use this option when migrating data to an existing PostgreSQL database, you should back up the entire database to avoid losing any data.
        • create tables: This option tells pgLoader to create new tables in the target PostgreSQL database based on the metadata held in the MySQL database. If the opposite option, create no tables, is used, then the target tables must already exist in the target Postgres database prior to the migration.
      • ALTER SCHEMA: Following the WITH clause, you can add specific SQL commands like this to instruct pgLoader to perform additional actions. Here, we instruct pgLoader to change the new Postgres database's schema from source_db to public, but only after it has created the schema. Note that you can also nest such commands within other clauses — such as BEFORE LOAD DO — to instruct pgLoader to execute those commands at specific points in the migration process.

      This is a demonstrative example of what you can include in a load file to modify pgLoader's behavior. The complete list of clauses that one can add to a load file and what they do can be found in the official pgLoader documentation.

      Save and close the load file after you've finished adding this content. To use it, include the name of the file as an argument to the pgloader command:

      • pgloader pgload_test.load

      To test that the migration was successful, open up the Postgres prompt:

      Then connect to the database:

      And run the following query:

      • SELECT * FROM sample_table;


      employee_id | first_name | last_name | start_date | salary -------------+------------+-------------+------------+------------ 1 | Elizabeth | Cotten | 2007-11-11 | $105433.18 2 | Yanka | Dyagileva | 2017-10-30 | $107540.67 3 | Lee | Dorsey | 2013-06-04 | $118024.04 4 | Kasey | Chambers | 2010-08-18 | $116456.98 5 | Bram | Tchaikovsky | 2018-09-16 | $61989.50 (5 rows)

      This output confirms that pgLoader migrated the data successfully, and also that the ALTER SCHEMA command we added to the load file worked as expected, since we didn't need to specify the source_db schema in the query to view the data.

      Note that if you plan to use a load file to migrate data held on one database to another located on a separate machine, you will still need to adjust any relevant networking and firewall rules in order for the migration to be successful.

      Migrating a MySQL Database to PostgreSQL Locally

      You can use pgLoader to migrate a MySQL database to a PostgreSQL database housed on the same machine. All you need is to run the migration command from a Linux user profile with access to the root MySQL user:

      • pgloader mysql://root@localhost/source_db pgsql://sammy:postgresql_password@localhost/target_db

      Performing a local migration like this means you don't have to make any changes to MySQL's default networking configuration or your system's firewall rules.

      Migrating from a CSV file

      You can also load a PostgreSQL database with data from a CSV file.

      Assuming you have a CSV file of data named load.csv, the command to load it into a Postgres database might look like this:

      • pgloader load.csv pgsql://sammy:password@localhost/target_db

      Because the CSV format is not fully standardized, there's a chance that you will run into issues when loading data directly from a CSV file in this manner. Fortunately, you can correct for irregularities by including various options with pgLoader's command line options or by specifying them in a load file. See the pgLoader documentation on the subject for more details.

      Migrating to a Managed PostgreSQL Database

      It's also possible to perform a migration from a self-managed database to a managed PostgreSQL database. To illustrate how this kind of migration could look, we will use the MySQL server and a DigitalOcean Managed PostgreSQL Database. We'll also use the sample database we created in Step 1, but if you skipped that step and have your own database you'd like to migrate, you can point to that one instead.

      Note: For instructions on how to set up a DigitalOcean Managed Database, please refer to our Managed Database Quickstart guide.

      For this migration, we won't need pgLoader’s useSSL option since it only works with remote MySQL databases and we will run this migration from a local MySQL database. However, we will use the sslmode=require option when we load and connect to the DigitalOcean Managed PostgreSQL database, which will ensure your data stays protected.

      Because we're not using the useSSL this time around, you can use apt to install pgLoader along with the postgresql-client package, which will allow you to access the Managed PostgreSQL Database from your MySQL server:

      • sudo apt install pgloader postgresql-client

      Following that, you can run the pgloader command to migrate the database. To do this, you'll need the connection string for the Managed Database.

      For DigitalOcean Managed Databases, you can copy the connection string from the Cloud Control Panel. First, click Databases in the left-hand sidebar menu and select the database to which you want to migrate the data. Then scroll down to the Connection Details section. Click on the drop down menu and select Connection string. Then, click the Copy button to copy the string to your clipboard and paste it into the following migration command, replacing the example PostgreSQL connection string shown here. This will migrate your MySQL database into the defaultdb PostgreSQL database as the doadmin PostgreSQL role:

      • pgloader mysql://root:password@localhost/source_db postgres://doadmin:password@db_host/defaultdb?sslmode=require

      Following this, you can use the same connection string as an argument to psql to connect to the managed PostgreSQL database and confirm that the migration was successful:

      • psql postgres://doadmin:password@db_host/defaultdb?sslmode=require

      Then, run the following query to check that pgLoader correctly migrated the data:

      • SELECT * FROM source_db.sample_table;


      employee_id | first_name | last_name | start_date | salary -------------+------------+-------------+------------+------------ 1 | Elizabeth | Cotten | 2007-11-11 | $105433.18 2 | Yanka | Dyagileva | 2017-10-30 | $107540.67 3 | Lee | Dorsey | 2013-06-04 | $118024.04 4 | Kasey | Chambers | 2010-08-18 | $116456.98 5 | Bram | Tchaikovsky | 2018-09-16 | $61989.50 (5 rows)

      This confirms that pgLoader successfully migrated your MySQL database to your managed PostgreSQL instance.


      pgLoader is a flexible tool that can perform a database migration in a single command. With a few configuration tweaks, it can migrate an entire database from one physical machine to another using a secure SSL/TLS connection. Our hope is that by following this tutorial, you will have gained a clearer understanding of pgLoader's capabilities and potential use cases.

      After migrating your data over to PostgreSQL, you may find the following tutorials to be of interest:

      Source link

      How To Audit a PostgreSQL Database with InSpec on Ubuntu 18.04

      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.


      InSpec is an open-source, automated testing framework for testing and auditing your system to ensure the compliance of integration, security, and other policy requirements. Developers can test the actual state of their infrastructure and applications against a target state using InSpec code.

      To specify the policy requirements you’re testing for, InSpec includes audit controls. Traditionally, developers manually enforce policy requirements and often do this right before deploying changes to production. With InSpec however, developers can continuously evaluate compliance at every stage of product development, which aids in solving issues earlier in the process of development. The InSpec DSL (Domain Specific Language) built on RSpec, a DSL testing tool written in Ruby, specifies the syntax used to write the audit controls.

      InSpec also includes a collection of resources to assist in configuring specific parts of your system and to simplify making audit controls. There is a feature to write your own custom resources when you need to define a specific solution that isn’t available. Universal matchers allow you to compare resource values to expectations across all InSpec tests.

      In this tutorial, you’ll install InSpec on a server running Ubuntu 18.04. You will start by writing a test that verifies the operating system family of the server, then you’ll create a PostgreSQL audit profile from the ground up. This audit profile starts by checking that you have PostgreSQL installed on the server and that its services are running. Then you’ll add tests to check that the PostgreSQL service is running with the correct port, address, protocol, and user. Next you’ll test specific PostgreSQL configuration parameters, and finally, you’ll audit client authentication configuration.


      Before following this tutorial, you will need the following:

      Step 1 — Preparing the Environment

      In this step, you’ll download and unpack the latest stable version of InSpec into your home directory. InSpec provides installable binaries on their downloads page.

      Navigate to your home directory:

      Now download the binary with curl:

      • curl -LO<^>_amd64.deb

      Next, use the sha256sum command to generate a checksum of the downloaded file. This is to verify the integrity and authenticity of the downloaded file.

      • sha256sum inspec_3.7.11-1_amd64.deb

      Checksums for each binary are listed on the InSpec downloads page, so visit the downloads page to compare with your output from this command.


      e665948f9c0441e8648b08f8d3c8d34a86f9e994609877a7e4853c012dbc7523 inspec_3.7.11-1_amd64.deb

      If the checksums are different, delete the downloaded file and repeat the download process.

      Next, you'll install the downloaded binary. For this, you'll use the dpkg command that you can use for package management, and which comes with all Debian-based systems, such as Ubuntu, by default. The -i flag prompts the dpkg command to install the package files.

      • sudo dpkg -i inspec_3.7.11-1_amd64.deb

      If there are no errors, it means that you've installed InSpec successfully. To verify the installation, enter the following command:

      You'll receive output showing the version of InSpec you just installed:



      If you don't see a version number displayed, run over step 1 again.

      After this, you can delete inspec_3.7.11-1_amd64.deb since you don't need it anymore as you've installed the package:

      • rm inspec_3.7.11-1_amd64.deb

      You've successfully installed InSpec on your server. In the next step, you will write a test to verify the operating system family of your server.

      Step 2 — Completing Your First InSpec Test

      In this step, you'll complete your first InSpec test, which will be testing that your operating system family is debian.

      You will use the os resource, which is a built-in InSpec audit resource to test the platform on which the system is running. You'll also use the eq matcher. The eq matcher is a universal matcher that tests for the exact equality of two values.

      An InSpec test consists of a describe block, which contains one or more it and its statements each of which validates one of the resource's features. Each statement describes an expectation of a specific condition of the system as assertions. Two keywords that you can include to make an assertion are should and should_not, which assert that the condition should be true and false respectively.

      Create a file called os_family.rb to hold your test and open it with your text editor:

      Add the following to your file:


      describe do
        it {should eq 'debian'}

      This test ensures that the operating system family of the target system is debian. Other possible values are windows, unix, bsd, and so on. You can find a complete list in the os resource documentation. Save and exit the file.

      Next, run your test with the following command:

      The test will pass, and you'll receive output resembling the following:


      Profile: tests from os_family.rb (tests from os_family.rb) Version: (not specified) Target: local:// debian ✔ should eq "debian" Test Summary: 1 successful, 0 failures, 0 skipped

      In your output, the Profile contains the name of the profile that just executed. Since this test is not included in a profile, InSpec generates a default profile name from the test's file name tests from os_family.rb. (You'll work with InSpec profiles in the next section where you will start building your PostgreSQL InSpec profile.) Here InSpec presents the Version as not specified, because you can only specify versions in profiles.

      The Target field specifies the target system that the test is executed on, which can be local or a remote system via ssh. In this case, you've executed your test on the local system so the target shows local://.

      Usefully, the output also displays the executed test with a checkmark symbol (✔) to the left indicating a successful test. The output will show a cross symbol (✘) if the test fails.

      Finally, the test summary gives overall details about how many tests were successful, failed, and skipped. In this instance, you had a single successful test.

      Now you'll see what the output looks like for a failed test. Open os_family.rb:

      In the test you created earlier in this step, you'll now change the expected value of the operating system family from debian to windows. Your file contents after this will be the following:


      describe do
        it {should eq 'windows'}

      Save and exit the file.

      Next, run the updated test with the following command:

      You will get output similar to the following:


      Profile: tests from (tests from Version: (not specified) Target: local:// debian (✘) should eq "windows" expected: "windows" got: "debian" (compared using ==) Test Summary: 0 successful, 1 failure, 0 skipped

      As expected, the test failed. The output indicates that your expected (windows) and actual (debian) values do not match for the property. The (compared using ==) output indicates that the eq matcher performed a string comparison between the two values to come up with this result.

      In this step, you've written a successful test that verifies the operating system family of the server. You've also created a failed test in order to see what the InSpec output for a failed test looks like. In the next step, you will start building the audit profile to test your PostgreSQL installation.

      Step 3 — Auditing Your PostgreSQL Installation

      Now, you will audit your PostgreSQL installation. You'll start by checking that you have PostgreSQL installed and its service is running correctly. Finally, you'll audit the PostgreSQL system port and process. For your PostgreSQL audit, you will create various InSpec controls, all within an InSpec profile named PostgreSQL.

      An InSpec control is a high-level grouping of related tests. Within a control, you can have multiple describe blocks, as well as metadata to describe your tests such as impact level, title, description, and tags. InSpec profiles organize controls to support dependency management and code reuse, which both help manage test complexity. They are also useful for packaging and sharing tests with the public via the Chef Supermarket. You can use profiles to define custom resources that you would implement as regular Ruby classes.

      To create an InSpec profile, you will use the init command. Enter this command to create the PostgreSQL profile:

      • inspec init profile PostgreSQL

      This creates the profile in a new directory with the same name as your profile, in this case PostgreSQL. Now, move into the new directory:

      The directory structure will look like this:

      ├── controls
      │   └── example.rb
      ├── inspec.yml
      ├── libraries

      The controls/example.rb file contains a sample control that tests to see if the /tmp folder exists on the target system. This is present only as a sample and you will replace it with your own test.

      Your first test will be to ensure that you have the package postgresql-10 installed on your system and that you have the postgresql service installed, enabled, and running.

      Rename the controls/example.rb file to controls/postgresql.rb:

      • mv controls/example.rb controls/postgresql.rb

      Next, open the file with your text editor:

      • nano controls/postgresql.rb

      Replace the content of the file with the following:


      control '1-audit_installation' do
        impact 1.0
        title 'Audit PostgreSQL Installation'
        desc 'Postgres should be installed and running'
        describe package('postgresql-10') do
          it {should be_installed}
          its('version') {should cmp >= '10'}
        describe service('postgresql@10-main') do
          it {should be_enabled}
          it {should be_installed}
          it {should be_running}

      In the preceding code block, you begin by defining the control with its name and metadata.

      In the first describe block, you use the package resource and pass in the PostgreSQL package name postgresql-10 as a resource argument. The package resource provides the matcher be_installed to test that the named package is installed on the system. It returns true if you have the package installed, and false otherwise. Next, you used the its statement to validate that the version of the installed PostgreSQL package is at least 10. You are using cmp instead of eq because package version strings usually contain other attributes apart from the numerical version. eq returns true only if there is an exact match while cmp is less-restrictive.

      In the second describe block, you use the service resource and pass in the PostgreSQL 10 service name postgresql@10-main as a resource argument. The service resource provides the matchers be_enabled, be_installed, and be_running and they return true if you have the named service installed, enabled, and running on the target system respectively.

      Save and exit your file.

      Next, you will run your profile. Make sure you're in the ~/PostgreSQL directory before running the following command:

      Since you completed the PostgreSQL prerequisite tutorial, your test will pass. Your output will look similar to the following:


      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running Profile Summary: 1 successful control, 0 control failures, 0 controls skipped Test Summary: 5 successful, 0 failures, 0 skipped

      The output indicates that your control was successful. A control is successful if, and only if, all the tests in it are successful. The output also confirms that all your tests were successful.

      Now that you've verified that the correct version of PostgreSQL is installed and the service is fine, you will create a new control that ensures that PostgreSQL is listening on the correct port, address, and protocol.

      For this test, you will also use attributes. An InSpec attribute is used to parameterize a profile to enable easy re-use in different environments or target systems. You'll define the PORT attribute.

      Open the inspec.yml file in your text editor:

      You'll append the port attribute to the end of the file. Add the following at the end of your file:


        - name: port
          type: string
          default: '5432'

      In the preceding code block, you added the port attribute and set it to a default value of 5432 because that is the port PostgreSQL listens on by default.

      Save and exit the file. Then run inspec check to verify the profile is still valid since you just edited inspec.yml:

      If there are no errors, you can proceed. Otherwise, open the inspec.yml file and ensure that the attribute is present at the end of the file.

      Now you'll create the control that checks that the PostgreSQL process is running and configured with the correct user. Open controls/postgresql.rb in your text editor:

      • nano controls/postgresql.rb

      Append the following control to the end of your current tests file controls/postgresql.rb:


      PORT = attribute('port')
      control '2-audit_address_port' do
        impact 1.0
        title 'Audit Process and Port'
        desc 'Postgres port should be listening and the process should be running'
        describe port(PORT) do
          it {should be_listening}
          its('addresses') {should include ''}
          its('protocols') {should cmp 'tcp'}
        describe processes('postgres') do
          it {should exist}
          its('users') {should include 'postgres'}
        describe user('postgres') do
          it {should exist}

      Here you begin by declaring a PORT variable to hold the value of the port profile attribute. Then you declare the control and its metadata.

      In the first describe block, you include the port resource to test basic port properties. The port resource provides the matchers be_listening, addresses, and protocols. You use the be_listening matcher to test that the named port is listening on the target system. It returns true if the port 5432 is listening and returns false otherwise. The addresses matcher tests if the specified address is associated with the port. In this case, PostgreSQL will be listening on the local address,
      The protocols matcher tests the Internet protocol the port is listening for, which can be icmp, tcp/tcp6, or udp/udp6. PostgreSQL will be listening for tcp connections.

      In the second describe block, you include the processes resource. You use the processes resource to test properties for programs that are running on the system. First, you verify that the postgres process exists on the system, then you use the users matcher to test that the postgres user owns the postgres process.

      In the third describe block, you have the user resource. You include the user resource to test user properties for a user such as whether the user exists or not, the group the user belongs to, and so on. Using this resource, you test that the postgres user exists on the system. Save and exit controls/postgresql.rb.

      Next, run your profile with the following command:

      The tests will pass, and your output will resemble the following:


      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist Profile Summary: 2 successful controls, 0 control failures, 0 controls skipped Test Summary: 11 successful, 0 failures, 0 skipped

      The output indicates that both of your controls and all of your tests were successful.

      In this section, you have created your first InSpec profile and control and used them to organize your tests. You've used several InSpec resources to ensure that you have the correct version of PostgreSQL installed, the PostgreSQL service enabled and running correctly, and that the PostgreSQL user exists on the system. With this set up you're ready to audit your configuration.

      Step 4 — Auditing Your PostgreSQL Configuration

      In this step, you'll audit some PostgreSQL configuration values, which will give you a foundation for working with these configuration files, allowing you to audit any PostgreSQL configuration parameters as desired.

      Now that you have tests auditing the PostgreSQL installation, you'll audit your PostgreSQL configuration itself. PostgreSQL has several configuration parameters that you can use to tune it as desired, and these are stored in the configuration file located by default at /etc/postgresql/10/main/postgresql.conf. You could have different requirements regarding PostgreSQL configuration for your various deployments such as logging, password encryption, SSL, and replication strategies — these requirements you specify in the configuration file.

      You will be using the postgres_conf resource that tests for specific, named configuration options against expected values in the contents of the PostgreSQL configuration file.

      This test will assume some non-default PostgreSQL configuration values that you'll set manually.

      Open the PostgreSQL configuration file in your favorite text editor:

      • sudo nano /etc/postgresql/10/main/postgresql.conf

      Set the following configuration values. If the option already exists in the file but is commented out, uncomment it by removing the #, and set the value as provided:


      password_encryption = scram-sha-256
      logging_collector = on
      log_connections = on
      log_disconnections = on
      log_duration = on

      The configuration values you have set:

      • Ensure that saved passwords are always encrypted with the scram-sha-256 algorithm.
      • Enable the logging collector, which is a background process that captures log messages from the standard error (stderr) and redirects them to a log file.
      • Enable logging of connection attempts to the PostgreSQL server as well as successful connections.
      • Enable logging of session terminations.
      • Enable logging of the duration of every completed statement.

      Save and exit the configuration file. Then restart the PostgreSQL service:

      • sudo service postgresql@10-main restart

      You'll test for only a few configuration options, but you can test any PostgreSQL configuration option with the postgres_conf resource.

      You will pass in your PostgreSQL configuration directory, which is at /etc/postgresql/10/main, using a new profile attribute, postgres_conf_dir. This configuration directory is not the same across all operating systems and platforms, so by passing it in as a profile attribute, you'll be making this profile easier to reuse in different environments.

      Open your inspec.yml file:

      Add this new attribute to the attributes section of inspec.yml:


        - name: postgres_conf_dir
          type: string
          default: '/etc/postgresql/10/main'

      Save and exit your file. Then run the following command to verify the InSpec profile is still valid because you just edited the inspec.yml:

      If there are no errors, you can proceed. Otherwise, open the inspec.yml file and ensure that the above lines are present at the end of the file.

      Now you will create the control that audits the configuration values you are enforcing. Append the following control to the end of the tests file controls/postgresql.rb:


      POSTGRES_CONF_DIR = attribute('postgres_conf_dir')
      POSTGRES_CONF_PATH = File.join(POSTGRES_CONF_DIR, 'postgresql.conf')
      control '3-postgresql' do
        impact 1.0
        title 'Audit PostgreSQL Configuration'
        desc 'Audits specific configuration options'
        describe postgres_conf(POSTGRES_CONF_PATH) do
          its('port') {should eq PORT}
          its('password_encryption') {should eq 'scram-sha-256'}
          its('ssl') {should eq 'on'}
          its('logging_collector') {should eq 'on'}
          its('log_connections') {should eq 'on'}
          its('log_disconnections') {should eq 'on'}
          its('log_duration') {should eq 'on'}

      Here you define two variables:

      • POSTGRES_CONF_DIR holds the postgres_conf_dir attribute as defined in the profile configuration.
      • POSTGRES_CONF_PATH holds the absolute path of the configuration file by concatenating the configuration file name with the configuration directory using File.join.

      Next, you define the control with its name and metadata. Then you use the postgres_conf resource together with the eq matcher to ensure your required values for the configuration options are correct. Save and exit controls/postgresql.rb.

      Next, you will run the test with the following command:

      The tests will pass, and your outputs will resemble the following:


      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist ✔ 3-postgresql: Audit PostgreSQL Configuration ✔ PostgreSQL Configuration port should eq "5432" ✔ PostgreSQL Configuration password_encryption should eq "scram-sha-256" ✔ PostgreSQL Configuration ssl should eq "on" ✔ PostgreSQL Configuration logging_collector should eq "on" ✔ PostgreSQL Configuration log_connections should eq "on" ✔ PostgreSQL Configuration log_disconnections should eq "on" ✔ PostgreSQL Configuration log_duration should eq "on" Profile Summary: 3 successful controls, 0 control failures, 0 controls skipped Test Summary: 18 successful, 0 failures, 0 skipped

      The output indicates that your three controls and all your tests were successful without any skipped tests or controls.

      In this step, you've added a new InSpec control that tests specific PostgreSQL configuration values from the configuration file using the postgres_conf resource. You audited a few values in this section, but you can use it to test any configuration option from the configuration file.

      Step 5 — Auditing PostgreSQL Client Authentication

      Now that you've written some tests for your PostgreSQL configuration, you'll write some tests for client authentication. This is important for installations that need to ensure specific authentication methods for different kinds of users; for example, to ensure clients connecting to PostgreSQL locally always need to authenticate with a password, or to reject connections from a specific IP address or IP address range, and so on.

      An important configuration for PostgreSQL installations where security is a concern is to only allow encrypted password authentications. PostgreSQL 10 supports two password encryption methods for client authentication: md5 and scram-sha-256. This test will require password encryption for all clients so this means that the METHOD field for all clients in the client configuration file must be set to either md5 or scram-sha-256. For these tests, you will use scram-sha-256 since it is more secure than md5.

      By default, local clients have their peer authentication method in the pg_hba.conf file. For the test, you need to change these to scram-sha-256. Open the /etc/postgresql/10/main/pg_hba.conf file:

      • sudo nano /etc/postgresql/10/main/pg_hba.conf

      The top of the file contains comments. Scroll down and look for uncommented lines where the authentication type is local, and change the authentication method from peer to scram-sha-256. For example, change:


      local   all             postgres                                peer



      local   all             postgres                                scram-sha-256

      At the end, your pg_hba.conf configuration will resemble the following:


      local   all             postgres                                scram-sha-256
      # TYPE  DATABASE        USER            ADDRESS                 METHOD
      # "local" is for Unix domain socket connections only
      local   all             all                                     scram-sha-256
      # IPv4 local connections:
      host    all             all               scram-sha-256
      # IPv6 local connections:
      host    all             all             ::1/128                 scram-sha-256
      # Allow replication connections from localhost, by a user with the
      # replication privilege.
      local   replication     all                                     scram-sha-256
      host    replication     all               scram-sha-256
      host    replication     all             ::1/128                 scram-sha-256

      Save and exit the configuration file. Then restart the PostgreSQL service:

      • sudo service postgresql@10-main restart

      For this test, you'll use the postgres_hba_conf resource. This resource is used to test the client authentication data defined in the pg_hba.conf file. You'll pass in the path of your pg_hba.conf file as a parameter to this resource.

      Your control will consist of two describe blocks that check the auth_method fields for both local and host clients respectively to ensure that they are both equal to scram-sha-256. Open controls/postgresql.rb in your text editor:

      • nano controls/postgresql.rb

      Append the following control to the end of the test file controls/postgresql.rb:


      POSTGRES_HBA_CONF_FILE = File.join(POSTGRES_CONF_DIR, 'pg_hba.conf')
      control '4-postgres_hba' do
        impact 1.0
        title 'Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf'
        desc 'Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf. Do not allow untrusted authentication methods.'
        describe postgres_hba_conf(POSTGRES_HBA_CONF_FILE).where { type == 'local' } do
          its('auth_method') { should all eq 'scram-sha-256' }
        describe postgres_hba_conf(POSTGRES_HBA_CONF_FILE).where { type == 'host' } do
          its('auth_method') { should all eq 'scram-sha-256' }

      In this code block, you define a new variable POSTGRES_HBA_CONF_FILE to store the absolute location of your pg_hba.conf file. File.join is a Ruby method to concatenate two file path segments with /. You use it here to join the POSTGRES_CONF_DIR variable, declared in the previous section, with the PostgreSQL configuration file pg_hba.conf. This will produce an absolute file path of the pg_hba.conf file and store it in the POSTGRES_HBA_CONF_FILE variable.

      After that, you declare and configure the control and its metadata. The first describe block checks that all configuration entries where the client type is local also have scram-sha-256 as their authentication methods. The second describe block does the same for cases where the client type is host. Save and exit controls/postgresql.rb.

      You'll execute this control as the postgres user because Read access to the PostgreSQL HBA configuration is granted only to Owner and Group, which is the postgres user. Execute the profile by running:

      • sudo -u postgres inspec exec .

      Your output will resemble the following:


      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist ✔ 3-postgresql: Audit PostgreSQL Configuration ✔ PostgreSQL Configuration port should eq "5432" ✔ PostgreSQL Configuration password_encryption should eq "scram-sha-256" ✔ PostgreSQL Configuration ssl should eq "on" ✔ PostgreSQL Configuration logging_collector should eq "on" ✔ PostgreSQL Configuration log_connections should eq "on" ✔ PostgreSQL Configuration log_disconnections should eq "on" ✔ PostgreSQL Configuration log_duration should eq "on" ✔ 4-postgres_hba: Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf ✔ Postgres Hba Config /etc/postgresql/10/main/pg_hba.conf with type == "local" auth_method should all eq "scram-sha-256" ✔ Postgres Hba Config /etc/postgresql/10/main/pg_hba.conf with type == "host" auth_method should all eq "scram-sha-256" Profile Summary: 4 successful controls, 0 control failures, 0 controls skipped Test Summary: 20 successful, 0 failures, 0 skipped

      This output indicates that the new control you added, together with all of the previous controls, are successful. It also indicates that all the tests in your profile are successful.

      In this step, you have added a control to your profile that successfully audited your PostgreSQL client authentication configuration to ensure that all clients are authenticated via scram-sha-256 using the postgres_hba_conf resource.


      You've set up InSpec and successfully audited a PostgreSQL 10 installation. In the process, you've used a selection of InSpec tools, such as: the InSpec DSL, matchers, resources, profiles, attributes, and the CLI. From here, you can incorporate other resources that InSpec provides in the Resources section of their documentation. InSpec also provides a mechanism for defining custom resources for your specific needs. These custom resources are written as a regular Ruby class.

      You can also explore the Compliance Profiles section of the Chef supermarket that contains publicly shared InSpec profiles that you can execute directly or extend in your own profiles. You can also share your own profiles with the general public in the Chef Supermarket.

      You can go further by exploring other tools in the Chef universe such as Chef and Habitat. InSpec is integrated with Habitat and this provides the ability to ship your compliance controls together with your Habitat-packaged applications and continuously run them. You can explore official and community InSpec tutorials on the tutorials page. For more advanced InSpec references, check the official InSpec documentation.

      Source link

      Understanding Database Sharding


      Any application or website that sees significant growth will eventually need to scale in order to accommodate increases in traffic. For data-driven applications and websites, it’s critical that scaling is done in a way that ensures the security and integrity of their data. It can be difficult to predict how popular a website or application will become or how long it will maintain that popularity, which is why some organizations choose a database architecture that allows them to scale their databases dynamically.

      In this conceptual article, we will discuss one such database architecture: sharded databases. Sharding has been receiving lots of attention in recent years, but many don’t have a clear understanding of what it is or the scenarios in which it might make sense to shard a database. We will go over what sharding is, some of its main benefits and drawbacks, and also a few common sharding methods.

      What is Sharding?

      Sharding is a database architecture pattern related to horizontal partitioning — the practice of separating one table’s rows into multiple different tables, known as partitions. Each partition has the same schema and columns, but also entirely different rows. Likewise, the data held in each is unique and independent of the data held in other partitions.

      It can be helpful to think of horizontal partitioning in terms of how it relates to vertical partitioning. In a vertically-partitioned table, entire columns are separated out and put into new, distinct tables. The data held within one vertical partition is independent from the data in all the others, and each holds both distinct rows and columns. The following diagram illustrates how a table could be partitioned both horizontally and vertically:

      Example tables showing horizontal and vertical partitioning

      Sharding involves breaking up one’s data into two or more smaller chunks, called logical shards. The logical shards are then distributed across separate database nodes, referred to as physical shards, which can hold multiple logical shards. Despite this, the data held within all the shards collectively represent an entire logical dataset.

      Database shards exemplify a shared-nothing architecture. This means that the shards are autonomous; they don’t share any of the same data or computing resources. In some cases, though, it may make sense to replicate certain tables into each shard to serve as reference tables. For example, let’s say there’s a database for an application that depends on fixed conversion rates for weight measurements. By replicating a table containing the necessary conversion rate data into each shard, it would help to ensure that all of the data required for queries is held in every shard.

      Oftentimes, sharding is implemented at the application level, meaning that the application includes code that defines which shard to transmit reads and writes to. However, some database management systems have sharding capabilities built in, allowing you to implement sharding directly at the database level.

      Given this general overview of sharding, let’s go over some of the positives and negatives associated with this database architecture.

      Benefits of Sharding

      The main appeal of sharding a database is that it can help to facilitate horizontal scaling, also known as scaling out. Horizontal scaling is the practice of adding more machines to an existing stack in order to spread out the load and allow for more traffic and faster processing. This is often contrasted with vertical scaling, otherwise known as scaling up, which involves upgrading the hardware of an existing server, usually by adding more RAM or CPU.

      It’s relatively simple to have a relational database running on a single machine and scale it up as necessary by upgrading its computing resources. Ultimately, though, any non-distributed database will be limited in terms of storage and compute power, so having the freedom to scale horizontally makes your setup far more flexible.

      Another reason why some might choose a sharded database architecture is to speed up query response times. When you submit a query on a database that hasn’t been sharded, it may have to search every row in the table you’re querying before it can find the result set you’re looking for. For an application with a large, monolithic database, queries can become prohibitively slow. By sharding one table into multiple, though, queries have to go over fewer rows and their result sets are returned much more quickly.

      Sharding can also help to make an application more reliable by mitigating the impact of outages. If your application or website relies on an unsharded database, an outage has the potential to make the entire application unavailable. With a sharded database, though, an outage is likely to affect only a single shard. Even though this might make some parts of the application or website unavailable to some users, the overall impact would still be less than if the entire database crashed.

      Drawbacks of Sharding

      While sharding a database can make scaling easier and improve performance, it can also impose certain limitations. Here, we’ll discuss some of these and why they might be reasons to avoid sharding altogether.

      The first difficulty that people encounter with sharding is the sheer complexity of properly implementing a sharded database architecture. If done incorrectly, there’s a significant risk that the sharding process can lead to lost data or corrupted tables. Even when done correctly, though, sharding is likely to have a major impact on your team’s workflows. Rather than accessing and managing one’s data from a single entry point, users must manage data across multiple shard locations, which could potentially be disruptive to some teams.

      One problem that users sometimes encounter after having sharded a database is that the shards eventually become unbalanced. By way of example, let’s say you have a database with two separate shards, one for customers whose last names begin with letters A through M and another for those whose names begin with the letters N through Z. However, your application serves an inordinate amount of people whose last names start with the letter G. Accordingly, the A-M shard gradually accrues more data than the N-Z one, causing the application to slow down and stall out for a significant portion of your users. The A-M shard has become what is known as a database hotspot. In this case, any benefits of sharding the database are canceled out by the slowdowns and crashes. The database would likely need to be repaired and resharded to allow for a more even data distribution.

      Another major drawback is that once a database has been sharded, it can be very difficult to return it to its unsharded architecture. Any backups of the database made before it was sharded won’t include data written since the partitioning. Consequently, rebuilding the original unsharded architecture would require merging the new partitioned data with the old backups or, alternatively, transforming the partitioned DB back into a single DB, both of which would be costly and time consuming endeavors.

      A final disadvantage to consider is that sharding isn’t natively supported by every database engine. For instance, PostgreSQL does not include automatic sharding as a feature, although it is possible to manually shard a PostgreSQL database. There are a number of Postgres forks that do include automatic sharding, but these often trail behind the latest PostgreSQL release and lack certain other features. Some specialized database technologies — like MySQL Cluster or certain database-as-a-service products like MongoDB Atlas — do include auto-sharding as a feature, but vanilla versions of these database management systems do not. Because of this, sharding often requires a “roll your own” approach. This means that documentation for sharding or tips for troubleshooting problems are often difficult to find.

      These are, of course, only some general issues to consider before sharding. There may be many more potential drawbacks to sharding a database depending on its use case.

      Now that we’ve covered a few of sharding’s drawbacks and benefits, we will go over a few different architectures for sharded databases.

      Sharding Architectures

      Once you’ve decided to shard your database, the next thing you need to figure out is how you’ll go about doing so. When running queries or distributing incoming data to sharded tables or databases, it’s crucial that it goes to the correct shard. Otherwise, it could result in lost data or painfully slow queries. In this section, we’ll go over a few common sharding architectures, each of which uses a slightly different process to distribute data across shards.

      Key Based Sharding

      Key based sharding, also known as hash based sharding, involves using a value taken from newly written data — such as a customer’s ID number, a client application’s IP address, a ZIP code, etc. — and plugging it into a hash function to determine which shard the data should go to. A hash function is a function that takes as input a piece of data (for example, a customer email) and outputs a discrete value, known as a hash value. In the case of sharding, the hash value is a shard ID used to determine which shard the incoming data will be stored on. Altogether, the process looks like this:

      Key based sharding example diagram

      To ensure that entries are placed in the correct shards and in a consistent manner, the values entered into the hash function should all come from the same column. This column is known as a shard key. In simple terms, shard keys are similar to primary keys in that both are columns which are used to establish a unique identifier for individual rows. Broadly speaking, a shard key should be static, meaning it shouldn’t contain values that might change over time. Otherwise, it would increase the amount of work that goes into update operations, and could slow down performance.

      While key based sharding is a fairly common sharding architecture, it can make things tricky when trying to dynamically add or remove additional servers to a database. As you add servers, each one will need a corresponding hash value and many of your existing entries, if not all of them, will need to be remapped to their new, correct hash value and then migrated to the appropriate server. As you begin rebalancing the data, neither the new nor the old hashing functions will be valid. Consequently, your server won’t be able to write any new data during the migration and your application could be subject to downtime.

      The main appeal of this strategy is that it can be used to evenly distribute data so as to prevent hotspots. Also, because it distributes data algorithmically, there’s no need to maintain a map of where all the data is located, as is necessary with other strategies like range or directory based sharding.

      Range Based Sharding

      Range based sharding involves sharding data based on ranges of a given value. To illustrate, let’s say you have a database that stores information about all the products within a retailer’s catalog. You could create a few different shards and divvy up each products’ information based on which price range they fall into, like this:

      Range based sharding example diagram

      The main benefit of range based sharding is that it’s relatively simple to implement. Every shard holds a different set of data but they all have an identical schema as one another, as well as the original database. The application code just reads which range the data falls into and writes it to the corresponding shard.

      On the other hand, range based sharding doesn’t protect data from being unevenly distributed, leading to the aforementioned database hotspots. Looking at the example diagram, even if each shard holds an equal amount of data the odds are that specific products will receive more attention than others. Their respective shards will, in turn, receive a disproportionate number of reads.

      Directory Based Sharding

      To implement directory based sharding, one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory based sharding:

      Directory based sharding example diagram

      Here, the Delivery Zone column is defined as a shard key. Data from the shard key is written to the lookup table along with whatever shard each respective row should be written to. This is similar to range based sharding, but instead of determining which range the shard key’s data falls into, each key is tied to its own specific shard. Directory based sharding is a good choice over range based sharding in cases where the shard key has a low cardinality and it doesn’t make sense for a shard to store a range of keys. Note that it’s also distinct from key based sharding in that it doesn’t process the shard key through a hash function; it just checks the key against a lookup table to see where the data needs to be written.

      The main appeal of directory based sharding is its flexibility. Range based sharding architectures limit you to specifying ranges of values, while key based ones limit you to using a fixed hash function which, as mentioned previously, can be exceedingly difficult to change later on. Directory based sharding, on the other hand, allows you to use whatever system or algorithm you want to assign data entries to shards, and it’s relatively easy dynamically add shards using this approach.

      While directory based sharding is the most flexible of the sharding methods discussed here, the need to connect to the lookup table before every query or write can have a detrimental impact on an application’s performance. Furthermore, the lookup table can become a single point of failure: if it becomes corrupted or otherwise fails, it can impact one’s ability to write new data or access their existing data.

      Should I Shard?

      Whether or not one should implement a sharded database architecture is almost always a matter of debate. Some see sharding as an inevitable outcome for databases that reach a certain size, while others see it as a headache that should be avoided unless it’s absolutely necessary, due to the operational complexity that sharding adds.

      Because of this added complexity, sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database:

      • The amount of application data grows to exceed the storage capacity of a single database node.
      • The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.
      • The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas, resulting in slowed response times or timeouts.

      Before sharding, you should exhaust all other options for optimizing your database. Some optimizations you might want to consider include:

      • Setting up a remote database. If you’re working with a monolithic application in which all of its components reside on the same server, you can improve your database’s performance by moving it over to its own machine. This doesn’t add as much complexity as sharding since the database’s tables remain intact. However, it still allows you to vertically scale your database apart from the rest of your infrastructure.
      • Implementing caching. If your application’s read performance is what’s causing you trouble, caching is one strategy that can help to improve it. Caching involves temporarily storing data that has already been requested in memory, allowing you to access it much more quickly later on.
      • Creating one or more read replicas. Another strategy that can help to improve read performance, this involves copying the data from one database server (the primary server) over to one or more secondary servers. Following this, every new write goes to the primary before being copied over to the secondaries, while reads are made exclusively to the secondary servers. Distributing reads and writes like this keeps any one machine from taking on too much of the load, helping to prevent slowdowns and crashes. Note that creating read replicas involves more computing resources and thus costs more money, which could be a significant constraint for some.
      • Upgrading to a larger server. In most cases, scaling up one’s database server to a machine with more resources requires less effort than sharding. As with creating read replicas, an upgraded server with more resources will likely cost more money. Accordingly, you should only go through with resizing if it truly ends up being your best option.

      Bear in mind that if your application or website grows past a certain point, none of these strategies will be enough to improve performance on their own. In such cases, sharding may indeed be the best option for you.


      Sharding can be a great solution for those looking to scale their database horizontally. However, it also adds a great deal of complexity and creates more potential failure points for your application. Sharding may be necessary for some, but the time and resources needed to create and maintain a sharded architecture could outweigh the benefits for others.

      By reading this conceptual article, you should have a clearer understanding of the pros and cons of sharding. Moving forward, you can use this insight to make a more informed decision about whether or not a sharded database architecture is right for your application.

      Source link