One place for hosting & domains

      PostgreSQL

      How To Migrate a MySQL Database to PostgreSQL Using pgLoader


      Introduction

      PostgreSQL, also known as “Postgres,” is an open-source relational database management system (RDBMS). It has seen a drastic growth in popularity in recent years, with many developers and companies migrating their data to Postgres from other database solutions.

      The prospect of migrating a database can be intimidating, especially when migrating from one database management system to another. pgLoader is an open-source database migration tool that aims to simplify the process of migrating to PostgreSQL. It supports migrations from several file types and RBDMSs — including MySQL and SQLite — to PostgreSQL.

      This tutorial provides instructions on how to install pgLoader and use it to migrate a remote MySQL database to PostgreSQL over an SSL connection. Near the end of the tutorial, we will also briefly touch on a few different migration scenarios where pgLoader may be useful.

      Prerequisites

      To complete this tutorial, you’ll need the following:

      • Access to two servers, each running Ubuntu 18.04. Both servers should have a firewall and a non-root user with sudo privileges configured. To set these up, you can follow our Initial Server Setup guide for Ubuntu 18.04.
      • MySQL installed on one of the servers. To set this up, follow Steps 1, 2, and 3 of our guide on How To Install MySQL on Ubuntu 18.04. Please note that in order to complete all the prerequisite tutorials linked here, you will need to configure your root MySQL user to authenticate with a password, as described in Step 3 of the MySQL installation guide.
      • PostgreSQL installed on the other server. To set this up, complete Step 1 of our guide How To Install and Use PostgreSQL on Ubuntu 18.04.
      • Your MySQL server should also be configured to accept encrypted connections. To set this up, complete every step of our tutorial on How To Configure SSL/TLS for MySQL on Ubuntu 18.04, including the optional Step 6. As you follow this guide, be sure to use your PostgreSQL server as the MySQL client machine, as you will need to be able to connect to your MySQL server from your Postgres machine in order to migrate the data with pgLoader.

      Please note that throughout this guide, the server on which you installed MySQL will be referred to as the “MySQL server” and any commands that should be run on this machine will be shown with a blue background, like this:

      Similarly, this guide will refer to the other server as the "PostgreSQL" or "Postgres" server and any commands that must be run on that machine will be shown with a red background:

      Please keep these in mind as you follow this tutorial so as to avoid any confusion.

      Step 1 — (Optional) Creating a Sample Database and Table in MySQL

      This step describes the process of creating a test database and populating it with dummy data. We encourage you to practice using pgLoader with this test case, but if you already have a database you want to migrate, you can move on to the next step.

      Start by opening up the MySQL prompt on your MySQL server:

      After entering your root MySQL user's password, you will see the MySQL prompt.

      From there, create a new database by running the following command. You can name your database whatever you'd like, but in this guide we will name it source_db:

      • CREATE DATABASE source_db;

      Then switch to this database with the USE command:

      Output

      Database changed

      Within this database, use the following command to create a sample table. Here, we will name this table sample_table but feel free to give it another name:

      • CREATE TABLE sample_table (
      • employee_id INT PRIMARY KEY,
      • first_name VARCHAR(50),
      • last_name VARCHAR(50),
      • start_date DATE,
      • salary VARCHAR(50)
      • );

      Then populate this table with some sample employee data using the following command:

      • INSERT INTO sample_table (employee_id, first_name, last_name, start_date, salary)
      • VALUES (1, 'Elizabeth', 'Cotten', '2007-11-11', '$105433.18'),
      • (2, 'Yanka', 'Dyagileva', '2017-10-30', '$107540.67'),
      • (3, 'Lee', 'Dorsey', '2013-06-04', '$118024.04'),
      • (4, 'Kasey', 'Chambers', '2010-08-18', '$116456.98'),
      • (5, 'Bram', 'Tchaikovsky', '2018-09-16', '$61989.50');

      Following this, you can close the MySQL prompt:

      Now that you have a sample database loaded with dummy data, you can move on to the next step in which you will install pgLoader on your PostgreSQL server.

      Step 2 — Installing pgLoader

      pgLoader is a program that can load data into a PostgreSQL database from a variety of different sources. It uses PostgreSQL's COPY command to copy data from a source database or file — such as a comma-separated values (CSV) file — into a target PostgreSQL database.

      pgLoader is available from the default Ubuntu APT repositories and you can install it using the apt command. However, in this guide we will take advantage of pgLoader's useSSL option, a feature that allows for migrations from MySQL over an SSL connection. This feature is only available in the latest version of pgLoader which, as of this writing, can only be installed using the source code from its GitHub repository.

      Before installing pgLoader, you will need to install its dependencies. If you haven't done so recently, update your Postgres server's package index:

      Then install the following packages:

      • sbcl: A Common Lisp compiler
      • unzip: A de-archiver for .zip files
      • libsqlite3-dev: A collection of development files for SQLite 3
      • gawk: Short for "GNU awk", a pattern scanning and processing language
      • curl: A command line tool for transferring data from a URL
      • make: A utility for managing package compilation
      • freetds-dev: A client library for MS SQL and Sybase databases
      • libzip-dev: A library for reading, creating, and modifying zip archives

      Use the following command to install these dependencies:

      • sudo apt install sbcl unzip libsqlite3-dev gawk curl make freetds-dev libzip-dev

      When prompted, confirm that you want to install these packages by pressing ENTER.

      Next, navigate to the pgLoader GitHub project's Releases page and find the latest release. For this guide, we will use the latest release at the time of this writing: version 3.6.1. Scroll down to its Assets menu and copy the link for the tar.gz file labeled Source code. Then paste the link into the following wget command. This will download the tarball to your server:

      • wget https://github.com/dimitri/pgloader/archive/v3.6.1.tar.gz

      Extract the tarball:

      This will create a number of new directories and files on your server. Navigate into the new pgLoader parent directory:

      Then use the make utility to compile the pgloader binary:

      This command will take some time to build the pgloader binary.

      Move the binary file into the /usr/local/bin directory, the location where Ubuntu looks for executable files:

      • sudo mv ./build/bin/pgloader /usr/local/bin/

      You can test that pgLoader was installed correctly by checking its version, like so:

      Output

      pgloader version "3.6.1" compiled with SBCL 1.4.5.debian

      pgLoader is now installed, but before you can begin your migration you'll need to make some configuration changes to both your PostgreSQL and MySQL instances. We'll focus on the PostgreSQL server first.

      Step 3 — Creating a PostgreSQL Role and Database

      The pgloader command works by copying source data, either from a file or directly from a database, and inserting it into a PostgreSQL database. For this reason, you must either run pgLoader as a Linux user who has access to your Postgres database or you must specify a PostgreSQL role with the appropriate permissions in your load command.

      PostgreSQL manages database access through the use of roles. Depending on how the role is configured, it can be thought of as either a database user or a group of database users. In most RDBMSs, you create a user with the CREATE USER SQL command. Postgres, however, comes installed with a handy script called createuser. This script serves as a wrapper for the CREATE USER SQL command that you can run directly from the command line.

      Note: In PostgreSQL, you authenticate as a database user using the Identification Protocol, or ident, authentication method by default, rather than with a password. This involves PostgreSQL taking the client's Ubuntu username and using it as the allowed Postgres database username. This allows for greater security in many cases, but it can also cause issues in instances where you'd like an outside program to connect to one of your databases.

      pgLoader can load data into a Postgres database through a role that authenticates with the ident method as long as that role shares the same name as the Linux user profile issuing the pgloader command. However, to keep this process as clear as possible, this tutorial describes setting up a different PostgreSQL role that authenticates with a password rather than with the ident method.

      Run the following command on your Postgres server to create a new role. Note the -P flag, which tells createuser to prompt you to enter a password for the new role:

      • sudo -u postgres createuser --interactive -P

      You may first be prompted for your sudo password. The script will then prompt you to enter a name for the new role. In this guide, we'll call this role pgloader_pg:

      Output

      Enter name of role to add: pgloader_pg

      Following that, createuser will prompt you to enter and confirm a password for this role. Be sure to take note of this password, as you'll need it to perform the migration in Step 5:

      Output

      Enter password for new role: Enter it again:

      Lastly, the script will ask you if the new role should be classified as a superuser. In PostgreSQL, connecting to the database with a superuser role allows you to circumvent all of the database's permissions checks, except for the right to log in. Because of this, the superuser privilege should not be used lightly, and the PostgreSQL documentation recommends that you do most of your database work as a non-superuser role. However, because pgLoader needs broad privileges to access and load data into tables, you can safely grant this new role superuser privileges. Do so by typing y and then pressing ENTER:

      Output

      . . . Shall the new role be a superuser? (y/n) y

      PostgreSQL comes with another useful script that allows you to create a database from the command line. Since pgLoader also needs a target database into which it can load the source data, run the following command to create one. We'll name this database new_db but feel free to modify that if you like:

      • sudo -u postgres createdb new_db

      If there aren't any errors, this command will complete without any output.

      Now that you have a dedicated PostgreSQL user and an empty database into which you can load your MySQL data, there are just a few more changes you'll need to make before performing a migration. You'll need to create a dedicated MySQL user with access to your source database and add your client-side certificates to Ubuntu's trusted certificate store.

      Step 4 — Creating a Dedicated User in MySQL and Managing Certificates

      Protecting data from snoopers is one of the most important parts of any database administrator's job. Migrating data from one machine to another opens up an opportunity for malicious actors to sniff the packets traveling over the network connection if it isn't encrypted. In this step, you will create a dedicated MySQL user which pgLoader will use to perform the migration over an SSL connection.

      Begin by opening up your MySQL prompt:

      From the MySQL prompt, use the following CREATE USER command to create a new MySQL user. We will name this user pgloader_my. Because this user will only access MySQL from your PostgreSQL server, be sure to replace your_postgres_server_ip with the public IP address of your PostgreSQL server. Additionally, replace password with a secure password or passphrase:

      • CREATE USER 'pgloader_my'@'your_postgres_server_ip' IDENTIFIED BY 'password' REQUIRE SSL;

      Note the REQUIRE SSL clause at the end of this command. This will restrict the pgloader_my user to only access the database through a secure SSL connection.

      Next, grant the pgloader_my user access to the target database and all of its tables. Here, we'll specify the database we created in the optional Step 1, but if you have your own database you'd like to migrate, use its name in place of source_db:

      • GRANT ALL ON source_db.* TO 'pgloader_my'@'your_postgresql_server_ip';

      Then run the FLUSH PRIVILEGES command to reload the grant tables, enabling the privilege changes:

      After this, you can close the MySQL prompt:

      Now go back to your Postgres server terminal and attempt to log in to the MySQL server as the new pgloader_my user. If you followed the prerequisite guide on configuring SSL/TLS for MySQL then you will already have mysql-client installed on your PostgreSQL server and you should be able to connect with the following command:

      • mysql -u pgloader_my -p -h your_mysql_server_ip

      If the command is successful, you will see the MySQL prompt:

      After confirming that your pgloader_my user can successfully connect, go ahead and close the prompt:

      At this point, you have a dedicated MySQL user that can access the source database from your Postgres machine. However, if you were to try to migrate your MySQL database using SSL the attempt would fail.

      The reason for this is that pgLoader isn't able to read MySQL's configuration files, and thus doesn't know where to look for the CA certificate or client certificate that you copied to your PostgreSQL server in the prerequisite SSL/TLS configuration guide. Rather than ignoring SSL requirements, though, pgLoader requires the use of trusted certificates in cases where SSL is needed to connect to MySQL. Accordingly, you can resolve this issue by adding the ca.pem and client-cert.pem files to Ubuntu's trusted certificate store.

      To do this, copy over the ca.pem and client-cert.pem files to the /usr/local/share/ca-certificates/ directory. Note that you must also rename these files so they have the .crt file extension. If you don't rename them, your system will not be able to recognize that you've added these new certificates:

      • sudo cp ~/client-ssl/ca.pem /usr/local/share/ca-certificates/ca.pem.crt
      • sudo cp ~/client-ssl/client-cert.pem /usr/local/share/ca-certificates/client-cert.pem.crt

      Following this, run the update-ca-certificates command. This program looks for certificates within /usr/local/share/ca-certificates, adds any new ones to the /etc/ssl/certs/ directory, and generates a list of trusted SSL certificates — ca-certificates.crt — based on the contents of the /etc/ssl/certs/ directory:

      • sudo update-ca-certificates

      Output

      Updating certificates in /etc/ssl/certs... 2 added, 0 removed; done. Running hooks in /etc/ca-certificates/update.d... done.

      With that, you're all set to migrate your MySQL database to PostgreSQL.

      Step 5 — Migrating the Data

      Now that you've configured remote access from your PostgreSQL server to your MySQL server, you're ready to begin the migration.

      Note: It's important to back up your database before taking any action that could impact the integrity of your data. However, this isn't necessary when performing a migration with pgLoader, since it doesn't delete or transform data; it only copies it.

      That said, if you're feeling cautious and would like to back up your data before migrating it, you can do so with the mysqldump utility. See the official MySQL documentation for details.

      pgLoader allows users to migrate an entire database with a single command. For a migration from a MySQL database to a PostgreSQL database on a separate server, the command would have the following syntax:

      • pgloader mysql://mysql_username:password@mysql_server_ip_/source_database_name?option_1=value&option_n=value postgresql://postgresql_role_name:password@postgresql_server_ip/target_database_name?option_1=value&option_n=value

      This includes the pgloader command and two connection strings, the first for the source database and the second for the target database. Both of these connection strings begin by declaring what type of DBMS the connection string points to, followed by the username and password that have access to the database (separated by a colon), the host address of the server where the database is installed, the name of the database pgLoader should target, and various options that affect pgLoader's behavior.

      Using the parameters defined earlier in this tutorial, you can migrate your MySQL database using a command with the following structure. Be sure to replace any highlighted values to align with your own setup:

      • pgloader mysql://pgloader_my:mysql_password@mysql_server_ip/source_db?useSSL=true postgresql://pgloader_pg:postgresql_password@localhost/new_db

      Note that this command includes the useSSL option in the MySQL connection string. By setting this option to true, pgLoader will connect to MySQL over SSL. This is necessary, as you've configured your MySQL server to only accept secure connections.

      If this command is successful, you will see an output table describing how the migration went:

      Output

      table name errors rows bytes total time ----------------------- --------- --------- --------- -------------- fetch meta data 0 2 0.111s Create Schemas 0 0 0.001s Create SQL Types 0 0 0.005s Create tables 0 2 0.017s Set Table OIDs 0 1 0.010s ----------------------- --------- --------- --------- -------------- source_db.sample_table 0 5 0.2 kB 0.048s ----------------------- --------- --------- --------- -------------- COPY Threads Completion 0 4 0.052s Index Build Completion 0 1 0.011s Create Indexes 0 1 0.006s Reset Sequences 0 0 0.014s Primary Keys 0 1 0.001s Create Foreign Keys 0 0 0.000s Create Triggers 0 0 0.000s Install Comments 0 0 0.000s ----------------------- --------- --------- --------- -------------- Total import time ✓ 5 0.2 kB 0.084s

      To check that the data was migrated correctly, open up the PostgreSQL prompt:

      From there, connect to the database into which you loaded the data:

      Then run the following query to test whether the migrated data is stored in your PostgreSQL database:

      • SELECT * FROM source_db.sample_table;

      Note: Notice the FROM clause in this query specifying the sample_table held within the source_db schema:

      • . . . FROM source_db.sample_table;

      This is called a qualified name. You could go further and specify the fully qualified name by including the database's name as well as those of the schema and table:

      • . . . FROM new_db.source_db.sample_table;

      When you run queries in a PostgreSQL database, you don't need to be this specific if the table is held within the default public schema. The reason you must do so here is that when pgLoader loads data into Postgres, it creates and targets a new schema named after the original database — in this case, source_db. This is pgLoader's default behavior for MySQL to PostgreSQL migrations. However, you can use a load file to instruct pgLoader to change the table's schema topubliconce it's done loading data. See the next step for an example of how to do this.

      If the data was indeed loaded correctly, you will see the following table in the query's output:

      Output

      employee_id | first_name | last_name | start_date | salary -------------+------------+-------------+------------+------------ 1 | Elizabeth | Cotten | 2007-11-11 | $105433.18 2 | Yanka | Dyagileva | 2017-10-30 | $107540.67 3 | Lee | Dorsey | 2013-06-04 | $118024.04 4 | Kasey | Chambers | 2010-08-18 | $116456.98 5 | Bram | Tchaikovsky | 2018-09-16 | $61989.50 (5 rows)

      To close the Postgres prompt, run the following command:

      Now that we've gone over how to migrate a MySQL database over a network and load it into a PostgreSQL database, we will go over a few other common migration scenarios in which pgLoader can be useful.

      Step 6 — Exploring Other Migration Options

      pgLoader is a highly flexible tool that can be useful in a wide variety of situations. Here, we'll take a quick look at a few other ways you can use pgLoader to migrate a MySQL database to PostgreSQL.

      Migrating with a pgLoader Load File

      In the context of pgLoader, a load file, or command file, is a file that tells pgLoader how to perform a migration. This file can include commands and options that affect pgLoader's behavior, giving you much finer control over how your data is loaded into PostgreSQL and allowing you to perform complex migrations.

      pgLoader's documentation provides comprehensive instructions on how to use and extend these files to support a number of migration types, so here we will work through a comparatively rudimentary example. We will perform the same migration we ran in Step 5, but will also include an ALTER SCHEMA command to change the new_db database's schema from source_db to public.

      To begin, create a new load file on the Postgres server using your preferred text editor:

      Then add the following content, making sure to update the highlighted values to align with your own configuration:

      pgload_test.load

      LOAD DATABASE
           FROM      mysql://pgloader_my:mysql_password@mysql_server_ip/source_db?useSSL=true
           INTO pgsql://pgloader_pg:postgresql_password@localhost/new_db
      
       WITH include drop, create tables
      
      ALTER SCHEMA 'source_db' RENAME TO 'public'
      ;
      

      Here is what each of these clauses do:

      • LOAD DATABASE: This line instructs pgLoader to load data from a separate database, rather than a file or data archive.
      • FROM: This clause specifies the source database. In this case, it points to the connection string for the MySQL database we created in Step 1.
      • INTO: Likewise, this line specifies the PostgreSQL database in to which pgLoader should load the data.
      • WITH: This clause allows you to define specific behaviors for pgLoader. You can find the full list of WITH options that are compatible with MySQL migrations here. In this example we only include two options:
        • include drop: When this option is used, pgLoader will drop any tables in the target PostgreSQL database that also appear in the source MySQL database. If you use this option when migrating data to an existing PostgreSQL database, you should back up the entire database to avoid losing any data.
        • create tables: This option tells pgLoader to create new tables in the target PostgreSQL database based on the metadata held in the MySQL database. If the opposite option, create no tables, is used, then the target tables must already exist in the target Postgres database prior to the migration.
      • ALTER SCHEMA: Following the WITH clause, you can add specific SQL commands like this to instruct pgLoader to perform additional actions. Here, we instruct pgLoader to change the new Postgres database's schema from source_db to public, but only after it has created the schema. Note that you can also nest such commands within other clauses — such as BEFORE LOAD DO — to instruct pgLoader to execute those commands at specific points in the migration process.

      This is a demonstrative example of what you can include in a load file to modify pgLoader's behavior. The complete list of clauses that one can add to a load file and what they do can be found in the official pgLoader documentation.

      Save and close the load file after you've finished adding this content. To use it, include the name of the file as an argument to the pgloader command:

      • pgloader pgload_test.load

      To test that the migration was successful, open up the Postgres prompt:

      Then connect to the database:

      And run the following query:

      • SELECT * FROM sample_table;

      Output

      employee_id | first_name | last_name | start_date | salary -------------+------------+-------------+------------+------------ 1 | Elizabeth | Cotten | 2007-11-11 | $105433.18 2 | Yanka | Dyagileva | 2017-10-30 | $107540.67 3 | Lee | Dorsey | 2013-06-04 | $118024.04 4 | Kasey | Chambers | 2010-08-18 | $116456.98 5 | Bram | Tchaikovsky | 2018-09-16 | $61989.50 (5 rows)

      This output confirms that pgLoader migrated the data successfully, and also that the ALTER SCHEMA command we added to the load file worked as expected, since we didn't need to specify the source_db schema in the query to view the data.

      Note that if you plan to use a load file to migrate data held on one database to another located on a separate machine, you will still need to adjust any relevant networking and firewall rules in order for the migration to be successful.

      Migrating a MySQL Database to PostgreSQL Locally

      You can use pgLoader to migrate a MySQL database to a PostgreSQL database housed on the same machine. All you need is to run the migration command from a Linux user profile with access to the root MySQL user:

      • pgloader mysql://root@localhost/source_db pgsql://sammy:postgresql_password@localhost/target_db

      Performing a local migration like this means you don't have to make any changes to MySQL's default networking configuration or your system's firewall rules.

      Migrating from a CSV file

      You can also load a PostgreSQL database with data from a CSV file.

      Assuming you have a CSV file of data named load.csv, the command to load it into a Postgres database might look like this:

      • pgloader load.csv pgsql://sammy:password@localhost/target_db

      Because the CSV format is not fully standardized, there's a chance that you will run into issues when loading data directly from a CSV file in this manner. Fortunately, you can correct for irregularities by including various options with pgLoader's command line options or by specifying them in a load file. See the pgLoader documentation on the subject for more details.

      Migrating to a Managed PostgreSQL Database

      It's also possible to perform a migration from a self-managed database to a managed PostgreSQL database. To illustrate how this kind of migration could look, we will use the MySQL server and a DigitalOcean Managed PostgreSQL Database. We'll also use the sample database we created in Step 1, but if you skipped that step and have your own database you'd like to migrate, you can point to that one instead.

      Note: For instructions on how to set up a DigitalOcean Managed Database, please refer to our Managed Database Quickstart guide.

      For this migration, we won't need pgLoader’s useSSL option since it only works with remote MySQL databases and we will run this migration from a local MySQL database. However, we will use the sslmode=require option when we load and connect to the DigitalOcean Managed PostgreSQL database, which will ensure your data stays protected.

      Because we're not using the useSSL this time around, you can use apt to install pgLoader along with the postgresql-client package, which will allow you to access the Managed PostgreSQL Database from your MySQL server:

      • sudo apt install pgloader postgresql-client

      Following that, you can run the pgloader command to migrate the database. To do this, you'll need the connection string for the Managed Database.

      For DigitalOcean Managed Databases, you can copy the connection string from the Cloud Control Panel. First, click Databases in the left-hand sidebar menu and select the database to which you want to migrate the data. Then scroll down to the Connection Details section. Click on the drop down menu and select Connection string. Then, click the Copy button to copy the string to your clipboard and paste it into the following migration command, replacing the example PostgreSQL connection string shown here. This will migrate your MySQL database into the defaultdb PostgreSQL database as the doadmin PostgreSQL role:

      • pgloader mysql://root:password@localhost/source_db postgres://doadmin:password@db_host/defaultdb?sslmode=require

      Following this, you can use the same connection string as an argument to psql to connect to the managed PostgreSQL database andhttps://www.digitalocean.com/community/tutorials/how-to-migrate-mysql-database-to-postgres-using-pgloader#step-1-%E2%80%94-(optional)-creating-a-sample-database-and-table-in-mysql confirm that the migration was successful:

      • psql postgres://doadmin:password@db_host/defaultdb?sslmode=require

      Then, run the following query to check that pgLoader correctly migrated the data:

      • SELECT * FROM source_db.sample_table;

      Output

      employee_id | first_name | last_name | start_date | salary -------------+------------+-------------+------------+------------ 1 | Elizabeth | Cotten | 2007-11-11 | $105433.18 2 | Yanka | Dyagileva | 2017-10-30 | $107540.67 3 | Lee | Dorsey | 2013-06-04 | $118024.04 4 | Kasey | Chambers | 2010-08-18 | $116456.98 5 | Bram | Tchaikovsky | 2018-09-16 | $61989.50 (5 rows)

      This confirms that pgLoader successfully migrated your MySQL database to your managed PostgreSQL instance.

      Conclusion

      pgLoader is a flexible tool that can perform a database migration in a single command. With a few configuration tweaks, it can migrate an entire database from one physical machine to another using a secure SSL/TLS connection. Our hope is that by following this tutorial, you will have gained a clearer understanding of pgLoader's capabilities and potential use cases.

      After migrating your data over to PostgreSQL, you may find the following tutorials to be of interest:



      Source link

      How To Audit a PostgreSQL Database with InSpec on Ubuntu 18.04


      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      InSpec is an open-source, automated testing framework for testing and auditing your system to ensure the compliance of integration, security, and other policy requirements. Developers can test the actual state of their infrastructure and applications against a target state using InSpec code.

      To specify the policy requirements you’re testing for, InSpec includes audit controls. Traditionally, developers manually enforce policy requirements and often do this right before deploying changes to production. With InSpec however, developers can continuously evaluate compliance at every stage of product development, which aids in solving issues earlier in the process of development. The InSpec DSL (Domain Specific Language) built on RSpec, a DSL testing tool written in Ruby, specifies the syntax used to write the audit controls.

      InSpec also includes a collection of resources to assist in configuring specific parts of your system and to simplify making audit controls. There is a feature to write your own custom resources when you need to define a specific solution that isn’t available. Universal matchers allow you to compare resource values to expectations across all InSpec tests.

      In this tutorial, you’ll install InSpec on a server running Ubuntu 18.04. You will start by writing a test that verifies the operating system family of the server, then you’ll create a PostgreSQL audit profile from the ground up. This audit profile starts by checking that you have PostgreSQL installed on the server and that its services are running. Then you’ll add tests to check that the PostgreSQL service is running with the correct port, address, protocol, and user. Next you’ll test specific PostgreSQL configuration parameters, and finally, you’ll audit client authentication configuration.

      Prerequisites

      Before following this tutorial, you will need the following:

      Step 1 — Preparing the Environment

      In this step, you’ll download and unpack the latest stable version of InSpec into your home directory. InSpec provides installable binaries on their downloads page.

      Navigate to your home directory:

      Now download the binary with curl:

      • curl -LO https://packages.chef.io/files/stable/inspec/3.7.11/ubuntu/18.04/inspec_3.7.11-1<^>_amd64.deb

      Next, use the sha256sum command to generate a checksum of the downloaded file. This is to verify the integrity and authenticity of the downloaded file.

      • sha256sum inspec_3.7.11-1_amd64.deb

      Checksums for each binary are listed on the InSpec downloads page, so visit the downloads page to compare with your output from this command.

      Output

      e665948f9c0441e8648b08f8d3c8d34a86f9e994609877a7e4853c012dbc7523 inspec_3.7.11-1_amd64.deb

      If the checksums are different, delete the downloaded file and repeat the download process.

      Next, you'll install the downloaded binary. For this, you'll use the dpkg command that you can use for package management, and which comes with all Debian-based systems, such as Ubuntu, by default. The -i flag prompts the dpkg command to install the package files.

      • sudo dpkg -i inspec_3.7.11-1_amd64.deb

      If there are no errors, it means that you've installed InSpec successfully. To verify the installation, enter the following command:

      You'll receive output showing the version of InSpec you just installed:

      Output

      3.7.11

      If you don't see a version number displayed, run over step 1 again.

      After this, you can delete inspec_3.7.11-1_amd64.deb since you don't need it anymore as you've installed the package:

      • rm inspec_3.7.11-1_amd64.deb

      You've successfully installed InSpec on your server. In the next step, you will write a test to verify the operating system family of your server.

      Step 2 — Completing Your First InSpec Test

      In this step, you'll complete your first InSpec test, which will be testing that your operating system family is debian.

      You will use the os resource, which is a built-in InSpec audit resource to test the platform on which the system is running. You'll also use the eq matcher. The eq matcher is a universal matcher that tests for the exact equality of two values.

      An InSpec test consists of a describe block, which contains one or more it and its statements each of which validates one of the resource's features. Each statement describes an expectation of a specific condition of the system as assertions. Two keywords that you can include to make an assertion are should and should_not, which assert that the condition should be true and false respectively.

      Create a file called os_family.rb to hold your test and open it with your text editor:

      Add the following to your file:

      os_family.rb

      describe os.family do
        it {should eq 'debian'}
      end
      

      This test ensures that the operating system family of the target system is debian. Other possible values are windows, unix, bsd, and so on. You can find a complete list in the os resource documentation. Save and exit the file.

      Next, run your test with the following command:

      The test will pass, and you'll receive output resembling the following:

      Output

      Profile: tests from os_family.rb (tests from os_family.rb) Version: (not specified) Target: local:// debian ✔ should eq "debian" Test Summary: 1 successful, 0 failures, 0 skipped

      In your output, the Profile contains the name of the profile that just executed. Since this test is not included in a profile, InSpec generates a default profile name from the test's file name tests from os_family.rb. (You'll work with InSpec profiles in the next section where you will start building your PostgreSQL InSpec profile.) Here InSpec presents the Version as not specified, because you can only specify versions in profiles.

      The Target field specifies the target system that the test is executed on, which can be local or a remote system via ssh. In this case, you've executed your test on the local system so the target shows local://.

      Usefully, the output also displays the executed test with a checkmark symbol (✔) to the left indicating a successful test. The output will show a cross symbol (✘) if the test fails.

      Finally, the test summary gives overall details about how many tests were successful, failed, and skipped. In this instance, you had a single successful test.

      Now you'll see what the output looks like for a failed test. Open os_family.rb:

      In the test you created earlier in this step, you'll now change the expected value of the operating system family from debian to windows. Your file contents after this will be the following:

      os_family.rb

      describe os.family do
        it {should eq 'windows'}
      end
      

      Save and exit the file.

      Next, run the updated test with the following command:

      You will get output similar to the following:

      Output

      Profile: tests from os_family.fail.rb (tests from os_family.fail.rb) Version: (not specified) Target: local:// debian (✘) should eq "windows" expected: "windows" got: "debian" (compared using ==) Test Summary: 0 successful, 1 failure, 0 skipped

      As expected, the test failed. The output indicates that your expected (windows) and actual (debian) values do not match for the os.family property. The (compared using ==) output indicates that the eq matcher performed a string comparison between the two values to come up with this result.

      In this step, you've written a successful test that verifies the operating system family of the server. You've also created a failed test in order to see what the InSpec output for a failed test looks like. In the next step, you will start building the audit profile to test your PostgreSQL installation.

      Step 3 — Auditing Your PostgreSQL Installation

      Now, you will audit your PostgreSQL installation. You'll start by checking that you have PostgreSQL installed and its service is running correctly. Finally, you'll audit the PostgreSQL system port and process. For your PostgreSQL audit, you will create various InSpec controls, all within an InSpec profile named PostgreSQL.

      An InSpec control is a high-level grouping of related tests. Within a control, you can have multiple describe blocks, as well as metadata to describe your tests such as impact level, title, description, and tags. InSpec profiles organize controls to support dependency management and code reuse, which both help manage test complexity. They are also useful for packaging and sharing tests with the public via the Chef Supermarket. You can use profiles to define custom resources that you would implement as regular Ruby classes.

      To create an InSpec profile, you will use the init command. Enter this command to create the PostgreSQL profile:

      • inspec init profile PostgreSQL

      This creates the profile in a new directory with the same name as your profile, in this case PostgreSQL. Now, move into the new directory:

      The directory structure will look like this:

      PostgreSQL/
      ├── controls
      │   └── example.rb
      ├── inspec.yml
      ├── libraries
      └── README.md
      

      The controls/example.rb file contains a sample control that tests to see if the /tmp folder exists on the target system. This is present only as a sample and you will replace it with your own test.

      Your first test will be to ensure that you have the package postgresql-10 installed on your system and that you have the postgresql service installed, enabled, and running.

      Rename the controls/example.rb file to controls/postgresql.rb:

      • mv controls/example.rb controls/postgresql.rb

      Next, open the file with your text editor:

      • nano controls/postgresql.rb

      Replace the content of the file with the following:

      controls/postgresql.rb

      control '1-audit_installation' do
        impact 1.0
        title 'Audit PostgreSQL Installation'
        desc 'Postgres should be installed and running'
      
        describe package('postgresql-10') do
          it {should be_installed}
          its('version') {should cmp >= '10'}
        end
      
        describe service('postgresql@10-main') do
          it {should be_enabled}
          it {should be_installed}
          it {should be_running}
        end
      end
      

      In the preceding code block, you begin by defining the control with its name and metadata.

      In the first describe block, you use the package resource and pass in the PostgreSQL package name postgresql-10 as a resource argument. The package resource provides the matcher be_installed to test that the named package is installed on the system. It returns true if you have the package installed, and false otherwise. Next, you used the its statement to validate that the version of the installed PostgreSQL package is at least 10. You are using cmp instead of eq because package version strings usually contain other attributes apart from the numerical version. eq returns true only if there is an exact match while cmp is less-restrictive.

      In the second describe block, you use the service resource and pass in the PostgreSQL 10 service name postgresql@10-main as a resource argument. The service resource provides the matchers be_enabled, be_installed, and be_running and they return true if you have the named service installed, enabled, and running on the target system respectively.

      Save and exit your file.

      Next, you will run your profile. Make sure you're in the ~/PostgreSQL directory before running the following command:

      Since you completed the PostgreSQL prerequisite tutorial, your test will pass. Your output will look similar to the following:

      Output

      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running Profile Summary: 1 successful control, 0 control failures, 0 controls skipped Test Summary: 5 successful, 0 failures, 0 skipped

      The output indicates that your control was successful. A control is successful if, and only if, all the tests in it are successful. The output also confirms that all your tests were successful.

      Now that you've verified that the correct version of PostgreSQL is installed and the service is fine, you will create a new control that ensures that PostgreSQL is listening on the correct port, address, and protocol.

      For this test, you will also use attributes. An InSpec attribute is used to parameterize a profile to enable easy re-use in different environments or target systems. You'll define the PORT attribute.

      Open the inspec.yml file in your text editor:

      You'll append the port attribute to the end of the file. Add the following at the end of your file:

      inspec.yml

      ...
      attributes:
        - name: port
          type: string
          default: '5432'
      

      In the preceding code block, you added the port attribute and set it to a default value of 5432 because that is the port PostgreSQL listens on by default.

      Save and exit the file. Then run inspec check to verify the profile is still valid since you just edited inspec.yml:

      If there are no errors, you can proceed. Otherwise, open the inspec.yml file and ensure that the attribute is present at the end of the file.

      Now you'll create the control that checks that the PostgreSQL process is running and configured with the correct user. Open controls/postgresql.rb in your text editor:

      • nano controls/postgresql.rb

      Append the following control to the end of your current tests file controls/postgresql.rb:

      controls/postgresql.rb

      ...
      PORT = attribute('port')
      
      control '2-audit_address_port' do
        impact 1.0
        title 'Audit Process and Port'
        desc 'Postgres port should be listening and the process should be running'
      
        describe port(PORT) do
          it {should be_listening}
          its('addresses') {should include '127.0.0.1'}
          its('protocols') {should cmp 'tcp'}
        end
      
        describe processes('postgres') do
          it {should exist}
          its('users') {should include 'postgres'}
        end
      
        describe user('postgres') do
          it {should exist}
        end
      end
      

      Here you begin by declaring a PORT variable to hold the value of the port profile attribute. Then you declare the control and its metadata.

      In the first describe block, you include the port resource to test basic port properties. The port resource provides the matchers be_listening, addresses, and protocols. You use the be_listening matcher to test that the named port is listening on the target system. It returns true if the port 5432 is listening and returns false otherwise. The addresses matcher tests if the specified address is associated with the port. In this case, PostgreSQL will be listening on the local address, 127.0.0.1.
      The protocols matcher tests the Internet protocol the port is listening for, which can be icmp, tcp/tcp6, or udp/udp6. PostgreSQL will be listening for tcp connections.

      In the second describe block, you include the processes resource. You use the processes resource to test properties for programs that are running on the system. First, you verify that the postgres process exists on the system, then you use the users matcher to test that the postgres user owns the postgres process.

      In the third describe block, you have the user resource. You include the user resource to test user properties for a user such as whether the user exists or not, the group the user belongs to, and so on. Using this resource, you test that the postgres user exists on the system. Save and exit controls/postgresql.rb.

      Next, run your profile with the following command:

      The tests will pass, and your output will resemble the following:

      Output

      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "127.0.0.1" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist Profile Summary: 2 successful controls, 0 control failures, 0 controls skipped Test Summary: 11 successful, 0 failures, 0 skipped

      The output indicates that both of your controls and all of your tests were successful.

      In this section, you have created your first InSpec profile and control and used them to organize your tests. You've used several InSpec resources to ensure that you have the correct version of PostgreSQL installed, the PostgreSQL service enabled and running correctly, and that the PostgreSQL user exists on the system. With this set up you're ready to audit your configuration.

      Step 4 — Auditing Your PostgreSQL Configuration

      In this step, you'll audit some PostgreSQL configuration values, which will give you a foundation for working with these configuration files, allowing you to audit any PostgreSQL configuration parameters as desired.

      Now that you have tests auditing the PostgreSQL installation, you'll audit your PostgreSQL configuration itself. PostgreSQL has several configuration parameters that you can use to tune it as desired, and these are stored in the configuration file located by default at /etc/postgresql/10/main/postgresql.conf. You could have different requirements regarding PostgreSQL configuration for your various deployments such as logging, password encryption, SSL, and replication strategies — these requirements you specify in the configuration file.

      You will be using the postgres_conf resource that tests for specific, named configuration options against expected values in the contents of the PostgreSQL configuration file.

      This test will assume some non-default PostgreSQL configuration values that you'll set manually.

      Open the PostgreSQL configuration file in your favorite text editor:

      • sudo nano /etc/postgresql/10/main/postgresql.conf

      Set the following configuration values. If the option already exists in the file but is commented out, uncomment it by removing the #, and set the value as provided:

      /etc/postgresql/10/main/postgresql.conf

      password_encryption = scram-sha-256
      logging_collector = on
      log_connections = on
      log_disconnections = on
      log_duration = on
      

      The configuration values you have set:

      • Ensure that saved passwords are always encrypted with the scram-sha-256 algorithm.
      • Enable the logging collector, which is a background process that captures log messages from the standard error (stderr) and redirects them to a log file.
      • Enable logging of connection attempts to the PostgreSQL server as well as successful connections.
      • Enable logging of session terminations.
      • Enable logging of the duration of every completed statement.

      Save and exit the configuration file. Then restart the PostgreSQL service:

      • sudo service postgresql@10-main restart

      You'll test for only a few configuration options, but you can test any PostgreSQL configuration option with the postgres_conf resource.

      You will pass in your PostgreSQL configuration directory, which is at /etc/postgresql/10/main, using a new profile attribute, postgres_conf_dir. This configuration directory is not the same across all operating systems and platforms, so by passing it in as a profile attribute, you'll be making this profile easier to reuse in different environments.

      Open your inspec.yml file:

      Add this new attribute to the attributes section of inspec.yml:

      inspec.yml

      ...
        - name: postgres_conf_dir
          type: string
          default: '/etc/postgresql/10/main'
      

      Save and exit your file. Then run the following command to verify the InSpec profile is still valid because you just edited the inspec.yml:

      If there are no errors, you can proceed. Otherwise, open the inspec.yml file and ensure that the above lines are present at the end of the file.

      Now you will create the control that audits the configuration values you are enforcing. Append the following control to the end of the tests file controls/postgresql.rb:

      controls/postgresql.rb

      ...
      POSTGRES_CONF_DIR = attribute('postgres_conf_dir')
      POSTGRES_CONF_PATH = File.join(POSTGRES_CONF_DIR, 'postgresql.conf')
      
      control '3-postgresql' do
        impact 1.0
        title 'Audit PostgreSQL Configuration'
        desc 'Audits specific configuration options'
      
        describe postgres_conf(POSTGRES_CONF_PATH) do
          its('port') {should eq PORT}
          its('password_encryption') {should eq 'scram-sha-256'}
          its('ssl') {should eq 'on'}
          its('logging_collector') {should eq 'on'}
          its('log_connections') {should eq 'on'}
          its('log_disconnections') {should eq 'on'}
          its('log_duration') {should eq 'on'}
        end
      end
      

      Here you define two variables:

      • POSTGRES_CONF_DIR holds the postgres_conf_dir attribute as defined in the profile configuration.
      • POSTGRES_CONF_PATH holds the absolute path of the configuration file by concatenating the configuration file name with the configuration directory using File.join.

      Next, you define the control with its name and metadata. Then you use the postgres_conf resource together with the eq matcher to ensure your required values for the configuration options are correct. Save and exit controls/postgresql.rb.

      Next, you will run the test with the following command:

      The tests will pass, and your outputs will resemble the following:

      Output

      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "127.0.0.1" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist ✔ 3-postgresql: Audit PostgreSQL Configuration ✔ PostgreSQL Configuration port should eq "5432" ✔ PostgreSQL Configuration password_encryption should eq "scram-sha-256" ✔ PostgreSQL Configuration ssl should eq "on" ✔ PostgreSQL Configuration logging_collector should eq "on" ✔ PostgreSQL Configuration log_connections should eq "on" ✔ PostgreSQL Configuration log_disconnections should eq "on" ✔ PostgreSQL Configuration log_duration should eq "on" Profile Summary: 3 successful controls, 0 control failures, 0 controls skipped Test Summary: 18 successful, 0 failures, 0 skipped

      The output indicates that your three controls and all your tests were successful without any skipped tests or controls.

      In this step, you've added a new InSpec control that tests specific PostgreSQL configuration values from the configuration file using the postgres_conf resource. You audited a few values in this section, but you can use it to test any configuration option from the configuration file.

      Step 5 — Auditing PostgreSQL Client Authentication

      Now that you've written some tests for your PostgreSQL configuration, you'll write some tests for client authentication. This is important for installations that need to ensure specific authentication methods for different kinds of users; for example, to ensure clients connecting to PostgreSQL locally always need to authenticate with a password, or to reject connections from a specific IP address or IP address range, and so on.

      An important configuration for PostgreSQL installations where security is a concern is to only allow encrypted password authentications. PostgreSQL 10 supports two password encryption methods for client authentication: md5 and scram-sha-256. This test will require password encryption for all clients so this means that the METHOD field for all clients in the client configuration file must be set to either md5 or scram-sha-256. For these tests, you will use scram-sha-256 since it is more secure than md5.

      By default, local clients have their peer authentication method in the pg_hba.conf file. For the test, you need to change these to scram-sha-256. Open the /etc/postgresql/10/main/pg_hba.conf file:

      • sudo nano /etc/postgresql/10/main/pg_hba.conf

      The top of the file contains comments. Scroll down and look for uncommented lines where the authentication type is local, and change the authentication method from peer to scram-sha-256. For example, change:

      /etc/postgresql/10/main/pg_hba.conf

      ...
      local   all             postgres                                peer
      ...
      

      to:

      /etc/postgresql/10/main/pg_hba.conf

      ...
      local   all             postgres                                scram-sha-256
      ...
      

      At the end, your pg_hba.conf configuration will resemble the following:

      /etc/postgresql/10/main/pg_hba.conf

      ...
      local   all             postgres                                scram-sha-256
      
      # TYPE  DATABASE        USER            ADDRESS                 METHOD
      
      # "local" is for Unix domain socket connections only
      local   all             all                                     scram-sha-256
      # IPv4 local connections:
      host    all             all             127.0.0.1/32            scram-sha-256
      # IPv6 local connections:
      host    all             all             ::1/128                 scram-sha-256
      # Allow replication connections from localhost, by a user with the
      # replication privilege.
      local   replication     all                                     scram-sha-256
      host    replication     all             127.0.0.1/32            scram-sha-256
      host    replication     all             ::1/128                 scram-sha-256
      ...
      

      Save and exit the configuration file. Then restart the PostgreSQL service:

      • sudo service postgresql@10-main restart

      For this test, you'll use the postgres_hba_conf resource. This resource is used to test the client authentication data defined in the pg_hba.conf file. You'll pass in the path of your pg_hba.conf file as a parameter to this resource.

      Your control will consist of two describe blocks that check the auth_method fields for both local and host clients respectively to ensure that they are both equal to scram-sha-256. Open controls/postgresql.rb in your text editor:

      • nano controls/postgresql.rb

      Append the following control to the end of the test file controls/postgresql.rb:

      controls/postgresql.rb

      POSTGRES_HBA_CONF_FILE = File.join(POSTGRES_CONF_DIR, 'pg_hba.conf')
      
      control '4-postgres_hba' do
        impact 1.0
        title 'Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf'
        desc 'Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf. Do not allow untrusted authentication methods.'
      
        describe postgres_hba_conf(POSTGRES_HBA_CONF_FILE).where { type == 'local' } do
          its('auth_method') { should all eq 'scram-sha-256' }
        end
      
        describe postgres_hba_conf(POSTGRES_HBA_CONF_FILE).where { type == 'host' } do
          its('auth_method') { should all eq 'scram-sha-256' }
        end
      end
      

      In this code block, you define a new variable POSTGRES_HBA_CONF_FILE to store the absolute location of your pg_hba.conf file. File.join is a Ruby method to concatenate two file path segments with /. You use it here to join the POSTGRES_CONF_DIR variable, declared in the previous section, with the PostgreSQL configuration file pg_hba.conf. This will produce an absolute file path of the pg_hba.conf file and store it in the POSTGRES_HBA_CONF_FILE variable.

      After that, you declare and configure the control and its metadata. The first describe block checks that all configuration entries where the client type is local also have scram-sha-256 as their authentication methods. The second describe block does the same for cases where the client type is host. Save and exit controls/postgresql.rb.

      You'll execute this control as the postgres user because Read access to the PostgreSQL HBA configuration is granted only to Owner and Group, which is the postgres user. Execute the profile by running:

      • sudo -u postgres inspec exec .

      Your output will resemble the following:

      Output

      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "127.0.0.1" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist ✔ 3-postgresql: Audit PostgreSQL Configuration ✔ PostgreSQL Configuration port should eq "5432" ✔ PostgreSQL Configuration password_encryption should eq "scram-sha-256" ✔ PostgreSQL Configuration ssl should eq "on" ✔ PostgreSQL Configuration logging_collector should eq "on" ✔ PostgreSQL Configuration log_connections should eq "on" ✔ PostgreSQL Configuration log_disconnections should eq "on" ✔ PostgreSQL Configuration log_duration should eq "on" ✔ 4-postgres_hba: Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf ✔ Postgres Hba Config /etc/postgresql/10/main/pg_hba.conf with type == "local" auth_method should all eq "scram-sha-256" ✔ Postgres Hba Config /etc/postgresql/10/main/pg_hba.conf with type == "host" auth_method should all eq "scram-sha-256" Profile Summary: 4 successful controls, 0 control failures, 0 controls skipped Test Summary: 20 successful, 0 failures, 0 skipped

      This output indicates that the new control you added, together with all of the previous controls, are successful. It also indicates that all the tests in your profile are successful.

      In this step, you have added a control to your profile that successfully audited your PostgreSQL client authentication configuration to ensure that all clients are authenticated via scram-sha-256 using the postgres_hba_conf resource.

      Conclusion

      You've set up InSpec and successfully audited a PostgreSQL 10 installation. In the process, you've used a selection of InSpec tools, such as: the InSpec DSL, matchers, resources, profiles, attributes, and the CLI. From here, you can incorporate other resources that InSpec provides in the Resources section of their documentation. InSpec also provides a mechanism for defining custom resources for your specific needs. These custom resources are written as a regular Ruby class.

      You can also explore the Compliance Profiles section of the Chef supermarket that contains publicly shared InSpec profiles that you can execute directly or extend in your own profiles. You can also share your own profiles with the general public in the Chef Supermarket.

      You can go further by exploring other tools in the Chef universe such as Chef and Habitat. InSpec is integrated with Habitat and this provides the ability to ship your compliance controls together with your Habitat-packaged applications and continuously run them. You can explore official and community InSpec tutorials on the tutorials page. For more advanced InSpec references, check the official InSpec documentation.



      Source link

      Managed Databases Connection Pools and PostgreSQL Benchmarking Using pgbench


      Introduction

      DigitalOcean Managed Databases allows you to scale your PostgreSQL database using several methods. One such method is a built-in connection pooler that allows you to efficiently handle large numbers of client connections and reduce the CPU and memory footprint of these open connections. By using a connection pool and sharing a fixed set of recyclable connections, you can handle significantly more concurrent client connections, and squeeze extra performance out of your PostgreSQL database.

      In this tutorial we’ll use pgbench, PostgreSQL’s built-in benchmarking tool, to run load tests on a DigitalOcean Managed PostgreSQL Database. We’ll dive in to connection pools, describe how they work, and show how to create one using the Cloud Control panel. Finally, using results from the pgbench tests, we’ll demonstrate how using a connection pool can be an inexpensive method of increasing database throughput.

      Prerequisites

      To complete this tutorial, you’ll need:

      • A DigitalOcean Managed PostgreSQL Database cluster. To learn how to provision and configure a DigitalOcean PostgreSQL cluster, consult the Managed Database product documentation.
      • A client machine with PostgreSQL installed. By default, your PostgreSQL installation will contain the pgbench benchmarking utility and the psql client, both of which we’ll use in this guide. Consult How To Install and Use PostgreSQL on Ubuntu 18.04 to learn how to Install PostgreSQL. If you’re not running Ubuntu on your client machine, you can use the version finder to find the appropriate tutorial.

      Once you have a DigitalOcean PostgreSQL cluster up and running and a client machine with pgbench installed, you’re ready to begin with this guide.

      Step 1 — Creating and Initializing benchmark Database

      Before we create a connection pool for our database, we’ll first create the benchmark database on our PostgreSQL cluster and populate it with some dummy data on which pgbench will run its tests. The pgbench utility repeatedly runs a series of five SQL commands (consisting of SELECT, UPDATE, and INSERT queries) in a transaction, using multiple threads and clients, and calculates a useful performance metric called Transactions per Second (TPS). TPS is a measure of database throughput, counting the number of atomic transactions processed by the database in one second. To learn more about the specific commands executed by pgbench, consult What is the “Transaction” Actually Performed in pgbench? from the official pgbench documentation.

      Let’s begin by connecting to our PostgreSQL cluster and creating the benchmark database.

      First, retrieve your cluster’s Connection Details by navigating to Databases and locating your PostgreSQL cluster. Click into your cluster. You should see a cluster overview page containing the following Connection Details box:

      PostgreSQL Cluster Connection Details

      From this, we can parse the following config variables:

      • Admin user: doadmin
      • Admin password: your_password
      • Cluster endpoint: dbaas-test-do-user-3587522-0.db.ondigitalocean.com
      • Connection port: 25060
      • Database to connect to: defaultdb
      • SSL Mode: require (use an SSL-encrypted connection for increased security)

      Take note of these parameters, as you’ll need them when using both the psql client and pgbench tool.

      Click on the dropdown above this box and select Connection String. We’ll copy this string and pass it in to psql to connect to this PostgreSQL node.

      Connect to your cluster using psql and the connection string you just copied:

      • psql postgresql://doadmin:your_password@your_cluster_endpoint:25060/defaultdb?sslmode=require

      You should see the following PostgreSQL client prompt, indicating that you’ve connected to your PostgreSQL cluster successfully:

      Output

      psql (10.6 (Ubuntu 10.6-0ubuntu0.18.04.1)) SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off) Type "help" for help. defaultdb=>

      From here, create the benchmark database:

      • CREATE DATABASE benchmark;

      You should see the following output:

      Output

      CREATE DATABASE

      Now, disconnect from the cluster:

      Before we run the pgbench tests, we need to populate this benchmark database with some tables and dummy data required to run the tests.

      To do this, we’ll run pgbench with the following flags:

      • -h: The PostgreSQL cluster endpoint
      • -p: The PostgreSQL cluster connection port
      • -U: The database username
      • -i: Indicates that we'd like to initialize the benchmark database with benchmarking tables and their dummy data.
      • -s : Set a scale factor of 150, which will multiply table sizes by 150. The default scale factor of 1 results in tables of the following sizes:

        table                   # of rows
        ---------------------------------
        pgbench_branches        1
        pgbench_tellers         10
        pgbench_accounts        100000
        pgbench_history         0
        

        Using a scale factor of 150, the pgbench_accounts table will contain 15,000,000 rows.

        Note: To avoid excessive blocked transactions, be sure to set the scale factor to a value at least as large as the number of concurrent clients you intend to test with. In this tutorial we'll test with 150 clients at most, so we set -s to 150 here. To learn more, consult these recommended practices from the official pgbench documentation.

      Run the complete pgbench command:

      • pgbench -h your_cluster_endpoint -p 25060 -U doadmin -i -s 150 benchmark

      After running this command, you will be prompted to enter the password for the database user you specified. Enter the password, and hit ENTER.

      You should see the following output:

      Output

      dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping creating tables... generating data... 100000 of 15000000 tuples (0%) done (elapsed 0.19 s, remaining 27.93 s) 200000 of 15000000 tuples (1%) done (elapsed 0.85 s, remaining 62.62 s) 300000 of 15000000 tuples (2%) done (elapsed 1.21 s, remaining 59.23 s) 400000 of 15000000 tuples (2%) done (elapsed 1.63 s, remaining 59.44 s) 500000 of 15000000 tuples (3%) done (elapsed 2.05 s, remaining 59.51 s) . . . 14700000 of 15000000 tuples (98%) done (elapsed 70.87 s, remaining 1.45 s) 14800000 of 15000000 tuples (98%) done (elapsed 71.39 s, remaining 0.96 s) 14900000 of 15000000 tuples (99%) done (elapsed 71.91 s, remaining 0.48 s) 15000000 of 15000000 tuples (100%) done (elapsed 72.42 s, remaining 0.00 s) vacuuming... creating primary keys... done.

      At this point, we've created a benchmarking database, populated with the tables and data required to run the pgbench tests. We can now move on to running a baseline test which we'll use to compare performance before and after connection pooling is enabled.

      Step 2 — Running a Baseline pgbench Test

      Before we run our first benchmark, it's worth diving into what we're trying to optimize with connection pools.

      Typically when a client connects to a PostgreSQL database, the main PostgreSQL OS process forks itself into a child process corresponding to this new connection. When there are only a few connections, this rarely presents an issue. However, as clients and connections scale, the CPU and memory overhead of creating and maintaining these connections begins to add up, especially if the application in question does not efficiently use database connections. In addition, the max_connections PostgreSQL setting may limit the number of client connections allowed, resulting in additional connections being refused or dropped.

      A connection pool keeps open a fixed number of database connections, the pool size, which it then uses to distribute and execute client requests. This means that you can accommodate far more simultaneous connections, efficiently deal with idle or stagnant clients, as well as queue up client requests during traffic spikes instead of rejecting them. By recycling connections, you can more efficiently use your machine's resources in an environment where there is a heavy connection volume, and squeeze extra performance out of your database.

      A connection pool can be implemented either on the application side or as middleware between the database and your application. The Managed Databases connection pooler is built on top of pgBouncer, a lightweight, open-source middleware connection pooler for PostgreSQL. Its interface is available via the Cloud Control Panel UI.

      Navigate to Databases in the Control Panel, and then click into your PostgreSQL cluster. From here, click into Connection Pools. Then, click on Create a Connection Pool. You should see the following configuration window:

      Connection Pools Config Window

      Here, you can configure the following fields:

      • Pool Name: A unique name for your connection pool
      • Database: The database for which you'd like to pool connections
      • User: The PostgreSQL user the connection pool will authenticate as
      • Mode: One of Session, Transaction, or Statement. This option controls how long the pool assigns a backend connection to a client.
        • Session: The client holds on to the connection until it explicitly disconnects.
        • Transaction: The client obtains the connection until it completes a transaction, after which the connection is returned to the pool.
        • Statement: The pool aggressively recycles connections after each client statement. In statement mode, multi-statement transactions are not allowed. To learn more, consult the Connection Pools product documentation.
      • Pool Size: The number of connections the connection pool will keep open between itself and the database.

      Before we create a connection pool, we'll run a baseline test to which we can compare database performance with connection pooling.

      In this tutorial, we'll use a 4 GB RAM, 2 vCPU, 80 GB Disk, primary node only Managed Database setup. You can scale the benchmark test parameters in this section according to your PostgreSQL cluster specs.

      DigitalOcean Managed Database clusters have the PostgreSQL max_connections parameter preset to 25 connections per 1 GB RAM. A 4 GB RAM PostgreSQL node therefore has max_connections set to 100. In addition, for all clusters, 3 connections are reserved for maintenance. So for this 4 GB RAM PostgreSQL cluster, 97 connections are available for connection pooling.

      With this in mind, let's run our first baseline pgbench test.

      Log in to your client machine. We’ll run pgbench, specifying the database endpoint, port and user as usual. In addition, we’ll provide the following flags:

      • -c: The number of concurrent clients or database sessions to simulate. We set this to 50 so as to simulate a number of concurrent connections smaller than the max_connections parameter for our PostgreSQL cluster.
      • -j: The number of worker threads pgbench will use to run the benchmark. If you're using a multi-CPU machine, you can tune this upwards to distribute clients across threads. On a two-core machine, we set this to 2.
      • -P: Display progress and metrics every 60 seconds.
      • -T: Run the benchmark for 600 seconds (10 minutes). To produce consistent, reproducible results, it's important that you run the benchmark for several minutes, or through one checkpoint cycle.

      We’ll also specify that we'd like to run the benchmark against the benchmark database we created and populated earlier.

      Run the following complete pgbench command:

      • pgbench -h your_db_endpoint -p 25060 -U doadmin -c 50 -j 2 -P 60 -T 600 benchmark

      Hit ENTER and then type in the password for the doadmin user to begin running the test. You should see output similar to the following (results will depend on the specs of your PostgreSQL cluster):

      Output

      starting vacuum...end. progress: 60.0 s, 157.4 tps, lat 282.988 ms stddev 40.261 progress: 120.0 s, 176.2 tps, lat 283.726 ms stddev 38.722 progress: 180.0 s, 167.4 tps, lat 298.663 ms stddev 238.124 progress: 240.0 s, 178.9 tps, lat 279.564 ms stddev 43.619 progress: 300.0 s, 178.5 tps, lat 280.016 ms stddev 43.235 progress: 360.0 s, 178.8 tps, lat 279.737 ms stddev 43.307 progress: 420.0 s, 179.3 tps, lat 278.837 ms stddev 43.783 progress: 480.0 s, 178.5 tps, lat 280.203 ms stddev 43.921 progress: 540.0 s, 180.0 tps, lat 277.816 ms stddev 43.742 progress: 600.0 s, 178.5 tps, lat 280.044 ms stddev 43.705 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 50 number of threads: 2 duration: 600 s number of transactions actually processed: 105256 latency average = 282.039 ms latency stddev = 84.244 ms tps = 175.329321 (including connections establishing) tps = 175.404174 (excluding connections establishing)

      Here, we observed that over a 10 minute run with 50 concurrent sessions, we processed 105,256 transactions with a throughput of roughly 175 transactions per second.

      Now, let's run the same test, this time using 150 concurrent clients, a value that is higher than max_connections for this database, to synthetically simulate a mass influx of client connections:

      • pgbench -h your_db_endpoint -p 25060 -U doadmin -c 150 -j 2 -P 60 -T 600 benchmark

      You should see output similar to the following:

      Output

      starting vacuum...end. connection to database "pgbench" failed: FATAL: remaining connection slots are reserved for non-replication superuser connections progress: 60.0 s, 182.6 tps, lat 280.069 ms stddev 42.009 progress: 120.0 s, 253.8 tps, lat 295.612 ms stddev 237.448 progress: 180.0 s, 271.3 tps, lat 276.411 ms stddev 40.643 progress: 240.0 s, 273.0 tps, lat 274.653 ms stddev 40.942 progress: 300.0 s, 272.8 tps, lat 274.977 ms stddev 41.660 progress: 360.0 s, 250.0 tps, lat 300.033 ms stddev 282.712 progress: 420.0 s, 272.1 tps, lat 275.614 ms stddev 42.901 progress: 480.0 s, 261.1 tps, lat 287.226 ms stddev 112.499 progress: 540.0 s, 272.5 tps, lat 275.309 ms stddev 41.740 progress: 600.0 s, 271.2 tps, lat 276.585 ms stddev 41.221 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 150 number of threads: 2 duration: 600 s number of transactions actually processed: 154892 latency average = 281.421 ms latency stddev = 125.929 ms tps = 257.994941 (including connections establishing) tps = 258.049251 (excluding connections establishing)

      Note the FATAL error, indicating that pgbench hit the 100 connection limit threshold set by max_connections, resulting in a refused connection. The test was still able to complete, with a TPS of roughly 257.

      At this point we can investigate how a connection pool could potentially improve our database's throughput.

      Step 3 — Creating and Testing a Connection Pool

      In this step we'll create a connection pool and rerun the previous pgbench test to see if we can improve our database's throughput.

      In general, the max_connections setting and connection pool parameters are tuned in tandem to max out the database's load. However, because max_connections is abstracted away from the user in DigitalOcean Managed Databases, our main levers here are the connection pool Mode and Size settings.

      To begin, let's create a connection pool in Transaction mode that keeps open all the available backend connections.

      Navigate to Databases in the Control Panel, and then click into your PostgreSQL cluster. From here, click into Connection Pools. Then, click on Create a Connection Pool.

      In the configuration window that appears, fill in the following values:

      Connection Pool Configuration Values

      Here we name our connection pool test-pool, and use it with the benchmark database. Our database user is doadmin and we set the connection pool to Transaction mode. Recall from earlier that for a managed database cluster with 4GB of RAM, there are 97 available database connections. Accordingly, configure the pool to keep open 97 database connections.

      When you're done, hit Create Pool.

      You should now see this pool in the Control Panel:

      Connection Pool in Control Panel

      Grab its URI by clicking Connection Details. It should look something like the following

      postgres://doadmin:password@pool_endpoint:pool_port/test-pool?sslmode=require
      

      You should notice a different port here, and potentially a different endpoint and database name, corresponding to the pool name test-pool.

      Now that we've created the test-pool connection pool, we can rerun the pgbench test we ran above.

      Rerun pgbench

      From your client machine, run the following pgbench command (with 150 concurrent clients), making sure to substitute the highlighted values with those in your connection pool URI:

      • pgbench -h pool_endpoint -p pool_port -U doadmin -c 150 -j 2 -P 60 -T 600 test-pool

      Here we once again use 150 concurrent clients, run the test across 2 threads, print progress every 60 seconds, and run the test for 600 seconds. We set the database name to test-pool, the name of the connection pool.

      Once the test completes, you should see output similar to the following (note that these results will vary depending on the specs of your database node):

      Output

      starting vacuum...end. progress: 60.0 s, 240.0 tps, lat 425.251 ms stddev 59.773 progress: 120.0 s, 350.0 tps, lat 428.647 ms stddev 57.084 progress: 180.0 s, 340.3 tps, lat 440.680 ms stddev 313.631 progress: 240.0 s, 364.9 tps, lat 411.083 ms stddev 61.106 progress: 300.0 s, 366.5 tps, lat 409.367 ms stddev 60.165 progress: 360.0 s, 362.5 tps, lat 413.750 ms stddev 59.005 progress: 420.0 s, 359.5 tps, lat 417.292 ms stddev 60.395 progress: 480.0 s, 363.8 tps, lat 412.130 ms stddev 60.361 progress: 540.0 s, 351.6 tps, lat 426.661 ms stddev 62.960 progress: 600.0 s, 344.5 tps, lat 435.516 ms stddev 65.182 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 150 number of threads: 2 duration: 600 s number of transactions actually processed: 206768 latency average = 421.719 ms latency stddev = 114.676 ms tps = 344.240797 (including connections establishing) tps = 344.385646 (excluding connections establishing)

      Notice here that we were able to increase our database's throughput from 257 TPS to 344 TPS with 150 concurrent connections (an increase of 33%), and did not run up against the max_connections limit we previously hit without a connection pool. By placing a connection pool in front of the database, we can avoid dropped connections and significantly increase database throughput in an environment with a large number of simultaneous connections.

      If you run this same test, but with a -c value of 50 (specifying a smaller number of clients), the gains from using a connection pool become much less evident:

      Output

      starting vacuum...end. progress: 60.0 s, 154.0 tps, lat 290.592 ms stddev 35.530 progress: 120.0 s, 162.7 tps, lat 307.168 ms stddev 241.003 progress: 180.0 s, 172.0 tps, lat 290.678 ms stddev 36.225 progress: 240.0 s, 172.4 tps, lat 290.169 ms stddev 37.603 progress: 300.0 s, 177.8 tps, lat 281.214 ms stddev 35.365 progress: 360.0 s, 177.7 tps, lat 281.402 ms stddev 35.227 progress: 420.0 s, 174.5 tps, lat 286.404 ms stddev 34.797 progress: 480.0 s, 176.1 tps, lat 284.107 ms stddev 36.540 progress: 540.0 s, 173.1 tps, lat 288.771 ms stddev 38.059 progress: 600.0 s, 174.5 tps, lat 286.508 ms stddev 59.941 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 50 number of threads: 2 duration: 600 s number of transactions actually processed: 102938 latency average = 288.509 ms latency stddev = 83.503 ms tps = 171.482966 (including connections establishing) tps = 171.553434 (excluding connections establishing)

      Here we see that we were not able to increase throughput by using a connection pool. Our throughput went down to 171 TPS from 175 TPS.

      Although in this guide we use pgbench with its built-in benchmark data set, the best test for determining whether or not to use a connection pool is a benchmark load that accurately represents production load on your database, against production data. Creating custom benchmarking scripts and data is beyond the scope of this guide, but to learn more, consult the official pgbench documentation.

      Note: The pool size setting is highly workload-specific. In this guide, we configured the connection pool to use all the available backend database connections. This was because throughout our benchmark, the database rarely reached full utilization (you can monitor database load from the Metrics tab in the Cloud Control Panel). Depending on your database's load, this may not be the optimal setting. If you notice that your database is constantly fully saturated, shrinking the connection pool may increase throughput and improve performance by queuing additional requests instead of trying to execute them all at the same time on an already loaded server.

      Conclusion

      DigitalOcean Managed Databases connection pooling is a powerful feature that can help you quickly squeeze extra performance out of your database. Along with other techniques like replication, caching, and sharding, connection pooling can help you scale your database layer to process an even greater volume of requests.

      In this guide we focused on a simplistic and synthetic testing scenario using PostgreSQL's built-in pgbench benchmarking tool and its default benchmark test. In any production scenario, you should run benchmarks against actual production data while simulating production load. This will allow you to tune your database for your particular usage pattern.

      Along with pgbench, other tools exist to benchmark and load your database. One such tool developed by Percona is sysbench-tpcc. Another is Apache's JMeter, which can load test databases as well as web applications.

      To learn more about DigitalOcean Managed Databases, consult the Managed Databases product documentation. To learn more about sharding, another useful scaling technique, consult Understanding Database Sharding.

      References



      Source link