One place for hosting & domains

      PostgreSQL

      How To Audit a PostgreSQL Database with InSpec on Ubuntu 18.04


      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      InSpec is an open-source, automated testing framework for testing and auditing your system to ensure the compliance of integration, security, and other policy requirements. Developers can test the actual state of their infrastructure and applications against a target state using InSpec code.

      To specify the policy requirements you’re testing for, InSpec includes audit controls. Traditionally, developers manually enforce policy requirements and often do this right before deploying changes to production. With InSpec however, developers can continuously evaluate compliance at every stage of product development, which aids in solving issues earlier in the process of development. The InSpec DSL (Domain Specific Language) built on RSpec, a DSL testing tool written in Ruby, specifies the syntax used to write the audit controls.

      InSpec also includes a collection of resources to assist in configuring specific parts of your system and to simplify making audit controls. There is a feature to write your own custom resources when you need to define a specific solution that isn’t available. Universal matchers allow you to compare resource values to expectations across all InSpec tests.

      In this tutorial, you’ll install InSpec on a server running Ubuntu 18.04. You will start by writing a test that verifies the operating system family of the server, then you’ll create a PostgreSQL audit profile from the ground up. This audit profile starts by checking that you have PostgreSQL installed on the server and that its services are running. Then you’ll add tests to check that the PostgreSQL service is running with the correct port, address, protocol, and user. Next you’ll test specific PostgreSQL configuration parameters, and finally, you’ll audit client authentication configuration.

      Prerequisites

      Before following this tutorial, you will need the following:

      Step 1 — Preparing the Environment

      In this step, you’ll download and unpack the latest stable version of InSpec into your home directory. InSpec provides installable binaries on their downloads page.

      Navigate to your home directory:

      Now download the binary with curl:

      • curl -LO https://packages.chef.io/files/stable/inspec/3.7.11/ubuntu/18.04/inspec_3.7.11-1<^>_amd64.deb

      Next, use the sha256sum command to generate a checksum of the downloaded file. This is to verify the integrity and authenticity of the downloaded file.

      • sha256sum inspec_3.7.11-1_amd64.deb

      Checksums for each binary are listed on the InSpec downloads page, so visit the downloads page to compare with your output from this command.

      Output

      e665948f9c0441e8648b08f8d3c8d34a86f9e994609877a7e4853c012dbc7523 inspec_3.7.11-1_amd64.deb

      If the checksums are different, delete the downloaded file and repeat the download process.

      Next, you'll install the downloaded binary. For this, you'll use the dpkg command that you can use for package management, and which comes with all Debian-based systems, such as Ubuntu, by default. The -i flag prompts the dpkg command to install the package files.

      • sudo dpkg -i inspec_3.7.11-1_amd64.deb

      If there are no errors, it means that you've installed InSpec successfully. To verify the installation, enter the following command:

      You'll receive output showing the version of InSpec you just installed:

      Output

      3.7.11

      If you don't see a version number displayed, run over step 1 again.

      After this, you can delete inspec_3.7.11-1_amd64.deb since you don't need it anymore as you've installed the package:

      • rm inspec_3.7.11-1_amd64.deb

      You've successfully installed InSpec on your server. In the next step, you will write a test to verify the operating system family of your server.

      Step 2 — Completing Your First InSpec Test

      In this step, you'll complete your first InSpec test, which will be testing that your operating system family is debian.

      You will use the os resource, which is a built-in InSpec audit resource to test the platform on which the system is running. You'll also use the eq matcher. The eq matcher is a universal matcher that tests for the exact equality of two values.

      An InSpec test consists of a describe block, which contains one or more it and its statements each of which validates one of the resource's features. Each statement describes an expectation of a specific condition of the system as assertions. Two keywords that you can include to make an assertion are should and should_not, which assert that the condition should be true and false respectively.

      Create a file called os_family.rb to hold your test and open it with your text editor:

      Add the following to your file:

      os_family.rb

      describe os.family do
        it {should eq 'debian'}
      end
      

      This test ensures that the operating system family of the target system is debian. Other possible values are windows, unix, bsd, and so on. You can find a complete list in the os resource documentation. Save and exit the file.

      Next, run your test with the following command:

      The test will pass, and you'll receive output resembling the following:

      Output

      Profile: tests from os_family.rb (tests from os_family.rb) Version: (not specified) Target: local:// debian ✔ should eq "debian" Test Summary: 1 successful, 0 failures, 0 skipped

      In your output, the Profile contains the name of the profile that just executed. Since this test is not included in a profile, InSpec generates a default profile name from the test's file name tests from os_family.rb. (You'll work with InSpec profiles in the next section where you will start building your PostgreSQL InSpec profile.) Here InSpec presents the Version as not specified, because you can only specify versions in profiles.

      The Target field specifies the target system that the test is executed on, which can be local or a remote system via ssh. In this case, you've executed your test on the local system so the target shows local://.

      Usefully, the output also displays the executed test with a checkmark symbol (✔) to the left indicating a successful test. The output will show a cross symbol (✘) if the test fails.

      Finally, the test summary gives overall details about how many tests were successful, failed, and skipped. In this instance, you had a single successful test.

      Now you'll see what the output looks like for a failed test. Open os_family.rb:

      In the test you created earlier in this step, you'll now change the expected value of the operating system family from debian to windows. Your file contents after this will be the following:

      os_family.rb

      describe os.family do
        it {should eq 'windows'}
      end
      

      Save and exit the file.

      Next, run the updated test with the following command:

      You will get output similar to the following:

      Output

      Profile: tests from os_family.fail.rb (tests from os_family.fail.rb) Version: (not specified) Target: local:// debian (✘) should eq "windows" expected: "windows" got: "debian" (compared using ==) Test Summary: 0 successful, 1 failure, 0 skipped

      As expected, the test failed. The output indicates that your expected (windows) and actual (debian) values do not match for the os.family property. The (compared using ==) output indicates that the eq matcher performed a string comparison between the two values to come up with this result.

      In this step, you've written a successful test that verifies the operating system family of the server. You've also created a failed test in order to see what the InSpec output for a failed test looks like. In the next step, you will start building the audit profile to test your PostgreSQL installation.

      Step 3 — Auditing Your PostgreSQL Installation

      Now, you will audit your PostgreSQL installation. You'll start by checking that you have PostgreSQL installed and its service is running correctly. Finally, you'll audit the PostgreSQL system port and process. For your PostgreSQL audit, you will create various InSpec controls, all within an InSpec profile named PostgreSQL.

      An InSpec control is a high-level grouping of related tests. Within a control, you can have multiple describe blocks, as well as metadata to describe your tests such as impact level, title, description, and tags. InSpec profiles organize controls to support dependency management and code reuse, which both help manage test complexity. They are also useful for packaging and sharing tests with the public via the Chef Supermarket. You can use profiles to define custom resources that you would implement as regular Ruby classes.

      To create an InSpec profile, you will use the init command. Enter this command to create the PostgreSQL profile:

      • inspec init profile PostgreSQL

      This creates the profile in a new directory with the same name as your profile, in this case PostgreSQL. Now, move into the new directory:

      The directory structure will look like this:

      PostgreSQL/
      ├── controls
      │   └── example.rb
      ├── inspec.yml
      ├── libraries
      └── README.md
      

      The controls/example.rb file contains a sample control that tests to see if the /tmp folder exists on the target system. This is present only as a sample and you will replace it with your own test.

      Your first test will be to ensure that you have the package postgresql-10 installed on your system and that you have the postgresql service installed, enabled, and running.

      Rename the controls/example.rb file to controls/postgresql.rb:

      • mv controls/example.rb controls/postgresql.rb

      Next, open the file with your text editor:

      • nano controls/postgresql.rb

      Replace the content of the file with the following:

      controls/postgresql.rb

      control '1-audit_installation' do
        impact 1.0
        title 'Audit PostgreSQL Installation'
        desc 'Postgres should be installed and running'
      
        describe package('postgresql-10') do
          it {should be_installed}
          its('version') {should cmp >= '10'}
        end
      
        describe service('postgresql@10-main') do
          it {should be_enabled}
          it {should be_installed}
          it {should be_running}
        end
      end
      

      In the preceding code block, you begin by defining the control with its name and metadata.

      In the first describe block, you use the package resource and pass in the PostgreSQL package name postgresql-10 as a resource argument. The package resource provides the matcher be_installed to test that the named package is installed on the system. It returns true if you have the package installed, and false otherwise. Next, you used the its statement to validate that the version of the installed PostgreSQL package is at least 10. You are using cmp instead of eq because package version strings usually contain other attributes apart from the numerical version. eq returns true only if there is an exact match while cmp is less-restrictive.

      In the second describe block, you use the service resource and pass in the PostgreSQL 10 service name postgresql@10-main as a resource argument. The service resource provides the matchers be_enabled, be_installed, and be_running and they return true if you have the named service installed, enabled, and running on the target system respectively.

      Save and exit your file.

      Next, you will run your profile. Make sure you're in the ~/PostgreSQL directory before running the following command:

      Since you completed the PostgreSQL prerequisite tutorial, your test will pass. Your output will look similar to the following:

      Output

      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running Profile Summary: 1 successful control, 0 control failures, 0 controls skipped Test Summary: 5 successful, 0 failures, 0 skipped

      The output indicates that your control was successful. A control is successful if, and only if, all the tests in it are successful. The output also confirms that all your tests were successful.

      Now that you've verified that the correct version of PostgreSQL is installed and the service is fine, you will create a new control that ensures that PostgreSQL is listening on the correct port, address, and protocol.

      For this test, you will also use attributes. An InSpec attribute is used to parameterize a profile to enable easy re-use in different environments or target systems. You'll define the PORT attribute.

      Open the inspec.yml file in your text editor:

      You'll append the port attribute to the end of the file. Add the following at the end of your file:

      inspec.yml

      ...
      attributes:
        - name: port
          type: string
          default: '5432'
      

      In the preceding code block, you added the port attribute and set it to a default value of 5432 because that is the port PostgreSQL listens on by default.

      Save and exit the file. Then run inspec check to verify the profile is still valid since you just edited inspec.yml:

      If there are no errors, you can proceed. Otherwise, open the inspec.yml file and ensure that the attribute is present at the end of the file.

      Now you'll create the control that checks that the PostgreSQL process is running and configured with the correct user. Open controls/postgresql.rb in your text editor:

      • nano controls/postgresql.rb

      Append the following control to the end of your current tests file controls/postgresql.rb:

      controls/postgresql.rb

      ...
      PORT = attribute('port')
      
      control '2-audit_address_port' do
        impact 1.0
        title 'Audit Process and Port'
        desc 'Postgres port should be listening and the process should be running'
      
        describe port(PORT) do
          it {should be_listening}
          its('addresses') {should include '127.0.0.1'}
          its('protocols') {should cmp 'tcp'}
        end
      
        describe processes('postgres') do
          it {should exist}
          its('users') {should include 'postgres'}
        end
      
        describe user('postgres') do
          it {should exist}
        end
      end
      

      Here you begin by declaring a PORT variable to hold the value of the port profile attribute. Then you declare the control and its metadata.

      In the first describe block, you include the port resource to test basic port properties. The port resource provides the matchers be_listening, addresses, and protocols. You use the be_listening matcher to test that the named port is listening on the target system. It returns true if the port 5432 is listening and returns false otherwise. The addresses matcher tests if the specified address is associated with the port. In this case, PostgreSQL will be listening on the local address, 127.0.0.1.
      The protocols matcher tests the Internet protocol the port is listening for, which can be icmp, tcp/tcp6, or udp/udp6. PostgreSQL will be listening for tcp connections.

      In the second describe block, you include the processes resource. You use the processes resource to test properties for programs that are running on the system. First, you verify that the postgres process exists on the system, then you use the users matcher to test that the postgres user owns the postgres process.

      In the third describe block, you have the user resource. You include the user resource to test user properties for a user such as whether the user exists or not, the group the user belongs to, and so on. Using this resource, you test that the postgres user exists on the system. Save and exit controls/postgresql.rb.

      Next, run your profile with the following command:

      The tests will pass, and your output will resemble the following:

      Output

      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "127.0.0.1" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist Profile Summary: 2 successful controls, 0 control failures, 0 controls skipped Test Summary: 11 successful, 0 failures, 0 skipped

      The output indicates that both of your controls and all of your tests were successful.

      In this section, you have created your first InSpec profile and control and used them to organize your tests. You've used several InSpec resources to ensure that you have the correct version of PostgreSQL installed, the PostgreSQL service enabled and running correctly, and that the PostgreSQL user exists on the system. With this set up you're ready to audit your configuration.

      Step 4 — Auditing Your PostgreSQL Configuration

      In this step, you'll audit some PostgreSQL configuration values, which will give you a foundation for working with these configuration files, allowing you to audit any PostgreSQL configuration parameters as desired.

      Now that you have tests auditing the PostgreSQL installation, you'll audit your PostgreSQL configuration itself. PostgreSQL has several configuration parameters that you can use to tune it as desired, and these are stored in the configuration file located by default at /etc/postgresql/10/main/postgresql.conf. You could have different requirements regarding PostgreSQL configuration for your various deployments such as logging, password encryption, SSL, and replication strategies — these requirements you specify in the configuration file.

      You will be using the postgres_conf resource that tests for specific, named configuration options against expected values in the contents of the PostgreSQL configuration file.

      This test will assume some non-default PostgreSQL configuration values that you'll set manually.

      Open the PostgreSQL configuration file in your favorite text editor:

      • sudo nano /etc/postgresql/10/main/postgresql.conf

      Set the following configuration values. If the option already exists in the file but is commented out, uncomment it by removing the #, and set the value as provided:

      /etc/postgresql/10/main/postgresql.conf

      password_encryption = scram-sha-256
      logging_collector = on
      log_connections = on
      log_disconnections = on
      log_duration = on
      

      The configuration values you have set:

      • Ensure that saved passwords are always encrypted with the scram-sha-256 algorithm.
      • Enable the logging collector, which is a background process that captures log messages from the standard error (stderr) and redirects them to a log file.
      • Enable logging of connection attempts to the PostgreSQL server as well as successful connections.
      • Enable logging of session terminations.
      • Enable logging of the duration of every completed statement.

      Save and exit the configuration file. Then restart the PostgreSQL service:

      • sudo service postgresql@10-main restart

      You'll test for only a few configuration options, but you can test any PostgreSQL configuration option with the postgres_conf resource.

      You will pass in your PostgreSQL configuration directory, which is at /etc/postgresql/10/main, using a new profile attribute, postgres_conf_dir. This configuration directory is not the same across all operating systems and platforms, so by passing it in as a profile attribute, you'll be making this profile easier to reuse in different environments.

      Open your inspec.yml file:

      Add this new attribute to the attributes section of inspec.yml:

      inspec.yml

      ...
        - name: postgres_conf_dir
          type: string
          default: '/etc/postgresql/10/main'
      

      Save and exit your file. Then run the following command to verify the InSpec profile is still valid because you just edited the inspec.yml:

      If there are no errors, you can proceed. Otherwise, open the inspec.yml file and ensure that the above lines are present at the end of the file.

      Now you will create the control that audits the configuration values you are enforcing. Append the following control to the end of the tests file controls/postgresql.rb:

      controls/postgresql.rb

      ...
      POSTGRES_CONF_DIR = attribute('postgres_conf_dir')
      POSTGRES_CONF_PATH = File.join(POSTGRES_CONF_DIR, 'postgresql.conf')
      
      control '3-postgresql' do
        impact 1.0
        title 'Audit PostgreSQL Configuration'
        desc 'Audits specific configuration options'
      
        describe postgres_conf(POSTGRES_CONF_PATH) do
          its('port') {should eq PORT}
          its('password_encryption') {should eq 'scram-sha-256'}
          its('ssl') {should eq 'on'}
          its('logging_collector') {should eq 'on'}
          its('log_connections') {should eq 'on'}
          its('log_disconnections') {should eq 'on'}
          its('log_duration') {should eq 'on'}
        end
      end
      

      Here you define two variables:

      • POSTGRES_CONF_DIR holds the postgres_conf_dir attribute as defined in the profile configuration.
      • POSTGRES_CONF_PATH holds the absolute path of the configuration file by concatenating the configuration file name with the configuration directory using File.join.

      Next, you define the control with its name and metadata. Then you use the postgres_conf resource together with the eq matcher to ensure your required values for the configuration options are correct. Save and exit controls/postgresql.rb.

      Next, you will run the test with the following command:

      The tests will pass, and your outputs will resemble the following:

      Output

      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "127.0.0.1" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist ✔ 3-postgresql: Audit PostgreSQL Configuration ✔ PostgreSQL Configuration port should eq "5432" ✔ PostgreSQL Configuration password_encryption should eq "scram-sha-256" ✔ PostgreSQL Configuration ssl should eq "on" ✔ PostgreSQL Configuration logging_collector should eq "on" ✔ PostgreSQL Configuration log_connections should eq "on" ✔ PostgreSQL Configuration log_disconnections should eq "on" ✔ PostgreSQL Configuration log_duration should eq "on" Profile Summary: 3 successful controls, 0 control failures, 0 controls skipped Test Summary: 18 successful, 0 failures, 0 skipped

      The output indicates that your three controls and all your tests were successful without any skipped tests or controls.

      In this step, you've added a new InSpec control that tests specific PostgreSQL configuration values from the configuration file using the postgres_conf resource. You audited a few values in this section, but you can use it to test any configuration option from the configuration file.

      Step 5 — Auditing PostgreSQL Client Authentication

      Now that you've written some tests for your PostgreSQL configuration, you'll write some tests for client authentication. This is important for installations that need to ensure specific authentication methods for different kinds of users; for example, to ensure clients connecting to PostgreSQL locally always need to authenticate with a password, or to reject connections from a specific IP address or IP address range, and so on.

      An important configuration for PostgreSQL installations where security is a concern is to only allow encrypted password authentications. PostgreSQL 10 supports two password encryption methods for client authentication: md5 and scram-sha-256. This test will require password encryption for all clients so this means that the METHOD field for all clients in the client configuration file must be set to either md5 or scram-sha-256. For these tests, you will use scram-sha-256 since it is more secure than md5.

      By default, local clients have their peer authentication method in the pg_hba.conf file. For the test, you need to change these to scram-sha-256. Open the /etc/postgresql/10/main/pg_hba.conf file:

      • sudo nano /etc/postgresql/10/main/pg_hba.conf

      The top of the file contains comments. Scroll down and look for uncommented lines where the authentication type is local, and change the authentication method from peer to scram-sha-256. For example, change:

      /etc/postgresql/10/main/pg_hba.conf

      ...
      local   all             postgres                                peer
      ...
      

      to:

      /etc/postgresql/10/main/pg_hba.conf

      ...
      local   all             postgres                                scram-sha-256
      ...
      

      At the end, your pg_hba.conf configuration will resemble the following:

      /etc/postgresql/10/main/pg_hba.conf

      ...
      local   all             postgres                                scram-sha-256
      
      # TYPE  DATABASE        USER            ADDRESS                 METHOD
      
      # "local" is for Unix domain socket connections only
      local   all             all                                     scram-sha-256
      # IPv4 local connections:
      host    all             all             127.0.0.1/32            scram-sha-256
      # IPv6 local connections:
      host    all             all             ::1/128                 scram-sha-256
      # Allow replication connections from localhost, by a user with the
      # replication privilege.
      local   replication     all                                     scram-sha-256
      host    replication     all             127.0.0.1/32            scram-sha-256
      host    replication     all             ::1/128                 scram-sha-256
      ...
      

      Save and exit the configuration file. Then restart the PostgreSQL service:

      • sudo service postgresql@10-main restart

      For this test, you'll use the postgres_hba_conf resource. This resource is used to test the client authentication data defined in the pg_hba.conf file. You'll pass in the path of your pg_hba.conf file as a parameter to this resource.

      Your control will consist of two describe blocks that check the auth_method fields for both local and host clients respectively to ensure that they are both equal to scram-sha-256. Open controls/postgresql.rb in your text editor:

      • nano controls/postgresql.rb

      Append the following control to the end of the test file controls/postgresql.rb:

      controls/postgresql.rb

      POSTGRES_HBA_CONF_FILE = File.join(POSTGRES_CONF_DIR, 'pg_hba.conf')
      
      control '4-postgres_hba' do
        impact 1.0
        title 'Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf'
        desc 'Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf. Do not allow untrusted authentication methods.'
      
        describe postgres_hba_conf(POSTGRES_HBA_CONF_FILE).where { type == 'local' } do
          its('auth_method') { should all eq 'scram-sha-256' }
        end
      
        describe postgres_hba_conf(POSTGRES_HBA_CONF_FILE).where { type == 'host' } do
          its('auth_method') { should all eq 'scram-sha-256' }
        end
      end
      

      In this code block, you define a new variable POSTGRES_HBA_CONF_FILE to store the absolute location of your pg_hba.conf file. File.join is a Ruby method to concatenate two file path segments with /. You use it here to join the POSTGRES_CONF_DIR variable, declared in the previous section, with the PostgreSQL configuration file pg_hba.conf. This will produce an absolute file path of the pg_hba.conf file and store it in the POSTGRES_HBA_CONF_FILE variable.

      After that, you declare and configure the control and its metadata. The first describe block checks that all configuration entries where the client type is local also have scram-sha-256 as their authentication methods. The second describe block does the same for cases where the client type is host. Save and exit controls/postgresql.rb.

      You'll execute this control as the postgres user because Read access to the PostgreSQL HBA configuration is granted only to Owner and Group, which is the postgres user. Execute the profile by running:

      • sudo -u postgres inspec exec .

      Your output will resemble the following:

      Output

      Profile: InSpec Profile (PostgreSQL) Version: 0.1.0 Target: local:// ✔ 1-audit_installation: Audit PostgreSQL Installation ✔ System Package postgresql-10 should be installed ✔ System Package postgresql-10 version should cmp >= "10" ✔ Service postgresql@10-main should be enabled ✔ Service postgresql@10-main should be installed ✔ Service postgresql@10-main should be running ✔ 2-audit_address_port: Audit Process and Port ✔ Port 5432 should be listening ✔ Port 5432 addresses should include "127.0.0.1" ✔ Port 5432 protocols should cmp == "tcp" ✔ Processes postgres should exist ✔ Processes postgres users should include "postgres" ✔ User postgres should exist ✔ 3-postgresql: Audit PostgreSQL Configuration ✔ PostgreSQL Configuration port should eq "5432" ✔ PostgreSQL Configuration password_encryption should eq "scram-sha-256" ✔ PostgreSQL Configuration ssl should eq "on" ✔ PostgreSQL Configuration logging_collector should eq "on" ✔ PostgreSQL Configuration log_connections should eq "on" ✔ PostgreSQL Configuration log_disconnections should eq "on" ✔ PostgreSQL Configuration log_duration should eq "on" ✔ 4-postgres_hba: Require SCRAM-SHA-256 for ALL users, peers in pg_hba.conf ✔ Postgres Hba Config /etc/postgresql/10/main/pg_hba.conf with type == "local" auth_method should all eq "scram-sha-256" ✔ Postgres Hba Config /etc/postgresql/10/main/pg_hba.conf with type == "host" auth_method should all eq "scram-sha-256" Profile Summary: 4 successful controls, 0 control failures, 0 controls skipped Test Summary: 20 successful, 0 failures, 0 skipped

      This output indicates that the new control you added, together with all of the previous controls, are successful. It also indicates that all the tests in your profile are successful.

      In this step, you have added a control to your profile that successfully audited your PostgreSQL client authentication configuration to ensure that all clients are authenticated via scram-sha-256 using the postgres_hba_conf resource.

      Conclusion

      You've set up InSpec and successfully audited a PostgreSQL 10 installation. In the process, you've used a selection of InSpec tools, such as: the InSpec DSL, matchers, resources, profiles, attributes, and the CLI. From here, you can incorporate other resources that InSpec provides in the Resources section of their documentation. InSpec also provides a mechanism for defining custom resources for your specific needs. These custom resources are written as a regular Ruby class.

      You can also explore the Compliance Profiles section of the Chef supermarket that contains publicly shared InSpec profiles that you can execute directly or extend in your own profiles. You can also share your own profiles with the general public in the Chef Supermarket.

      You can go further by exploring other tools in the Chef universe such as Chef and Habitat. InSpec is integrated with Habitat and this provides the ability to ship your compliance controls together with your Habitat-packaged applications and continuously run them. You can explore official and community InSpec tutorials on the tutorials page. For more advanced InSpec references, check the official InSpec documentation.



      Source link

      Managed Databases Connection Pools and PostgreSQL Benchmarking Using pgbench


      Introduction

      DigitalOcean Managed Databases allows you to scale your PostgreSQL database using several methods. One such method is a built-in connection pooler that allows you to efficiently handle large numbers of client connections and reduce the CPU and memory footprint of these open connections. By using a connection pool and sharing a fixed set of recyclable connections, you can handle significantly more concurrent client connections, and squeeze extra performance out of your PostgreSQL database.

      In this tutorial we’ll use pgbench, PostgreSQL’s built-in benchmarking tool, to run load tests on a DigitalOcean Managed PostgreSQL Database. We’ll dive in to connection pools, describe how they work, and show how to create one using the Cloud Control panel. Finally, using results from the pgbench tests, we’ll demonstrate how using a connection pool can be an inexpensive method of increasing database throughput.

      Prerequisites

      To complete this tutorial, you’ll need:

      • A DigitalOcean Managed PostgreSQL Database cluster. To learn how to provision and configure a DigitalOcean PostgreSQL cluster, consult the Managed Database product documentation.
      • A client machine with PostgreSQL installed. By default, your PostgreSQL installation will contain the pgbench benchmarking utility and the psql client, both of which we’ll use in this guide. Consult How To Install and Use PostgreSQL on Ubuntu 18.04 to learn how to Install PostgreSQL. If you’re not running Ubuntu on your client machine, you can use the version finder to find the appropriate tutorial.

      Once you have a DigitalOcean PostgreSQL cluster up and running and a client machine with pgbench installed, you’re ready to begin with this guide.

      Step 1 — Creating and Initializing benchmark Database

      Before we create a connection pool for our database, we’ll first create the benchmark database on our PostgreSQL cluster and populate it with some dummy data on which pgbench will run its tests. The pgbench utility repeatedly runs a series of five SQL commands (consisting of SELECT, UPDATE, and INSERT queries) in a transaction, using multiple threads and clients, and calculates a useful performance metric called Transactions per Second (TPS). TPS is a measure of database throughput, counting the number of atomic transactions processed by the database in one second. To learn more about the specific commands executed by pgbench, consult What is the “Transaction” Actually Performed in pgbench? from the official pgbench documentation.

      Let’s begin by connecting to our PostgreSQL cluster and creating the benchmark database.

      First, retrieve your cluster’s Connection Details by navigating to Databases and locating your PostgreSQL cluster. Click into your cluster. You should see a cluster overview page containing the following Connection Details box:

      PostgreSQL Cluster Connection Details

      From this, we can parse the following config variables:

      • Admin user: doadmin
      • Admin password: your_password
      • Cluster endpoint: dbaas-test-do-user-3587522-0.db.ondigitalocean.com
      • Connection port: 25060
      • Database to connect to: defaultdb
      • SSL Mode: require (use an SSL-encrypted connection for increased security)

      Take note of these parameters, as you’ll need them when using both the psql client and pgbench tool.

      Click on the dropdown above this box and select Connection String. We’ll copy this string and pass it in to psql to connect to this PostgreSQL node.

      Connect to your cluster using psql and the connection string you just copied:

      • psql postgresql://doadmin:your_password@your_cluster_endpoint:25060/defaultdb?sslmode=require

      You should see the following PostgreSQL client prompt, indicating that you’ve connected to your PostgreSQL cluster successfully:

      Output

      psql (10.6 (Ubuntu 10.6-0ubuntu0.18.04.1)) SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off) Type "help" for help. defaultdb=>

      From here, create the benchmark database:

      • CREATE DATABASE benchmark;

      You should see the following output:

      Output

      CREATE DATABASE

      Now, disconnect from the cluster:

      Before we run the pgbench tests, we need to populate this benchmark database with some tables and dummy data required to run the tests.

      To do this, we’ll run pgbench with the following flags:

      • -h: The PostgreSQL cluster endpoint
      • -p: The PostgreSQL cluster connection port
      • -U: The database username
      • -i: Indicates that we'd like to initialize the benchmark database with benchmarking tables and their dummy data.
      • -s : Set a scale factor of 150, which will multiply table sizes by 150. The default scale factor of 1 results in tables of the following sizes:

        table                   # of rows
        ---------------------------------
        pgbench_branches        1
        pgbench_tellers         10
        pgbench_accounts        100000
        pgbench_history         0
        

        Using a scale factor of 150, the pgbench_accounts table will contain 15,000,000 rows.

        Note: To avoid excessive blocked transactions, be sure to set the scale factor to a value at least as large as the number of concurrent clients you intend to test with. In this tutorial we'll test with 150 clients at most, so we set -s to 150 here. To learn more, consult these recommended practices from the official pgbench documentation.

      Run the complete pgbench command:

      • pgbench -h your_cluster_endpoint -p 25060 -U doadmin -i -s 150 benchmark

      After running this command, you will be prompted to enter the password for the database user you specified. Enter the password, and hit ENTER.

      You should see the following output:

      Output

      dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping creating tables... generating data... 100000 of 15000000 tuples (0%) done (elapsed 0.19 s, remaining 27.93 s) 200000 of 15000000 tuples (1%) done (elapsed 0.85 s, remaining 62.62 s) 300000 of 15000000 tuples (2%) done (elapsed 1.21 s, remaining 59.23 s) 400000 of 15000000 tuples (2%) done (elapsed 1.63 s, remaining 59.44 s) 500000 of 15000000 tuples (3%) done (elapsed 2.05 s, remaining 59.51 s) . . . 14700000 of 15000000 tuples (98%) done (elapsed 70.87 s, remaining 1.45 s) 14800000 of 15000000 tuples (98%) done (elapsed 71.39 s, remaining 0.96 s) 14900000 of 15000000 tuples (99%) done (elapsed 71.91 s, remaining 0.48 s) 15000000 of 15000000 tuples (100%) done (elapsed 72.42 s, remaining 0.00 s) vacuuming... creating primary keys... done.

      At this point, we've created a benchmarking database, populated with the tables and data required to run the pgbench tests. We can now move on to running a baseline test which we'll use to compare performance before and after connection pooling is enabled.

      Step 2 — Running a Baseline pgbench Test

      Before we run our first benchmark, it's worth diving into what we're trying to optimize with connection pools.

      Typically when a client connects to a PostgreSQL database, the main PostgreSQL OS process forks itself into a child process corresponding to this new connection. When there are only a few connections, this rarely presents an issue. However, as clients and connections scale, the CPU and memory overhead of creating and maintaining these connections begins to add up, especially if the application in question does not efficiently use database connections. In addition, the max_connections PostgreSQL setting may limit the number of client connections allowed, resulting in additional connections being refused or dropped.

      A connection pool keeps open a fixed number of database connections, the pool size, which it then uses to distribute and execute client requests. This means that you can accommodate far more simultaneous connections, efficiently deal with idle or stagnant clients, as well as queue up client requests during traffic spikes instead of rejecting them. By recycling connections, you can more efficiently use your machine's resources in an environment where there is a heavy connection volume, and squeeze extra performance out of your database.

      A connection pool can be implemented either on the application side or as middleware between the database and your application. The Managed Databases connection pooler is built on top of pgBouncer, a lightweight, open-source middleware connection pooler for PostgreSQL. Its interface is available via the Cloud Control Panel UI.

      Navigate to Databases in the Control Panel, and then click into your PostgreSQL cluster. From here, click into Connection Pools. Then, click on Create a Connection Pool. You should see the following configuration window:

      Connection Pools Config Window

      Here, you can configure the following fields:

      • Pool Name: A unique name for your connection pool
      • Database: The database for which you'd like to pool connections
      • User: The PostgreSQL user the connection pool will authenticate as
      • Mode: One of Session, Transaction, or Statement. This option controls how long the pool assigns a backend connection to a client.
        • Session: The client holds on to the connection until it explicitly disconnects.
        • Transaction: The client obtains the connection until it completes a transaction, after which the connection is returned to the pool.
        • Statement: The pool aggressively recycles connections after each client statement. In statement mode, multi-statement transactions are not allowed. To learn more, consult the Connection Pools product documentation.
      • Pool Size: The number of connections the connection pool will keep open between itself and the database.

      Before we create a connection pool, we'll run a baseline test to which we can compare database performance with connection pooling.

      In this tutorial, we'll use a 4 GB RAM, 2 vCPU, 80 GB Disk, primary node only Managed Database setup. You can scale the benchmark test parameters in this section according to your PostgreSQL cluster specs.

      DigitalOcean Managed Database clusters have the PostgreSQL max_connections parameter preset to 25 connections per 1 GB RAM. A 4 GB RAM PostgreSQL node therefore has max_connections set to 100. In addition, for all clusters, 3 connections are reserved for maintenance. So for this 4 GB RAM PostgreSQL cluster, 97 connections are available for connection pooling.

      With this in mind, let's run our first baseline pgbench test.

      Log in to your client machine. We’ll run pgbench, specifying the database endpoint, port and user as usual. In addition, we’ll provide the following flags:

      • -c: The number of concurrent clients or database sessions to simulate. We set this to 50 so as to simulate a number of concurrent connections smaller than the max_connections parameter for our PostgreSQL cluster.
      • -j: The number of worker threads pgbench will use to run the benchmark. If you're using a multi-CPU machine, you can tune this upwards to distribute clients across threads. On a two-core machine, we set this to 2.
      • -P: Display progress and metrics every 60 seconds.
      • -T: Run the benchmark for 600 seconds (10 minutes). To produce consistent, reproducible results, it's important that you run the benchmark for several minutes, or through one checkpoint cycle.

      We’ll also specify that we'd like to run the benchmark against the benchmark database we created and populated earlier.

      Run the following complete pgbench command:

      • pgbench -h your_db_endpoint -p 25060 -U doadmin -c 50 -j 2 -P 60 -T 600 benchmark

      Hit ENTER and then type in the password for the doadmin user to begin running the test. You should see output similar to the following (results will depend on the specs of your PostgreSQL cluster):

      Output

      starting vacuum...end. progress: 60.0 s, 157.4 tps, lat 282.988 ms stddev 40.261 progress: 120.0 s, 176.2 tps, lat 283.726 ms stddev 38.722 progress: 180.0 s, 167.4 tps, lat 298.663 ms stddev 238.124 progress: 240.0 s, 178.9 tps, lat 279.564 ms stddev 43.619 progress: 300.0 s, 178.5 tps, lat 280.016 ms stddev 43.235 progress: 360.0 s, 178.8 tps, lat 279.737 ms stddev 43.307 progress: 420.0 s, 179.3 tps, lat 278.837 ms stddev 43.783 progress: 480.0 s, 178.5 tps, lat 280.203 ms stddev 43.921 progress: 540.0 s, 180.0 tps, lat 277.816 ms stddev 43.742 progress: 600.0 s, 178.5 tps, lat 280.044 ms stddev 43.705 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 50 number of threads: 2 duration: 600 s number of transactions actually processed: 105256 latency average = 282.039 ms latency stddev = 84.244 ms tps = 175.329321 (including connections establishing) tps = 175.404174 (excluding connections establishing)

      Here, we observed that over a 10 minute run with 50 concurrent sessions, we processed 105,256 transactions with a throughput of roughly 175 transactions per second.

      Now, let's run the same test, this time using 150 concurrent clients, a value that is higher than max_connections for this database, to synthetically simulate a mass influx of client connections:

      • pgbench -h your_db_endpoint -p 25060 -U doadmin -c 150 -j 2 -P 60 -T 600 benchmark

      You should see output similar to the following:

      Output

      starting vacuum...end. connection to database "pgbench" failed: FATAL: remaining connection slots are reserved for non-replication superuser connections progress: 60.0 s, 182.6 tps, lat 280.069 ms stddev 42.009 progress: 120.0 s, 253.8 tps, lat 295.612 ms stddev 237.448 progress: 180.0 s, 271.3 tps, lat 276.411 ms stddev 40.643 progress: 240.0 s, 273.0 tps, lat 274.653 ms stddev 40.942 progress: 300.0 s, 272.8 tps, lat 274.977 ms stddev 41.660 progress: 360.0 s, 250.0 tps, lat 300.033 ms stddev 282.712 progress: 420.0 s, 272.1 tps, lat 275.614 ms stddev 42.901 progress: 480.0 s, 261.1 tps, lat 287.226 ms stddev 112.499 progress: 540.0 s, 272.5 tps, lat 275.309 ms stddev 41.740 progress: 600.0 s, 271.2 tps, lat 276.585 ms stddev 41.221 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 150 number of threads: 2 duration: 600 s number of transactions actually processed: 154892 latency average = 281.421 ms latency stddev = 125.929 ms tps = 257.994941 (including connections establishing) tps = 258.049251 (excluding connections establishing)

      Note the FATAL error, indicating that pgbench hit the 100 connection limit threshold set by max_connections, resulting in a refused connection. The test was still able to complete, with a TPS of roughly 257.

      At this point we can investigate how a connection pool could potentially improve our database's throughput.

      Step 3 — Creating and Testing a Connection Pool

      In this step we'll create a connection pool and rerun the previous pgbench test to see if we can improve our database's throughput.

      In general, the max_connections setting and connection pool parameters are tuned in tandem to max out the database's load. However, because max_connections is abstracted away from the user in DigitalOcean Managed Databases, our main levers here are the connection pool Mode and Size settings.

      To begin, let's create a connection pool in Transaction mode that keeps open all the available backend connections.

      Navigate to Databases in the Control Panel, and then click into your PostgreSQL cluster. From here, click into Connection Pools. Then, click on Create a Connection Pool.

      In the configuration window that appears, fill in the following values:

      Connection Pool Configuration Values

      Here we name our connection pool test-pool, and use it with the benchmark database. Our database user is doadmin and we set the connection pool to Transaction mode. Recall from earlier that for a managed database cluster with 4GB of RAM, there are 97 available database connections. Accordingly, configure the pool to keep open 97 database connections.

      When you're done, hit Create Pool.

      You should now see this pool in the Control Panel:

      Connection Pool in Control Panel

      Grab its URI by clicking Connection Details. It should look something like the following

      postgres://doadmin:password@pool_endpoint:pool_port/test-pool?sslmode=require
      

      You should notice a different port here, and potentially a different endpoint and database name, corresponding to the pool name test-pool.

      Now that we've created the test-pool connection pool, we can rerun the pgbench test we ran above.

      Rerun pgbench

      From your client machine, run the following pgbench command (with 150 concurrent clients), making sure to substitute the highlighted values with those in your connection pool URI:

      • pgbench -h pool_endpoint -p pool_port -U doadmin -c 150 -j 2 -P 60 -T 600 test-pool

      Here we once again use 150 concurrent clients, run the test across 2 threads, print progress every 60 seconds, and run the test for 600 seconds. We set the database name to test-pool, the name of the connection pool.

      Once the test completes, you should see output similar to the following (note that these results will vary depending on the specs of your database node):

      Output

      starting vacuum...end. progress: 60.0 s, 240.0 tps, lat 425.251 ms stddev 59.773 progress: 120.0 s, 350.0 tps, lat 428.647 ms stddev 57.084 progress: 180.0 s, 340.3 tps, lat 440.680 ms stddev 313.631 progress: 240.0 s, 364.9 tps, lat 411.083 ms stddev 61.106 progress: 300.0 s, 366.5 tps, lat 409.367 ms stddev 60.165 progress: 360.0 s, 362.5 tps, lat 413.750 ms stddev 59.005 progress: 420.0 s, 359.5 tps, lat 417.292 ms stddev 60.395 progress: 480.0 s, 363.8 tps, lat 412.130 ms stddev 60.361 progress: 540.0 s, 351.6 tps, lat 426.661 ms stddev 62.960 progress: 600.0 s, 344.5 tps, lat 435.516 ms stddev 65.182 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 150 number of threads: 2 duration: 600 s number of transactions actually processed: 206768 latency average = 421.719 ms latency stddev = 114.676 ms tps = 344.240797 (including connections establishing) tps = 344.385646 (excluding connections establishing)

      Notice here that we were able to increase our database's throughput from 257 TPS to 344 TPS with 150 concurrent connections (an increase of 33%), and did not run up against the max_connections limit we previously hit without a connection pool. By placing a connection pool in front of the database, we can avoid dropped connections and significantly increase database throughput in an environment with a large number of simultaneous connections.

      If you run this same test, but with a -c value of 50 (specifying a smaller number of clients), the gains from using a connection pool become much less evident:

      Output

      starting vacuum...end. progress: 60.0 s, 154.0 tps, lat 290.592 ms stddev 35.530 progress: 120.0 s, 162.7 tps, lat 307.168 ms stddev 241.003 progress: 180.0 s, 172.0 tps, lat 290.678 ms stddev 36.225 progress: 240.0 s, 172.4 tps, lat 290.169 ms stddev 37.603 progress: 300.0 s, 177.8 tps, lat 281.214 ms stddev 35.365 progress: 360.0 s, 177.7 tps, lat 281.402 ms stddev 35.227 progress: 420.0 s, 174.5 tps, lat 286.404 ms stddev 34.797 progress: 480.0 s, 176.1 tps, lat 284.107 ms stddev 36.540 progress: 540.0 s, 173.1 tps, lat 288.771 ms stddev 38.059 progress: 600.0 s, 174.5 tps, lat 286.508 ms stddev 59.941 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 50 number of threads: 2 duration: 600 s number of transactions actually processed: 102938 latency average = 288.509 ms latency stddev = 83.503 ms tps = 171.482966 (including connections establishing) tps = 171.553434 (excluding connections establishing)

      Here we see that we were not able to increase throughput by using a connection pool. Our throughput went down to 171 TPS from 175 TPS.

      Although in this guide we use pgbench with its built-in benchmark data set, the best test for determining whether or not to use a connection pool is a benchmark load that accurately represents production load on your database, against production data. Creating custom benchmarking scripts and data is beyond the scope of this guide, but to learn more, consult the official pgbench documentation.

      Note: The pool size setting is highly workload-specific. In this guide, we configured the connection pool to use all the available backend database connections. This was because throughout our benchmark, the database rarely reached full utilization (you can monitor database load from the Metrics tab in the Cloud Control Panel). Depending on your database's load, this may not be the optimal setting. If you notice that your database is constantly fully saturated, shrinking the connection pool may increase throughput and improve performance by queuing additional requests instead of trying to execute them all at the same time on an already loaded server.

      Conclusion

      DigitalOcean Managed Databases connection pooling is a powerful feature that can help you quickly squeeze extra performance out of your database. Along with other techniques like replication, caching, and sharding, connection pooling can help you scale your database layer to process an even greater volume of requests.

      In this guide we focused on a simplistic and synthetic testing scenario using PostgreSQL's built-in pgbench benchmarking tool and its default benchmark test. In any production scenario, you should run benchmarks against actual production data while simulating production load. This will allow you to tune your database for your particular usage pattern.

      Along with pgbench, other tools exist to benchmark and load your database. One such tool developed by Percona is sysbench-tpcc. Another is Apache's JMeter, which can load test databases as well as web applications.

      To learn more about DigitalOcean Managed Databases, consult the Managed Databases product documentation. To learn more about sharding, another useful scaling technique, consult Understanding Database Sharding.

      References



      Source link

      An Introduction to Queries in PostgreSQL


      Introduction

      Databases are a key component of many websites and applications, and are at the core of how data is stored and exchanged across the internet. One of the most important aspects of database management is the practice of retrieving data from a database, whether it’s on an ad hoc basis or part of a process that’s been coded into an application. There are several ways to retrieve information from a database, but one of the most commonly-used methods is performed through submitting queries through the command line.

      In relational database management systems, a query is any command used to retrieve data from a table. In Structured Query Language (SQL), queries are almost always made using the SELECT statement.

      In this guide, we will discuss the basic syntax of SQL queries as well as some of the more commonly-employed functions and operators. We will also practice making SQL queries using some sample data in a PostgreSQL database.

      PostgreSQL, often shortened to “Postgres,” is a relational database management system with an object-oriented approach, meaning that information can be represented as objects or classes in PostgreSQL schemas. PostgreSQL aligns closely with standard SQL, although it also includes some features not found in other relational database systems.

      Prerequisites

      In general, the commands and concepts presented in this guide can be used on any Linux-based operating system running any SQL database software. However, it was written specifically with an Ubuntu 18.04 server running PostgreSQL in mind. To set this up, you will need the following:

      With this setup in place, we can begin the tutorial.

      Creating a Sample Database

      Before we can begin making queries in SQL, we will first create a database and a couple tables, then populate these tables with some sample data. This will allow you to gain some hands-on experience when you begin making queries later on.

      For the sample database we’ll use throughout this guide, imagine the following scenario:

      You and several of your friends all celebrate your birthdays with one another. On each occasion, the members of the group head to the local bowling alley, participate in a friendly tournament, and then everyone heads to your place where you prepare the birthday-person’s favorite meal.

      Now that this tradition has been going on for a while, you’ve decided to begin tracking the records from these tournaments. Also, to make planning dinners easier, you decide to create a record of your friends’ birthdays and their favorite entrees, sides, and desserts. Rather than keep this information in a physical ledger, you decide to exercise your database skills by recording it in a PostgreSQL database.

      To begin, open up a PostgreSQL prompt as your postgres superuser:

      Note: If you followed all the steps of the prerequisite tutorial on Installing PostgreSQL on Ubuntu 18.04, you may have configured a new role for your PostgreSQL installation. In this case, you can connect to the Postgres prompt with the following command, substituting sammy with your own username:

      Next, create the database by running:

      • CREATE DATABASE birthdays;

      Then select this database by typing:

      Next, create two tables within this database. We'll use the first table to track your friends' records at the bowling alley. The following command will create a table called tourneys with columns for the name of each of your friends, the number of tournaments they've won (wins), their all-time best score, and what size bowling shoe they wear (size):

      • CREATE TABLE tourneys (
      • name varchar(30),
      • wins real,
      • best real,
      • size real
      • );

      Once you run the CREATE TABLE command and populate it with column headings, you’ll receive the following output:

      Output

      CREATE TABLE

      Populate the tourneys table with some sample data:

      • INSERT INTO tourneys (name, wins, best, size)
      • VALUES ('Dolly', '7', '245', '8.5'),
      • ('Etta', '4', '283', '9'),
      • ('Irma', '9', '266', '7'),
      • ('Barbara', '2', '197', '7.5'),
      • ('Gladys', '13', '273', '8');

      You’ll receive the following output:

      Output

      INSERT 0 5

      Following this, create another table within the same database which we'll use to store information about your friends' favorite birthday meals. The following command creates a table named dinners with columns for the name of each of your friends, their birthdate, their favorite entree, their preferred side dish, and their favorite dessert:

      • CREATE TABLE dinners (
      • name varchar(30),
      • birthdate date,
      • entree varchar(30),
      • side varchar(30),
      • dessert varchar(30)
      • );

      Similarly for this table, you’ll receive feedback verifying that the table was created:

      Output

      CREATE TABLE

      Populate this table with some sample data as well:

      • INSERT INTO dinners (name, birthdate, entree, side, dessert)
      • VALUES ('Dolly', '1946-01-19', 'steak', 'salad', 'cake'),
      • ('Etta', '1938-01-25', 'chicken', 'fries', 'ice cream'),
      • ('Irma', '1941-02-18', 'tofu', 'fries', 'cake'),
      • ('Barbara', '1948-12-25', 'tofu', 'salad', 'ice cream'),
      • ('Gladys', '1944-05-28', 'steak', 'fries', 'ice cream');

      Output

      INSERT 0 5

      Once that command completes successfully, you're done setting up your database. Next, we'll go over the basic command structure of SELECT queries.

      Understanding SELECT Statements

      As mentioned in the introduction, SQL queries almost always begin with the SELECT statement. SELECT is used in queries to specify which columns from a table should be returned in the result-set. Queries also almost always include FROM, which is used to specify which table the statement will query.

      Generally, SQL queries follow this syntax:

      • SELECT column_to_select FROM table_to_select WHERE certain_conditions_apply;

      By way of example, the following statement will return the entire name column from the dinners table:

      • SELECT name FROM dinners;

      Output

      name --------- Dolly Etta Irma Barbara Gladys (5 rows)

      You can select multiple columns from the same table by separating their names with a comma, like this:

      • SELECT name, birthdate FROM dinners;

      Output

      name | birthdate ---------+------------ Dolly | 1946-01-19 Etta | 1938-01-25 Irma | 1941-02-18 Barbara | 1948-12-25 Gladys | 1944-05-28 (5 rows)

      Instead of naming a specific column or set of columns, you can follow the SELECT operator with an asterisk (*) which serves as a placeholder representing all the columns in a table. The following command returns every column from the tourneys table:

      Output

      name | wins | best | size ---------+------+------+------ Dolly | 7 | 245 | 8.5 Etta | 4 | 283 | 9 Irma | 9 | 266 | 7 Barbara | 2 | 197 | 7.5 Gladys | 13 | 273 | 8 (5 rows)

      WHERE is used in queries to filter records that meet a specified condition, and any rows that do not meet that condition are eliminated from the result. A WHERE clause typically follows this syntax:

      • . . . WHERE column_name comparison_operator value

      The comparison operator in a WHERE clause defines how the specified column should be compared against the value. Here are some common SQL comparison operators:

      Operator What it does
      = tests for equality
      != tests for inequality
      < tests for less-than
      > tests for greater-than
      <= tests for less-than or equal-to
      >= tests for greater-than or equal-to
      BETWEEN tests whether a value lies within a given range
      IN tests whether a row's value is contained in a set of specified values
      EXISTS tests whether rows exist, given the specified conditions
      LIKE tests whether a value matches a specified string
      IS NULL tests for NULL values
      IS NOT NULL tests for all values other than NULL

      For example, if you wanted to find Irma's shoe size, you could use the following query:

      • SELECT size FROM tourneys WHERE name = 'Irma';

      Output

      size ------ 7 (1 row)

      SQL allows the use of wildcard characters, and these are especially handy when used in WHERE clauses. Percentage signs (%) represent zero or more unknown characters, and underscores (_) represent a single unknown character. These are useful if you're trying to find a specific entry in a table, but aren't sure of what that entry is exactly. To illustrate, let's say that you've forgotten the favorite entree of a few of your friends, but you're certain this particular entree starts with a "t." You could find its name by running the following query:

      • SELECT entree FROM dinners WHERE entree LIKE 't%';

      Output

      entree ------- tofu tofu (2 rows)

      Based on the output above, we see that the entree we have forgotten is tofu.

      There may be times when you're working with databases that have columns or tables with relatively long or difficult-to-read names. In these cases, you can make these names more readable by creating an alias with the AS keyword. Aliases created with AS are temporary, and only exist for the duration of the query for which they're created:

      • SELECT name AS n, birthdate AS b, dessert AS d FROM dinners;

      Output

      n | b | d ---------+------------+----------- Dolly | 1946-01-19 | cake Etta | 1938-01-25 | ice cream Irma | 1941-02-18 | cake Barbara | 1948-12-25 | ice cream Gladys | 1944-05-28 | ice cream (5 rows)

      Here, we have told SQL to display the name column as n, the birthdate column as b, and the dessert column as d.

      The examples we've gone through up to this point include some of the more frequently-used keywords and clauses in SQL queries. These are useful for basic queries, but they aren't helpful if you're trying to perform a calculation or derive a scalar value (a single value, as opposed to a set of multiple different values) based on your data. This is where aggregate functions come into play.

      Aggregate Functions

      Oftentimes, when working with data, you don't necessarily want to see the data itself. Rather, you want information about the data. The SQL syntax includes a number of functions that allow you to interpret or run calculations on your data just by issuing a SELECT query. These are known as aggregate functions.

      The COUNT function counts and returns the number of rows that match a certain criteria. For example, if you'd like to know how many of your friends prefer tofu for their birthday entree, you could issue this query:

      • SELECT COUNT(entree) FROM dinners WHERE entree = 'tofu';

      Output

      count ------- 2 (1 row)

      The AVG function returns the average (mean) value of a column. Using our example table, you could find the average best score amongst your friends with this query:

      • SELECT AVG(best) FROM tourneys;

      Output

      avg ------- 252.8 (1 row)

      SUM is used to find the total sum of a given column. For instance, if you'd like to see how many games you and your friends have bowled over the years, you could run this query:

      • SELECT SUM(wins) FROM tourneys;

      Output

      sum ----- 35 (1 row)

      Note that the AVG and SUM functions will only work correctly when used with numeric data. If you try to use them on non-numerical data, it will result in either an error or just 0, depending on which RDBMS you're using:

      • SELECT SUM(entree) FROM dinners;

      Output

      ERROR: function sum(character varying) does not exist LINE 1: select sum(entree) from dinners; ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.

      MIN is used to find the smallest value within a specified column. You could use this query to see what the worst overall bowling record is so far (in terms of number of wins):

      • SELECT MIN(wins) FROM tourneys;

      Output

      min ----- 2 (1 row)

      Similarly, MAX is used to find the largest numeric value in a given column. The following query will show the best overall bowling record:

      • SELECT MAX(wins) FROM tourneys;

      Output

      max ----- 13 (1 row)

      Unlike SUM and AVG, the MIN and MAX functions can be used for both numeric and alphabetic data types. When run on a column containing string values, the MIN function will show the first value alphabetically:

      • SELECT MIN(name) FROM dinners;

      Output

      min --------- Barbara (1 row)

      Likewise, when run on a column containing string values, the MAX function will show the last value alphabetically:

      • SELECT MAX(name) FROM dinners;

      Output

      max ------ Irma (1 row)

      Aggregate functions have many uses beyond what was described in this section. They're particularly useful when used with the GROUP BY clause, which is covered in the next section along with several other query clauses that affect how result-sets are sorted.

      Manipulating Query Outputs

      In addition to the FROM and WHERE clauses, there are several other clauses which are used to manipulate the results of a SELECT query. In this section, we will explain and provide examples for some of the more commonly-used query clauses.

      One of the most frequently-used query clauses, aside from FROM and WHERE, is the GROUP BY clause. It's typically used when you're performing an aggregate function on one column, but in relation to matching values in another.

      For example, let's say you wanted to know how many of your friends prefer each of the three entrees you make. You could find this info with the following query:

      • SELECT COUNT(name), entree FROM dinners GROUP BY entree;

      Output

      count | entree -------+--------- 1 | chicken 2 | steak 2 | tofu (3 rows)

      The ORDER BY clause is used to sort query results. By default, numeric values are sorted in ascending order, and text values are sorted in alphabetical order. To illustrate, the following query lists the name and birthdate columns, but sorts the results by birthdate:

      • SELECT name, birthdate FROM dinners ORDER BY birthdate;

      Output

      name | birthdate ---------+------------ Etta | 1938-01-25 Irma | 1941-02-18 Gladys | 1944-05-28 Dolly | 1946-01-19 Barbara | 1948-12-25 (5 rows)

      Notice that the default behavior of ORDER BY is to sort the result-set in ascending order. To reverse this and have the result-set sorted in descending order, close the query with DESC:

      • SELECT name, birthdate FROM dinners ORDER BY birthdate DESC;

      Output

      name | birthdate ---------+------------ Barbara | 1948-12-25 Dolly | 1946-01-19 Gladys | 1944-05-28 Irma | 1941-02-18 Etta | 1938-01-25 (5 rows)

      As mentioned previously, the WHERE clause is used to filter results based on specific conditions. However, if you use the WHERE clause with an aggregate function, it will return an error, as is the case with the following attempt to find which sides are the favorite of at least three of your friends:

      • SELECT COUNT(name), side FROM dinners WHERE COUNT(name) >= 3;

      Output

      ERROR: aggregate functions are not allowed in WHERE LINE 1: SELECT COUNT(name), side FROM dinners WHERE COUNT(name) >= 3...

      The HAVING clause was added to SQL to provide functionality similar to that of the WHERE clause while also being compatible with aggregate functions. It's helpful to think of the difference between these two clauses as being that WHERE applies to individual records, while HAVING applies to group records. To this end, any time you issue a HAVING clause, the GROUP BY clause must also be present.

      The following example is another attempt to find which side dishes are the favorite of at least three of your friends, although this one will return a result without error:

      • SELECT COUNT(name), side FROM dinners GROUP BY side HAVING COUNT(name) >= 3;

      Output

      count | side -------+------- 3 | fries (1 row)

      Aggregate functions are useful for summarizing the results of a particular column in a given table. However, there are many cases where it's necessary to query the contents of more than one table. We'll go over a few ways you can do this in the next section.

      Querying Multiple Tables

      More often than not, a database contains multiple tables, each holding different sets of data. SQL provides a few different ways to run a single query on multiple tables.

      The JOIN clause can be used to combine rows from two or more tables in a query result. It does this by finding a related column between the tables and sorts the results appropriately in the output.

      SELECT statements that include a JOIN clause generally follow this syntax:

      • SELECT table1.column1, table2.column2
      • FROM table1
      • JOIN table2 ON table1.related_column=table2.related_column;

      Note that because JOIN clauses compare the contents of more than one table, the previous example specifies which table to select each column from by preceding the name of the column with the name of the table and a period. You can specify which table a column should be selected from like this for any query, although it's not necessary when selecting from a single table, as we've done in the previous sections. Let's walk through an example using our sample data.

      Imagine that you wanted to buy each of your friends a pair of bowling shoes as a birthday gift. Because the information about your friends' birthdates and shoe sizes are held in separate tables, you could query both tables separately then compare the results from each. With a JOIN clause, though, you can find all the information you want with a single query:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • JOIN dinners ON tourneys.name=dinners.name;

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 (5 rows)

      The JOIN clause used in this example, without any other arguments, is an inner JOIN clause. This means that it selects all the records that have matching values in both tables and prints them to the results set, while any records that aren't matched are excluded. To illustrate this idea, let's add a new row to each table that doesn't have a corresponding entry in the other:

      • INSERT INTO tourneys (name, wins, best, size)
      • VALUES ('Bettye', '0', '193', '9');
      • INSERT INTO dinners (name, birthdate, entree, side, dessert)
      • VALUES ('Lesley', '1946-05-02', 'steak', 'salad', 'ice cream');

      Then, re-run the previous SELECT statement with the JOIN clause:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • JOIN dinners ON tourneys.name=dinners.name;

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 (5 rows)

      Notice that, because the tourneys table has no entry for Lesley and the dinners table has no entry for Bettye, those records are absent from this output.

      It is possible, though, to return all the records from one of the tables using an outer JOIN clause. Outer JOIN clauses are written as either LEFT JOIN, RIGHT JOIN, or FULL JOIN.

      A LEFT JOIN clause returns all the records from the “left” table and only the matching records from the right table. In the context of outer joins, the left table is the one referenced by the FROM clause, and the right table is any other table referenced after the JOIN statement.

      Run the previous query again, but this time use a LEFT JOIN clause:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • LEFT JOIN dinners ON tourneys.name=dinners.name;

      This command will return every record from the left table (in this case, tourneys) even if it doesn't have a corresponding record in the right table. Any time there isn't a matching record from the right table, it's returned as a blank value or NULL, depending on your RDBMS:

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 Bettye | 9 | (6 rows)

      Now run the query again, this time with a RIGHT JOIN clause:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • RIGHT JOIN dinners ON tourneys.name=dinners.name;

      This will return all the records from the right table (dinners). Because Lesley's birthdate is recorded in the right table, but there is no corresponding row for her in the left table, the name and size columns will return as blank values in that row:

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 | | 1946-05-02 (6 rows)

      Note that left and right joins can be written as LEFT OUTER JOIN or RIGHT OUTER JOIN, although the OUTER part of the clause is implied. Likewise, specifying INNER JOIN will produce the same result as just writing JOIN.

      There is a fourth join clause called FULL JOIN available for some RDBMS distributions, including PostgreSQL. A FULL JOIN will return all the records from each table, including any null values:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • FULL JOIN dinners ON tourneys.name=dinners.name;

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 Bettye | 9 | | | 1946-05-02 (7 rows)

      Note: As of this writing, the FULL JOIN clause is not supported by either MySQL or MariaDB.

      As an alternative to using FULL JOIN to query all the records from multiple tables, you can use the UNION clause.

      The UNION operator works slightly differently than a JOIN clause: instead of printing results from multiple tables as unique columns using a single SELECT statement, UNION combines the results of two SELECT statements into a single column.

      To illustrate, run the following query:

      • SELECT name FROM tourneys UNION SELECT name FROM dinners;

      This query will remove any duplicate entries, which is the default behavior of the UNION operator:

      Output

      name --------- Irma Etta Bettye Gladys Barbara Lesley Dolly (7 rows)

      To return all entries (including duplicates) use the UNION ALL operator:

      • SELECT name FROM tourneys UNION ALL SELECT name FROM dinners;

      Output

      name --------- Dolly Etta Irma Barbara Gladys Bettye Dolly Etta Irma Barbara Gladys Lesley (12 rows)

      The names and number of the columns in the results table reflect the name and number of columns queried by the first SELECT statement. Note that when using UNION to query multiple columns from more than one table, each SELECT statement must query the same number of columns, the respective columns must have similar data types, and the columns in each SELECT statement must be in the same order. The following example shows what might result if you use a UNION clause on two SELECT statements that query a different number of columns:

      • SELECT name FROM dinners UNION SELECT name, wins FROM tourneys;

      Output

      ERROR: each UNION query must have the same number of columns LINE 1: SELECT name FROM dinners UNION SELECT name, wins FROM tourne...

      Another way to query multiple tables is through the use of subqueries. Subqueries (also known as inner or nested queries) are queries enclosed within another query. These are useful in cases where you're trying to filter the results of a query against the result of a separate aggregate function.

      To illustrate this idea, say you want to know which of your friends have won more matches than Barbara. Rather than querying how many matches Barbara has won then running another query to see who has won more games than that, you can calculate both with a single query:

      • SELECT name, wins FROM tourneys
      • WHERE wins > (
      • SELECT wins FROM tourneys WHERE name = 'Barbara'
      • );

      Output

      name | wins --------+------ Dolly | 7 Etta | 4 Irma | 9 Gladys | 13 (4 rows)

      The subquery in this statement was run only once; it only needed to find the value from the wins column in the same row as Barbara in the name column, and the data returned by the subquery and outer query are independent of one another. There are cases, though, where the outer query must first read every row in a table and compare those values against the data returned by the subquery in order to return the desired data. In this case, the subquery is referred to as a correlated subquery.

      The following statement is an example of a correlated subquery. This query seeks to find which of your friends have won more games than is the average for those with the same shoe size:

      • SELECT name, size FROM tourneys AS t
      • WHERE wins > (
      • SELECT AVG(wins) FROM tourneys WHERE size = t.size
      • );

      In order for the query to complete, it must first collect the name and size columns from the outer query. Then, it compares each row from that result set against the results of the inner query, which determines the average number of wins for individuals with identical shoe sizes. Because you only have two friends that have the same shoe size, there can only be one row in the result-set:

      Output

      name | size ------+------ Etta | 9 (1 row)

      As mentioned earlier, subqueries can be used to query results from multiple tables. To illustrate this with one final example, say you wanted to throw a surprise dinner for the group's all-time best bowler. You could find which of your friends has the best bowling record and return their favorite meal with the following query:

      • SELECT name, entree, side, dessert
      • FROM dinners
      • WHERE name = (SELECT name FROM tourneys
      • WHERE wins = (SELECT MAX(wins) FROM tourneys));

      Output

      name | entree | side | dessert --------+--------+-------+----------- Gladys | steak | fries | ice cream (1 row)

      Notice that this statement not only includes a subquery, but also contains a subquery within that subquery.

      Conclusion

      Issuing queries is one of the most commonly-performed tasks within the realm of database management. There are a number of database administration tools, such as phpMyAdmin or pgAdmin, that allow you to perform queries and visualize the results, but issuing SELECT statements from the command line is still a widely-practiced workflow that can also provide you with greater control.

      If you're new to working with SQL, we encourage you to use our SQL Cheat Sheet as a reference and to review the official PostgreSQL documenation. Additionally, if you'd like to learn more about SQL and relational databases, the following tutorials may be of interest to you:



      Source link