One place for hosting & domains

      Queries

      How To Optimize MySQL Queries with ProxySQL Caching on Ubuntu 16.04


      The author selected the Free Software Foundation to receive a donation as part of the Write for DOnations program.

      Introduction

      ProxySQL is a SQL-aware proxy server that can be positioned between your application and your database. It offers many features, such as load-balancing between multiple MySQL servers and serving as a caching layer for queries. This tutorial will focus on ProxySQL’s caching feature, and how it can optimize queries for your MySQL database.

      MySQL caching occurs when the result of a query is stored so that, when that query is repeated, the result can be returned without needing to sort through the database. This can significantly increase the speed of common queries. But in many caching methods, developers must modify the code of their application, which could introduce a bug into the codebase. To avoid this error-prone practice, ProxySQL allows you to set up transparent caching.

      In transparent caching, only database administrators need to change the ProxySQL configuration to enable caching for the most common queries, and these changes can be done through the ProxySQL admin interface. All the developer needs to do is connect to the protocol-aware proxy, and the proxy will decide if the query can be served from the cache without hitting the back-end server.

      In this tutorial, you will use ProxySQL to set up transparent caching for a MySQL server on Ubuntu 16.04. You will then test its performance using mysqlslap with and without caching to demonstrate the effect of caching and how much time it can save when executing many similar queries.

      Prerequisites

      Before you begin this guide you’ll need the following:

      Step 1 — Installing and Setting Up the MySQL Server

      First, you will install MySQL server and configure it to be used by ProxySQL as a back-end server for serving client queries.

      On Ubuntu 16.04, mysql-server can be installed using this command:

      • sudo apt-get install mysql-server

      Press Y to confirm the installation.

      You will then be prompted for your MySQL root user password. Enter a strong password and save it for later use.

      Now that you have your MySQL server ready, you will configure it for ProxySQL to work correctly. You need to add a monitor user for ProxySQL to monitor the MySQL server, since ProxySQL listens to the back-end server via the SQL protocol, rather than using a TCP connection or HTTP GET requests to make sure that the backend is running. monitor will use a dummy SQL connection to determine if the server is alive or not.

      First, log in to the MySQL shell:

      -uroot logs you in using the MySQL root user, and -p prompts for the root user’s password. This root user is different from your server’s root user, and the password is the one you entered when installing the mysql-server package.

      Enter the root password and press ENTER.

      Now you will create two users, one named monitor for ProxySQL and another that you will use to execute client queries and grant them the right privileges. This tutorial will name this user sammy.

      Create the monitor user:

      • CREATE USER 'monitor'@'%' IDENTIFIED BY 'monitor_password';

      The CREATE USER query is used to create a new user that can connect from specific IPs. Using % denotes that the user can connect from any IP address. IDENTIFIED BY sets the password for the new user; enter whatever password you like, but make sure to remember it for later use.

      With the user monitor created, next make the sammy user:

      • CREATE USER 'sammy'@'%' IDENTIFIED BY 'sammy_password';

      Next, grant privileges to your new users. Run the following command to configure monitor:

      • GRANT SELECT ON sys.* TO 'monitor'@'%';

      The GRANT query is used to give privileges to users. Here you granted only SELECT on all tables in the sys database to the monitor user; it only needs this privilege to listen to the back-end server.

      Now grant all privileges to all databases to the user sammy:

      • GRANT ALL PRIVILEGES on *.* TO 'sammy'@'%';

      This will allow sammy to make the necessary queries to test your database later.

      Apply the privilege changes by running the following:

      Finally, exit the mysql shell:

      You’ve now installed mysql-server and created a user to be used by ProxySQL to monitor your MySQL server, and another one to execute client queries. Next you will install and configure ProxySQL.

      Step 2 — Installing and Configuring ProxySQL Server

      Now you can install ProxySQL server, which will be used as a caching layer for your queries. A caching layer exists as a stop between your application servers and database back-end servers; it is used to connect to the database and to save the results of some queries in its memory for fast access later.

      The ProxySQL releases Github page offers installation files for common Linux distributions. For this tutorial, you will use wget to download the ProxySQL version 2.0.4 Debian installation file:

      • wget https://github.com/sysown/proxysql/releases/download/v2.0.4/proxysql_2.0.4-ubuntu16_amd64.deb

      Next, install the package using dpkg:

      • sudo dpkg -i proxysql_2.0.4-ubuntu16_amd64.deb

      Once it is installed, start ProxySQL with this command:

      • sudo systemctl start proxysql

      You can check if ProxySQL started correctly with this command:

      • sudo systemctl status proxysql

      You will get an output similar to this:

      Output

      root@ubuntu-s-1vcpu-2gb-sgp1-01:~# systemctl status proxysql ● proxysql.service - LSB: High Performance Advanced Proxy for MySQL Loaded: loaded (/etc/init.d/proxysql; bad; vendor preset: enabled) Active: active (exited) since Wed 2019-06-12 21:32:50 UTC; 6 months 7 days ago Docs: man:systemd-sysv-generator(8) Tasks: 0 Memory: 0B CPU: 0

      Now it is time to connect your ProxySQL server to the MySQL server. For this purpose, use the ProxySQL admin SQL interface, which by default listens to port 6032 on localhost and has admin as its username and password.

      Connect to the interface by running the following:

      • mysql -uadmin -p -h 127.0.0.1 -P6032

      Enter admin when prompted for the password.

      -uadmin sets the username as admin, and the -h flag specifies the host as localhost. The port is 6032, specified using the -P flag.

      Here you had to specify the host and port explicitly because, by default, the MySQL client connects using a local sockets file and port 3306.

      Now that you are logged into the mysql shell as admin, configure the monitor user so that ProxySQL can use it. First, use standard SQL queries to set the values of two global variables:

      • UPDATE global_variables SET variable_value='monitor' WHERE variable_name='mysql-monitor_username';
      • UPDATE global_variables SET variable_value='monitor_password' WHERE variable_name='mysql-monitor_password';

      The variable mysql-monitor_username specifies the MySQL username that will be used to check if the back-end server is alive or not. The variable mysql-monitor_password points to the password that will be used when connecting to the back-end server. Use the password you created for the monitor username.

      Every time you create a change in the ProxySQL admin interface, you need to use the right LOAD command to apply changes to the running ProxySQL instance. You changed MySQL global variables, so load them to RUNTIME to apply changes:

      • LOAD MYSQL VARIABLES TO RUNTIME;

      Next, SAVE the changes to the on-disk database to persist changes between restarts. ProxySQL uses its own SQLite local database to store its own tables and variables:

      • SAVE MYSQL VARIABLES TO DISK;

      Now, you will tell ProxySQL about the back-end server. The table mysql_servers holds information about each back-end server where ProxySQL can connect and execute queries, so add a new record using a standard SQL INSERT statement with the following values for hostgroup_id, hostname, and port:

      • INSERT INTO mysql_servers(hostgroup_id, hostname, port) VALUES (1, '127.0.0.1', 3306);

      To apply the changes, run LOAD and SAVE again:

      • LOAD MYSQL SERVERS TO RUNTIME;
      • SAVE MYSQL SERVERS TO DISK;

      Finally, you will tell ProxySQL which user will connect to the back-end server; set sammy as the user, and replace sammy_password with the password you created earlier:

      • INSERT INTO mysql_users(username, password, default_hostgroup) VALUES ('sammy', 'sammy_password', 1);

      The table mysql_users holds information about users used to connect to the back-end servers; you specified the username, password, and default_hostgroup.

      LOAD and SAVE the changes:

      • LOAD MYSQL USERS TO RUNTIME;
      • SAVE MYSQL USERS TO DISK;

      Then exit the mysql shell:

      To test that you can connect to your back-end server using ProxySQL, execute the following test query:

      • mysql -usammy -h127.0.0.1 -p -P6033 -e "SELECT @@HOSTNAME as hostname"

      In this command, you used the -e flag to execute a query and close the connection. The query prints the hostname of the back-end server.

      Note: ProxySQL uses port 6033 by default for listening to incoming connections.

      The output will look like this, with your_hostname replaced by your hostname:

      Output

      +----------------------------+ | hostname | +----------------------------+ | your_hostname | +----------------------------+

      To learn more about ProxySQL configuration, see Step 3 of How To Use ProxySQL as a Load Balancer for MySQL on Ubuntu 16.04.

      So far, you configured ProxySQL to use your MySQL server as a backend and connected to the backend using ProxySQL. Now, you are ready to use mysqlslap to benchmark the query performance without caching.

      Step 3 — Testing Using mysqlslap Without Caching

      In this step, you will download a test database so you can execute queries against it with mysqlslap to test the latency without caching, setting a benchmark for the speed of your queries. You will also explore how ProxySQL keeps records of queries in the stats_mysql_query_digest table.

      mysqlslap is a load emulation client that is used as a load testing tool for MySQL. It can test a MySQL server with auto-generated queries or with some custom queries executed on a database. It comes installed with the MySQL client package, so you do not need to install it; instead, you will download a database for testing purposes only, on which you can use mysqlslap.

      In this tutorial, you will use a sample employee database. You will be using this employee database because it features a large data set that can illustrate differences in query optimization. The database has six tables, but the data it contains has more than 300,000 employee records. This will help you emulate a large-scale production workload.

      To download the database, first clone the Github repository using this command:

      • git clone https://github.com/datacharmer/test_db.git

      Then enter the test_db directory and load the database into the MySQL server using these commands:

      • cd test_db
      • mysql -uroot -p < employees.sql

      This command uses shell redirection to read the SQL queries in employees.sql file and execute them on the MySQL server to create the database structure.

      You will see output like this:

      Output

      INFO CREATING DATABASE STRUCTURE INFO storage engine: InnoDB INFO LOADING departments INFO LOADING employees INFO LOADING dept_emp INFO LOADING dept_manager INFO LOADING titles INFO LOADING salaries data_load_time_diff 00:00:32

      Once the database is loaded into your MySQL server, test that mysqlslap is working with the following query:

      • mysqlslap -usammy -p -P6033 -h127.0.0.1 --auto-generate-sql --verbose

      mysqlslap has similar flags to the mysql client; here are the ones used in this command:

      • -u specifies the user used to connect to the server.
      • -p prompts for the user’s password.
      • -P connects using the specified port.
      • -h connects to the specified host.
      • --auto-generate-sql lets MySQL perform load testing using its own generated queries.
      • --verbose makes the output show more information.

      You will get output similar to the following:

      Output

      Benchmark Average number of seconds to run all queries: 0.015 seconds Minimum number of seconds to run all queries: 0.015 seconds Maximum number of seconds to run all queries: 0.015 seconds Number of clients running queries: 1 Average number of queries per client: 0

      In this output, you can see the average, minimum, and maximum number of seconds spent to execute all queries. This gives you an indication about the amount of time needed to execute the queries by a number of clients. In this output, only one client was used to execute queries.

      Next, find out what queries mysqlslap executed in the last command by looking at ProxySQL’s stats_mysql_query_digest. This will give us information like the digest of the queries, which is a normalized form of the SQL statement that can be referenced later to enable caching.

      Enter the ProxySQL admin interface with this command:

      • mysql -uadmin -p -h 127.0.0.1 -P6032

      Then execute this query to find information in the stats_mysql_query_digest table:

      • SELECT count_star,sum_time,hostgroup,digest,digest_text FROM stats_mysql_query_digest ORDER BY sum_time DESC;

      You will see output similar to the following:

      +------------+----------+-----------+--------------------+----------------------------------+
      | count_star | sum_time | hostgroup | digest             | digest_text                      |
      +------------+----------+-----------+--------------------+----------------------------------+
      | 1          | 598      | 1         | 0xF8F780C47A8D1D82 | SELECT @@HOSTNAME as hostname    |
      | 1          | 0        | 1         | 0x226CD90D52A2BA0B | select @@version_comment limit ? |
      +------------+----------+-----------+--------------------+----------------------------------+
      2 rows in set (0.01 sec)
      

      The previous query selects data from the stats_mysql_query_digest table, which contains information about all executed queries in ProxySQL. Here you have five columns selected:

      • count_star: The number of times this query was executed.
      • sum_time: Total time in milliseconds that this query took to execute.
      • hostgroup: The hostgroup used to execute the query.
      • digest: A digest of the executed query.
      • digest_text: The actual query. In this tutorial’s example, the second query is parameterized using ? marks in place of variable parameters. select @@version_comment limit 1 and select @@version_comment limit 2, therefore, are grouped together as the same query with the same digest.

      Now that you know how to check query data in the stats_mysql_query_digest table, exit the mysql shell:

      The database you downloaded contains some tables with demo data. You will now test queries on the dept_emp table by selecting any records whose from_date is greater than 2000-04-20 and recording the average execution time.

      Use this command to run the test:

      • mysqlslap -usammy -P6033 -p -h127.0.0.1 --concurrency=100 --iterations=20 --create-schema=employees --query="SELECT * from dept_emp WHERE from_date>'2000-04-20'" --verbose

      Here you are using some new flags:

      • --concurrency=100: This sets the number of users to simulate, in this case 100.
      • --iterations=20: This causes the test to run 20 times and calculate results from all of them.
      • --create-schema=employees: Here you selected the employees database.
      • --query="SELECT * from dept_emp WHERE from_date>'2000-04-20'": Here you specified the query executed in the test.

      The test will take a few minutes. After it is done, you will get results similar to the following:

      Output

      Benchmark Average number of seconds to run all queries: 18.117 seconds Minimum number of seconds to run all queries: 8.726 seconds Maximum number of seconds to run all queries: 22.697 seconds Number of clients running queries: 100 Average number of queries per client: 1

      Your numbers could be a little different. Keep these numbers somewhere in order to compare them with the results from after you enable caching.

      After testing ProxySQL without caching, it is time to run the same test again, but this time with caching enabled.

      Step 4 — Testing Using mysqlslap With Caching

      In this step, caching will help us to decrease latency when executing similar queries. Here, you will identify the queries executed, take their digests from ProxySQL’s stats_mysql_query_digest table, and use them to enable caching. Then, you will test again to check the difference.

      To enable caching, you need to know the digests of the queries that will be cached. Log in to the ProxySQL admin interface using this command:

      • mysql -uadmin -p -h127.0.0.1 -P6032

      Then execute this query again to get a list of queries executed and their digests:

      • SELECT count_star,sum_time,hostgroup,digest,digest_text FROM stats_mysql_query_digest ORDER BY sum_time DESC;

      You will get a result similar to this:

      Output

      +------------+-------------+-----------+--------------------+------------------------------------------+ | count_star | sum_time | hostgroup | digest | digest_text | +------------+-------------+-----------+--------------------+------------------------------------------+ | 2000 | 33727110501 | 1 | 0xC5DDECD7E966A6C4 | SELECT * from dept_emp WHERE from_date>? | | 1 | 601 | 1 | 0xF8F780C47A8D1D82 | SELECT @@HOSTNAME as hostname | | 1 | 0 | 1 | 0x226CD90D52A2BA0B | select @@version_comment limit ? | +------------+-------------+-----------+--------------------+------------------------------------------+ 3 rows in set (0.00 sec)

      Look at the first row. It is about a query that was executed 2000 times. This is the benchmarked query executed previously. Take its digest and save it to be used in adding a query rule for caching.

      The next few queries will add a new query rule to ProxySQL that will match the digest of the previous query and put a cache_ttl value for it. cache_ttl is the number of milliseconds that the result will be cached in memory:

      • INSERT INTO mysql_query_rules(active, digest, cache_ttl, apply) VALUES(1,'0xC5DDECD7E966A6C4',2000,1);

      In this command you are adding a new record to the mysql_query_rules table; this table holds all the rules applied before executing a query. In this example, you are adding a value for the cache_ttl column that will cause the matched query by the given digest to be cached for a number of milliseconds specified in this column. You put 1 in the apply column to make sure that the rule is applied to queries.

      LOAD and SAVE these changes, then exit the mysql shell:

      • LOAD MYSQL QUERY RULES TO RUNTIME;
      • SAVE MYSQL QUERY RULES TO DISK;
      • exit;

      Now that caching is enabled, re-run the test again to check the result:

      • mysqlslap -usammy -P6033 -p -h127.0.0.1 --concurrency=100 --iterations=20 --create-schema=employees --query="SELECT * from dept_emp WHERE from_date>'2000-04-20'" --verbose

      This will give output similar to the following:

      Output

      Benchmark Average number of seconds to run all queries: 7.020 seconds Minimum number of seconds to run all queries: 0.274 seconds Maximum number of seconds to run all queries: 23.014 seconds Number of clients running queries: 100 Average number of queries per client: 1

      Here you can see the big difference in average execution time: it dropped from 18.117 seconds to 7.020.

      Conclusion

      In this article, you set up transparent caching with ProxySQL to cache database query results. You also tested the query speed with and without caching to see the difference that caching can make.

      You’ve used one level of caching in this tutorial. You could also try, web caching, which sits in front of a web server and caches the responses to similar requests, sending the response back to the client without hitting the back-end servers. This is very similar to ProxySQL caching but at a different level. To learn more about web caching, check out our Web Caching Basics: Terminology, HTTP Headers, and Caching Strategies primer.

      MySQL server also has its own query cache; you can learn more about it in our How To Optimize MySQL with Query Cache on Ubuntu 18.04 tutorial.



      Source link

      How To Troubleshoot MySQL Queries



      Part of the Series:
      How To Troubleshoot Issues in MySQL

      This guide is intended to serve as a troubleshooting resource and starting point as you diagnose your MySQL setup. We’ll go over some of the issues that many MySQL users encounter and provide guidance for troubleshooting specific problems. We will also include links to DigitalOcean tutorials and the official MySQL documentation that may be useful in certain cases.

      Sometimes users run into problems once they begin issuing queries on their data. In some database systems, including MySQL, query statements in must end in a semicolon (;) for the query to complete, as in the following example:

      If you fail to include a semicolon at the end of your query, the prompt will continue on a new line until you complete the query by entering a semicolon and pressing ENTER.

      Some users may find that their queries are exceedingly slow. One way to find which query statement is the cause of a slowdown is to enable and view MySQL's slow query log. To do this, open your mysqld.cnf file, which is used to configure options for the MySQL server. This file is typically stored within the /etc/mysql/mysql.conf.d/ directory:

      • sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf

      Scroll through the file until you see the following lines:

      /etc/mysql/mysql.conf.d/mysqld.cnf

      . . .
      #slow_query_log         = 1
      #slow_query_log_file    = /var/log/mysql/mysql-slow.log
      #long_query_time = 2
      #log-queries-not-using-indexes
      . . .
      

      These commented-out directives provide MySQL's default configuration options for the slow query log. Specifically, here's what each of them do:

      • slow-query-log: Setting this to 1 enables the slow query log.
      • slow-query-log-file: This defines the file where MySQL will log any slow queries. In this case, it points to the /var/log/mysql-slow.log file.
      • long_query_time: By setting this directive to 2, it configures MySQL to log any queries that take longer than 2 seconds to complete.
      • log_queries_not_using_indexes: This tells MySQL to also log any queries that run without indexes to the /var/log/mysql-slow.log file. This setting isn't required for the slow query log to function, but it can be helpful for spotting inefficient queries.

      Uncomment each of these lines by removing the leading pound signs (#). The section will now look like this:

      /etc/mysql/mysql.conf.d/mysqld.cnf

      . . .
      slow_query_log = 1
      slow_query_log_file = /var/log/mysql-slow.log
      long_query_time = 2
      log_queries_not_using_indexes
      . . .
      

      Note: If you're running MySQL 8+, these commented lines will not be in the mysqld.cnf file by default. In this case, add the following lines to the bottom of the file:

      /etc/mysql/mysql.conf.d/mysqld.cnf

      . . .
      slow_query_log = 1
      slow_query_log_file = /var/log/mysql-slow.log
      long_query_time = 2
      log_queries_not_using_indexes
      

      After enabling the slow query log, save and close the file. Then restart the MySQL service:

      • sudo systemctl restart mysql

      With these settings in place, you can find problematic query statements by viewing the slow query log. You can do so with less, like this:

      • sudo less /var/log/mysql_slow.log

      Once you've singled out the queries causing the slowdown, you may find our guide on How To Optimize Queries and Tables in MySQL and MariaDB on a VPS to be helpful with optimizing them.

      Additionally, MySQL includes the EXPLAIN statement, which provides information about how MySQL executes queries. This page from the official MySQL documentation provides insight on how to use EXPLAIN to highlight inefficient queries.

      For help with understanding basic query structures, see our Introduction to MySQL Queries.



      Source link

      An Introduction to Queries in PostgreSQL


      Introduction

      Databases are a key component of many websites and applications, and are at the core of how data is stored and exchanged across the internet. One of the most important aspects of database management is the practice of retrieving data from a database, whether it’s on an ad hoc basis or part of a process that’s been coded into an application. There are several ways to retrieve information from a database, but one of the most commonly-used methods is performed through submitting queries through the command line.

      In relational database management systems, a query is any command used to retrieve data from a table. In Structured Query Language (SQL), queries are almost always made using the SELECT statement.

      In this guide, we will discuss the basic syntax of SQL queries as well as some of the more commonly-employed functions and operators. We will also practice making SQL queries using some sample data in a PostgreSQL database.

      PostgreSQL, often shortened to “Postgres,” is a relational database management system with an object-oriented approach, meaning that information can be represented as objects or classes in PostgreSQL schemas. PostgreSQL aligns closely with standard SQL, although it also includes some features not found in other relational database systems.

      Prerequisites

      In general, the commands and concepts presented in this guide can be used on any Linux-based operating system running any SQL database software. However, it was written specifically with an Ubuntu 18.04 server running PostgreSQL in mind. To set this up, you will need the following:

      With this setup in place, we can begin the tutorial.

      Creating a Sample Database

      Before we can begin making queries in SQL, we will first create a database and a couple tables, then populate these tables with some sample data. This will allow you to gain some hands-on experience when you begin making queries later on.

      For the sample database we’ll use throughout this guide, imagine the following scenario:

      You and several of your friends all celebrate your birthdays with one another. On each occasion, the members of the group head to the local bowling alley, participate in a friendly tournament, and then everyone heads to your place where you prepare the birthday-person’s favorite meal.

      Now that this tradition has been going on for a while, you’ve decided to begin tracking the records from these tournaments. Also, to make planning dinners easier, you decide to create a record of your friends’ birthdays and their favorite entrees, sides, and desserts. Rather than keep this information in a physical ledger, you decide to exercise your database skills by recording it in a PostgreSQL database.

      To begin, open up a PostgreSQL prompt as your postgres superuser:

      Note: If you followed all the steps of the prerequisite tutorial on Installing PostgreSQL on Ubuntu 18.04, you may have configured a new role for your PostgreSQL installation. In this case, you can connect to the Postgres prompt with the following command, substituting sammy with your own username:

      Next, create the database by running:

      • CREATE DATABASE birthdays;

      Then select this database by typing:

      Next, create two tables within this database. We'll use the first table to track your friends' records at the bowling alley. The following command will create a table called tourneys with columns for the name of each of your friends, the number of tournaments they've won (wins), their all-time best score, and what size bowling shoe they wear (size):

      • CREATE TABLE tourneys (
      • name varchar(30),
      • wins real,
      • best real,
      • size real
      • );

      Once you run the CREATE TABLE command and populate it with column headings, you’ll receive the following output:

      Output

      CREATE TABLE

      Populate the tourneys table with some sample data:

      • INSERT INTO tourneys (name, wins, best, size)
      • VALUES ('Dolly', '7', '245', '8.5'),
      • ('Etta', '4', '283', '9'),
      • ('Irma', '9', '266', '7'),
      • ('Barbara', '2', '197', '7.5'),
      • ('Gladys', '13', '273', '8');

      You’ll receive the following output:

      Output

      INSERT 0 5

      Following this, create another table within the same database which we'll use to store information about your friends' favorite birthday meals. The following command creates a table named dinners with columns for the name of each of your friends, their birthdate, their favorite entree, their preferred side dish, and their favorite dessert:

      • CREATE TABLE dinners (
      • name varchar(30),
      • birthdate date,
      • entree varchar(30),
      • side varchar(30),
      • dessert varchar(30)
      • );

      Similarly for this table, you’ll receive feedback verifying that the table was created:

      Output

      CREATE TABLE

      Populate this table with some sample data as well:

      • INSERT INTO dinners (name, birthdate, entree, side, dessert)
      • VALUES ('Dolly', '1946-01-19', 'steak', 'salad', 'cake'),
      • ('Etta', '1938-01-25', 'chicken', 'fries', 'ice cream'),
      • ('Irma', '1941-02-18', 'tofu', 'fries', 'cake'),
      • ('Barbara', '1948-12-25', 'tofu', 'salad', 'ice cream'),
      • ('Gladys', '1944-05-28', 'steak', 'fries', 'ice cream');

      Output

      INSERT 0 5

      Once that command completes successfully, you're done setting up your database. Next, we'll go over the basic command structure of SELECT queries.

      Understanding SELECT Statements

      As mentioned in the introduction, SQL queries almost always begin with the SELECT statement. SELECT is used in queries to specify which columns from a table should be returned in the result-set. Queries also almost always include FROM, which is used to specify which table the statement will query.

      Generally, SQL queries follow this syntax:

      • SELECT column_to_select FROM table_to_select WHERE certain_conditions_apply;

      By way of example, the following statement will return the entire name column from the dinners table:

      • SELECT name FROM dinners;

      Output

      name --------- Dolly Etta Irma Barbara Gladys (5 rows)

      You can select multiple columns from the same table by separating their names with a comma, like this:

      • SELECT name, birthdate FROM dinners;

      Output

      name | birthdate ---------+------------ Dolly | 1946-01-19 Etta | 1938-01-25 Irma | 1941-02-18 Barbara | 1948-12-25 Gladys | 1944-05-28 (5 rows)

      Instead of naming a specific column or set of columns, you can follow the SELECT operator with an asterisk (*) which serves as a placeholder representing all the columns in a table. The following command returns every column from the tourneys table:

      Output

      name | wins | best | size ---------+------+------+------ Dolly | 7 | 245 | 8.5 Etta | 4 | 283 | 9 Irma | 9 | 266 | 7 Barbara | 2 | 197 | 7.5 Gladys | 13 | 273 | 8 (5 rows)

      WHERE is used in queries to filter records that meet a specified condition, and any rows that do not meet that condition are eliminated from the result. A WHERE clause typically follows this syntax:

      • . . . WHERE column_name comparison_operator value

      The comparison operator in a WHERE clause defines how the specified column should be compared against the value. Here are some common SQL comparison operators:

      Operator What it does
      = tests for equality
      != tests for inequality
      < tests for less-than
      > tests for greater-than
      <= tests for less-than or equal-to
      >= tests for greater-than or equal-to
      BETWEEN tests whether a value lies within a given range
      IN tests whether a row's value is contained in a set of specified values
      EXISTS tests whether rows exist, given the specified conditions
      LIKE tests whether a value matches a specified string
      IS NULL tests for NULL values
      IS NOT NULL tests for all values other than NULL

      For example, if you wanted to find Irma's shoe size, you could use the following query:

      • SELECT size FROM tourneys WHERE name = 'Irma';

      Output

      size ------ 7 (1 row)

      SQL allows the use of wildcard characters, and these are especially handy when used in WHERE clauses. Percentage signs (%) represent zero or more unknown characters, and underscores (_) represent a single unknown character. These are useful if you're trying to find a specific entry in a table, but aren't sure of what that entry is exactly. To illustrate, let's say that you've forgotten the favorite entree of a few of your friends, but you're certain this particular entree starts with a "t." You could find its name by running the following query:

      • SELECT entree FROM dinners WHERE entree LIKE 't%';

      Output

      entree ------- tofu tofu (2 rows)

      Based on the output above, we see that the entree we have forgotten is tofu.

      There may be times when you're working with databases that have columns or tables with relatively long or difficult-to-read names. In these cases, you can make these names more readable by creating an alias with the AS keyword. Aliases created with AS are temporary, and only exist for the duration of the query for which they're created:

      • SELECT name AS n, birthdate AS b, dessert AS d FROM dinners;

      Output

      n | b | d ---------+------------+----------- Dolly | 1946-01-19 | cake Etta | 1938-01-25 | ice cream Irma | 1941-02-18 | cake Barbara | 1948-12-25 | ice cream Gladys | 1944-05-28 | ice cream (5 rows)

      Here, we have told SQL to display the name column as n, the birthdate column as b, and the dessert column as d.

      The examples we've gone through up to this point include some of the more frequently-used keywords and clauses in SQL queries. These are useful for basic queries, but they aren't helpful if you're trying to perform a calculation or derive a scalar value (a single value, as opposed to a set of multiple different values) based on your data. This is where aggregate functions come into play.

      Aggregate Functions

      Oftentimes, when working with data, you don't necessarily want to see the data itself. Rather, you want information about the data. The SQL syntax includes a number of functions that allow you to interpret or run calculations on your data just by issuing a SELECT query. These are known as aggregate functions.

      The COUNT function counts and returns the number of rows that match a certain criteria. For example, if you'd like to know how many of your friends prefer tofu for their birthday entree, you could issue this query:

      • SELECT COUNT(entree) FROM dinners WHERE entree = 'tofu';

      Output

      count ------- 2 (1 row)

      The AVG function returns the average (mean) value of a column. Using our example table, you could find the average best score amongst your friends with this query:

      • SELECT AVG(best) FROM tourneys;

      Output

      avg ------- 252.8 (1 row)

      SUM is used to find the total sum of a given column. For instance, if you'd like to see how many games you and your friends have bowled over the years, you could run this query:

      • SELECT SUM(wins) FROM tourneys;

      Output

      sum ----- 35 (1 row)

      Note that the AVG and SUM functions will only work correctly when used with numeric data. If you try to use them on non-numerical data, it will result in either an error or just 0, depending on which RDBMS you're using:

      • SELECT SUM(entree) FROM dinners;

      Output

      ERROR: function sum(character varying) does not exist LINE 1: select sum(entree) from dinners; ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.

      MIN is used to find the smallest value within a specified column. You could use this query to see what the worst overall bowling record is so far (in terms of number of wins):

      • SELECT MIN(wins) FROM tourneys;

      Output

      min ----- 2 (1 row)

      Similarly, MAX is used to find the largest numeric value in a given column. The following query will show the best overall bowling record:

      • SELECT MAX(wins) FROM tourneys;

      Output

      max ----- 13 (1 row)

      Unlike SUM and AVG, the MIN and MAX functions can be used for both numeric and alphabetic data types. When run on a column containing string values, the MIN function will show the first value alphabetically:

      • SELECT MIN(name) FROM dinners;

      Output

      min --------- Barbara (1 row)

      Likewise, when run on a column containing string values, the MAX function will show the last value alphabetically:

      • SELECT MAX(name) FROM dinners;

      Output

      max ------ Irma (1 row)

      Aggregate functions have many uses beyond what was described in this section. They're particularly useful when used with the GROUP BY clause, which is covered in the next section along with several other query clauses that affect how result-sets are sorted.

      Manipulating Query Outputs

      In addition to the FROM and WHERE clauses, there are several other clauses which are used to manipulate the results of a SELECT query. In this section, we will explain and provide examples for some of the more commonly-used query clauses.

      One of the most frequently-used query clauses, aside from FROM and WHERE, is the GROUP BY clause. It's typically used when you're performing an aggregate function on one column, but in relation to matching values in another.

      For example, let's say you wanted to know how many of your friends prefer each of the three entrees you make. You could find this info with the following query:

      • SELECT COUNT(name), entree FROM dinners GROUP BY entree;

      Output

      count | entree -------+--------- 1 | chicken 2 | steak 2 | tofu (3 rows)

      The ORDER BY clause is used to sort query results. By default, numeric values are sorted in ascending order, and text values are sorted in alphabetical order. To illustrate, the following query lists the name and birthdate columns, but sorts the results by birthdate:

      • SELECT name, birthdate FROM dinners ORDER BY birthdate;

      Output

      name | birthdate ---------+------------ Etta | 1938-01-25 Irma | 1941-02-18 Gladys | 1944-05-28 Dolly | 1946-01-19 Barbara | 1948-12-25 (5 rows)

      Notice that the default behavior of ORDER BY is to sort the result-set in ascending order. To reverse this and have the result-set sorted in descending order, close the query with DESC:

      • SELECT name, birthdate FROM dinners ORDER BY birthdate DESC;

      Output

      name | birthdate ---------+------------ Barbara | 1948-12-25 Dolly | 1946-01-19 Gladys | 1944-05-28 Irma | 1941-02-18 Etta | 1938-01-25 (5 rows)

      As mentioned previously, the WHERE clause is used to filter results based on specific conditions. However, if you use the WHERE clause with an aggregate function, it will return an error, as is the case with the following attempt to find which sides are the favorite of at least three of your friends:

      • SELECT COUNT(name), side FROM dinners WHERE COUNT(name) >= 3;

      Output

      ERROR: aggregate functions are not allowed in WHERE LINE 1: SELECT COUNT(name), side FROM dinners WHERE COUNT(name) >= 3...

      The HAVING clause was added to SQL to provide functionality similar to that of the WHERE clause while also being compatible with aggregate functions. It's helpful to think of the difference between these two clauses as being that WHERE applies to individual records, while HAVING applies to group records. To this end, any time you issue a HAVING clause, the GROUP BY clause must also be present.

      The following example is another attempt to find which side dishes are the favorite of at least three of your friends, although this one will return a result without error:

      • SELECT COUNT(name), side FROM dinners GROUP BY side HAVING COUNT(name) >= 3;

      Output

      count | side -------+------- 3 | fries (1 row)

      Aggregate functions are useful for summarizing the results of a particular column in a given table. However, there are many cases where it's necessary to query the contents of more than one table. We'll go over a few ways you can do this in the next section.

      Querying Multiple Tables

      More often than not, a database contains multiple tables, each holding different sets of data. SQL provides a few different ways to run a single query on multiple tables.

      The JOIN clause can be used to combine rows from two or more tables in a query result. It does this by finding a related column between the tables and sorts the results appropriately in the output.

      SELECT statements that include a JOIN clause generally follow this syntax:

      • SELECT table1.column1, table2.column2
      • FROM table1
      • JOIN table2 ON table1.related_column=table2.related_column;

      Note that because JOIN clauses compare the contents of more than one table, the previous example specifies which table to select each column from by preceding the name of the column with the name of the table and a period. You can specify which table a column should be selected from like this for any query, although it's not necessary when selecting from a single table, as we've done in the previous sections. Let's walk through an example using our sample data.

      Imagine that you wanted to buy each of your friends a pair of bowling shoes as a birthday gift. Because the information about your friends' birthdates and shoe sizes are held in separate tables, you could query both tables separately then compare the results from each. With a JOIN clause, though, you can find all the information you want with a single query:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • JOIN dinners ON tourneys.name=dinners.name;

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 (5 rows)

      The JOIN clause used in this example, without any other arguments, is an inner JOIN clause. This means that it selects all the records that have matching values in both tables and prints them to the results set, while any records that aren't matched are excluded. To illustrate this idea, let's add a new row to each table that doesn't have a corresponding entry in the other:

      • INSERT INTO tourneys (name, wins, best, size)
      • VALUES ('Bettye', '0', '193', '9');
      • INSERT INTO dinners (name, birthdate, entree, side, dessert)
      • VALUES ('Lesley', '1946-05-02', 'steak', 'salad', 'ice cream');

      Then, re-run the previous SELECT statement with the JOIN clause:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • JOIN dinners ON tourneys.name=dinners.name;

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 (5 rows)

      Notice that, because the tourneys table has no entry for Lesley and the dinners table has no entry for Bettye, those records are absent from this output.

      It is possible, though, to return all the records from one of the tables using an outer JOIN clause. Outer JOIN clauses are written as either LEFT JOIN, RIGHT JOIN, or FULL JOIN.

      A LEFT JOIN clause returns all the records from the “left” table and only the matching records from the right table. In the context of outer joins, the left table is the one referenced by the FROM clause, and the right table is any other table referenced after the JOIN statement.

      Run the previous query again, but this time use a LEFT JOIN clause:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • LEFT JOIN dinners ON tourneys.name=dinners.name;

      This command will return every record from the left table (in this case, tourneys) even if it doesn't have a corresponding record in the right table. Any time there isn't a matching record from the right table, it's returned as a blank value or NULL, depending on your RDBMS:

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 Bettye | 9 | (6 rows)

      Now run the query again, this time with a RIGHT JOIN clause:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • RIGHT JOIN dinners ON tourneys.name=dinners.name;

      This will return all the records from the right table (dinners). Because Lesley's birthdate is recorded in the right table, but there is no corresponding row for her in the left table, the name and size columns will return as blank values in that row:

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 | | 1946-05-02 (6 rows)

      Note that left and right joins can be written as LEFT OUTER JOIN or RIGHT OUTER JOIN, although the OUTER part of the clause is implied. Likewise, specifying INNER JOIN will produce the same result as just writing JOIN.

      There is a fourth join clause called FULL JOIN available for some RDBMS distributions, including PostgreSQL. A FULL JOIN will return all the records from each table, including any null values:

      • SELECT tourneys.name, tourneys.size, dinners.birthdate
      • FROM tourneys
      • FULL JOIN dinners ON tourneys.name=dinners.name;

      Output

      name | size | birthdate ---------+------+------------ Dolly | 8.5 | 1946-01-19 Etta | 9 | 1938-01-25 Irma | 7 | 1941-02-18 Barbara | 7.5 | 1948-12-25 Gladys | 8 | 1944-05-28 Bettye | 9 | | | 1946-05-02 (7 rows)

      Note: As of this writing, the FULL JOIN clause is not supported by either MySQL or MariaDB.

      As an alternative to using FULL JOIN to query all the records from multiple tables, you can use the UNION clause.

      The UNION operator works slightly differently than a JOIN clause: instead of printing results from multiple tables as unique columns using a single SELECT statement, UNION combines the results of two SELECT statements into a single column.

      To illustrate, run the following query:

      • SELECT name FROM tourneys UNION SELECT name FROM dinners;

      This query will remove any duplicate entries, which is the default behavior of the UNION operator:

      Output

      name --------- Irma Etta Bettye Gladys Barbara Lesley Dolly (7 rows)

      To return all entries (including duplicates) use the UNION ALL operator:

      • SELECT name FROM tourneys UNION ALL SELECT name FROM dinners;

      Output

      name --------- Dolly Etta Irma Barbara Gladys Bettye Dolly Etta Irma Barbara Gladys Lesley (12 rows)

      The names and number of the columns in the results table reflect the name and number of columns queried by the first SELECT statement. Note that when using UNION to query multiple columns from more than one table, each SELECT statement must query the same number of columns, the respective columns must have similar data types, and the columns in each SELECT statement must be in the same order. The following example shows what might result if you use a UNION clause on two SELECT statements that query a different number of columns:

      • SELECT name FROM dinners UNION SELECT name, wins FROM tourneys;

      Output

      ERROR: each UNION query must have the same number of columns LINE 1: SELECT name FROM dinners UNION SELECT name, wins FROM tourne...

      Another way to query multiple tables is through the use of subqueries. Subqueries (also known as inner or nested queries) are queries enclosed within another query. These are useful in cases where you're trying to filter the results of a query against the result of a separate aggregate function.

      To illustrate this idea, say you want to know which of your friends have won more matches than Barbara. Rather than querying how many matches Barbara has won then running another query to see who has won more games than that, you can calculate both with a single query:

      • SELECT name, wins FROM tourneys
      • WHERE wins > (
      • SELECT wins FROM tourneys WHERE name = 'Barbara'
      • );

      Output

      name | wins --------+------ Dolly | 7 Etta | 4 Irma | 9 Gladys | 13 (4 rows)

      The subquery in this statement was run only once; it only needed to find the value from the wins column in the same row as Barbara in the name column, and the data returned by the subquery and outer query are independent of one another. There are cases, though, where the outer query must first read every row in a table and compare those values against the data returned by the subquery in order to return the desired data. In this case, the subquery is referred to as a correlated subquery.

      The following statement is an example of a correlated subquery. This query seeks to find which of your friends have won more games than is the average for those with the same shoe size:

      • SELECT name, size FROM tourneys AS t
      • WHERE wins > (
      • SELECT AVG(wins) FROM tourneys WHERE size = t.size
      • );

      In order for the query to complete, it must first collect the name and size columns from the outer query. Then, it compares each row from that result set against the results of the inner query, which determines the average number of wins for individuals with identical shoe sizes. Because you only have two friends that have the same shoe size, there can only be one row in the result-set:

      Output

      name | size ------+------ Etta | 9 (1 row)

      As mentioned earlier, subqueries can be used to query results from multiple tables. To illustrate this with one final example, say you wanted to throw a surprise dinner for the group's all-time best bowler. You could find which of your friends has the best bowling record and return their favorite meal with the following query:

      • SELECT name, entree, side, dessert
      • FROM dinners
      • WHERE name = (SELECT name FROM tourneys
      • WHERE wins = (SELECT MAX(wins) FROM tourneys));

      Output

      name | entree | side | dessert --------+--------+-------+----------- Gladys | steak | fries | ice cream (1 row)

      Notice that this statement not only includes a subquery, but also contains a subquery within that subquery.

      Conclusion

      Issuing queries is one of the most commonly-performed tasks within the realm of database management. There are a number of database administration tools, such as phpMyAdmin or pgAdmin, that allow you to perform queries and visualize the results, but issuing SELECT statements from the command line is still a widely-practiced workflow that can also provide you with greater control.

      If you're new to working with SQL, we encourage you to use our SQL Cheat Sheet as a reference and to review the official PostgreSQL documenation. Additionally, if you'd like to learn more about SQL and relational databases, the following tutorials may be of interest to you:



      Source link