One place for hosting & domains

      Connection

      Managed Databases Connection Pools and PostgreSQL Benchmarking Using pgbench


      Introduction

      DigitalOcean Managed Databases allows you to scale your PostgreSQL database using several methods. One such method is a built-in connection pooler that allows you to efficiently handle large numbers of client connections and reduce the CPU and memory footprint of these open connections. By using a connection pool and sharing a fixed set of recyclable connections, you can handle significantly more concurrent client connections, and squeeze extra performance out of your PostgreSQL database.

      In this tutorial we’ll use pgbench, PostgreSQL’s built-in benchmarking tool, to run load tests on a DigitalOcean Managed PostgreSQL Database. We’ll dive in to connection pools, describe how they work, and show how to create one using the Cloud Control panel. Finally, using results from the pgbench tests, we’ll demonstrate how using a connection pool can be an inexpensive method of increasing database throughput.

      Prerequisites

      To complete this tutorial, you’ll need:

      • A DigitalOcean Managed PostgreSQL Database cluster. To learn how to provision and configure a DigitalOcean PostgreSQL cluster, consult the Managed Database product documentation.
      • A client machine with PostgreSQL installed. By default, your PostgreSQL installation will contain the pgbench benchmarking utility and the psql client, both of which we’ll use in this guide. Consult How To Install and Use PostgreSQL on Ubuntu 18.04 to learn how to Install PostgreSQL. If you’re not running Ubuntu on your client machine, you can use the version finder to find the appropriate tutorial.

      Once you have a DigitalOcean PostgreSQL cluster up and running and a client machine with pgbench installed, you’re ready to begin with this guide.

      Step 1 — Creating and Initializing benchmark Database

      Before we create a connection pool for our database, we’ll first create the benchmark database on our PostgreSQL cluster and populate it with some dummy data on which pgbench will run its tests. The pgbench utility repeatedly runs a series of five SQL commands (consisting of SELECT, UPDATE, and INSERT queries) in a transaction, using multiple threads and clients, and calculates a useful performance metric called Transactions per Second (TPS). TPS is a measure of database throughput, counting the number of atomic transactions processed by the database in one second. To learn more about the specific commands executed by pgbench, consult What is the “Transaction” Actually Performed in pgbench? from the official pgbench documentation.

      Let’s begin by connecting to our PostgreSQL cluster and creating the benchmark database.

      First, retrieve your cluster’s Connection Details by navigating to Databases and locating your PostgreSQL cluster. Click into your cluster. You should see a cluster overview page containing the following Connection Details box:

      PostgreSQL Cluster Connection Details

      From this, we can parse the following config variables:

      • Admin user: doadmin
      • Admin password: your_password
      • Cluster endpoint: dbaas-test-do-user-3587522-0.db.ondigitalocean.com
      • Connection port: 25060
      • Database to connect to: defaultdb
      • SSL Mode: require (use an SSL-encrypted connection for increased security)

      Take note of these parameters, as you’ll need them when using both the psql client and pgbench tool.

      Click on the dropdown above this box and select Connection String. We’ll copy this string and pass it in to psql to connect to this PostgreSQL node.

      Connect to your cluster using psql and the connection string you just copied:

      • psql postgresql://doadmin:your_password@your_cluster_endpoint:25060/defaultdb?sslmode=require

      You should see the following PostgreSQL client prompt, indicating that you’ve connected to your PostgreSQL cluster successfully:

      Output

      psql (10.6 (Ubuntu 10.6-0ubuntu0.18.04.1)) SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off) Type "help" for help. defaultdb=>

      From here, create the benchmark database:

      • CREATE DATABASE benchmark;

      You should see the following output:

      Output

      CREATE DATABASE

      Now, disconnect from the cluster:

      Before we run the pgbench tests, we need to populate this benchmark database with some tables and dummy data required to run the tests.

      To do this, we’ll run pgbench with the following flags:

      • -h: The PostgreSQL cluster endpoint
      • -p: The PostgreSQL cluster connection port
      • -U: The database username
      • -i: Indicates that we'd like to initialize the benchmark database with benchmarking tables and their dummy data.
      • -s : Set a scale factor of 150, which will multiply table sizes by 150. The default scale factor of 1 results in tables of the following sizes:

        table                   # of rows
        ---------------------------------
        pgbench_branches        1
        pgbench_tellers         10
        pgbench_accounts        100000
        pgbench_history         0
        

        Using a scale factor of 150, the pgbench_accounts table will contain 15,000,000 rows.

        Note: To avoid excessive blocked transactions, be sure to set the scale factor to a value at least as large as the number of concurrent clients you intend to test with. In this tutorial we'll test with 150 clients at most, so we set -s to 150 here. To learn more, consult these recommended practices from the official pgbench documentation.

      Run the complete pgbench command:

      • pgbench -h your_cluster_endpoint -p 25060 -U doadmin -i -s 150 benchmark

      After running this command, you will be prompted to enter the password for the database user you specified. Enter the password, and hit ENTER.

      You should see the following output:

      Output

      dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping creating tables... generating data... 100000 of 15000000 tuples (0%) done (elapsed 0.19 s, remaining 27.93 s) 200000 of 15000000 tuples (1%) done (elapsed 0.85 s, remaining 62.62 s) 300000 of 15000000 tuples (2%) done (elapsed 1.21 s, remaining 59.23 s) 400000 of 15000000 tuples (2%) done (elapsed 1.63 s, remaining 59.44 s) 500000 of 15000000 tuples (3%) done (elapsed 2.05 s, remaining 59.51 s) . . . 14700000 of 15000000 tuples (98%) done (elapsed 70.87 s, remaining 1.45 s) 14800000 of 15000000 tuples (98%) done (elapsed 71.39 s, remaining 0.96 s) 14900000 of 15000000 tuples (99%) done (elapsed 71.91 s, remaining 0.48 s) 15000000 of 15000000 tuples (100%) done (elapsed 72.42 s, remaining 0.00 s) vacuuming... creating primary keys... done.

      At this point, we've created a benchmarking database, populated with the tables and data required to run the pgbench tests. We can now move on to running a baseline test which we'll use to compare performance before and after connection pooling is enabled.

      Step 2 — Running a Baseline pgbench Test

      Before we run our first benchmark, it's worth diving into what we're trying to optimize with connection pools.

      Typically when a client connects to a PostgreSQL database, the main PostgreSQL OS process forks itself into a child process corresponding to this new connection. When there are only a few connections, this rarely presents an issue. However, as clients and connections scale, the CPU and memory overhead of creating and maintaining these connections begins to add up, especially if the application in question does not efficiently use database connections. In addition, the max_connections PostgreSQL setting may limit the number of client connections allowed, resulting in additional connections being refused or dropped.

      A connection pool keeps open a fixed number of database connections, the pool size, which it then uses to distribute and execute client requests. This means that you can accommodate far more simultaneous connections, efficiently deal with idle or stagnant clients, as well as queue up client requests during traffic spikes instead of rejecting them. By recycling connections, you can more efficiently use your machine's resources in an environment where there is a heavy connection volume, and squeeze extra performance out of your database.

      A connection pool can be implemented either on the application side or as middleware between the database and your application. The Managed Databases connection pooler is built on top of pgBouncer, a lightweight, open-source middleware connection pooler for PostgreSQL. Its interface is available via the Cloud Control Panel UI.

      Navigate to Databases in the Control Panel, and then click into your PostgreSQL cluster. From here, click into Connection Pools. Then, click on Create a Connection Pool. You should see the following configuration window:

      Connection Pools Config Window

      Here, you can configure the following fields:

      • Pool Name: A unique name for your connection pool
      • Database: The database for which you'd like to pool connections
      • User: The PostgreSQL user the connection pool will authenticate as
      • Mode: One of Session, Transaction, or Statement. This option controls how long the pool assigns a backend connection to a client.
        • Session: The client holds on to the connection until it explicitly disconnects.
        • Transaction: The client obtains the connection until it completes a transaction, after which the connection is returned to the pool.
        • Statement: The pool aggressively recycles connections after each client statement. In statement mode, multi-statement transactions are not allowed. To learn more, consult the Connection Pools product documentation.
      • Pool Size: The number of connections the connection pool will keep open between itself and the database.

      Before we create a connection pool, we'll run a baseline test to which we can compare database performance with connection pooling.

      In this tutorial, we'll use a 4 GB RAM, 2 vCPU, 80 GB Disk, primary node only Managed Database setup. You can scale the benchmark test parameters in this section according to your PostgreSQL cluster specs.

      DigitalOcean Managed Database clusters have the PostgreSQL max_connections parameter preset to 25 connections per 1 GB RAM. A 4 GB RAM PostgreSQL node therefore has max_connections set to 100. In addition, for all clusters, 3 connections are reserved for maintenance. So for this 4 GB RAM PostgreSQL cluster, 97 connections are available for connection pooling.

      With this in mind, let's run our first baseline pgbench test.

      Log in to your client machine. We’ll run pgbench, specifying the database endpoint, port and user as usual. In addition, we’ll provide the following flags:

      • -c: The number of concurrent clients or database sessions to simulate. We set this to 50 so as to simulate a number of concurrent connections smaller than the max_connections parameter for our PostgreSQL cluster.
      • -j: The number of worker threads pgbench will use to run the benchmark. If you're using a multi-CPU machine, you can tune this upwards to distribute clients across threads. On a two-core machine, we set this to 2.
      • -P: Display progress and metrics every 60 seconds.
      • -T: Run the benchmark for 600 seconds (10 minutes). To produce consistent, reproducible results, it's important that you run the benchmark for several minutes, or through one checkpoint cycle.

      We’ll also specify that we'd like to run the benchmark against the benchmark database we created and populated earlier.

      Run the following complete pgbench command:

      • pgbench -h your_db_endpoint -p 25060 -U doadmin -c 50 -j 2 -P 60 -T 600 benchmark

      Hit ENTER and then type in the password for the doadmin user to begin running the test. You should see output similar to the following (results will depend on the specs of your PostgreSQL cluster):

      Output

      starting vacuum...end. progress: 60.0 s, 157.4 tps, lat 282.988 ms stddev 40.261 progress: 120.0 s, 176.2 tps, lat 283.726 ms stddev 38.722 progress: 180.0 s, 167.4 tps, lat 298.663 ms stddev 238.124 progress: 240.0 s, 178.9 tps, lat 279.564 ms stddev 43.619 progress: 300.0 s, 178.5 tps, lat 280.016 ms stddev 43.235 progress: 360.0 s, 178.8 tps, lat 279.737 ms stddev 43.307 progress: 420.0 s, 179.3 tps, lat 278.837 ms stddev 43.783 progress: 480.0 s, 178.5 tps, lat 280.203 ms stddev 43.921 progress: 540.0 s, 180.0 tps, lat 277.816 ms stddev 43.742 progress: 600.0 s, 178.5 tps, lat 280.044 ms stddev 43.705 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 50 number of threads: 2 duration: 600 s number of transactions actually processed: 105256 latency average = 282.039 ms latency stddev = 84.244 ms tps = 175.329321 (including connections establishing) tps = 175.404174 (excluding connections establishing)

      Here, we observed that over a 10 minute run with 50 concurrent sessions, we processed 105,256 transactions with a throughput of roughly 175 transactions per second.

      Now, let's run the same test, this time using 150 concurrent clients, a value that is higher than max_connections for this database, to synthetically simulate a mass influx of client connections:

      • pgbench -h your_db_endpoint -p 25060 -U doadmin -c 150 -j 2 -P 60 -T 600 benchmark

      You should see output similar to the following:

      Output

      starting vacuum...end. connection to database "pgbench" failed: FATAL: remaining connection slots are reserved for non-replication superuser connections progress: 60.0 s, 182.6 tps, lat 280.069 ms stddev 42.009 progress: 120.0 s, 253.8 tps, lat 295.612 ms stddev 237.448 progress: 180.0 s, 271.3 tps, lat 276.411 ms stddev 40.643 progress: 240.0 s, 273.0 tps, lat 274.653 ms stddev 40.942 progress: 300.0 s, 272.8 tps, lat 274.977 ms stddev 41.660 progress: 360.0 s, 250.0 tps, lat 300.033 ms stddev 282.712 progress: 420.0 s, 272.1 tps, lat 275.614 ms stddev 42.901 progress: 480.0 s, 261.1 tps, lat 287.226 ms stddev 112.499 progress: 540.0 s, 272.5 tps, lat 275.309 ms stddev 41.740 progress: 600.0 s, 271.2 tps, lat 276.585 ms stddev 41.221 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 150 number of threads: 2 duration: 600 s number of transactions actually processed: 154892 latency average = 281.421 ms latency stddev = 125.929 ms tps = 257.994941 (including connections establishing) tps = 258.049251 (excluding connections establishing)

      Note the FATAL error, indicating that pgbench hit the 100 connection limit threshold set by max_connections, resulting in a refused connection. The test was still able to complete, with a TPS of roughly 257.

      At this point we can investigate how a connection pool could potentially improve our database's throughput.

      Step 3 — Creating and Testing a Connection Pool

      In this step we'll create a connection pool and rerun the previous pgbench test to see if we can improve our database's throughput.

      In general, the max_connections setting and connection pool parameters are tuned in tandem to max out the database's load. However, because max_connections is abstracted away from the user in DigitalOcean Managed Databases, our main levers here are the connection pool Mode and Size settings.

      To begin, let's create a connection pool in Transaction mode that keeps open all the available backend connections.

      Navigate to Databases in the Control Panel, and then click into your PostgreSQL cluster. From here, click into Connection Pools. Then, click on Create a Connection Pool.

      In the configuration window that appears, fill in the following values:

      Connection Pool Configuration Values

      Here we name our connection pool test-pool, and use it with the benchmark database. Our database user is doadmin and we set the connection pool to Transaction mode. Recall from earlier that for a managed database cluster with 4GB of RAM, there are 97 available database connections. Accordingly, configure the pool to keep open 97 database connections.

      When you're done, hit Create Pool.

      You should now see this pool in the Control Panel:

      Connection Pool in Control Panel

      Grab its URI by clicking Connection Details. It should look something like the following

      postgres://doadmin:password@pool_endpoint:pool_port/test-pool?sslmode=require
      

      You should notice a different port here, and potentially a different endpoint and database name, corresponding to the pool name test-pool.

      Now that we've created the test-pool connection pool, we can rerun the pgbench test we ran above.

      Rerun pgbench

      From your client machine, run the following pgbench command (with 150 concurrent clients), making sure to substitute the highlighted values with those in your connection pool URI:

      • pgbench -h pool_endpoint -p pool_port -U doadmin -c 150 -j 2 -P 60 -T 600 test-pool

      Here we once again use 150 concurrent clients, run the test across 2 threads, print progress every 60 seconds, and run the test for 600 seconds. We set the database name to test-pool, the name of the connection pool.

      Once the test completes, you should see output similar to the following (note that these results will vary depending on the specs of your database node):

      Output

      starting vacuum...end. progress: 60.0 s, 240.0 tps, lat 425.251 ms stddev 59.773 progress: 120.0 s, 350.0 tps, lat 428.647 ms stddev 57.084 progress: 180.0 s, 340.3 tps, lat 440.680 ms stddev 313.631 progress: 240.0 s, 364.9 tps, lat 411.083 ms stddev 61.106 progress: 300.0 s, 366.5 tps, lat 409.367 ms stddev 60.165 progress: 360.0 s, 362.5 tps, lat 413.750 ms stddev 59.005 progress: 420.0 s, 359.5 tps, lat 417.292 ms stddev 60.395 progress: 480.0 s, 363.8 tps, lat 412.130 ms stddev 60.361 progress: 540.0 s, 351.6 tps, lat 426.661 ms stddev 62.960 progress: 600.0 s, 344.5 tps, lat 435.516 ms stddev 65.182 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 150 number of threads: 2 duration: 600 s number of transactions actually processed: 206768 latency average = 421.719 ms latency stddev = 114.676 ms tps = 344.240797 (including connections establishing) tps = 344.385646 (excluding connections establishing)

      Notice here that we were able to increase our database's throughput from 257 TPS to 344 TPS with 150 concurrent connections (an increase of 33%), and did not run up against the max_connections limit we previously hit without a connection pool. By placing a connection pool in front of the database, we can avoid dropped connections and significantly increase database throughput in an environment with a large number of simultaneous connections.

      If you run this same test, but with a -c value of 50 (specifying a smaller number of clients), the gains from using a connection pool become much less evident:

      Output

      starting vacuum...end. progress: 60.0 s, 154.0 tps, lat 290.592 ms stddev 35.530 progress: 120.0 s, 162.7 tps, lat 307.168 ms stddev 241.003 progress: 180.0 s, 172.0 tps, lat 290.678 ms stddev 36.225 progress: 240.0 s, 172.4 tps, lat 290.169 ms stddev 37.603 progress: 300.0 s, 177.8 tps, lat 281.214 ms stddev 35.365 progress: 360.0 s, 177.7 tps, lat 281.402 ms stddev 35.227 progress: 420.0 s, 174.5 tps, lat 286.404 ms stddev 34.797 progress: 480.0 s, 176.1 tps, lat 284.107 ms stddev 36.540 progress: 540.0 s, 173.1 tps, lat 288.771 ms stddev 38.059 progress: 600.0 s, 174.5 tps, lat 286.508 ms stddev 59.941 transaction type: <builtin: TPC-B (sort of)> scaling factor: 150 query mode: simple number of clients: 50 number of threads: 2 duration: 600 s number of transactions actually processed: 102938 latency average = 288.509 ms latency stddev = 83.503 ms tps = 171.482966 (including connections establishing) tps = 171.553434 (excluding connections establishing)

      Here we see that we were not able to increase throughput by using a connection pool. Our throughput went down to 171 TPS from 175 TPS.

      Although in this guide we use pgbench with its built-in benchmark data set, the best test for determining whether or not to use a connection pool is a benchmark load that accurately represents production load on your database, against production data. Creating custom benchmarking scripts and data is beyond the scope of this guide, but to learn more, consult the official pgbench documentation.

      Note: The pool size setting is highly workload-specific. In this guide, we configured the connection pool to use all the available backend database connections. This was because throughout our benchmark, the database rarely reached full utilization (you can monitor database load from the Metrics tab in the Cloud Control Panel). Depending on your database's load, this may not be the optimal setting. If you notice that your database is constantly fully saturated, shrinking the connection pool may increase throughput and improve performance by queuing additional requests instead of trying to execute them all at the same time on an already loaded server.

      Conclusion

      DigitalOcean Managed Databases connection pooling is a powerful feature that can help you quickly squeeze extra performance out of your database. Along with other techniques like replication, caching, and sharding, connection pooling can help you scale your database layer to process an even greater volume of requests.

      In this guide we focused on a simplistic and synthetic testing scenario using PostgreSQL's built-in pgbench benchmarking tool and its default benchmark test. In any production scenario, you should run benchmarks against actual production data while simulating production load. This will allow you to tune your database for your particular usage pattern.

      Along with pgbench, other tools exist to benchmark and load your database. One such tool developed by Percona is sysbench-tpcc. Another is Apache's JMeter, which can load test databases as well as web applications.

      To learn more about DigitalOcean Managed Databases, consult the Managed Databases product documentation. To learn more about sharding, another useful scaling technique, consult Understanding Database Sharding.

      References



      Source link

      Troubleshooting Basic Connection Issues


      Updated by Linode Written by Linode

      This guide presents troubleshooting strategies for Linodes that are unresponsive to any network access. One reason that a Linode may be unresponsive is if you recently performed a distribution upgrade or other broad software updates to your Linode, as those changes can lead to unexpected problems for your core system components.

      Similarly, your server may be unresponsive after maintenance was applied by Linode to your server’s host (frequently, this is correlated with software/distribution upgrades performed on your deployment prior to the host’s maintenance). This guide is designed as a useful resource for either of these scenarios.

      If you can ping your Linode, but you cannot access SSH or other services, this guide will not assist with troubleshooting those services. Instead, refer to the Troubleshooting SSH or Troubleshooting Web Servers, Databases, and Other Services guides.

      Where to go for help outside this guide

      This guide explains how to use different troubleshooting commands on your Linode. These commands can produce diagnostic information and logs that may expose the root of your connection issues. For some specific examples of diagnostic information, this guide also explains the corresponding cause of the issue and presents solutions for it.

      If the information and logs you gather do not match a solution outlined here, consider searching the Linode Community Site for posts that match your system’s symptoms. Or, post a new question in the Community Site and include your commands’ output.

      Linode is not responsible for the configuration or installation of software on your Linode. Refer to Linode’s Scope of Support for a description of which issues Linode Support can help with.

      Before You Begin

      There are a few core troubleshooting tools you should familiarize yourself with that are used when diagnosing connection problems.

      The Linode Shell (Lish)

      Lish is a shell that provides access to your Linode’s serial console. Lish does not establish a network connection to your Linode, so you can use it when your networking is down or SSH is inaccessible. Much of your troubleshooting for basic connection issues will be performed from the Lish console.

      To learn about Lish in more detail, and for instructions on how to connect to your Linode via Lish, review the Using the Linode Shell (Lish) guide. In particular, using your web browser is a fast and simple way to access Lish.

      MTR

      When your network traffic leaves your computer to your Linode, it travels through a series of routers that are administered by your internet service provider, by Linode’s transit providers, and by the various organizations that form the Internet’s backbone. It is possible to analyze the route that your traffic takes for possible service interruptions using a tool called MTR.

      MTR is similar to the traceroute tool, in that it will trace and display your traffic’s route. MTR also runs several iterations of its tracing algorithm, which means that it can report statistics like average packet loss and latency over the period that the MTR test runs.

      Review the installation instructions in Linode’s Diagnosing Network Issues with MTR guide and install MTR on your computer.

      Is your Linode Running?

      Log in to the Linode Manager and inspect the Linode’s dashboard. If the Linode is powered off, turn it on.

      Inspect the Lish Console

      If the Linode is listed as running in the Manager, or after you boot it from the Manager, open the Lish console and look for a login prompt. If a login prompt exists, try logging in with your root user credentials (or any other Linux user credentials that you previously created on the server).

      Note

      The root user is available in Lish even if root user login is disabled in your SSH configuration.

      1. If you can log in at the Lish console, move on to the diagnose network connection issues section of this guide.

        If you see a log in prompt, but you have forgotten the credentials for your Linode, follow the instructions for resetting your root password and then attempt to log in at the Lish console again.

      2. If you do not see a login prompt, your Linode may have issues with booting.

      Troubleshoot Booting Issues

      If your Linode isn’t booting normally, you will not be able to rely on the Lish console to troubleshoot your deployment directly. To continue, you will first need to reboot your Linode into Rescue Mode, which is a special recovery environment that Linode provides.

      When you boot into Rescue Mode, you are booting your Linode into the Finnix recovery Linux distribution. This Finnix image includes a working network configuration, and you will be able to mount your Linode’s disks from this environment, which means that you will be able to access your files.

      1. Review the Rescue and Rebuild guide for instructions and boot into Rescue Mode. If your Linode does not reboot into Rescue Mode successfully, please contact Linode Support.

      2. Connect to Rescue Mode via the Lish console as you would normally. You will not be required to enter a username or password to start using the Lish console while in Rescue Mode.

      Perform a File System Check

      If your Linode can’t boot, then it may have experienced filesystem corruption.

      1. Review the Rescue and Rebuild guide for instructions on running a filesystem check.

        Caution

        Never run a filesystem check on a disk that is mounted.

      2. If your filesystem check reports errors that cannot be fixed, you may need to rebuild your Linode.

      3. If the filesystem check reports errors that it has fixed, try rebooting your Linode under your normal configuration profile. After you reboot, you may find that your connection issues are resolved. If you still cannot connect as normal, restart the troubleshooting process from the beginning of this guide.

      4. If the filesystem check does not report any errors, there may be another reason for your booting issues. Continue to inspecting your system and kernel logs.

      Inspect System and Kernel Logs

      In addition to being able to mount your Linode’s disks, you can also change root (sometimes abbreviated as chroot) within Rescue Mode. Chrooting will make Rescue Mode’s working environment emulate your normal Linux distribution. This means your files and logs will appear where you normally expect them, and you will be able to work with tools like your standard package manager and other system utilities.

      To proceed, review the Rescue and Rebuild guide’s instructions on changing root. Once you have chrooted, you can then investigate your Linode’s logs for messages that may describe the cause of your booting issues.

      In systemd Linux distributions (like Debian 8+, Ubuntu 16.04+, CentOS 7+, and recent releases of Arch), you can run the journalctl command to view system and kernel logs. In these and other distributions, you may also find system log messages in the following files:

      • /var/log/messages

      • /var/log/syslog

      • /var/log/kern.log

      • /var/log/dmesg

      You can use the less command to review the contents of these files (e.g. less /var/log/syslog). Try pasting your log messages into a search engine or searching in the Linode Community Site to see if anyone else has run into similar issues. If you don’t find any results, you can try asking about your issues in a new post on the Linode Community Site. If it becomes difficult to find a solution, you may need to rebuild your Linode.

      Quick Tip for Ubuntu and Debian Systems

      After you have chrooted inside Rescue Mode, the following command may help with issues related to your package manager’s configuration:

      dpkg --configure -a
      

      After running this command, try rebooting your Linode into your normal configuration profile. If your issues persist, you may need to investigate and research your system logs further, or consider rebuilding your Linode.

      Diagnose Network Connection Issues

      If you can boot your Linode normally and access the Lish console, you can continue investigating network issues. Networking issues may have two causes:

      • There may be a network routing problem between you and your Linode, or:

      • If the traffic is properly routed, your Linode’s network configuration may be malfunctioning.

      Check for Network Route Problems

      To diagnose routing problems, run and analyze an MTR report from your computer to your Linode. For instructions on how to use MTR, review Linode’s MTR guide. It is useful to run your MTR report for 100 cycles in order to get a good sample size (note that running a report with this many cycles will take more time to complete). This recommended command includes other helpful options:

      mtr -rwbzc 100 -i 0.2 -rw 198.51.100.0 <Linode's IP address>
      

      Once you have generated this report, compare it with the following example scenarios.

      Note

      If you are located in China, and the output of your MTR report shows high packet loss or an improperly configured router, then your IP address may have been blacklisted by the GFW (Great Firewall of China). Linode is not able to change your IP address if it has been blacklisted by the GFW. If you have this issue, review this community post for troubleshooting help.
      • High Packet Loss

        root@localhost:~# mtr --report www.google.com
        HOST: localhost                   Loss%   Snt   Last   Avg  Best  Wrst StDev
        1. 63.247.74.43                   0.0%    10    0.3   0.6   0.3   1.2   0.3
        2. 63.247.64.157                  0.0%    10    0.4   1.0   0.4   6.1   1.8
        3. 209.51.130.213                60.0%    10    0.8   2.7   0.8  19.0   5.7
        4. aix.pr1.atl.google.com        60.0%    10    6.7   6.8   6.7   6.9   0.1
        5. 72.14.233.56                  50.0%   10    7.2   8.3   7.1  16.4   2.9
        6. 209.85.254.247                40.0%   10   39.1  39.4  39.1  39.7   0.2
        7. 64.233.174.46                 40.0%   10   39.6  40.4  39.4  46.9   2.3
        8. gw-in-f147.1e100.net          40.0%   10   39.6  40.5  39.5  46.7   2.2
        

        This example report shows high persistent packet loss starting mid-way through the route at hop 3, which indicates an issue with the router at hop 3. If your report looks like this, open a support ticket with your MTR results for further troubleshooting assistance.

        Note

        If your route only shows packet loss at certain routers, and not through to the end of the route, then it is likely that those routers are purposefully limiting ICMP responses. This is generally not a problem for your connection. Linode’s MTR guide provides more context for packet loss issues.

        If your report resembles the example, open a support ticket with your MTR results for further troubleshooting assistance. Also, consult Linode’s MTR guide for more context on packet loss issues.

      • Improperly Configured Router

        root@localhost:~# mtr --report www.google.com
        HOST: localhost                   Loss%   Snt   Last   Avg  Best  Wrst StDev
        1. 63.247.74.43                  0.0%    10    0.3   0.6   0.3   1.2   0.3
        2. 63.247.64.157                 0.0%    10    0.4   1.0   0.4   6.1   1.8
        3. 209.51.130.213                0.0%    10    0.8   2.7   0.8  19.0   5.7
        4. aix.pr1.atl.google.com        0.0%    10    6.7   6.8   6.7   6.9   0.1
        5. ???                           0.0%    10    0.0   0.0   0.0   0.0   0.0
        6. ???                           0.0%    10    0.0   0.0   0.0   0.0   0.0
        7. ???                           0.0%    10    0.0   0.0   0.0   0.0   0.0
        8. ???                           0.0%    10    0.0   0.0   0.0   0.0   0.0
        9. ???                           0.0%    10    0.0   0.0   0.0   0.0   0.0
        10. ???                           0.0%    10    0.0   0.0   0.0   0.0   0.0
        

        If your report shows question marks instead of the hostnames (or IP addresses) of the routers, and if these question marks persist to the end of the route, then the report indicates an improperly configured router. If your report looks like this, open a support ticket with your MTR results for further troubleshooting assistance.

        Note

        If your route only shows question marks for certain routers, and not through to the end of the route, then it is likely that those routers are purposefully blocking ICMP responses. This is generally not a problem for your connection. Linode’s MTR guide provides more information about router configuration issues.
      • Destination Host Networking Improperly Configured

        root@localhost:~# mtr --report www.google.com
        HOST: localhost                   Loss%   Snt   Last   Avg  Best  Wrst StDev
        1. 63.247.74.43                  0.0%    10    0.3   0.6   0.3   1.2   0.3
        2. 63.247.64.157                 0.0%    10    0.4   1.0   0.4   6.1   1.8
        3. 209.51.130.213                0.0%    10    0.8   2.7   0.8  19.0   5.7
        4. aix.pr1.atl.google.com        0.0%    10    6.7   6.8   6.7   6.9   0.1
        5. 72.14.233.56                  0.0%    10    7.2   8.3   7.1  16.4   2.9
        6. 209.85.254.247                0.0%    10   39.1  39.4  39.1  39.7   0.2
        7. 64.233.174.46                 0.0%    10   39.6  40.4  39.4  46.9   2.3
        8. gw-in-f147.1e100.net         100.0    10    0.0   0.0   0.0   0.0   0.0
        

        If your report shows no packet loss or low packet loss (or non-persistent packet loss isolated to certain routers) until the end of the route, and 100% loss at your Linode, then the report indicates that your Linode’s network interface is not configured correctly. If your report looks like this, move down to confirming network configuration issues from Rescue Mode.

      Note

      If your report does not look like any of the previous examples, read through the MTR guide for other potential scenarios.

      Confirm Network Configuration Issues from Rescue Mode

      If your MTR indicates a configuration issue within your Linode, you can confirm the problem by using Rescue Mode:

      1. Reboot your Linode into Rescue Mode.

      2. Run another MTR report from your computer to your Linode’s IP address.

      3. As noted earlier, Rescue Mode boots with a working network configuration. If your new MTR report does not show the same packet loss that it did before, this result confirms that your deployment’s network configuration needs to be fixed. Continue to troubleshooting network configuration issues.

      4. If your new MTR report still shows the same packet loss at your Linode, this result indicates issues outside of your configuration. Open a support ticket with your MTR results for further troubleshooting assistance.

      Open a Support Ticket with your MTR Results

      Before opening a support ticket, you should also generate a reverse MTR report. The MTR tool is run from your Linode and targets your machine’s IP address on your local network, whether you’re on your home LAN, for example, or public WiFi. To run an MTR from your Linode, log in to your Lish console. To find your local IP, visit a website like https://www.whatismyip.com/.

      Once you have generated your original MTR and your reverse MTR, open a Linode support ticket, and include your reports and a description of the troubleshooting you’ve performed so far. Linode Support will try to help further diagnose the routing issue.

      Troubleshoot Network Configuration Issues

      If you have determined that your network configuration is the cause of the problem, review the following troubleshooting suggestions. If you make any changes in an attempt to fix the issue, you can test those changes with these steps:

      1. Run another MTR report (or ping the Linode) from your computer to your Linode’s IP.

      2. If the report shows no packet loss but you still can’t access SSH or other services, this result indicates that your network connection is up again, but the other services are still down. Move onto troubleshooting SSH or troubleshooting other services.

      3. If the report still shows the same packet loss, review the remaining troubleshooting suggestions in this section.

      If the recommendations in this section do not resolve your issue, try pasting your diagnostic commands’ output into a search engine or searching for your output in the Linode Community Site to see if anyone else has run into similar issues. If you don’t find any results, you can try asking about your issues in a new post on the Linode Community Site. If it becomes difficult to find a solution, you may need to rebuild your Linode.

      Try Enabling Network Helper

      A quick fix may be to enable Linode’s Network Helper tool. Network Helper will attempt to generate the appropriate static networking configuration for your Linux distribution. After you enable Network Helper, reboot your Linode for the changes to take effect. If Network Helper was already enabled, continue to the remaining troubleshooting suggestions in this section.

      Did You Upgrade to Ubuntu 18.04+ From an Earlier Version?

      If you performed an inline upgrade from an earlier version of Ubuntu to Ubuntu 18.04+, you may need to enable the systemd-networkd service:

      sudo systemctl enable systemd-networkd
      

      Afterwards, reboot your Linode.

      Run Diagnostic Commands

      To collect more information about your network configuration, collect output from the diagnostic commands appropriate for your distribution:

      Network diagnostic commands

      • Debian 7, Ubuntu 14.04

        sudo service network status
        cat /etc/network/interfaces
        ip a
        ip r
        sudo ifdown eth0 && sudo ifup eth0
        
      • Debian 8 and 9, Ubuntu 16.04

        sudo systemctl status networking.service -l
        sudo journalctl -u networking --no-pager | tail -20
        cat /etc/network/interfaces
        ip a
        ip r
        sudo ifdown eth0 && sudo ifup eth0
        
      • Ubuntu 18.04

        sudo networkctl status
        sudo systemctl status systemd-networkd -l
        sudo journalctl -u systemd-networkd --no-pager | tail -20
        cat /etc/systemd/network/05-eth0.network
        ip a
        ip r
        sudo netplan apply
        
      • Arch, CoreOS

        sudo systemctl status systemd-networkd -l
        sudo journalctl -u systemd-networkd --no-pager | tail -20
        cat /etc/systemd/network/05-eth0.network
        ip a
        ip r
        
      • CentOS 6

        sudo service network status
        cat /etc/sysconfig/network-scripts/ifcfg-eth0
        ip a
        ip r
        sudo ifdown eth0 && sudo ifup eth0
        
      • CentOS 7, Fedora

        sudo systemctl status NetworkManager -l
        sudo journalctl -u NetworkManager --no-pager | tail -20
        sudo nmcli
        cat /etc/sysconfig/network-scripts/ifcfg-eth0
        ip a
        ip r
        sudo ifdown eth0 && sudo ifup eth0
        

      Inspect Error Messages

      Your commands’ output may show error messages, including generic errors like Failed to start Raise network interfaces. There may also be more specific errors that appear. Two common errors that can appear are related to Sendmail and iptables:

      Sendmail

      If you find a message similar to the following, it is likely that a broken Sendmail update is at fault:

        
      /etc/network/if-up.d/sendmail: 44: .: Can't open /usr/share/sendmail/dynamic run-parts: /etc/network/if-up.d/sendmail exited with return code 2
      
      

      The Sendmail issue can usually be resolved by running the following command and restarting your Linode:

      sudo mv /etc/network/if-up.d/sendmail ~
      ifdown -a && ifup -a
      

      Note

      Read more about the Sendmail bug here.

      iptables

      Malformed rules in your iptables ruleset can sometimes cause issues for your network scripts. An error similar to the following can appear in your logs if this is the case:

        
      Apr 06 01:03:17 xlauncher ifup[6359]: run-parts: failed to exec /etc/network/if- Apr 06 01:03:17 xlauncher ifup[6359]: run-parts: /etc/network/if-up.d/iptables e
      
      

      Run the following command and restart your Linode to resolve this issue:

      sudo mv /etc/network/if-up.d/iptables ~
      

      Please note that your firewall will be down at this point, so you will need to re-enable it manually. Review the Control Network Traffic with iptables guide for help with managing iptables.

      Was your Interface Renamed?

      In your commands’ output, you might notice that your eth0 interface is missing and replaced with another name (for example, ensp or ensp0). This behavior can be caused by systemd’s Predictable Network Interface Names feature.

      1. Disable the use of Predictable Network Interface Names with these commands:

        ln -s /dev/null /etc/systemd/network/99-default.link
        ln -s /dev/null /etc/udev/rules.d/80-net-setup-link.rules
        
      2. Reboot your Linode for the changes to take effect.

      Review Firewall Rules

      If your interface is up but your networking is still down, your firewall (which is likely implemented by the iptables software) may be blocking all connections, including basic ping requests. To review your current firewall ruleset, run:

      sudo iptables -L # displays IPv4 rules
      sudo ip6tables -L # displays IPv6 rules
      

      Note

      Your deployment may be running FirewallD or UFW, which are frontend software packages used to more easily manage your iptables rules. Run these commands to find out if you are running either package:

      sudo ufw status
      sudo firewall-cmd --state
      

      Review How to Configure a Firewall with UFW and Introduction to FirewallD on CentOS to learn how to manage and inspect your firewall rules with those packages.

      Firewall rulesets can vary widely. Review our Control Network Traffic with iptables guide to analyze your rules and determine if they are blocking connections.

      Disable Firewall Rules

      In addition to analyzing your firewall ruleset, you can also temporarily disable your firewall to test if it is interfering with your connections. Leaving your firewall disabled increases your security risk, so we recommend re-enabling it afterwards with a modified ruleset that will accept your connections. Review Control Network Traffic with iptables for help with this subject.

      1. Create a temporary backup of your current iptables:

        sudo iptables-save > ~/iptables.txt
        
      2. Set the INPUT, FORWARD and OUTPUT packet policies as ACCEPT:

        sudo iptables -P INPUT ACCEPT
        sudo iptables -P FORWARD ACCEPT
        sudo iptables -P OUTPUT ACCEPT
        
      3. Flush the nat table that is consulted when a packet that creates a new connection is encountered:

        sudo iptables -t nat -F
        
      4. Flush the mangle table that is used for specialized packet alteration:

        sudo iptables -t mangle -F
        
      5. Flush all the chains in the table:

        sudo iptables -F
        
      6. Delete every non-built-in chain in the table:

        sudo iptables -X
        
      7. Repeat these steps with the ip6tables command to flush your IPv6 rules. Be sure to assign a different name to the IPv6 rules file. (e.g. ~/ip6tables.txt).

      Next Steps

      If you are able to restore basic networking, but you still can’t access SSH or other services, refer to the Troubleshooting SSH or Troubleshooting Web Servers, Databases, and Other Services guides.

      If your connection issues were the result of maintenance performed by Linode, review the Reboot Survival Guide for methods to prepare a Linode for any future maintenance.

      Find answers, ask questions, and help others.

      This guide is published under a CC BY-ND 4.0 license.



      Source link