One place for hosting & domains


      How to Use Ansible to Install and Set Up Apache on Ubuntu 18.04


      Server automation now plays an essential role in systems administration, due to the disposable nature of modern application environments. Configuration management tools such as Ansible are typically used to streamline the process of automating server setup by establishing standard procedures for new servers while also reducing human error associated with manual setups.

      Ansible offers a simple architecture that doesn’t require special software to be installed on nodes. It also provides a robust set of features and built-in modules which facilitate writing automation scripts.

      This guide explains how to use Ansible to automate the steps contained in our guide on How To Install the Apache Web Server on Ubuntu 18.04. The Apache HTTP server is the most widely-used web server in the world. It provides many powerful features including dynamically loadable modules, robust media support, and extensive integration with other popular software.


      In order to execute the automated setup provided by the playbook we’re discussing in this guide, you’ll need:

      Before proceeding, you first need to make sure your Ansible control node is able to connect and execute commands on your Ansible host(s). For a connection test, please check step 3 of How to Install and Configure Ansible on Ubuntu 18.04.

      What Does this Playbook Do?

      This Ansible playbook provides an alternative to manually running through the procedure outlined in our guide on How To Install the Apache Web Server on Ubuntu 18.04.

      Running this playbook will perform the following actions on your Ansible hosts:

      1. Install aptitude, which is preferred by Ansible as an alternative to the apt package manager.
      2. Install Apache.
      3. Create a custom document root folder for the new Apache VirtualHost and set up a test page.
      4. Enable the new Apache VirtualHost.
      5. Disable the default Apache website when the variable disable_default is set to true.
      6. Set up UFW to allow HTTP traffic on the configured port (80 by default).

      Once the playbook has finished running, you will have a web server running on your target node, based on the options you defined within your configuration variables.

      How to Use this Playbook

      The first thing we need to do is obtain the Apache playbook and its dependencies from the do-community/ansible-playbooks repository. We need to clone this repository to a local folder inside the Ansible Control Node.

      In case you have cloned this repository before while following a different guide, access your existing ansible-playbooks copy and run a git pull command to make sure you have updated contents:

      • cd ~/ansible-playbooks
      • git pull

      If this is your first time using the do-community/ansible-playbooks repository, you should start by cloning the repository to your home folder with:

      • cd ~
      • git clone
      • cd ansible-playbooks

      The files we’re interested in are located inside the apache_ubuntu1804 folder, which has the following structure:

      ├── files
      │   ├── apache.conf.j2
      │   └── index.html.j2
      ├── vars
      │   └── default.yml
      ├── playbook.yml

      Here is what each of these files are:

      • files/apache.conf.j2: Template file for setting up the Apache Virtual Host.
      • files/index.html.j2: Template file for setting up a test page on the web server’s root directory.
      • vars/default.yml: Variable file for customizing playbook settings.
      • playbook.yml: The playbook file, containing the tasks to be executed on the remote server(s).
      • A text file containing information about this playbook.

      We’ll edit the playbook’s variable file to customize a few options. Access the apache_ubuntu1804 directory and open the vars/default.yml file using your command line editor of choice:

      • cd apache_ubuntu1804
      • nano vars/default.yml

      This file contains a few variables that require your attention:


      app_user: "sammy"
      http_host: "your_domain"
      http_conf: "your_domain.conf"
      http_port: "80"
      disable_default: true

      The following list contains a brief explanation of each of these variables and how you might want to change them:

      • app_user: A remote non-root user on the Ansible host that will be set as the owner of the application files.
      • http_host: Your domain name.
      • http_conf: The name of the configuration file that will be created within Apache.
      • http_port: HTTP port for this virtual host, where 80 is the default.
      • disable_default: Whether or not to disable the default website that comes with Apache.

      Once you’re done updating the variables inside vars/default.yml, save and close this file. If you used nano, do so by pressing CTRL + X, Y, then ENTER.

      You’re now ready to run this playbook on one or more servers. Most playbooks are configured to be executed on every server in your inventory, by default. We can use the -l flag to make sure that only a subset of servers, or a single server, is affected by the playbook. We can also use the -u flag to specify which user on the remote server we’re using to connect and execute the playbook commands on the remote hosts.

      To execute the playbook only on server1, connecting as root, you can use the following command:

      • ansible-playbook playbook.yml -l server1 -u root

      You will get output similar to this:


      PLAY [all] ***************************************************************************************************************************** TASK [Gathering Facts] ***************************************************************************************************************** ok: [server1] TASK [Install prerequisites] *********************************************************************************************************** ok: [server1] => (item=aptitude) TASK [Install Apache] ****************************************************************************************************************** changed: [server1] TASK [Create document root] ************************************************************************************************************ changed: [server1] TASK [Copy index test page] ************************************************************************************************************ changed: [server1] TASK [Set up Apache virtualhost] ******************************************************************************************************* changed: [server1] TASK [Enable new site] ***************************************************************************************************************** changed: [server1] TASK [Disable default Apache site] ***************************************************************************************************** changed: [server1] TASK [UFW - Allow HTTP on port 80] ***************************************************************************************************** changed: [server1] RUNNING HANDLER [Reload Apache] ******************************************************************************************************** changed: [server1] PLAY RECAP ***************************************************************************************************************************** server1 : ok=10 changed=8 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

      Note: For more information on how to run Ansible playbooks, check our Ansible Cheat Sheet Guide.

      When the playbook is finished running, go to your web browser and access the host or IP address of the server, as configured in the playbook variables:


      You will see a page like this:

      it works page

      That means the automation was fully executed on your server, and Apache is now ready to serve static HTML pages and assets placed in the document root directory that you’ve set up within the playbook configuration variables.

      The Playbook Contents

      You can find the Apache server setup featured in this tutorial in the apache_ubuntu1804 folder inside the DigitalOcean Community Playbooks repository. To copy or download the script contents directly, click the Raw button towards the top of each script.

      The full contents of the playbook as well as its associated files are also included here for your convenience.


      The default.yml variable file contains values that will be used within the playbook tasks, such as the HTTP port and domain name to configure within your Apache VirtualHost.


      app_user: "sammy"
      http_host: "your_domain"
      http_conf: "your_domain.conf"
      http_port: "80"
      disable_default: true


      The apache.conf.j2 file is a Jinja 2 template file that configures a new Apache VirtualHost. The variables used within this template are defined in the vars/default.yml variable file.


      <VirtualHost *:{{ http_port }}>
         ServerAdmin webmaster@localhost
         ServerName {{ http_host }}
         ServerAlias www.{{ http_host }}
         DocumentRoot /var/www/{{ http_host }}
         ErrorLog ${APACHE_LOG_DIR}/error.log
         CustomLog ${APACHE_LOG_DIR}/access.log combined


      The index.html.j2 file is another Jinja template, used to set up a test HTML page in the document root of the newly configured Apache server.


             <title>Welcome to {{ http_host }} !</title>
             <h1>Success! The {{ http_host }} virtual host is working!</h1>


      The playbook.yml file is where all tasks from this setup are defined. It starts by defining the group of servers that should be the target of this setup (all), after which it uses become: true to define that tasks should be executed with privilege escalation (sudo) by default. Then, it includes the vars/default.yml variable file to load configuration options.


      - hosts: all
        become: true
          - vars/default.yml
          - name: Install prerequisites
            apt: name={{ item }} update_cache=yes state=latest force_apt_get=yes
            loop: [ 'aptitude' ]
          - name: Install Apache
            apt: name=apache2 update_cache=yes state=latest
          - name: Create document root
              path: "/var/www/{{ http_host }}"
              state: directory
              owner: "{{ app_user }}"
              mode: '0755'
          - name: Copy index test page
              src: "files/index.html.j2"
              dest: "/var/www/{{ http_host }}/index.html"
          - name: Set up Apache virtuahHost
              src: "files/apache.conf.j2"
              dest: "/etc/apache2/sites-available/{{ http_conf }}"
          - name: Enable new site
            shell: /usr/sbin/a2ensite {{ http_conf }}
            notify: Reload Apache
          - name: Disable default Apache site
            shell: /usr/sbin/a2dissite 000-default.conf
            when: disable_default
            notify: Reload Apache
          - name: "UFW - Allow HTTP on port {{ http_port }}"
              rule: allow
              port: "{{ http_port }}"
              proto: tcp
          - name: Reload Apache
              name: apache2
              state: reloaded
          - name: Restart Apache
              name: apache2
              state: restarted

      Feel free to modify these files to best suit your individual needs within your own workflow.


      In this guide, we used Ansible to automate the process of installing and configuring Apache on Ubuntu 18.04.

      If you’d like to include other tasks in this playbook to further customize your server setup, please refer to our introductory Ansible guide Configuration Management 101: Writing Ansible Playbooks.

      Source link

      How To Install and Use ClickHouse on Debian 10

      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.


      ClickHouse is an open-source, column-oriented analytics database created by Yandex for OLAP and big data use cases. ClickHouse’s support for real-time query processing makes it suitable for applications that require sub-second analytical results. ClickHouse’s query language is a dialect of SQL that enables powerful declarative querying capabilities while offering familiarity and a smaller learning curve for the end user.

      Column-oriented databases store records in blocks grouped by columns instead of rows. By not loading data for columns absent in the query, column-oriented databases spend less time reading data while completing queries. As a result, these databases can compute and return results much faster than traditional row-based systems for certain workloads, such as OLAP.

      Online Analytics Processing (OLAP) systems allow for organizing large amounts of data and performing complex queries. They are capable of managing petabytes of data and returning query results quickly. In this way, OLAP is useful for work in areas like data science and business analytics.

      In this tutorial, you’ll install the ClickHouse database server and client on your machine. You’ll use the DBMS for typical tasks and optionally enable remote access from another server so that you’ll be able to connect to the database from another machine. Then you’ll test ClickHouse by modeling and querying example website-visit data.


      • One Debian 10 with a sudo enabled non-root user and firewall setup. You can follow the initial server setup tutorial to create the user and set up the firewall.
      • (Optional) A secondary Debian 10 with a sudo enabled non-root user and firewall setup. You can follow the initial server setup tutorial.

      Step 1 — Installing ClickHouse

      In this section, you will install the ClickHouse server and client programs using apt.

      First, SSH into your server by running:

      dirmngr is a server for managing certificates and keys. It is required for adding and verifying remote repository keys, install it by running:

      Yandex maintains an APT repository that has the latest version of ClickHouse. Add the repository’s GPG key so that you’ll be able to securely download validated ClickHouse packages:

      • sudo apt-key adv --keyserver --recv E0C56BD4

      You will see output similar to the following:


      Executing: /tmp/apt-key-gpghome.JkkcKnBAFY/ --keyserver --recv E0C56BD4 gpg: key C8F1E19FE0C56BD4: public key "ClickHouse Repository Key <>" imported gpg: Total number processed: 1 gpg: imported: 1

      The output confirms it has successfully verified and added the key.

      Add the repository to your APT repositories list by executing:

      • echo "deb main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list

      Here you’ve piped the output of echo to sudo tee so that this output can print to a root-owned file.

      Now, run apt update to update your packages:

      The clickhouse-server and clickhouse-client packages will now be available for installation.

      As of ClickHouse version 19.13.3, certain OpenSSL 1.1.1 configurations such as MinProtocol and CipherVersion are not read correctly. In order to workaround this incompatibility, modify the OpenSSL config file and comment out the ssl_conf = ssl_sect line in /etc/ssl/openssl.cnf.

      Edit the configuration file by executing:

      • sudo nano /etc/ssl/openssl.cnf

      Then comment out the line containing ssl_conf = ssl_sect, so it looks like the following file:


      tsa_name        = yes   # Must the TSA name be included in the reply?
                      # (optional, default: no)
      ess_cert_id_chain   = no    # Must the ESS cert id chain be included?
                      # (optional, default: no)
      ess_cert_id_alg     = sha1  # algorithm to compute certificate
                      # identifier (optional, default: sha1)
      #ssl_conf = ssl_sect

      Now that the OpenSSL config has been patched, you’re ready to install the ClickHouse server and client packages. Install them with:

      • sudo apt install clickhouse-server clickhouse-client

      During the installation, you will be asked to set a password for the default ClickHouse user.

      You’ve installed the ClickHouse server and client successfully. You’re now ready to start the database service and ensure that it’s running correctly.

      Step 2 — Starting the Service

      The clickhouse-server package that you installed in the previous section creates a systemd service, which performs actions such as starting, stopping, and restarting the database server. systemd is an init system for Linux to initialize and manage services. In this section you’ll start the service and verify that it is running successfully.

      Start the clickhouse-server service by running:

      • sudo service clickhouse-server start

      The previous command will not display any output. To verify that the service is running successfully, execute:

      • sudo service clickhouse-server status

      You’ll see output similar to the following:


      ● clickhouse-server.service - ClickHouse Server (analytic DBMS for big data) Loaded: loaded (/etc/systemd/system/clickhouse-server.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2018-12-22 07:23:20 UTC; 1h 9min ago Main PID: 27101 (ClickHouse-serv) Tasks: 34 (limit: 1152) CGroup: /system.slice/ClickHouse-server.service └─27101 /usr/bin/ClickHouse-server --config=/etc/ClickHouse-server/config.xml

      The output notes that the server is running.

      You have successfully started the ClickHouse server and will now be able to use the clickhouse-client CLI program to connect to the server.

      Step 3 — Creating Databases and Tables

      In ClickHouse, you can create and delete databases by executing SQL statements directly in the interactive database prompt. Statements consist of commands following a particular syntax that tell the database server to perform a requested operation along with any data required. You create databases by using the CREATE DATABASE table_name syntax. To create a database, first start a client session by running the following command:

      • clickhouse-client --password

      You will be asked to enter the password you had set during the installation—enter it to successfully start the client session.

      The previous command will log you in to the client prompt where you can run ClickHouse SQL statements to perform actions such as:

      • Creating, updating, and deleting databases, tables, indexes, partitions, and views.

      • Executing queries to retrieve data that is optionally filtered and grouped using various conditions.

      In this step, with the ClickHouse client ready for inserting data, you’re going to create a database and table. For the purposes of this tutorial, you’ll create a database named test, and inside that you’ll create a table named visits that tracks website-visit durations.

      Now that you’re inside the ClickHouse command prompt, create your test database by executing:

      You’ll see the following output that shows that you have created the database:


      CREATE DATABASE test Ok. 0 rows in set. Elapsed: 0.003 sec.

      A ClickHouse table is similar to tables in other relational databases; it holds a collection of related data in a structured format. You can specify columns along with their types, add rows of data, and execute different kinds of queries on tables.

      The syntax for creating tables in ClickHouse follows this example structure:

      CREATE TABLE table_name
          column_name1 column_type [options],
          column_name2 column_type [options],
      ) ENGINE = engine

      The table_name and column_name values can be any valid ASCII identifiers. ClickHouse supports a wide range of column types; some of the most popular are:

      • UInt64: used for storing integer values in the range 0 to 18446744073709551615.

      • Float64: used for storing floating point numbers such as 2039.23, 10.5, etc.

      • String: used for storing variable length characters. It does not require a max-length attribute since it can store arbitrary lengths.

      • Date: used for storing dates that follow the YYYY-MM-DD format.

      • DateTime: used for storing dates coupled with time and follows the YYYY-MM-DD HH:MM:SS format.

      After the column definitions, you specify the engine used for the table. In ClickHouse, Engines determine the physical structure of the underlying data, the table’s querying capabilities, its concurrent access modes, and support for indexes. Different engine types are suitable for different application requirements. The most commonly used and widely applicable engine type is MergeTree.

      Now that you have an overview of table creation, you’ll create a table. Start by confirming the database you’ll be modifying:

      You will see the following output showing that you have switched to the test database from the default database:


      USE test Ok. 0 rows in set. Elapsed: 0.001 sec.

      The remainder of this guide will assume that you are executing statements within this database’s context.

      Create your visits table by running this command:

      • CREATE TABLE visits (
      • id UInt64,
      • duration Float64,
      • url String,
      • created DateTime
      • ) ENGINE = MergeTree()
      • PRIMARY KEY id
      • ORDER BY id;

      Here’s a breakdown of what the command does. You create a table named visits that has four columns:

      • id: The primary key column. Similarly to other RDBMS systems, a primary key column in ClickHouse uniquely identifies a row; each row should have a unique value for this column.

      • duration: A float column used to store the duration of each visit in seconds. float columns can store decimal values such as 12.50.

      • url: A string column that stores the URL visited, such as

      • created: A date and time column that tracks when the visit occurred.

      After the column definitions, you specify MergeTree as the storage engine for the table. The MergeTree family of engines is recommended for production databases due to its optimized support for large real-time inserts, overall robustness, and query support. Additionally, MergeTree engines support sorting of rows by primary key, partitioning of rows, and replicating and sampling data.

      If you intend to use ClickHouse for archiving data that is not queried often or for storing temporary data, you can use the Log family of engines to optimize for that use-case.

      After the column definitions, you’ll define other table-level options. The PRIMARY KEY clause sets id as the primary key column and the ORDER BY clause will store values sorted by the id column. A primary key uniquely identifies a row and is used for efficiently accessing a single row and efficient colocation of rows.

      On executing the create statement, you will see the following output:


      CREATE TABLE visits ( id UInt64, duration Float64, url String, created DateTime ) ENGINE = MergeTree() PRIMARY KEY id ORDER BY id Ok. 0 rows in set. Elapsed: 0.010 sec.

      In this section, you’ve created a database and a table to track website-visit data. In the next step, you’ll insert data into the table, update existing data, and delete that data.

      Step 4 — Inserting, Updating, and Deleting Data and Columns

      In this step, you’ll use your visits table to insert, update, and delete data. The following command is an example of the syntax for inserting rows into a ClickHouse table:

      INSERT INTO table_name VALUES (column_1_value, column_2_value, ....);

      Now, insert a few rows of example website-visit data into your visits table by running each of the following statements:

      • INSERT INTO visits VALUES (1, 10.5, '', '2019-01-01 00:01:01');
      • INSERT INTO visits VALUES (2, 40.2, '', '2019-01-03 10:01:01');
      • INSERT INTO visits VALUES (3, 13, '', '2019-01-03 12:01:01');
      • INSERT INTO visits VALUES (4, 2, '', '2019-01-04 02:01:01');

      You’ll see the following output repeated for each insert statement.


      INSERT INTO visits VALUES Ok. 1 rows in set. Elapsed: 0.004 sec.

      The output for each row shows that you’ve inserted it successfully into the visits table.

      Now you’ll add an additional column to the visits table. When adding or deleting columns from existing tables, ClickHouse supports the ALTER syntax.

      For example, the basic syntax for adding a column to a table is as follows:

      ALTER TABLE table_name ADD COLUMN column_name column_type;

      Add a column named location that will store the location of the visits to a website by running the following statement:

      • ALTER TABLE visits ADD COLUMN location String;

      You’ll see output similar to the following:


      ALTER TABLE visits ADD COLUMN location String Ok. 0 rows in set. Elapsed: 0.014 sec.

      The output shows that you have added the location column successfully.

      As of version 19.13.3, ClickHouse doesn’t support updating and deleting individual rows of data due to implementation constraints. ClickHouse has support for bulk updates and deletes, however, and has a distinct SQL syntax for these operations to highlight their non-standard usage.

      The following syntax is an example for bulk updating rows:

      ALTER TABLE table_name UPDATE  column_1 = value_1, column_2 = value_2 ...  WHERE  filter_conditions;

      You’ll run the following statement to update the url column of all rows that have a duration of less than 15. Enter it into the database prompt to execute:

      • ALTER TABLE visits UPDATE url = '' WHERE duration < 15;

      The output of the bulk update statement will be as follows:


      ALTER TABLE visits UPDATE url = '' WHERE duration < 15 Ok. 0 rows in set. Elapsed: 0.003 sec.

      The output shows that your update query completed successfully. The 0 rows in set in the output denotes that the query did not return any rows; this will be the case for any update and delete queries.

      The example syntax for bulk deleting rows is similar to updating rows and has the following structure:

      ALTER TABLE table_name DELETE WHERE filter_conditions;

      To test deleting data, run the following statement to remove all rows that have a duration of less than 5:

      • ALTER TABLE visits DELETE WHERE duration < 5;

      The output of the bulk delete statement will be similar to:


      ALTER TABLE visits DELETE WHERE duration < 5 Ok. 0 rows in set. Elapsed: 0.003 sec.

      The output confirms that you have deleted the rows with a duration of less than five seconds.

      To delete columns from your table, the syntax would follow this example structure:

      ALTER TABLE table_name DROP COLUMN column_name;

      Delete the location column you added previously by running the following:

      • ALTER TABLE visits DROP COLUMN location;

      The DROP COLUMN output confirming that you have deleted the column will be as follows:


      ALTER TABLE visits DROP COLUMN location String Ok. 0 rows in set. Elapsed: 0.010 sec.

      Now that you’ve successfully inserted, updated, and deleted rows and columns in your visits table, you’ll move on to query data in the next step.

      Step 5 — Querying Data

      ClickHouse’s query language is a custom dialect of SQL with extensions and functions suited for analytics workloads. In this step, you’ll run selection and aggregation queries to retrieve data and results from your visits table.

      Selection queries allow you to retrieve rows and columns of data filtered by conditions that you specify, along with options such as the number of rows to return. You can select rows and columns of data using the SELECT syntax. The basic syntax for SELECT queries is:

      SELECT func_1(column_1), func_2(column_2) FROM table_name WHERE filter_conditions row_options;

      Execute the following statement to retrieve url and duration values for rows where the url is

      • SELECT url, duration FROM visits WHERE url = '' LIMIT 2;

      You will see the following output:


      SELECT url, duration FROM visits WHERE url = '' LIMIT 2 ┌─url─────────────────┬─duration─┐ │ │ 10.5 │ └─────────────────────┴──────────┘ ┌─url─────────────────┬─duration─┐ │ │ 13 │ └─────────────────────┴──────────┘ 2 rows in set. Elapsed: 0.013 sec.

      The output has returned two rows that match the conditions you specified. Now that you’ve selected values, you can move on to executing aggregation queries.

      Aggregation queries are queries that operate on a set of values and return single output values. In analytics databases, these queries are run frequently and are well optimized by the database. Some aggregate functions supported by ClickHouse are:

      • count: returns the count of rows matching the conditions specified.

      • sum: returns the sum of selected column values.

      • avg: returns the average of selected column values.

      Some ClickHouse-specific aggregate functions include:

      • uniq: returns an approximate number of distinct rows matched.

      • topK: returns an array of the most frequent values of a specific column using an approximation algorithm.

      To demonstrate the execution of aggregation queries, you’ll calculate the total duration of visits by running the sum query:

      • SELECT SUM(duration) FROM visits;

      You will see output similar to the following:


      SELECT SUM(duration) FROM visits ┌─SUM(duration)─┐ │ 63.7 │ └───────────────┘ 1 rows in set. Elapsed: 0.010 sec.

      Now, calculate the top two URLs by executing:

      • SELECT topK(2)(url) FROM visits;

      You will see output similar to the following:


      SELECT topK(2)(url) FROM visits ┌─topK(2)(url)──────────────────────────────────┐ │ ['',''] │ └───────────────────────────────────────────────┘ 1 rows in set. Elapsed: 0.010 sec.

      Now that you have successfully queried your visits table, you’ll delete tables and databases in the next step.

      Step 6 — Deleting Tables and Databases

      In this section, you’ll delete your visits table and test database.

      The syntax for deleting tables follows this example:

      DROP TABLE table_name;

      To delete the visits table, run the following statement:

      You will see the following output declaring that you’ve deleted the table successfully:


      DROP TABLE visits Ok. 0 rows in set. Elapsed: 0.005 sec.

      You can delete databases using the DROP database table_name syntax. To delete the test database, execute the following statement:

      The resulting output shows that you’ve deleted the database successfully.


      DROP DATABASE test Ok. 0 rows in set. Elapsed: 0.003 sec.

      You’ve deleted tables and databases in this step. Now that you’ve created, updated, and deleted databases, tables, and data in your ClickHouse instance, you’ll enable remote access to your database server in the next section.

      Step 7 — Setting Up Firewall Rules (Optional)

      If you intend to only use ClickHouse locally with applications running on the same server, or do not have a firewall enabled on your server, you don’t need to complete this section. If instead, you’ll be connecting to the ClickHouse database server remotely, you should follow this step.

      Currently your server has a firewall enabled that disables your public IP address accessing all ports. You’ll complete the following two steps to allow remote access:

      • Modify ClickHouse’s configuration and allow it to listen on all interfaces.

      • Add a firewall rule allowing incoming connections to port 8123, which is the HTTP port that the ClickHouse server runs.

      If you are inside the database prompt, exit it by typing CTRL+D.

      Edit the configuration file by executing:

      • sudo nano /etc/clickhouse-server/config.xml

      Then uncomment the line containing <!-- <listen_host></listen_host> -->, like the following file:


          <!-- Listen specified host. use :: (wildcard IPv6 address), if you want to accept connections both with IPv4 and IPv6 from everywhere. -->
          <!-- <listen_host>::</listen_host> -->
          <!-- Same for hosts with disabled ipv6: -->
          <!-- Default values - try listen localhost on ipv4 and ipv6: -->

      Save the file and exit. For the new configuration to apply restart the service by running:

      • sudo service clickhouse-server restart

      You won’t see any output from this command. ClickHouse’s server listens on port 8123 for HTTP connections and port 9000 for connections from clickhouse-client. Allow access to both ports for your second server’s IP address with the following command:

      • sudo ufw allow from second_server_ip/32 to any port 8123
      • sudo ufw allow from second_server_ip/32 to any port 9000

      You will see the following output for both commands that shows that you’ve enabled access to both ports:


      Rule added

      ClickHouse will now be accessible from the IP that you added. Feel free to add additional IPs such as your local machine’s address if required.

      To verify that you can connect to the ClickHouse server from the remote machine, first follow the steps in Step 1 of this tutorial on the second server and ensure that you have the clickhouse-client installed on it.

      Now that you have logged in to the second server, start a client session by executing:

      • clickhouse-client --host your_server_ip --password

      You will see the following output that shows that you have connected successfully to the server:


      ClickHouse client version (official build). Password for user (default): Connecting to your_server_ip:9000 as user default. Connected to ClickHouse server version 19.13.3 revision 54425. hostname 🙂

      In this step, you’ve enabled remote access to your ClickHouse database server by adjusting your firewall rules.


      You have successfully set up a ClickHouse database instance on your server and created a database and table, added data, performed queries, and deleted the database. Within ClickHouse’s documentation you can read about their benchmarks against other open-source and commercial analytics databases and general reference documents.

      Further features ClickHouse offers include distributed query processing across multiple servers to improve performance and protect against data loss by storing data over different shards.

      Source link

      How To Install and Use TimescaleDB on Ubuntu 18.04

      The author selected the Computer History Museum to receive a donation as part of the Write for DOnations program.


      Many applications, such as monitoring systems and data collection systems, accumulate data for further analysis. These analyses often look at the way a piece of data or a system changes over time. In these instances, data is represented as a time series, with every data point accompanied by a timestamp. An example would look like this:

      2019-11-01 09:00:00    server.cpu.1    0.9
      2019-11-01 09:00:00    server.cpu.15   0.8
      2019-11-01 09:01:00    server.cpu.1    0.9
      2019-11-01 09:01:00    server.cpu.15   0.8

      Managing time series data has become an essential skill with the rise of the Internet of Things (IoT) and Industrial Internet of Things. There are more and more devices that collect various time series information: fitness trackers, smart watches, home weather stations, and various sensors, to name a few. These devices collect a lot of information, and all this data must be stored somewhere.

      Classic relational databases are most often used to store data, but they don’t always fit when it comes to the huge data volumes of time series. When you need to process a large amount of time series data, relational databases can be too slow. Because of this, specially optimized databases, called NoSQL databases, have been created to avoid the problems of relational databases.

      TimescaleDB is an open-source database optimized for storing time series data. It is implemented as an extension of PostgreSQL and combines the ease-of-use of relational databases and the speed of NoSQL databases. As a result, it allows you to use PostgreSQL for both storing business data and time series data in one place.

      By following this tutorial, you’ll set up TimescaleDB on Ubuntu 18.04, configure it, and learn how to work with it. You’ll create time series databases and make simple queries. Finally, you’ll see how to get rid of unnecessary data.


      To follow this tutorial, you will need:

      Step 1 — Installing TimescaleDB

      TimescaleDB is not available in Ubuntu’s default package repositories, so in this step you will install it from the TimescaleDB Personal Packages Archive (PPA).

      First, add Timescale’s APT repository:

      • sudo add-apt-repository ppa:timescale/timescaledb-ppa

      Confirm this action by hitting the ENTER key.

      Next, refresh your APT cache to update your package lists:

      You can now proceed with the installation. This tutorial uses PostgreSQL version 10; if you are using a different version of PostgreSQL (11 or 9.6, for example), replace the value in the following command and run it:

      • sudo apt install timescaledb-postgresql-10

      TimescaleDB is now installed and ready to be used. Next, you will turn it on and adjust some of the settings associated with it in the PostgreSQL configuration file to optimize the database.

      Step 2 — Configuring TimescaleDB

      The TimescaleDB module works fine with the default PostgreSQL configuration settings, but to improve performance and make better use of processor, memory, and disk resources, the developers of TimescaleDB suggest configuring some individual parameters. This can be done automatically with the timescaledb-tune tool or by manually editing your server’s postgresql.conf file.

      In this tutorial, you will use the timescaledb-tune tool, which will read the postgresql.conf file and interactively suggest making changes.

      Run the following command to start the configuration wizard:

      First, you will be asked to confirm the path to the PostgreSQL configuration file:


      Using postgresql.conf at this path: /etc/postgresql/10/main/postgresql.conf Is this correct? [(y)es/(n)o]:

      The utility automatically detects the path to the configuration file, so confirm this by entering y:


      ... Is this correct? [(y)es/(n)o]: y Writing backup to: /tmp/timescaledb_tune.backup201911181111

      Next, you will be prompted to change the shared_preload_libraries variable to preload the TimescaleDB module upon starting the PostgreSQL server:


      shared_preload_libraries needs to be updated Current: #shared_preload_libraries = '' Recommended: shared_preload_libraries = 'timescaledb' Is this okay? [(y)es/(n)o]:

      shared_preload_libraries accepts a comma separated list of modules as a value, designating which modules PostgreSQL should load before starting the database server. Making this change will add the timescaledb module to that list.

      Note: If a library specified by shared_preload_libraries is not found, the database server will fail to start. Keep this in mind when debugging applications that make use of shared_preload_libraries. For more information on this, see this PostgresqlCO.NF article on shared_preload_libraries.

      Enable the TimescaleDB module by typing y at this prompt and pressing ENTER:


      ... Is this okay? [(y)es/(n)o]: y success: shared_preload_libraries will be updated

      Based on the characteristics of your server and the PostgreSQL version, the script will then offer to tune your settings. Press y to start the tuning process:


      Tune memory/parallelism/WAL and other settings? [(y)es/(n)o]: y Recommendations based on 7.79 GB of available memory and 4 CPUs for PostgreSQL 10 Memory settings recommendations Current: shared_buffers = 128MB #effective_cache_size = 4GB #maintenance_work_mem = 64MB #work_mem = 4MB Recommended: shared_buffers = 1994MB effective_cache_size = 5982MB maintenance_work_mem = 1021001kB work_mem = 5105kB Is this okay? [(y)es/(s)kip/(q)uit]:

      timescaledb-tune will automatically detect the servers’s available memory and calculate recommended values for a number of settings. shared_buffers, for example, determines the amount of memory allocated for caching data. By default this setting is relatively low to account for a wider range of platforms, so timescaledb-tune has suggested increasing the value from 128MB to 1994MB, taking better advantage of resources by making more room to store cached information like repeated queries. The work_mem variable has been increased as well to allow for more complicated sorts.

      If you would like to learn more about the process of tuning memory settings for PostgreSQL, see the Tuning Your PostgreSQL Server article on the PostgreSQL wiki.

      Enter y to accept the values:


      ... Is this okay? [(y)es/(s)kip/(q)uit]: y success: memory settings will be updated

      At this point, if your server has multiple CPUs, you will find the recommendations for parallelism settings. These settings determine how multiple CPUs can make simultaneous queries in parallel to scan databases and return the requested data quicker.

      Those with multiple CPUs will encounter recommendations like this:


      Parallelism settings recommendations Current: missing: timescaledb.max_background_workers #max_worker_processes = 8 #max_parallel_workers_per_gather = 2 #max_parallel_workers = 8 Recommended: timescaledb.max_background_workers = 8 max_worker_processes = 13 max_parallel_workers_per_gather = 1 max_parallel_workers = 2 Is this okay? [(y)es/(s)kip/(q)uit]:

      These settings regulate the number of workers, which process requests and background tasks. You can learn more about these settings from the TimescaleDB and PostgreSQL documentation.

      Type y then ENTER to accept these settings:


      ... Is this okay? [(y)es/(s)kip/(q)uit]: y success: parallelism settings will be updated

      Next, you will find recommendations for Write Ahead Log (WAL):


      WAL settings recommendations Current: #wal_buffers = -1 #min_wal_size = 80MB #max_wal_size = 1GB Recommended: wal_buffers = 16MB min_wal_size = 4GB max_wal_size = 8GB Is this okay? [(y)es/(s)kip/(q)uit]:

      WAL is a logging method in which PostgreSQL logs changes to data files before the changes are made to the database. By prioritizing an up-to-date record of data changes, WAL ensures that you can reconstruct your database in the event of a crash. In this way, it preserves data integrity. However, the default settings can cause inefficient input/output (I/O) operations that slow down write performance. To fix this, type and enter y:


      ... Is this okay? [(y)es/(s)kip/(q)uit]: y success: WAL settings will be updated

      You’ll now find some miscellaneous recommendations:


      Miscellaneous settings recommendations Current: #default_statistics_target = 100 #random_page_cost = 4.0 #checkpoint_completion_target = 0.5 #max_locks_per_transaction = 64 #autovacuum_max_workers = 3 #autovacuum_naptime = 1min #effective_io_concurrency = 1 Recommended: default_statistics_target = 500 random_page_cost = 1.1 checkpoint_completion_target = 0.9 max_locks_per_transaction = 64 autovacuum_max_workers = 10 autovacuum_naptime = 10 effective_io_concurrency = 200 Is this okay? [(y)es/(s)kip/(q)uit]:

      All of these different parameters are aimed at increasing performance. For example, SSDs can process many concurrent requests, so the best value for the effective_io_concurrency might be in the hundreds. You can find more info about these options in the PostgreSQL documentation.

      Press y then ENTER to continue.


      ... Is this okay? [(y)es/(s)kip/(q)uit]: y success: miscellaneous settings will be updated Saving changes to: /etc/postgresql/10/main/postgresql.conf

      As a result, you will get a ready-made configuration file at /etc/postgresql/10/main/postgresql.conf.

      Note: If you are automating the installation, you could also run the initial command with the --quiet and --yes flags, which will automatically apply all the recommendations and will make changes to the postgresql.conf configuration file:

      • sudo timescaledb-tune --quiet --yes

      In order for the configuration changes to take effect, you must restart the PostgreSQL service:

      • sudo systemctl restart postgresql.service

      Now the database is running with optimal parameters and is ready to work with the time series data. In the next steps, you’ll try out working with this data: creating new databases and hypertables and performing operations.

      Step 3 — Creating a New Database and Hypertable

      With your TimescaleDB setup optimized, you are ready to work with time series data. TimescaleDB is implemented as an extension of PostgreSQL, so operations with time series data are not much different from relational data operations. At the same time, the database allows you to freely combine data from time series and relational tables in the future.

      To demonstrate this, you will use PostgreSQL commands to create a database, then enable the TimescaleDB extension to create a hypertable, which is a higher-level abstraction of many individual tables. Hypertables are the main structures you will work with in TimescaleDB.

      Log into your PostgreSQL database:

      Now create a new database and connect to it. This tutorial will name the database timeseries:

      • CREATE DATABASE timeseries;
      • c timeseries

      You can find additional information about working with the PostgreSQL database in our How To Create, Remove & Manage Tables in PostgreSQL on a Cloud Server tutorial.

      Finally, enable the TimescaleDB extension:


      You will see the following output:


      WARNING: WELCOME TO _____ _ _ ____________ |_ _(_) | | | _ ___ | | _ _ __ ___ ___ ___ ___ __ _| | ___| | | | |_/ / | | | | _ ` _ / _ / __|/ __/ _` | |/ _ | | | ___ | | | | | | | | | __/__ (_| (_| | | __/ |/ /| |_/ / |_| |_|_| |_| |_|___||___/_____,_|_|___|___/ ____/ Running version 1.5.1 For more information on TimescaleDB, please visit the following links: 1. Getting started: 2. API reference documentation: 3. How TimescaleDB is designed: Note: TimescaleDB collects anonymous reports to better understand and assist our users. For more information and how to disable, please see our docs CREATE EXTENSION

      As mentioned earlier, the primary points of interaction with your time series data are hypertables, which consist of many individual tables holding data, called chunks.

      To create a hypertable, start with a regular SQL table and then convert it into a hypertable via the function create_hypertable.

      Make a table that will store data for tracking temperature and humidity across a collection of devices over time:

      • CREATE TABLE conditions (
      • device_id TEXT,
      • temperature NUMERIC,
      • humidity NUMERIC
      • );

      This command creates a table called conditions with four columns. The first column will store the timestamp, which includes the time zone and cannot be empty. Next, you will use the time column to transform your table into a hypertable that is partitioned by time:

      • SELECT create_hypertable('conditions', 'time');

      This command calls the create_hypertable() function, which creates a TimescaleDB hypertable from a PostgreSQL table, replacing the latter.

      You will receive the following output:


      create_hypertable ------------------------- (1,public,conditions,t) (1 row)

      In this step, you created a new hypertable to store time series data. Now you can populate it with data by writing to the hypertable, then run through the process of deleting it.

      Step 4 — Writing and Deleting Data

      In this step, you will insert data using standard SQL commands and import large sets of data from external sources. This will show you the relational database aspects of TimescaleDB.

      First, try out the basic commands. You can insert data into the hypertable using the standard INSERT SQL command. Insert some sample temperature and humidity data for the theoretical device weather-pro-000000 using the following command:

      • INSERT INTO conditions(time, device_id, temperature, humidity)
      • VALUES (NOW(), 'weather-pro-000000', 84.1, 84.1);

      You’ll see the following output:


      INSERT 0 1

      You can also insert multiple rows of data at once. Try the following:

      • INSERT INTO conditions
      • VALUES
      • (NOW(), 'weather-pro-000002', 71.0, 51.0),
      • (NOW(), 'weather-pro-000003', 70.5, 50.5),
      • (NOW(), 'weather-pro-000004', 70.0, 50.2);

      You will receive the following:


      INSERT 0 3

      You can also specify that the INSERT command will return some or all of the inserted data using the RETURNING statement:

      • INSERT INTO conditions
      • VALUES (NOW(), 'weather-pro-000002', 70.1, 50.1) RETURNING *;

      You will see the following output:


      time | device_id | temperature | humidity -------------------------------+--------------------+-------------+---------- 2019-09-15 14:14:01.576651+00 | weather-pro-000002 | 70.1 | 50.1 (1 row)

      If you want to delete data from the hypertable, use the standard DELETE SQL command. Run the following to delete whatever data has a temperature higher than 80 or a humidity higher than 50:

      • DELETE FROM conditions WHERE temperature > 80;
      • DELETE FROM conditions WHERE humidity > 50;

      After the delete operation, use the VACUUM command to reclaim space still used by data that has been deleted.

      You can find more info about VACUUM command in the PostgreSQL documentation.

      These commands are fine for small-scale data entry, but since time series data often generates huge datasets from multiple devices simultaneously, it’s essential also to know how to insert hundreds or thousands of rows at a time. If you have prepared data from external sources in a structured form, for example in csv format, this task can be accomplished quickly.

      To test this out, you will use a sample dataset that represents temperature and humidity data from a variety of locations. This is an official TimescaleDB dataset, made to test out their database. You can check out more info about sample datasets in the TimescaleDB documentation.

      Let’s see how you can import data from the weather_small sample dataset into your database. First, quit Postgresql:

      Then download the dataset and extract it:

      • wget
      • tar -xvzf weather_small.tar.gz

      Next, import the temperature and humidity data into your database:

      • sudo -u postgres psql -d timeseries -c "COPY conditions FROM weather_small_conditions.csv CSV"

      This connects to the timeseries database and executes the COPY command that copies the data from the chosen file into the conditions hypertable. It will run for a few seconds.

      When the data has been entered into your table, you will receive the following output:


      COPY 1000000

      In this step, you added data to the hypertable manually and in batches. Next, continue on to performing queries.

      Step 5 — Querying Data

      Now that your table contains data, you can perform various queries to analyze it.

      To get started, log in to the database:

      • sudo -u postgres psql -d timeseries

      As mentioned before, to work with hypertables you can use standard SQL commands. For example, to show the last 10 entries from the conditions hypertable, run the following command:

      • SELECT * FROM conditions LIMIT 10;

      You will see the following output:


      time | device_id | temperature | humidity ------------------------+--------------------+--------------------+---------- 2016-11-15 12:00:00+00 | weather-pro-000000 | 39.9 | 49.9 2016-11-15 12:00:00+00 | weather-pro-000001 | 32.4 | 49.8 2016-11-15 12:00:00+00 | weather-pro-000002 | 39.800000000000004 | 50.2 2016-11-15 12:00:00+00 | weather-pro-000003 | 36.800000000000004 | 49.8 2016-11-15 12:00:00+00 | weather-pro-000004 | 71.8 | 50.1 2016-11-15 12:00:00+00 | weather-pro-000005 | 71.8 | 49.9 2016-11-15 12:00:00+00 | weather-pro-000006 | 37 | 49.8 2016-11-15 12:00:00+00 | weather-pro-000007 | 72 | 50 2016-11-15 12:00:00+00 | weather-pro-000008 | 31.3 | 50 2016-11-15 12:00:00+00 | weather-pro-000009 | 84.4 | 87.8 (10 rows)

      This command lets you see what data is in the database. Since the database contains a million records, you used LIMIT 10 to limit the output to 10 entries.

      To see the most recent entries, sort the data array by time in descending order:

      • SELECT * FROM conditions ORDER BY time DESC LIMIT 20;

      This will output the top 20 most recent entries.

      You can also add a filter. For example, to see entries from the weather-pro-000000 device, run the following:

      • SELECT * FROM conditions WHERE device_id = 'weather-pro-000000' ORDER BY time DESC LIMIT 10;

      In this case, you will see the 10 most recent temperature and humidity datapoints recorded by the weather-pro-000000 device.

      In addition to standard SQL commands, TimescaleDB also provides a number of special functions that are useful for time series data analysis. For example, to find the median of the temperature values, you can use the following query with the percentile_cont function:

      • SELECT percentile_cont(0.5)
      • WITHIN GROUP (ORDER BY temperature)
      • FROM conditions
      • WHERE device_id = 'weather-pro-000000';

      You will see the following output:


      percentile_cont ----------------- 40.5 (1 row)

      In this way, you’ll see the median temperature for the entire observation period where the weather-pro-00000 sensor is located.

      To show the latest values from each of the sensors, you can use the last function:

      • select device_id, last(temperature, time)
      • FROM conditions
      • GROUP BY device_id;

      In the output you will see a list of all the sensors and relevant latest values.

      To get the first values use the first function.

      The following example is more complex. It will show the hourly average, minimum, and maximum temperatures for the chosen sensor within the last 24 hours:

      • SELECT time_bucket('1 hour', time) "hour",
      • trunc(avg(temperature), 2) avg_temp,
      • trunc(min(temperature), 2) min_temp,
      • trunc(max(temperature), 2) max_temp
      • FROM conditions
      • WHERE device_id = 'weather-pro-000000'
      • GROUP BY "hour" ORDER BY "hour" DESC LIMIT 24;

      Here you used the time_bucket function, which acts as a more powerful version of the PostgreSQL date_trunc function. As a result, you will see which periods of the day the temperature rises or decreases:


      hour | avg_temp | min_temp | max_temp ------------------------+----------+----------+---------- 2016-11-16 21:00:00+00 | 42.00 | 42.00 | 42.00 2016-11-16 20:00:00+00 | 41.92 | 41.69 | 42.00 2016-11-16 19:00:00+00 | 41.07 | 40.59 | 41.59 2016-11-16 18:00:00+00 | 40.11 | 39.79 | 40.59 2016-11-16 17:00:00+00 | 39.46 | 38.99 | 39.79 2016-11-16 16:00:00+00 | 38.54 | 38.19 | 38.99 2016-11-16 15:00:00+00 | 37.56 | 37.09 | 38.09 2016-11-16 14:00:00+00 | 36.62 | 36.39 | 37.09 2016-11-16 13:00:00+00 | 35.59 | 34.79 | 36.29 2016-11-16 12:00:00+00 | 34.59 | 34.19 | 34.79 2016-11-16 11:00:00+00 | 33.94 | 33.49 | 34.19 2016-11-16 10:00:00+00 | 33.27 | 32.79 | 33.39 2016-11-16 09:00:00+00 | 33.37 | 32.69 | 34.09 2016-11-16 08:00:00+00 | 34.94 | 34.19 | 35.49 2016-11-16 07:00:00+00 | 36.12 | 35.49 | 36.69 2016-11-16 06:00:00+00 | 37.02 | 36.69 | 37.49 2016-11-16 05:00:00+00 | 38.05 | 37.49 | 38.39 2016-11-16 04:00:00+00 | 38.71 | 38.39 | 39.19 2016-11-16 03:00:00+00 | 39.72 | 39.19 | 40.19 2016-11-16 02:00:00+00 | 40.67 | 40.29 | 40.99 2016-11-16 01:00:00+00 | 41.63 | 40.99 | 42.00 2016-11-16 00:00:00+00 | 42.00 | 42.00 | 42.00 2016-11-15 23:00:00+00 | 42.00 | 42.00 | 42.00 2016-11-15 22:00:00+00 | 42.00 | 42.00 | 42.00 (24 rows)

      You can find more useful functions in the TimescaleDB documentation.

      Now you know how to handle your data. Next, you will go through how to delete unnecessary data and how to compress data.

      Step 6 — Configuring Data Compression and Deletion

      As data accumulates, it will take up more and more space on your hard drive. To save space, the latest version of TimescaleDB provides a data compression feature. This feature doesn’t require tweaking any file system settings, and can be used to quickly make your database more efficient. For more information on how this compression works, take a look at this Compression article from TimescaleDB.

      First, enable the compression of your hypertable:

      • ALTER TABLE conditions SET (
      • timescaledb.compress,
      • timescaledb.compress_segmentby = 'device_id'
      • );

      You will receive the following data:


      NOTICE: adding index _compressed_hypertable_2_device_id__ts_meta_sequence_num_idx ON _timescaledb_internal._compressed_hypertable_2 USING BTREE(device_id, _ts_meta_sequence_num) ALTER TABLE

      Note: You can also set up TimescaleDB to compress data over the specified time period. For example, you could run:

      • SELECT add_compress_chunks_policy('conditions', INTERVAL '7 days');

      In this example, the data will be automatically compressed after a week.

      You can see the statistics on the compressed data with the command:

      • SELECT *
      • FROM timescaledb_information.compressed_chunk_stats;

      You will then see a list of chunks with their statuses: compression status and how much space is taken up by uncompressed and compressed data in bytes.

      If you don’t have the need to store data for a long period of time, you can delete out-of-date data to free up even more space. There is a special drop_chunks function for this. It allows you to delete chunks with data older than the specified time:

      • SELECT drop_chunks(interval '24 hours', 'conditions');

      This query will drop all chunks from the hypertable conditions that only include data older than a day ago.

      You will receive the following output:


      drop_chunks ---------------------------------------- _timescaledb_internal._hyper_1_2_chunk (1 row)

      To automatically delete old data, you can configure a cron task. See our tutorial to learn more about how to use cron to automate various system tasks.

      Exit from the database:

      Next, edit your crontab with the following command, which should be run from the shell:

      Now add the following line to the end of the file:


      0 1 * * * /usr/bin/psql -h localhost -p 5432 -U postgres -d postgres -c "SELECT drop_chunks(interval '24 hours', 'conditions');" >/dev/null 2>&1

      This job will delete obsolete data that is older than one day at 1:00 AM every day.


      You’ve now set up TimescaleDB on your Ubuntu 18.04 server. You also tried out creating hypertables, inserting data into it, querying the data, compressing, and deleting unnecessary records. With these examples, you’ll be able to take advantage of TimescaleDB’s key benefits over traditional relational database management systems for storing time-series data, including:

      • Higher data ingest rates
      • Quicker query performance
      • Time-oriented features

      Now that you know how to store time series data, you could use the data to create graphs. TimescaleDB is compatible with visualization tools that work with PostgreSQL, like Grafana. You can use our How To Install and Secure Grafana on Ubuntu 18.04 tutorial to learn more about this popular visualization tool.

      Source link