One place for hosting & domains


      How To Build a Data Processing Pipeline Using Luigi in Python on Ubuntu 20.04

      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.


      Luigi is a Python package that manages long-running batch processing, which is the automated running of data processing jobs on batches of items. Luigi allows you to define a data processing job as a set of dependent tasks. For example, task B depends on the output of task A. And task D depends on the output of task B and task C. Luigi automatically works out what tasks it needs to run to complete a requested job.

      Overall Luigi provides a framework to develop and manage data processing pipelines. It was originally developed by Spotify, who use it to manage plumbing together collections of tasks that need to fetch and process data from a variety of sources. Within Luigi, developers at Spotify built functionality to help with their batch processing needs including handling of failures, the ability to automatically resolve dependencies between tasks, and visualization of task processing. Spotify uses Luigi to support batch processing jobs, including providing music recommendations to users, populating internal dashboards, and calculating lists of top songs.

      In this tutorial, you will build a data processing pipeline to analyze the most common words from the most popular books on Project Gutenburg. To do this, you will build a pipeline using the Luigi package. You will use Luigi tasks, targets, dependencies, and parameters to build your pipeline.


      To complete this tutorial, you will need the following:

      Step 1 — Installing Luigi

      In this step, you will create a clean sandbox environment for your Luigi installation.

      First, create a project directory. For this tutorial luigi-demo:

      Navigate into the newly created luigi-demo directory:

      Create a new virtual environment luigi-venv:

      • python3 -m venv luigi-venv

      And activate the newly created virtual environment:

      • . luigi-venv/bin/activate

      You will find (luigi-venv) appended to the front of your terminal prompt to indicate which virtual environment is active:


      (luigi-venv) username@hostname:~/luigi-demo$

      For this tutorial, you will need three libraries: luigi, beautifulsoup4, and requests. The requests library streamlines making HTTP requests; you will use it to download the Project Gutenberg book lists and the books to analyze. The beautifulsoup4 library provides functions to parse data from web pages; you will use it to parse out a list of the most popular books on the Project Gutenberg site.

      Run the following command to install these libraries using pip:

      • pip install wheel luigi beautifulsoup4 requests

      You will get a response confirming the installation of the latest versions of the libraries and all of their dependencies:


      Successfully installed beautifulsoup4-4.9.1 certifi-2020.6.20 chardet-3.0.4 docutils-0.16 idna-2.10 lockfile-0.12.2 luigi-3.0.1 python-daemon-2.2.4 python-dateutil-2.8.1 requests-2.24.0 six-1.15.0 soupsieve-2.0.1 tornado-5.1.1 urllib3-1.25.10

      You’ve installed the dependencies for your project. Now, you’ll move on to building your first Luigi task.

      Step 2 — Creating a Luigi Task

      In this step, you will create a “Hello World” Luigi task to demonstrate how they work.

      A Luigi task is where the execution of your pipeline and the definition of each task’s input and output dependencies take place. Tasks are the building blocks that you will create your pipeline from. You define them in a class, which contains:

      • A run() method that holds the logic for executing the task.
      • An output() method that returns the artifacts generated by the task. The run() method populates these artifacts.
      • An optional input() method that returns any additional tasks in your pipeline that are required to execute the current task. The run() method uses these to carry out the task.

      Create a new file

      Now add the following code to your file:

      import luigi
      class HelloLuigi(luigi.Task):
          def output(self):
              return luigi.LocalTarget('hello-luigi.txt')
          def run(self):
              with self.output().open("w") as outfile:
                  outfile.write("Hello Luigi!")

      You define that HelloLuigi() is a Luigi task by adding the luigi.Task mixin to it.

      The output() method defines one or more Target outputs that your task produces. In the case of this example, you define a luigi.LocalTarget, which is a local file.

      Note: Luigi allows you to connect to a variety of common data sources including AWS S3 buckets, MongoDB databases, and SQL databases. You can find a complete list of supported data sources in the Luigi docs.

      The run() method contains the code you want to execute for your pipeline stage. For this example you are opening the output() target file in write mode, self.output().open("w") as outfile: and writing "Hello Luigi!" to it with outfile.write("Hello Luigi!").

      To execute the task you created, run the following command:

      • python -m luigi --module hello-world HelloLuigi --local-scheduler

      Here, you run the task using python -m instead of executing the luigi command directly; this is because Luigi can only execute code that is within the current PYTHONPATH. You can alternatively add PYTHONPATH='.' to the front of your Luigi command, like so:

      • PYTHONPATH='.' luigi --module hello-world HelloLuigi --local-scheduler

      With the --module hello-world HelloLuigi flag, you tell Luigi which Python module and Luigi task to execute.

      The --local-scheduler flag tells Luigi to not connect to a Luigi scheduler and, instead, execute this task locally. (We explain the Luigi scheduler in Step 4.) Running tasks using the local-scheduler flag is only recommended for development work.

      Luigi will output a summary of the executed tasks:


      ===== Luigi Execution Summary ===== Scheduled 1 tasks of which: * 1 ran successfully: - 1 HelloLuigi() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      And it will create a new file hello-luigi.txt with content:


      Hello Luigi!

      You have created a Luigi task that generates a file and then executed it using the Luigi local-scheduler. Now, you’ll create a task that can extract a list of books from a web page.

      In this step, you will create a Luigi task and define a run() method for the task to download a list of the most popular books on Project Gutenberg. You’ll define an output() method to store links to these books in a file. You will run these using the Luigi local scheduler.

      Create a new directory data inside of your luigi-demo directory. This will be where you will store the files defined in the output() methods of your tasks. You need to create the directories before running your tasks—Python throws exceptions when you try to write a file to a directory that does not exist yet:

      • mkdir data
      • mkdir data/counts
      • mkdir data/downloads

      Create a new file

      Insert the following code, which is a Luigi task to extract a list of links to the top most-read books on Project Gutenberg:

      import requests
      import luigi
      from bs4 import BeautifulSoup
      class GetTopBooks(luigi.Task):
          Get list of the most popular books from Project Gutenberg
          def output(self):
              return luigi.LocalTarget("data/books_list.txt")
          def run(self):
              resp = requests.get("")
              soup = BeautifulSoup(resp.content, "html.parser")
              pageHeader = soup.find_all("h2", string="Top 100 EBooks yesterday")[0]
              listTop = pageHeader.find_next_sibling("ol")
              with self.output().open("w") as f:
                  for result in"li>a"):
                      if "/ebooks/" in result["href"]:

      You define an output() target of file "data/books_list.txt" to store the list of books.

      In the run() method, you:

      • use the requests library to download the HTML contents of the Project Gutenberg top books page.
      • use the BeautifulSoup library to parse the contents of the page. The BeautifulSoup library allows us to scrape information out of web pages. To find out more about using the BeautifulSoup library, read the How To Scrape Web Pages with Beautiful Soup and Python 3 tutorial.
      • open the output file defined in the output() method.
      • iterate over the HTML structure to get all of the links in the Top 100 EBooks yesterday list. For this page, this is locating all links <a> that are within a list item <li>. For each of those links, if they link to a page that points at a link containing /ebooks/, you can assume it is a book and write that link to your output() file.

      Screenshot of the Project Gutenberg top books web page with the top ebooks links highlighted

      Save and exit the file once you’re done.

      Execute this new task using the following command:

      • python -m luigi --module word-frequency GetTopBooks --local-scheduler

      Luigi will output a summary of the executed tasks:


      ===== Luigi Execution Summary ===== Scheduled 1 tasks of which: * 1 ran successfully: - 1 GetTopBooks() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      In the data directory, Luigi will create a new file (data/books_list.txt). Run the following command to output the contents of the file:

      This file contains a list of URLs extracted from the Project Gutenberg top projects list:

      Output ...

      You’ve created a task that can extract a list of books from a web page. In the next step, you’ll set up a central Luigi scheduler.

      Step 4 — Running the Luigi Scheduler

      Now, you’ll launch the Luigi scheduler to execute and visualize your tasks. You will take the task developed in Step 3 and run it using the Luigi scheduler.

      So far, you have been running Luigi using the --local-scheduler tag to run your jobs locally without allocating work to a central scheduler. This is useful for development, but for production usage it is recommended to use the Luigi scheduler. The Luigi scheduler provides:

      • A central point to execute your tasks.
      • Visualization of the execution of your tasks.

      To access the Luigi scheduler interface, you need to enable access to port 8082. To do this, run the following command:

      To run the scheduler execute the following command:

      • sudo sh -c ". luigi-venv/bin/activate ;luigid --background --port 8082"

      Note: We have re-run the virtualenv activate script as root, before launching the Luigi scheduler as a background task. This is because when running sudo the virtualenv environment variables and aliases are not carried over.

      If you do not want to run as root, you can run the Luigi scheduler as a background process for the current user. This command runs the Luigi scheduler in the background and hides messages from the scheduler background task. You can find out more about managing background processes in the terminal at How To Use Bash’s Job Control to Manage Foreground and Background Processes:

      • luigid --port 8082 > /dev/null 2> /dev/null &

      Open a browser to access the Luigi interface. This will either be at http://your_server_ip:8082, or if you have set up a domain for your server http://your_domain:8082. This will open the Luigi user interface.

      Luigi default user interface

      By default, Luigi tasks run using the Luigi scheduler. To run one of your previous tasks using the Luigi scheduler omit the --local-scheduler argument from the command. Re-run the task from Step 3 using the following command:

      • python -m luigi --module word-frequency GetTopBooks

      Refresh the Luigi scheduler user interface. You will find the GetTopBooks task added to the run list and its execution status.

      Luigi User Interface after running the GetTopBooks Task

      You will continue to refer back to this user interface to monitor the progress of your pipeline.

      Note: If you’d like to secure your Luigi scheduler through HTTPS, you can serve it through Nginx. To set up an Nginx server using HTTPS follow: How To Secure Nginx with Let’s Encrypt on Ubuntu 20.04. See Github - Luigi - Pull Request 2785 for suggestions on a suitable Nginx configuration to connect the Luigi server to Nginx.

      You’ve launched the Luigi Scheduler and used it to visualize your executed tasks. Next, you will create a task to download the list of books that the GetTopBooks() task outputs.

      Step 5 — Downloading the Books

      In this step you will create a Luigi task to download a specified book. You will define a dependency between this newly created task and the task created in Step 3.

      First open your file:

      Add an additional class following your GetTopBooks() task to the file with the following code:

      . . .
      class DownloadBooks(luigi.Task):
          Download a specified list of books
          FileID = luigi.IntParameter()
          REPLACE_LIST = """.,"';_[]:*-"""
          def requires(self):
              return GetTopBooks()
          def output(self):
              return luigi.LocalTarget("data/downloads/{}.txt".format(self.FileID))
          def run(self):
              with self.input().open("r") as i:
                  URL =[self.FileID]
                  with self.output().open("w") as outfile:
                      book_downloads = requests.get(URL)
                      book_text = book_downloads.text
                      for char in self.REPLACE_LIST:
                          book_text = book_text.replace(char, " ")
                      book_text = book_text.lower()

      In this task you introduce a Parameter; in this case, an integer parameter. Luigi parameters are inputs to your tasks that affect the execution of the pipeline. Here you introduce a parameter FileID to specify a line in your list of URLs to fetch.

      You have added an additional method to your Luigi task, def requires(); in this method you define the Luigi task that you need the output of before you can execute this task. You require the output of the GetTopBooks() task you defined in Step 3.

      In the output() method, you define your target. You use the FileID parameter to create a name for the file created by this step. In this case, you format data/downloads/{FileID}.txt.

      In the run() method, you:

      • open the list of books generated in the GetTopBooks() task.
      • get the URL from the line specified by parameter FileID.
      • use the requests library to download the contents of the book from the URL.
      • filter out any special characters inside the book like :,.?, so they don’t get included in your word analysis.
      • convert the text to lowercase so you can compare words with different cases.
      • write the filtered output to the file specified in the output() method.

      Save and exit your file.

      Run the new DownloadBooks() task using this command:

      • python -m luigi --module word-frequency DownloadBooks --FileID 2

      In this command, you set the FileID parameter using the --FileID argument.

      Note: Be careful when defining a parameter with an _ in the name. To reference them in Luigi you need to substitute the _ for a -. For example, a File_ID parameter would be referenced as --File-ID when calling a task from the terminal.

      You will receive the following output:


      ===== Luigi Execution Summary ===== Scheduled 2 tasks of which: * 1 complete ones were encountered: - 1 GetTopBooks() * 1 ran successfully: - 1 DownloadBooks(FileID=2) This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      Note from the output that Luigi has detected that you have already generated the output of GetTopBooks() and skipped running that task. This functionality allows you to minimize the number of tasks you have to execute as you can re-use successful output from previous runs.

      You have created a task that uses the output of another task and downloads a set of books to analyze. In the next step, you will create a task to count the most common words in a downloaded book.

      Step 6 — Counting Words and Summarizing Results

      In this step, you will create a Luigi task to count the frequency of words in each of the books downloaded in Step 5. This will be your first task that executes in parallel.

      First open your file again:

      Add the following imports to the top of

      from collections import Counter
      import pickle

      Add the following task to, after your DownloadBooks() task. This task takes the output of the previous DownloadBooks() task for a specified book, and returns the most common words in that book:

      class CountWords(luigi.Task):
          Count the frequency of the most common words from a file
          FileID = luigi.IntParameter()
          def requires(self):
              return DownloadBooks(FileID=self.FileID)
          def output(self):
              return luigi.LocalTarget(
          def run(self):
              with self.input().open("r") as i:
                  word_count = Counter(
                  with self.output().open("w") as outfile:
                      pickle.dump(word_count, outfile)

      When you define requires() you pass the FileID parameter to the next task. When you specify that a task depends on another task, you specify the parameters you need the dependent task to be executed with.

      In the run() method you:

      • open the file generated by the DownloadBooks() task.
      • use the built-in Counter object in the collections library. This provides an easy way to analyze the most common words in a book.
      • use the pickle library to store the output of the Python Counter object, so you can re-use that object in a later task. pickle is a library that you use to convert Python objects into a byte stream, which you can store and restore into a later Python session. You have to set the format property of the luigi.LocalTarget to allow it to write the binary output the pickle library generates.

      Save and exit your file.

      Run the new CountWords() task using this command:

      • python -m luigi --module word-frequency CountWords --FileID 2

      Open the CountWords task graph view in the Luigi scheduler user interface.

      Showing how to view a graph from the Luigi user interface

      Deselect the Hide Done option, and deselect Upstream Dependencies. You will find the flow of execution from the tasks you have created.

      Visualizing the execution of the CountWords task

      You have created a task to count the most common words in a downloaded book and visualized the dependencies between those tasks. Next, you will define parameters that you can use to customize the execution of your tasks.

      Step 7 — Defining Configuration Parameters

      In this step, you will add configuration parameters to the pipeline. These will allow you to customize how many books to analyze and the number of words to include in the results.

      When you want to set parameters that are shared among tasks, you can create a Config() class. Other pipeline stages can reference the parameters defined in the Config() class; these are set by the pipeline when executing a job.

      Add the following Config() class to the end of This will define two new parameters in your pipeline for the number of books to analyze and the number of most frequent words to include in the summary:

      class GlobalParams(luigi.Config):
          NumberBooks = luigi.IntParameter(default=10)
          NumberTopWords = luigi.IntParameter(default=500)

      Add the following class to This class aggregates the results from all of the CountWords() task to create a summary of the most frequent words:

      class TopWords(luigi.Task):
          Aggregate the count results from the different files
          def requires(self):
              requiredInputs = []
              for i in range(GlobalParams().NumberBooks):
              return requiredInputs
          def output(self):
              return luigi.LocalTarget("data/summary.txt")
          def run(self):
              total_count = Counter()
              for input in self.input():
                  with"rb") as infile:
                      nextCounter = pickle.load(infile)
                      total_count += nextCounter
              with self.output().open("w") as f:
                  for item in total_count.most_common(GlobalParams().NumberTopWords):
                      f.write("{0: <15}{1}n".format(*item))

      In the requires() method, you can provide a list where you want a task to use the output of multiple dependent tasks. You use the GlobalParams().NumberBooks parameter to set the number of books you need word counts from.

      In the output() method, you define a data/summary.txt output file that will be the final output of your pipeline.

      In the run() method you:

      • create a Counter() object to store the total count.
      • open the file and “unpickle” it (convert it from a file back to a Python object), for each count carried out in the CountWords() method
      • append the loaded count and add it to the total count.
      • write the most common words to target output file.

      Run the pipeline with the following command:

      • python -m luigi --module word-frequency TopWords --GlobalParams-NumberBooks 15 --GlobalParams-NumberTopWords 750

      Luigi will execute the remaining tasks needed to generate the summary of the top words:


      ===== Luigi Execution Summary ===== Scheduled 31 tasks of which: * 2 complete ones were encountered: - 1 CountWords(FileID=2) - 1 GetTopBooks() * 29 ran successfully: - 14 CountWords(FileID=0,1,10,11,12,13,14,3,4,5,6,7,8,9) - 14 DownloadBooks(FileID=0,1,10,11,12,13,14,3,4,5,6,7,8,9) - 1 TopWords() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      You can visualize the execution of the pipeline from the Luigi scheduler. Select the GetTopBooks task in the task list and press the View Graph button.

      Showing how to view a graph from the Luigi user interface

      Deselect the Hide Done and Upstream Dependencies options.

      Visualizing the execution of the TopWords Task

      It will show the flow of processing that is happening in Luigi.

      Open the data/summary.txt file:

      You will find the calculated most common words:


      the 64593 and 41650 of 31896 to 31368 a 25265 i 23449 in 19496 it 16282 that 15907 he 14974 ...

      In this step, you have defined and used parameters to customize the execution of your tasks. You have generated a summary of the most common words for a set of books.

      Find all the code for this tutorial in this repository.


      This tutorial has introduced you to using the Luigi data processing pipeline and its major features including tasks, parameters, configuration parameters, and the Luigi scheduler.

      Luigi supports connecting to a large number of common data sources out the box. You can also scale it to run large, complex data pipelines. This provides a powerful framework to start solving your data processing challenges.

      For more tutorials, check out our Data Analysis topic page and Python topic page.

      Source link

      How To Set Up a Continuous Deployment Pipeline with GitLab CI/CD on Ubuntu 18.04

      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.


      GitLab is an open source collaboration platform that provides powerful features beyond hosting a code repository. You can track issues, host packages and registries, maintain Wikis, set up continuous integration (CI) and continuous deployment (CD) pipelines, and more.

      In this tutorial you’ll build a continuous deployment pipeline with GitLab. You will configure the pipeline to build a Docker image, push it to the GitLab container registry, and deploy it to your server using SSH. The pipeline will run for each commit pushed to the repository.

      You will deploy a small, static web page, but the focus of this tutorial is configuring the CD pipeline. The static web page is only for demonstration purposes; you can apply the same pipeline configuration using other Docker images for the deployment as well.

      When you have finished this tutorial, you can visit http://your_server_IP in a browser for the results of the automatic deployment.


      To complete this tutorial, you will need:

      Step 1 — Creating the GitLab Repository

      Let’s start by creating a GitLab project and adding an HTML file to it. You will later copy the HTML file into an Nginx Docker image, which in turn you’ll deploy to the server.

      Log in to your GitLab instance and click New project.

      The new project button in GitLab

      1. Give it a proper Project name.
      2. Optionally add a Project description.
      3. Make sure to set the Visibility Level to Private or Public depending on your requirements.
      4. Finally click Create project

      The new project form in GitLab

      You will be redirected to the Project’s overview page.

      Let’s create the HTML file. On your Project’s overview page, click New file.

      The new file button on the project overview page

      Set the File name to index.html and add the following HTML to the file body:


      <h1>My Personal Website</h1>

      Click Commit changes at the bottom of the page to create the file.

      This HTML will produce a blank page with one headline showing My Personal Website when opened in a browser.

      Dockerfiles are recipes used by Docker to build Docker images. Let’s create a Dockerfile to copy the HTML file into an Nginx image.

      Go back to the Project’s overview page, click the + button and select the New file option.

      New file option in the project's overview page listed in the plus button

      Set the File name to Dockerfile and add these instructions to the file body:


      FROM nginx:1.18
      COPY index.html /usr/share/nginx/html

      The FROM instruction specifies the image to inherit from—in this case the nginx:1.18 image. 1.18 is the image tag representing the Nginx version. The nginx:latest tag references the latest Nginx release, but that could break your application in the future, which is why fixed versions are recommended.

      The COPY instruction copies the index.html file to /usr/share/nginx/html in the Docker image. This is the directory where Nginx stores static HTML content.

      Click Commit changes at the bottom of the page to create the file.

      In the next step, you’ll configure a GitLab runner to keep control of who gets to execute the deployment job.

      Step 2 — Registering a GitLab Runner

      In order to keep track of the environments that will have contact with the SSH private key, you’ll register your server as a GitLab runner.

      In your deployment pipeline you want to log in to your server using SSH. To achieve this, you’ll store the SSH private key in a GitLab CI/CD variable (Step 5). The SSH private key is a very sensitive piece of data, because it is the entry ticket to your server. Usually, the private key never leaves the system it was generated on. In the usual case, you would generate an SSH key on your host machine, then authorize it on the server (that is, copy the public key to the server) in order to log in manually and perform the deployment routine.

      Here the situation changes slightly: You want to grant an autonomous authority (GitLab CI/CD) access to your server to automate the deployment routine. Therefore the private key needs to leave the system it was generated on and be given in trust to GitLab and other involved parties. You never want your private key to enter an environment that is not either controlled or trusted by you.

      Besides GitLab, the GitLab runner is yet another system that your private key will enter. For each pipeline, GitLab uses runners to perform the heavy work, that is, execute the jobs you have specified in the CI/CD configuration. That means the deployment job will ultimately be executed on a GitLab runner, hence the private key will be copied to the runner such that it can log in to the server using SSH.

      If you use unknown GitLab Runners (for example, shared runners) to execute the deployment job, then you’d be unaware of the systems getting in contact with the private key. Even though GitLab runners clean up all data after job execution, you can avoid sending the private key to unknown systems by registering your own server as a GitLab runner. The private key will then be copied to the server controlled by you.

      Start by logging in to your server:

      In order to install the gitlab-runner service, you’ll add the official GitLab repository. Download and inspect the install script:

      • curl -L >
      • less

      Once you are satisfied with the safety of the script, run the installer:

      It may not be obvious, but you have to enter your non-root user’s password to proceed. When you execute the previous command, the output will be like:


      [sudo] password for sammy: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5945 100 5945 0 0 8742 0 --:--:-- --:--:-- --:--:-- 8729

      When the curl command finishes, you will receive the following message:


      The repository is setup! You can now install packages.

      Next install the gitlab-runner service:

      • sudo apt install gitlab-runner

      Verify the installation by checking the service status:

      • systemctl status gitlab-runner

      You will have active (running) in the output:


      ● gitlab-runner.service - GitLab Runner Loaded: loaded (/etc/systemd/system/gitlab-runner.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-06-01 09:01:49 UTC; 4s ago Main PID: 16653 (gitlab-runner) Tasks: 6 (limit: 1152) CGroup: /system.slice/gitlab-runner.service └─16653 /usr/lib/gitlab-runner/gitlab-runner run --working-directory /home/gitlab-runner --config /etc/gitla

      To register the runner, you need to get the project token and the GitLab URL:

      1. In your GitLab project, navigate to Settings > CI/CD > Runners.
      2. In the Set up a specific Runner manually section, you’ll find the registration token and the GitLab URL. Copy both to a text editor; you’ll need them for the next command. They will be referred to as and project_token.

      The runners section in the ci/cd settings with the copy token button

      Back to your terminal, register the runner for your project:

      • sudo gitlab-runner register -n --url --registration-token project_token --executor docker --description "Deployment Runner" --docker-image "docker:stable" --tag-list deployment --docker-privileged

      The command options can be interpreted as follows:

      • -n executes the register command non-interactively (we specify all parameters as command options).
      • --url is the GitLab URL you copied from the runners page in GitLab.
      • --registration-token is the token you copied from the runners page in GitLab.
      • --executor is the executor type. docker executes each CI/CD job in a Docker container (see GitLab’s documentation on executors).
      • --description is the runner’s description, which will show up in GitLab.
      • --docker-image is the default Docker image to use in CI/CD jobs, if not explicitly specified.
      • --tag-list is a list of tags assigned to the runner. Tags can be used in a pipeline configuration to select specific runners for a CI/CD job. The deployment tag will allow you to refer to this specific runner to execute the deployment job.
      • --docker-privileged executes the Docker container created for each CI/CD job in privileged mode. A privileged container has access to all devices on the host machine and has nearly the same access to the host as processes running outside containers (see Docker’s documentation about runtime privilege and Linux capabilities). The reason for running in privileged mode is so you can use Docker-in-Docker (dind) to build a Docker image in your CI/CD pipeline. It is good practice to give a container the minimum requirements it needs. For you it is a requirement to run in privileged mode in order to use Docker-in-Docker. Be aware, you registered the runner for this specific project only, where you are in control of the commands being executed in the privileged container.

      After executing the gitlab-runner register command, you will receive the following output:


      Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

      Verify the registration process by going to Settings > CI/CD > Runners in GitLab, where the registered runner will show up.

      The registered runner in the runners section of the ci/cd settings

      In the next step you’ll create a deployment user.

      Step 3 — Creating a Deployment User

      You are going to create a non-sudo user that is dedicated for the deployment task, so that its power is limited and the deployment takes place in an isolated user space. You will later configure the CI/CD pipeline to log in to the server with that user.

      On your server, create a new user:

      You’ll be guided through the user creation process. Enter a strong password and optionally any further user information you want to specify. Finally confirm the user creation with Y.

      Add the user to the Docker group:

      • sudo usermod -aG docker deployer

      This permits deployer to execute the docker command, which is required to perform the deployment.

      In the next step you’ll create an SSH key to be able to log in to the server as deployer.

      Step 4 — Setting Up an SSH Key

      You are going to create an SSH key for the deployment user. GitLab CI/CD will later use the key to log in to the server and perform the deployment routine.

      Let’s start by switching to the newly created deployer user for whom you’ll generate the SSH key:

      You’ll be prompted for the deployer password to complete the user switch.

      Next, generate a 4096-bit SSH key. It is important to answer the questions of the ssh-keygen command correctly:

      1. First question: answer it with ENTER, which stores the key in the default location (the rest of this tutorial assumes the key is stored in the default location).
      2. Second question: configures a password to protect the SSH private key (the key used for authentication). If you specify a passphrase, you’ll have to enter it each time the private key is used. In general, a passphrase adds another security layer to SSH keys, which is good practice. Somebody in possession of the private key would also require the passphrase to use the key. For the purposes of this tutorial, it is important that you have an empty passphrase, because the CI/CD pipeline will execute non-interactively and therefore does not allow to enter a passphrase.

      To summarize, run the following command and confirm both questions with ENTER to create a 4096-bit SSH key and store it in the default location with an empty passphrase:

      To authorize the SSH key for the deployer user, you need to append the public key to the authorized_keys file:

      • cat ~/.ssh/ >> ~/.ssh/authorized_keys

      ~ is short for the user home in Linux. The cat program will print the contents of a file; here you use the >> operator to redirect the output of cat and append it to the authorized_keys file.

      In this step you have created an SSH key pair for the CI/CD pipeline to log in and deploy the application. Next you’ll store the private key in GitLab to make it accessible during the pipeline process.

      Step 5 — Storing the Private Key in a GitLab CI/CD Variable

      You are going to store the SSH private key in a GitLab CI/CD file variable, so that the pipeline can make use of the key to log in to the server.

      When GitLab creates a CI/CD pipeline, it will send all variables to the corresponding runner and the variables will be set as environment variables for the duration of the job. In particular, the values of file variables are stored in a file and the environment variable will contain the path to this file.

      While you’re in the variables section, you’ll also add a variable for the server IP and the server user, which will inform the pipeline about the destination server and user to log in.

      Start by showing the SSH private key:

      Copy the output to your clipboard using CTRL+C. Make sure to copy everything including the BEGIN and END line:


      -----BEGIN RSA PRIVATE KEY-----
      -----END RSA PRIVATE KEY-----

      Now navigate to Settings > CI / CD > Variables in your GitLab project and click Add Variable. Fill out the form as follows:

      • Key: ID_RSA
      • Value: Paste your SSH private key from your clipboard with CTRL+V
      • Type: File
      • Environment Scope: All (default)
      • Protect variable: Checked
      • Mask variable: Unchecked

      Note: The variable can’t be masked because it does not meet the regular expression requirements (see GitLab’s documentation about masked variables). However, the private key will never appear in the console log, which makes masking it obsolete.

      A file containing the private key will be created on the runner for each CI/CD job and its path will be stored in the $ID_RSA environment variable.

      Create another variable with your server IP. Click Add Variable and fill out the form as follows:

      • Key: SERVER_IP
      • Value: your_server_IP
      • Type: Variable
      • Environment scope: All (default)
      • Protect variable: Checked
      • Mask variable: Checked

      Finally, create a variable with the login user. Click Add Variable and fill out the form as follows:

      • Key: SERVER_USER
      • Value: deployer
      • Type: Variable
      • Environment scope: All (default)
      • Protect variable: Checked
      • Mask variable: Checked

      You have now stored the private key in a GitLab CI/CD variable, which makes the key available during pipeline execution. In the next step, you’re moving on to configuring the CI/CD pipeline.

      Step 6 — Configuring the .gitlab-ci.yml File

      You are going to configure the GitLab CI/CD pipeline. The pipeline will build a Docker image and push it to the container registry. GitLab provides a container registry for each project. You can explore the container registry by going to Packages & Registries > Container Registry in your GitLab project (read more in GitLab’s container registry documentation.) The final step in your pipeline is to log in to your server, pull the latest Docker image, remove the old container, and start a new container.

      Now you’re going to create the .gitlab-ci.yml file that contains the pipeline configuration. In GitLab, go to the Project overview page, click the + button and select New file. Then set the File name to .gitlab-ci.yml.

      (Alternatively you can clone the repository and make all following changes to .gitlab-ci.yml on your local machine, then commit and push to the remote repository.)

      To begin add the following:


        - publish
        - deploy

      Each job is assigned to a stage. Jobs assigned to the same stage run in parallel (if there are enough runners available). Stages will be executed in the order they were specified. Here, the publish stage will go first and the deploy stage second. Successive stages only start when the previous stage finished successfully (that is, all jobs have passed). Stage names can be chosen arbitrarily.

      When you want to combine this CD configuration with your existing CI pipeline, which tests and builds the app, you may want to add the publish and deploy stages after your existing stages, such that the deployment only takes place if the tests passed.

      Following this, add this to your .gitlab-ci.yml file:


      . . .

      The variables section defines environment variables that will be available in the context of a job’s script section. These variables will be available as usual Linux environment variables; that is, you can reference them in the script by prefixing with a dollar sign such as $TAG_LATEST. GitLab creates some predefined variables for each job that provide context specific information, such as the branch name or the commit hash the job is working on (read more about predefined variable). Here you compose two environment variables out of predefined variables. They represent:

      • CI_REGISTRY_IMAGE: Represents the URL of the container registry tied to the specific project. This URL depends on the GitLab instance. For example, registry URLs for projects follow the pattern: But since GitLab will provide this variable, you do not need to know the exact URL.
      • CI_COMMIT_REF_NAME: The branch or tag name for which project is built.
      • CI_COMMIT_SHORT_SHA: The first eight characters of the commit revision for which the project is built.

      Both of the variables are composed of predefined variables and will be used to tag the Docker image.

      TAG_LATEST will add the latest tag to the image. This is a common strategy to provide a tag that always represents the latest release. For each deployment, the latest image will be overridden in the container registry with the newly built Docker image.

      TAG_COMMIT, on the other hand, uses the first eight characters of the commit SHA being deployed as the image tag, thereby creating a unique Docker image for each commit. You will be able to trace the history of Docker images down to the granularity of Git commits. This is a common technique when doing continuous deployments, because it allows you to quickly deploy an older version of the code in case of a defective deployment.

      As you’ll explore in the coming steps, the process of rolling back a deployment to an older Git revision can be done directly in GitLab.

      $CI_REGISTRY_IMAGE/$CI_COMMIT_REF_NAME specifies the Docker image base name. According to GitLab’s documentation, a Docker image name has to follow this scheme:

      image name scheme

      <registry URL>/<namespace>/<project>/<image>

      $CI_REGISTRY_IMAGE represents the <registry URL>/<namespace>/<project> part and is mandatory because it is the project’s registry root. $CI_COMMIT_REF_NAME is optional but useful to host Docker images for different branches. In this tutorial you will only work with one branch, but it is good to build an extendable structure. In general, there are three levels of image repository names supported by GitLab:

      repository name levels

      For your TAG_COMMIT variable you used the second option, where image will be replaced with the branch name.

      Next, add the following to your .gitlab-ci.yml file:


      . . .
        image: docker:latest
        stage: publish
          - docker:dind
          - docker build -t $TAG_COMMIT -t $TAG_LATEST .
          - docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY
          - docker push $TAG_COMMIT
          - docker push $TAG_LATEST

      The publish section is the first job in your CI/CD configuration. Let’s break it down:

      • image is the Docker image to use for this job. The GitLab runner will create a Docker container for each job and execute the script within this container. docker:latest image ensures that the docker command will be available.
      • stage assigns the job to the publish stage.
      • services specifies Docker-in-Docker—the dind service. This is the reason why you registered the GitLab runner in privileged mode.

      The script section of the publish job specifies the shell commands to execute for this job. The working directory will be set to the repository root when these commands will be executed.

      • docker build ...: Builds the Docker image based on the Dockerfile and tags it with the latest commit tag defined in the variables section.
      • docker login ...: Logs Docker in to the project’s container registry. You use the predefined variable $CI_BUILD_TOKEN as an authentication token. GitLab will generate the token and stay valid for the job’s lifetime.
      • docker push ...: Pushes both image tags to the container registry.

      Following this, add the deploy job to your .gitlab-ci.yml:


      . . .
        image: alpine:latest
        stage: deploy
          - deployment
          - chmod og= $ID_RSA
          - apk update && apk add openssh-client
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker pull $TAG_COMMIT"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker container rm -f my-app || true"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker run -d -p 80:80 --name my-app $TAG_COMMIT"

      Alpine is a lightweight Linux distribution and is sufficient as a Docker image here. You assign the job to the deploy stage. The deployment tag ensures that the job will be executed on runners that are tagged deployment, such as the runner you configured in Step 2.

      The script section of the deploy job starts with two configurative commands:

      • chmod og= $ID_RSA: Revokes all permissions for group and others from the private key, such that only the owner can use it. This is a requirement, otherwise SSH refuses to work with the private key.
      • apk update && apk add openssh-client: Updates Alpine’s package manager (apk) and installs the openssh-client, which provides the ssh command.

      Four consecutive ssh commands follow. The pattern for each is:

      ssh connect pattern for all deployment commands

      ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "command"

      In each ssh statement you are executing command on the remote server. To do so, you authenticate with your private key.

      The options are as follows:

      • -i stands for identity file and $ID_RSA is the GitLab variable containing the path to the private key file.
      • -o StrictHostKeyChecking=no makes sure to bypass the question, whether or not you trust the remote host. This question can not be answered in a non-interactive context such as the pipeline.
      • $SERVER_USER and $SERVER_IP are the GitLab variables you created in Step 5. They specify the remote host and login user for the SSH connection.
      • command will be executed on the remote host.

      The deployment ultimately takes place by executing these four commands on your server:

      1. docker login ...: Logs Docker in to the container registry.
      2. docker pull ...: Pulls the latest image from the container registry.
      3. docker container rm ...: Deletes the existing container if it exists. || true makes sure that the exit code is always successful, even if there was no container running by the name my-app. This guarantees a delete if exists routine without breaking the pipeline when the container does not exist (for example, for the first deployment).
      4. docker run ...: Starts a new container using the latest image from the registry. The container will be named my-app. Port 80 on the host will be bound to port 80 of the container (the order is -p host:container). -d starts the container in detached mode, otherwise the pipeline would be stuck waiting for the command to terminate.

      Note: It may seem odd to use SSH to run these commands on your server, considering the GitLab runner that executes the commands is the exact same server. Yet it is required, because the runner executes the commands in a Docker container, thus you would deploy inside the container instead of the server if you’d execute the commands without the use of SSH. One could argue that instead of using Docker as a runner executor, you could use the shell executor to run the commands on the host itself. But, that would create a constraint to your pipeline, namely that the runner has to be the same server as the one you want to deploy to. This is not a sustainable and extensible solution because one day you may want to migrate the application to a different server or use a different runner server. In any case it makes sense to use SSH to execute the deployment commands, may it be for technical or migration-related reasons.

      Let’s move on by adding this to the deployment job in your .gitlab-ci.yml:


      . . .
      . . .
          name: production
          url: http://your_server_IP
          - master

      GitLab environments allow you to control the deployments within GitLab. You can examine the environments in your GitLab project by going to Operations > Environments. If the pipeline did not finish yet, there will be no environment available, as no deployment took place so far.

      When a pipeline job defines an environment section, GitLab will create a deployment for the given environment (here production) each time the job successfully finishes. This allows you to trace all the deployments created by GitLab CI/CD. For each deployment you can see the related commit and the branch it was created for.

      There is also a button available for re-deployment that allows you to rollback to an older version of the software. The URL that was specified in the environment section will be opened when clicking the View deployment button.

      The only section defines the names of branches and tags for which the job will run. By default, GitLab will start a pipeline for each push to the repository and run all jobs (provided that the .gitlab-ci.yml file exists). The only section is one option of restricting job execution to certain branches/tags. Here you want to execute the deployment job for the master branch only. To define more complex rules on whether a job should run or not, have a look at the rules syntax.

      Note: In October 2020, GitHub has changed its naming convention for the default branch from master to main. Other providers such as GitLab and the developer community in general are starting to follow this approach. The term master branch is used in this tutorial to denote the default branch for which you may have a different name.

      Your complete .gitlab-ci.yml file will look like the following:


        - publish
        - deploy
        image: docker:latest
        stage: publish
          - docker:dind
          - docker build -t $TAG_COMMIT -t $TAG_LATEST .
          - docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY
          - docker push $TAG_COMMIT
          - docker push $TAG_LATEST
        image: alpine:latest
        stage: deploy
          - deployment
          - chmod og= $ID_RSA
          - apk update && apk add openssh-client
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker pull $TAG_COMMIT"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker container rm -f my-app || true"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker run -d -p 80:80 --name my-app $TAG_COMMIT"
          name: production
          url: http://your_server_IP
          - master

      Finally click Commit changes at the bottom of the page in GitLab to create the .gitlab-ci.yml file. Alternatively, when you have cloned the Git repository locally, commit and push the file to the remote.

      You’ve created a GitLab CI/CD configuration for building a Docker image and deploying it to your server. In the next step you are validating the deployment.

      Step 7 — Validating the Deployment

      Now you’ll validate the deployment in various places of GitLab as well as on your server and in a browser.

      When a .gitlab-ci.yml file is pushed to the repository, GitLab will automatically detect it and start a CI/CD pipeline. At the time you created the .gitlab-ci.yml file, GitLab started the first pipeline.

      Go to CI/CD > Pipelines in your GitLab project to see the pipeline’s status. If the jobs are still running/pending, wait until they are complete. You will see a Passed pipeline with two green checkmarks, denoting that the publish and deploy job ran successfully.

      The pipeline overview page showing a passed pipeline

      Let’s examine the pipeline. Click the passed button in the Status column to open the pipeline’s overview page. You will get an overview of general information such as:

      • Execution duration of the whole pipeline.
      • For which commit and branch the pipeline was executed.
      • Related merge requests. If there is an open merge request for the branch in charge, it would show up here.
      • All jobs executed in this pipeline as well as their status.

      Next click the deploy button to open the result page of the deploy job.

      The result page of the deploy job

      On the job result page you can see the shell output of the job’s script. This is the place to look for when debugging a failed pipeline. In the right sidebar you’ll find the deployment tag you added to this job, and that it was executed on your Deployment Runner.

      If you scroll to the top of the page, you will find the This job is deployed to production message. GitLab recognizes that a deployment took place because of the job’s environment section. Click the production link to move over to the production environment.

      The production environment in GitLab

      You will have an overview of all production deployments. There was only a single deployment so far. For each deployment there is a re-deploy button available to the very right. A re-deployment will repeat the deploy job of that particular pipeline.

      Whether a re-deployment works as intended depends on the pipeline configuration, because it will not do more than repeating the deploy job under the same circumstances. Since you have configured to deploy a Docker image using the commit SHA as a tag, a re-deployment will work for your pipeline.

      Note: Your GitLab container registry may have an expiration policy. The expiration policy regularly removes older images and tags from the container registry. As a consequence, a deployment that is older than the expiration policy would fail to re-deploy, because the Docker image for this commit will have been removed from the registry. You can manage the expiration policy in Settings > CI/CD > Container Registry tag expiration policy. The expiration interval is usually set to something high, like 90 days. But when you run into the case of trying to deploy an image that has been removed from the registry due to the expiration policy, you can solve the problem by re-running the publish job of that particular pipeline as well, which will re-create and push the image for the given commit to registry.

      Next click the View deployment button, which will open http://your_server_IP in a browser and you should see the My Personal Website headline.

      Finally we want to check the deployed container on your server. Head over to your terminal and make sure to log in again, if you have disconnected already (it works for both users, sammy and deployer):

      Now list the running containers:

      Which will list the my-app container:


      CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 5b64df4b37f8 "nginx -g 'daemon of…" 4 hours ago Up 4 hours>80/tcp my-app

      Read the How To Install and Use Docker on Ubuntu 18.04 guide to learn more about managing Docker containers.

      You have now validated the deployment. In the next step, you will go through the process of rolling back a deployment.

      Step 8 — Rolling Back a Deployment

      Next you’ll update the web page, which will create a new deployment and then re-deploy the previous deployment using GitLab environments. This covers the use case of a deployment rollback in case of a defective deployment.

      Start by making a little change in the index.html file:

      1. In GitLab, go to the Project overview and open the index.html file.
      2. Click the Edit button to open the online editor.
      3. Change the file content to the following:


      <h1>My Enhanced Personal Website</h1>

      Save the changes by clicking Commit changes at the bottom of the page.

      A new pipeline will be created to deploy the changes. In GitLab, go to CI/CD > Pipelines. When the pipeline has completed, you can open http://your_server_IP in a browser for the updated web page now showing My Enhanced Personal Website instead of My Personal Website.

      When you move over to Operations > Environments > production you will see the newly created deployment. Now click the re-deploy button of the initial, older deployment:

      A list of the deployments of the production environment in GitLab with emphasize on the re-deploy button of the first deployment

      Confirm the popup by clicking the Rollback button.

      The deploy job of that older pipeline will be restarted and you will be redirected to the job’s overview page. Wait for the job to finish, then open http://your_server_IP in a browser, where you’ll see the initial headline My Personal Website showing up again.

      Let’s summarize what you have achieved throughout this tutorial.


      In this tutorial, you have configured a continuous deployment pipeline with GitLab CI/CD. You created a small web project consisting of an HTML file and a Dockerfile. Then you configured the .gitlab-ci.yml pipeline configuration to:

      1. Build the Docker image.
      2. Push the Docker image to the container registry.
      3. Log in to the server, pull the latest image, stop the current container, and start a new one.

      GitLab will now deploy the web page to your server for each push to the repository.

      Furthermore you have verified a deployment in GitLab and on your server. You have also created a second deployment and rolled back to the first deployment using GitLab environments, which demonstrates how you deal with defective deployments.

      At this point you have automated the whole deployment chain. You can now share code changes more frequently with the world and/or customer. As a result, development cycles are likely to become shorter, as less time is required to gather feedback and publish the code changes.

      As a next step you could make your service accessible by a domain name and secure the communication with HTTPS for which How To Use Traefik as a Reverse Proxy for Docker Containers is a good follow up.

      Source link

      Create Your First CI/CD Pipeline on Kubernetes With Jenkins

      How to Join

      This Tech Talk is free and open to everyone. Register below to get a link to join the live event.

      Format Date RSVP
      Presentation and Q&A September 8, 2020, 11:00–12:00 p.m. ET

      If you can’t join us live, the video recording will be published here as soon as it’s available.

      About the Talk

      Setting up a Kubernetes cluster is easy, but what do you do after that? Setting up a CI/CD pipeline is one of the core concepts of DevOps. This talk will help you set up that first pipeline via Jenkins on top of a Kubernetes cluster.

      What You’ll Learn

      • Why CI/CD pipelines are important
      • How to use Jenkins Pipeline with Kubernetes

      This Talk is Designed For

      Developers and system administrators that are new to Kubernetes.


      Basic knowledge of Jenkins and Kubernetes.

      About the Presenter

      Peeyush Gupta is currently a Senior Developer Advocate at DigitalOcean. He loves developing cloud platforms, helping developers migrate legacy applications to the cloud, and serving communities through speaking at meetups and contributing to the Kubernetes Contributor Experience Group.

      To join the live Tech Talk, register here for the session of your choice.

      Source link