One place for hosting & domains

      How To Store and Retrieve Data in MariaDB Using Python on Ubuntu 18.04


      The author selected the Tech Education Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      MariaDB is an open source version of the popular MySQL relational database management system (DBMS) with a SQL interface for accessing and managing data. It is highly reliable and easy to administer, which are essential qualities of a DBMS capable of serving modern applications. With Python’s growing popularity in technologies like artificial intelligence and machine learning, MariaDB makes a good option for a database server for Python.

      In this tutorial, you will connect a Python application to a database server using the MySQL connector. This module allows you to make queries on the database server from within your application. You’ll set up MariaDB for a Python environment on Ubuntu 18.04 and write a Python script that connects to and executes queries on MariaDB.

      Prerequisites

      Before you begin this guide, you will need the following:

      Step 1 — Preparing and Installing

      In this step, you’ll create a database and a table in MariaDB.

      First, open your terminal and enter the MariaDB shell from the terminal with the following command:

      Once you’re in the MariaDB shell, your terminal prompt will change. In this tutorial, you’ll write Python to connect to an example employee database named workplace and a table named employees.

      Start by creating the workplace database:

      • CREATE DATABASE workplace;

      Next, tell MariaDB to use workplace as your current database:

      You will receive the following output, which means that every query you run after this will take effect in the workplace database:

      Output

      Database changed

      Next, create the employees table:

      • CREATE TABLE employees (first_name CHAR(35), last_name CHAR(35));

      In the table schema, the parameters first_name and a last_name are specified as character strings (CHAR) with a maximum length of 35.

      Following this, exit the MariaDB shell:

      Back in the terminal, export your MariaDB authorization credentials as environment variables:

      • export username="username"
      • export password="password"

      This technique allows you to avoid adding credentials in plain text within your script.

      You’ve set up your environment for the project. Next, you’ll begin writing your script and connect to your database.

      Step 2 — Connecting to Your Database

      In this step, you will install the MySQL Connector and set up the database.

      In your terminal, run the following command to install the Connector:

      • pip3 install mysql-connector-python

      pip is the standard package manager for Python. mysql-connector-python is the database connector Python module.

      Once you’ve successfully installed the connector, create and open a new file Python file:

      In the opened file, import the os module and the mysql.connector module using the import keyword:

      database.py

      import os
      import mysql.connector as database
      

      The as keyword here means that mysql.connector will be referenced as database in the rest of the code.

      Next, initialize the authorization credentials you exported as Python variables:

      database.py

      . . .
      username = os.environ.get("username")
      password = os.environ.get("password")
      

      Follow up and establish a database connection using the connect() method provided by database. The method takes a series of named arguments specifying your client credentials:

      database.py

      . . .
      connection = database.connect(
          user=username,
          password=password,
          host=localhost,
          database="workplace")
      

      You declare a variable named connection that holds the call to the database.connect() method. Inside the method, you assign values to the user, password, host, and database arguments. For user and password, you will reference your MariaDB authorization credentials. The host will be localhost by default if you are running the database on the same system.

      Lastly, call the cursor() method on the connection to obtain the database cursor:

      database.py

      . . .
      cursor = connection.cursor()
      

      A cursor is a database object that retrieves and also updates data, one row at a time, from a set of data.

      Leave your file open for the next step.

      Now you can connect to MariaDB with your credentials; next, you will add entries to your database using your script.

      Step 3 — Adding Data

      Using the execute() method on the database cursor, you will add entries to your database in this step.

      Define a function add_data() to accept the first and last names of an employee as arguments. Inside the function, create a try/except block. Add the following code following your cursor object:

      database.py

      . . .
      def add_data(first_name, last_name):
          try:
              statement = "INSERT INTO employees (first_name,last_name) VALUES (%s, %s)"
              data = (first_name, last_name)
              cursor.execute(statement, data)
              connection.commit()
              print("Successfully added entry to database")
          except database.Error as e:
              print(f"Error adding entry to database: {e}")
      

      You use the try and except block to catch and handle exceptions (events or errors) that disrupt the normal flow of program execution.

      Under the try block, you declare statement as a variable holding your INSERT SQL statement. The statement tells MariaDB to add to the columns first_name and last_name.

      The code syntax accepts data as parameters that reduce the chances of SQL injection. Prepared statements with parameters ensure that only given parameters are securely passed to the database as intended. Parameters are generally not injectable.

      Next you declare data as a tuple with the arguments received from the add_data function. Proceed to run the execute() method on your cursor object by passing the SQL statement and the data. After calling the execute() method, you call the commit() method on the connection to permanently save the inserted data.

      Finally, you print out a success message if this succeeds.

      In the except block, which only executes when there’s an exception, you declare database.Error as e. This variable will hold information about the type of exception or what event happened when the script breaks. You then proceed to print out an error message formatted with e to end the block using an f-string.

      After adding data to the database, you’ll next want to retrieve it. The next step will take you through the process of retrieving data.

      Step 4 — Retrieving Data

      In this step, you will write a SQL query within your Python code to retrieve data from your database.

      Using the same execute() method on the database cursor, you can retrieve a database entry.

      Define a function get_data() to accept the last name of an employee as an argument, which you will call with the execute() method with the SELECT SQL query to locate the exact row:

      database.py

      . . .
      def get_data(last_name):
          try:
            statement = "SELECT first_name, last_name FROM employees WHERE last_name=%s"
            data = (last_name,)
            cursor.execute(statement, data)
            for (first_name, last_name) in cursor:
              print(f"Successfully retrieved {first_name}, {last_name}")
          except database.Error as e:
            print(f"Error retrieving entry from database: {e}")
      

      Under the try block, you declare statement as a variable holding your SELECT SQL statement. The statement tells MariaDB to retrieve the columns first_name and last_name from the employees table when a specific last name is matched.

      Again, you use parameters to reduce the chances of SQL injection.

      Smilarly to the last function, you declare data as a tuple with last_name followed by a comma. Proceed to run the execute() method on the cursor object by passing the SQL statement and the data. Using a for loop, you iterate through the returned elements in the cursor and then print out if there are any successful matches.

      In the except block, which only executes when there is an exception, declare database.Error as e. This variable will hold information about the type of exception that occurs. You then proceed to print out an error message formatted with e to end the block.

      In the final step, you will execute your script by calling the defined functions.

      Step 5 — Running Your Script

      In this step, you will write the final piece of code to make your script executable and run it from your terminal.

      Complete your script by calling add_data() and get_data() with sample data (strings) to verify that your code is working as expected.

      If you would like to add multiple entries, you can call add_data() with further sample names of your choice.

      Once you finish working with the database make sure that you close the connection to avoid wasting resources:
      connection.close():

      database.py

      import os
      import mysql.connector as database
      
      username = os.environ.get("username")
      password = os.environ.get("password")
      
      connection = database.connect(
          user=username,
          password=password,
          host=localhost,
          database="workplace")
      
      cursor = connection.cursor()
      
      def add_data(first_name, last_name):
          try:
          statement = "INSERT INTO employees (first_name,last_name) VALUES (%s, %s)"
          data = (first_name, last_name)
            cursor.execute(statement, data)
          cursor.commit()
          print("Successfully added entry to database")
          except database.Error as e:
          print(f"Error adding entry to database: {e}")
      
      def get_data(last_name):
          try:
            statement = "SELECT first_name, last_name FROM employees WHERE last_name=%s"
            data = (last_name,)
            cursor.execute(statement, data)
            for (first_name, last_name) in cursor:
              print(f"Successfully retrieved {first_name}, {last_name}")
          except database.Error as e:
            print(f"Error retrieving entry from database: {e}")
      
      add_data("Kofi", "Doe")
      get_data("Doe")
      
      connection.close()
      

      Make sure you have indented your code correctly to avoid errors.

      In the same directory, you created the database.py file, run your script with:

      You will receive the following output:

      Output

      Successfully added entry to database Successfully retrieved Kofi, Doe

      Finally, return to MariaDB to confirm you have successfully added your entries.

      Open up the MariaDB prompt from your terminal:

      Next, tell MariaDB to switch to and use the workplace database:

      After you get the success message Database changed, proceed to query for all entries in the employees table:

      You output will be similar to the following:

      Output

      +------------+-----------+ | first_name | last_name | +------------+-----------+ | Kofi | Doe | +------------+-----------+ 1 row in set (0.00 sec)

      Putting it all together, you’ve written a script that saves and retrieves information from a MariaDB database.

      You started by importing the necessary libraries. You used mysql-connector to connect to the database and os to retrieve authorization credentials from the environment. On the database connection, you retrieved the cursor to carry out queries and structured your code into add_data and get_data functions. With your functions, you inserted data into and retrieved data from the database.

      If you wish to implement deletion, you can build a similar function with the necessary declarations, statements, and calls.

      Conclusion

      You have successfully set up a database connection to MariaDB using a Python script on Ubuntu 18.04. From here, you could use similar code in any of your Python projects in which you need to store data in a database. This guide may also be helpful for other relational databases that were developed out of MySQL.

      For more on how to accomplish your projects with Python, check out other community tutorials on Python.



      Source link

      How To Use Vue.js Environment Modes with a Node.js Mock Data Layer


      The author selected Open Sourcing Mental Illness to receive a donation as part of the Write for DOnations program.

      Introduction

      When it comes to software development, there are two ends of the stack. A stack is a collection of technologies used for your software to function. Vue.js, the progressive user interface framework, is part of the frontend, the part of the stack that a user directly interacts with. This frontend is also referred to as the client and encompasses everything that is rendered in the user’s browser. Technologies such as HTML, JavaScript, and CSS are all rendered in the client. In contrast, the backend commonly interacts with data or servers through technologies like Java, Kotlin, or .NET.

      The application is the data itself, and the interface (frontend) is a way to display data meaningfully to the user for them to interact with. In the begining phase of software development, you don’t need a backend to get started. In some cases, the backend hasn’t even been created yet. In a case such as this, you can create your own local data to build your interface. Using Node environments and variables, you can toggle different datasets per environment or toggle between local data and “live” data via a network call. Having a mock data layer is useful if you do not have data yet, because it provides data to test your frontend before the backend is ready.

      By the end of this tutorial, you will have created several Node environments and toggled these datasets with Node environment variables. To illustrate these concepts, you will create a number of Vue components to visualize this data across environments.

      Prerequisites

      To complete this tutorial, you will need:

      Step 1 — Creating Environments with Modes

      Modes are an important concept when it comes to Vue CLI projects. A mode is an environment type, or a set of variables that gets loaded during a build. These variables are stored in .env files in the root directory of your project. As part of the default vue-cli-service plugin, you immediately have access to three modes. These are:

      • development: used when vue-cli-service serve is executed.
      • test: used when vue-cli-service test:unit is executed.
      • production: used when vue-cli-service build and vue-cli-service test:e2e are executed.

      Perhaps the most used mode is development mode. This is the mode that Vue.js developers use when working on their application on their local machine. This mode starts a local Node server with hot-module reloading (instant browser refreshing) enabled. The test mode, on the other hand, is a mode to run your unit tests in. Unit tests are JavaScript functions that test application methods, events, and in some cases, user interaction. The last default mode is production. This compressess all of your code and optimizes it for performance so it can be hosted on a production server.

      The project that is generated for you from Vue CLI has these commands pre-mapped to npm run serve, npm run test:unit, and npm run build.

      Each environment is associated with its own .env file, where you can put custom Node environment key/value pairs that your application can reference. You will not have these files after generating a project from the CLI, but you can add these using one command in your terminal.

      You will now generate a development environment file, which you will use later in the tutorial. Open your terminal and enter the following in the root directory of your project:

      Open the newly created file in your text editor of choice. In this file, you’ll want to explicitly define the environment type. This is a key/value pair that could be anything you want. However, it’s considered best practice to define the environment type that corresponds with the name of the .env file.

      You will be using this NODE_ENV later in the tutorial by loading different data sets depending on the environment or mode selected on build. Add the following line:

      .env.development

      NODE_ENV="development"
      

      Save and exit the file.

      The key/value pairs in this file will only affect your program when the application is in development mode. It’s important to note that Git will automatically commit these files unless you add the file in your .gitignore file or append the .local extension to the file name: .env.development.local.

      You are not limited to the standard environments that Vue.js provides. You may have several other environments that are specific to your workflow. Next, you’ll create a custom environment for a staging server.

      Start by creating the .env file for the staging environment. Open your terminal of choice and in the root directory run the following:

      In this file, create a key/value pair that’ll define the NODE_ENV of this project. You may open this file in your text editor of choice, or you can modify it using the terminal. Nano is an editor that is used in the terminal window. You may have other editors such as Vim.

      You will be using this NODE_ENV later in the tutorial by loading different data sets depending on the environment or mode selected on build.

      Add the following to the file:

      .env.staging

      NODE_ENV="staging"
      

      Save and exit the file. In nano, you can save the file with CTRL+X then CTRL+Y.

      In order to use this environment, you can register a new script in your package.json file. Open this file now.

      In the "scripts" section, add the following highlighted line:

      package.json

      {
        ...
        "scripts": {
          ...
          "staging": "vue-cli-service serve --mode staging",
        },
        ...
      }
      

      Save this file, then exit the editor.

      You’ve just created a new script that can be executed with npm run staging. The flag --mode lets Vue CLI know to use the .env.staging (or .env.staging.local) file when starting the local server.

      In this step, you created custom NODE_ENV variables for two Vue.js modes: development and staging. These modes will come in handy in the following steps when you create custom datasets for each of these modes. By running the project in one mode or the other, you can load different data sets by reading these files.

      Step 2 — Creating a Mock Data Layer

      As stated in the Introduction, you can start developing your user interface without a backend by using a mock data layer. A mock data layer is static data that is stored locally for your application to reference. When working with Vue, these data files are JavaScript objects and arrays. Where these are stored is your personal preference. For the sake of this tutorial, mock data files will be stored in a directory named data.

      In this tutorial, you’re building a “main/detail” airport browser application. In the “main” view, the user will see a number of airport codes and locations.

      In your terminal, in the root project directory, create a new directory using the mkdir command:

      Now create a .js file named airports.staging.mock.js using the touch command. Naming these files is personal preference, but it’s generally a good idea to differentiate this mock data from essential files in your app. For the sake of this tutorial, mock files will follow this naming convention: name.environment.mock.js.

      Create the file with the following command:

      • touch data/airports.staging.mock.js

      In your editor of choice, open this newly created JavaScript file and add the following array of objects:

      data/airports.staging.mock.js

      const airports = [
          {
              name: 'Cincinnati/Northern Kentucky International Airport',
              abbreviation: 'CVG',
              city: 'Hebron',
              state: 'KY'
          },
          {
              name: 'Seattle-Tacoma International Airport',
              abbreviation: 'SEA',
              city: 'Seattle',
              state: 'WA'
          },
          {
              name: 'Minneapolis-Saint Paul International Airport',
              abbreviation: 'MSP',
              city: 'Bloomington',
              state: 'MN'
          }
      ]
      
      export default airports
      

      In this code block, you are creating objects that represent airports in the United States and providing their name, abbreviation, and the city and state in which they are located. You then export the array to make it available to other parts of your program. This will act as your “staging” data.

      Next, create a dataset for another environment like “development"—the default environment when running npm run serve. To follow the naming convention, create a new file in your terminal with the touch command and name it airports.development.mock.js:

      • touch data/airports.development.mock.js

      In your editor of choice, open this newly created JavaScript file and add the following array of objects:

      data/airports.development.mock.js

      const airports = [
          {
              name: 'Louis Armstrong New Orleans International Airport',
              abbreviation: 'MSY',
              city: 'New Orleans',
              state: 'LA'
          },
          {
              name: 'Denver International Airport',
              abbreviation: 'DEN',
              city: 'Denver',
              state: 'CO'
          },
          {
              name: 'Philadelphia International Airport',
              abbreviation: 'PHL',
              city: 'Philadelphia',
              state: 'PA'
          }
      ]
      
      export default airports
      

      This will act as your "development” data when you run npm run serve.

      Now that you’ve created the mock data for your environments, in the next step, you are going to iterate or loop through that data with the v-for directive and start building out the user interface. This will give you a visual representation of the change when using the different modes.

      Step 3 — Iterating Through Mock Data in App.vue

      Now that you have your mock data, you can test out how useful environments are by iterating through this data in the App.vue component in the src directory.

      First, open up App.vue in your editor of choice.

      Once it is open, delete all of the HTML inside the template tags and remove the import statement in the script section. Also, delete the HelloWorld component in the export object. Some general styles have also been provided to make the data easier to read.

      Add the following highlighted lines to your App.vue file:

      src/App.vue

      <template>
      
      </template>
      
      <script>
      export default {
        name: 'App',
      }
      </script>
      
      <style>
      #app {
        font-family: Avenir, Helvetica, Arial, sans-serif;
        -webkit-font-smoothing: antialiased;
        -moz-osx-font-smoothing: grayscale;
        text-align: center;
        color: #2c3e50;
        margin-top: 60px;
      }
      
      .wrapper {
        display: grid;
        grid-template-columns: 1fr 1fr 1fr;
        grid-column-gap: 1rem;
        max-width: 960px;
        margin: 0 auto;
      }
      
      .airport {
        border: 3px solid;
        border-radius: .5rem;
        padding: 1rem;
      }
      
      .airport p:first-child {
        font-weight: bold;
        font-size: 2.5rem;
        margin: 1rem 0;
      <^>}
      
      .airport p:last-child {
        font-style: italic;
        font-size: .8rem;
      }
      </style>
      

      In this case, CSS Grid is being used to create these cards of airport codes into a grid of three. Notice how this grid is set up in the .wrapper class. The .airport is the card or section that contains each airport code, name, and location.

      Next, import the development mock data that was created earlier. Since this is vanilla JavaScript, you can import it via an import statement. You will also need to register this data with the data property so the Vue template has access to this data.

      Add the following highlighted lines:

      src/App.vue

      ...
      <script>
      import airports from '../data/airports.development.mock'
      
      export default {
        name: 'App',
        data() {
          return {
            airports
          }
        }
      }
      </script>
      ...
      

      Now that the mock data has been imported, you can start using it to build your interface. In the case of this application, iterate through this data with the v-for directive and display it in your template:

      src/App.vue

      <template>
        <div class="wrapper">
          <div v-for="airport in airports" :key="airport.abbreviation" class="airport">
            <p>{{ airport.abbreviation }}</p>
            <p>{{ airport.name }}</p>
            <p>{{ airport.city }}, {{ airport.state }}</p>
          </div>
        </div>
      </template>
      ...
      

      v-for in this case is used to render the list of airports.

      Save and close the file.

      In your terminal, start the development server by running the command:

      Vue CLI will provide you a local address, generally localhost:8080. Visit that addresss in your browser of choice. You will find the data from airports.development.mock.js rendered in your browser:

      Styled cards containing airport data from the development dataset.

      At this point, you created a static mock data layer and loaded it when you executed npm run serve. However, you will notice that if you stop the server (CTRL+C) and start the staging environment, you will have the same result in your browser. In this next step, you are going to tell your app to load a set of data for each environment. To achieve this, you can use a Vue computed property.

      Step 4 — Loading Environment-Specific Data with Computed Properties

      In Vue, computed properties are component functions that must return a value. These functions cannot accept arguments, and are cached by Vue. Computed properties are very useful when you need to perform logic and assign that return value to a property. In this respect, computed properties act similar to data properties as far as the template is concerned.

      In this step, you will use computed properties to use different datasets for the staging and the development environment.

      Start by opening src/App.vue and importing both sets of data:

      src/App.vue

      ...
      <script>
      import DevData from '../data/airports.development.mock'
      import StagingData from '../data/airports.staging.mock'
      
      export default {
        name: 'App',
        ...
      }
      </script>
      ...
      

      If you still have one of the environments running, your data will disappear. That is because you removed the data property that connected your JavaScript mock data to the template.

      Next, create a computed property named airports. The function name is important here because Vue is taking the return value of the function and assigning it to airports for the template to consume. In this computed property, you’ll need to write a bit of logic; an if/else statement that evaluates the Node environment name. To get the value of the Node environment in Vue, you can access it with process.env.NODE_ENV. When you created your Node environments, Vue automatically assigned NODE_ENV to development and staging respectively.

      src/App.vue

      ...
      <script>
      import DevData from '../data/airports.development.mock'
      import StagingData from '../data/airports.staging.mock'
      
      export default {
        name: 'App',
        computed: {
            airports() {
              if (process.env.NODE_ENV === 'development') return DevData
              else return StagingData
            }
        }
      }
      </script>
      ...
      

      Now you are loading each set of data per its respective environment.

      In your terminal, start the local development environment with npm run serve.

      The data will be identical to what it was before.

      Now, start the staging environment by first stopping the server and then executing npm run staging in your terminal window.

      When you visit localhost:8080 in your browser, you will find a different set of data.

      Styled cards containing airport data from the staging dataset.

      Conclusion

      In this tutorial, you worked with different Vue.js environment modes and added environment scripts to your package.json file. You also created mock data for a number of environments and iterated through the data using the v-for directive.

      By using this approach to make a temporary backend, you can focus on the development of your interface and the frontend of your application. You are also not limited to any number of environments or the default environments provided by Vue. It isn’t uncommon to have .env files for four or more environments: development, staging, user acceptance testing (UAT), and production.

      For more information on Vue.js and Vue CLI 3, it’s recommended to read through their documentation. For more tutorials on Vue, check out the Vue Topic Page.



      Source link

      How To Build a Data Processing Pipeline Using Luigi in Python on Ubuntu 20.04


      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      Luigi is a Python package that manages long-running batch processing, which is the automated running of data processing jobs on batches of items. Luigi allows you to define a data processing job as a set of dependent tasks. For example, task B depends on the output of task A. And task D depends on the output of task B and task C. Luigi automatically works out what tasks it needs to run to complete a requested job.

      Overall Luigi provides a framework to develop and manage data processing pipelines. It was originally developed by Spotify, who use it to manage plumbing together collections of tasks that need to fetch and process data from a variety of sources. Within Luigi, developers at Spotify built functionality to help with their batch processing needs including handling of failures, the ability to automatically resolve dependencies between tasks, and visualization of task processing. Spotify uses Luigi to support batch processing jobs, including providing music recommendations to users, populating internal dashboards, and calculating lists of top songs.

      In this tutorial, you will build a data processing pipeline to analyze the most common words from the most popular books on Project Gutenburg. To do this, you will build a pipeline using the Luigi package. You will use Luigi tasks, targets, dependencies, and parameters to build your pipeline.

      Prerequisites

      To complete this tutorial, you will need the following:

      Step 1 — Installing Luigi

      In this step, you will create a clean sandbox environment for your Luigi installation.

      First, create a project directory. For this tutorial luigi-demo:

      Navigate into the newly created luigi-demo directory:

      Create a new virtual environment luigi-venv:

      • python3 -m venv luigi-venv

      And activate the newly created virtual environment:

      • . luigi-venv/bin/activate

      You will find (luigi-venv) appended to the front of your terminal prompt to indicate which virtual environment is active:

      Output

      (luigi-venv) username@hostname:~/luigi-demo$

      For this tutorial, you will need three libraries: luigi, beautifulsoup4, and requests. The requests library streamlines making HTTP requests; you will use it to download the Project Gutenberg book lists and the books to analyze. The beautifulsoup4 library provides functions to parse data from web pages; you will use it to parse out a list of the most popular books on the Project Gutenberg site.

      Run the following command to install these libraries using pip:

      • pip install wheel luigi beautifulsoup4 requests

      You will get a response confirming the installation of the latest versions of the libraries and all of their dependencies:

      Output

      Successfully installed beautifulsoup4-4.9.1 certifi-2020.6.20 chardet-3.0.4 docutils-0.16 idna-2.10 lockfile-0.12.2 luigi-3.0.1 python-daemon-2.2.4 python-dateutil-2.8.1 requests-2.24.0 six-1.15.0 soupsieve-2.0.1 tornado-5.1.1 urllib3-1.25.10

      You’ve installed the dependencies for your project. Now, you’ll move on to building your first Luigi task.

      Step 2 — Creating a Luigi Task

      In this step, you will create a “Hello World” Luigi task to demonstrate how they work.

      A Luigi task is where the execution of your pipeline and the definition of each task’s input and output dependencies take place. Tasks are the building blocks that you will create your pipeline from. You define them in a class, which contains:

      • A run() method that holds the logic for executing the task.
      • An output() method that returns the artifacts generated by the task. The run() method populates these artifacts.
      • An optional input() method that returns any additional tasks in your pipeline that are required to execute the current task. The run() method uses these to carry out the task.

      Create a new file hello-world.py:

      Now add the following code to your file:

      hello-world.py

      import luigi
      
      class HelloLuigi(luigi.Task):
      
          def output(self):
              return luigi.LocalTarget('hello-luigi.txt')
      
          def run(self):
              with self.output().open("w") as outfile:
                  outfile.write("Hello Luigi!")
      
      

      You define that HelloLuigi() is a Luigi task by adding the luigi.Task mixin to it.

      The output() method defines one or more Target outputs that your task produces. In the case of this example, you define a luigi.LocalTarget, which is a local file.

      Note: Luigi allows you to connect to a variety of common data sources including AWS S3 buckets, MongoDB databases, and SQL databases. You can find a complete list of supported data sources in the Luigi docs.

      The run() method contains the code you want to execute for your pipeline stage. For this example you are opening the output() target file in write mode, self.output().open("w") as outfile: and writing "Hello Luigi!" to it with outfile.write("Hello Luigi!").

      To execute the task you created, run the following command:

      • python -m luigi --module hello-world HelloLuigi --local-scheduler

      Here, you run the task using python -m instead of executing the luigi command directly; this is because Luigi can only execute code that is within the current PYTHONPATH. You can alternatively add PYTHONPATH='.' to the front of your Luigi command, like so:

      • PYTHONPATH='.' luigi --module hello-world HelloLuigi --local-scheduler

      With the --module hello-world HelloLuigi flag, you tell Luigi which Python module and Luigi task to execute.

      The --local-scheduler flag tells Luigi to not connect to a Luigi scheduler and, instead, execute this task locally. (We explain the Luigi scheduler in Step 4.) Running tasks using the local-scheduler flag is only recommended for development work.

      Luigi will output a summary of the executed tasks:

      Output

      ===== Luigi Execution Summary ===== Scheduled 1 tasks of which: * 1 ran successfully: - 1 HelloLuigi() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      And it will create a new file hello-luigi.txt with content:

      hello-luigi.txt

      Hello Luigi!
      

      You have created a Luigi task that generates a file and then executed it using the Luigi local-scheduler. Now, you’ll create a task that can extract a list of books from a web page.

      In this step, you will create a Luigi task and define a run() method for the task to download a list of the most popular books on Project Gutenberg. You’ll define an output() method to store links to these books in a file. You will run these using the Luigi local scheduler.

      Create a new directory data inside of your luigi-demo directory. This will be where you will store the files defined in the output() methods of your tasks. You need to create the directories before running your tasks—Python throws exceptions when you try to write a file to a directory that does not exist yet:

      • mkdir data
      • mkdir data/counts
      • mkdir data/downloads

      Create a new file word-frequency.py:

      Insert the following code, which is a Luigi task to extract a list of links to the top most-read books on Project Gutenberg:

      word-frequency.py

      import requests
      import luigi
      from bs4 import BeautifulSoup
      
      
      class GetTopBooks(luigi.Task):
          """
          Get list of the most popular books from Project Gutenberg
          """
      
          def output(self):
              return luigi.LocalTarget("data/books_list.txt")
      
          def run(self):
              resp = requests.get("http://www.gutenberg.org/browse/scores/top")
      
              soup = BeautifulSoup(resp.content, "html.parser")
      
              pageHeader = soup.find_all("h2", string="Top 100 EBooks yesterday")[0]
              listTop = pageHeader.find_next_sibling("ol")
      
              with self.output().open("w") as f:
                  for result in listTop.select("li>a"):
                      if "/ebooks/" in result["href"]:
                          f.write("http://www.gutenberg.org{link}.txt.utf-8n"
                              .format(
                                  link=result["href"]
                              )
                          )
      

      You define an output() target of file "data/books_list.txt" to store the list of books.

      In the run() method, you:

      • use the requests library to download the HTML contents of the Project Gutenberg top books page.
      • use the BeautifulSoup library to parse the contents of the page. The BeautifulSoup library allows us to scrape information out of web pages. To find out more about using the BeautifulSoup library, read the How To Scrape Web Pages with Beautiful Soup and Python 3 tutorial.
      • open the output file defined in the output() method.
      • iterate over the HTML structure to get all of the links in the Top 100 EBooks yesterday list. For this page, this is locating all links <a> that are within a list item <li>. For each of those links, if they link to a page that points at a link containing /ebooks/, you can assume it is a book and write that link to your output() file.

      Screenshot of the Project Gutenberg top books web page with the top ebooks links highlighted

      Save and exit the file once you’re done.

      Execute this new task using the following command:

      • python -m luigi --module word-frequency GetTopBooks --local-scheduler

      Luigi will output a summary of the executed tasks:

      Output

      ===== Luigi Execution Summary ===== Scheduled 1 tasks of which: * 1 ran successfully: - 1 GetTopBooks() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      In the data directory, Luigi will create a new file (data/books_list.txt). Run the following command to output the contents of the file:

      This file contains a list of URLs extracted from the Project Gutenberg top projects list:

      Output

      http://www.gutenberg.org/ebooks/1342.txt.utf-8 http://www.gutenberg.org/ebooks/11.txt.utf-8 http://www.gutenberg.org/ebooks/2701.txt.utf-8 http://www.gutenberg.org/ebooks/1661.txt.utf-8 http://www.gutenberg.org/ebooks/16328.txt.utf-8 http://www.gutenberg.org/ebooks/45858.txt.utf-8 http://www.gutenberg.org/ebooks/98.txt.utf-8 http://www.gutenberg.org/ebooks/84.txt.utf-8 http://www.gutenberg.org/ebooks/5200.txt.utf-8 http://www.gutenberg.org/ebooks/51461.txt.utf-8 ...

      You’ve created a task that can extract a list of books from a web page. In the next step, you’ll set up a central Luigi scheduler.

      Step 4 — Running the Luigi Scheduler

      Now, you’ll launch the Luigi scheduler to execute and visualize your tasks. You will take the task developed in Step 3 and run it using the Luigi scheduler.

      So far, you have been running Luigi using the --local-scheduler tag to run your jobs locally without allocating work to a central scheduler. This is useful for development, but for production usage it is recommended to use the Luigi scheduler. The Luigi scheduler provides:

      • A central point to execute your tasks.
      • Visualization of the execution of your tasks.

      To access the Luigi scheduler interface, you need to enable access to port 8082. To do this, run the following command:

      To run the scheduler execute the following command:

      • sudo sh -c ". luigi-venv/bin/activate ;luigid --background --port 8082"

      Note: We have re-run the virtualenv activate script as root, before launching the Luigi scheduler as a background task. This is because when running sudo the virtualenv environment variables and aliases are not carried over.

      If you do not want to run as root, you can run the Luigi scheduler as a background process for the current user. This command runs the Luigi scheduler in the background and hides messages from the scheduler background task. You can find out more about managing background processes in the terminal at How To Use Bash’s Job Control to Manage Foreground and Background Processes:

      • luigid --port 8082 > /dev/null 2> /dev/null &

      Open a browser to access the Luigi interface. This will either be at http://your_server_ip:8082, or if you have set up a domain for your server http://your_domain:8082. This will open the Luigi user interface.

      Luigi default user interface

      By default, Luigi tasks run using the Luigi scheduler. To run one of your previous tasks using the Luigi scheduler omit the --local-scheduler argument from the command. Re-run the task from Step 3 using the following command:

      • python -m luigi --module word-frequency GetTopBooks

      Refresh the Luigi scheduler user interface. You will find the GetTopBooks task added to the run list and its execution status.

      Luigi User Interface after running the GetTopBooks Task

      You will continue to refer back to this user interface to monitor the progress of your pipeline.

      Note: If you’d like to secure your Luigi scheduler through HTTPS, you can serve it through Nginx. To set up an Nginx server using HTTPS follow: How To Secure Nginx with Let’s Encrypt on Ubuntu 20.04. See Github - Luigi - Pull Request 2785 for suggestions on a suitable Nginx configuration to connect the Luigi server to Nginx.

      You’ve launched the Luigi Scheduler and used it to visualize your executed tasks. Next, you will create a task to download the list of books that the GetTopBooks() task outputs.

      Step 5 — Downloading the Books

      In this step you will create a Luigi task to download a specified book. You will define a dependency between this newly created task and the task created in Step 3.

      First open your file:

      Add an additional class following your GetTopBooks() task to the word-frequency.py file with the following code:

      word-frequency.py

      . . .
      class DownloadBooks(luigi.Task):
          """
          Download a specified list of books
          """
          FileID = luigi.IntParameter()
      
          REPLACE_LIST = """.,"';_[]:*-"""
      
          def requires(self):
              return GetTopBooks()
      
          def output(self):
              return luigi.LocalTarget("data/downloads/{}.txt".format(self.FileID))
      
          def run(self):
              with self.input().open("r") as i:
                  URL = i.read().splitlines()[self.FileID]
      
                  with self.output().open("w") as outfile:
                      book_downloads = requests.get(URL)
                      book_text = book_downloads.text
      
                      for char in self.REPLACE_LIST:
                          book_text = book_text.replace(char, " ")
      
                      book_text = book_text.lower()
                      outfile.write(book_text)
      

      In this task you introduce a Parameter; in this case, an integer parameter. Luigi parameters are inputs to your tasks that affect the execution of the pipeline. Here you introduce a parameter FileID to specify a line in your list of URLs to fetch.

      You have added an additional method to your Luigi task, def requires(); in this method you define the Luigi task that you need the output of before you can execute this task. You require the output of the GetTopBooks() task you defined in Step 3.

      In the output() method, you define your target. You use the FileID parameter to create a name for the file created by this step. In this case, you format data/downloads/{FileID}.txt.

      In the run() method, you:

      • open the list of books generated in the GetTopBooks() task.
      • get the URL from the line specified by parameter FileID.
      • use the requests library to download the contents of the book from the URL.
      • filter out any special characters inside the book like :,.?, so they don’t get included in your word analysis.
      • convert the text to lowercase so you can compare words with different cases.
      • write the filtered output to the file specified in the output() method.

      Save and exit your file.

      Run the new DownloadBooks() task using this command:

      • python -m luigi --module word-frequency DownloadBooks --FileID 2

      In this command, you set the FileID parameter using the --FileID argument.

      Note: Be careful when defining a parameter with an _ in the name. To reference them in Luigi you need to substitute the _ for a -. For example, a File_ID parameter would be referenced as --File-ID when calling a task from the terminal.

      You will receive the following output:

      Output

      ===== Luigi Execution Summary ===== Scheduled 2 tasks of which: * 1 complete ones were encountered: - 1 GetTopBooks() * 1 ran successfully: - 1 DownloadBooks(FileID=2) This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      Note from the output that Luigi has detected that you have already generated the output of GetTopBooks() and skipped running that task. This functionality allows you to minimize the number of tasks you have to execute as you can re-use successful output from previous runs.

      You have created a task that uses the output of another task and downloads a set of books to analyze. In the next step, you will create a task to count the most common words in a downloaded book.

      Step 6 — Counting Words and Summarizing Results

      In this step, you will create a Luigi task to count the frequency of words in each of the books downloaded in Step 5. This will be your first task that executes in parallel.

      First open your file again:

      Add the following imports to the top of word-frequency.py:

      word-frequency.py

      from collections import Counter
      import pickle
      

      Add the following task to word-frequency.py, after your DownloadBooks() task. This task takes the output of the previous DownloadBooks() task for a specified book, and returns the most common words in that book:

      word-frequency.py

      class CountWords(luigi.Task):
          """
          Count the frequency of the most common words from a file
          """
      
          FileID = luigi.IntParameter()
      
          def requires(self):
              return DownloadBooks(FileID=self.FileID)
      
          def output(self):
              return luigi.LocalTarget(
                  "data/counts/count_{}.pickle".format(self.FileID),
                  format=luigi.format.Nop
              )
      
          def run(self):
              with self.input().open("r") as i:
                  word_count = Counter(i.read().split())
      
                  with self.output().open("w") as outfile:
                      pickle.dump(word_count, outfile)
      

      When you define requires() you pass the FileID parameter to the next task. When you specify that a task depends on another task, you specify the parameters you need the dependent task to be executed with.

      In the run() method you:

      • open the file generated by the DownloadBooks() task.
      • use the built-in Counter object in the collections library. This provides an easy way to analyze the most common words in a book.
      • use the pickle library to store the output of the Python Counter object, so you can re-use that object in a later task. pickle is a library that you use to convert Python objects into a byte stream, which you can store and restore into a later Python session. You have to set the format property of the luigi.LocalTarget to allow it to write the binary output the pickle library generates.

      Save and exit your file.

      Run the new CountWords() task using this command:

      • python -m luigi --module word-frequency CountWords --FileID 2

      Open the CountWords task graph view in the Luigi scheduler user interface.

      Showing how to view a graph from the Luigi user interface

      Deselect the Hide Done option, and deselect Upstream Dependencies. You will find the flow of execution from the tasks you have created.

      Visualizing the execution of the CountWords task

      You have created a task to count the most common words in a downloaded book and visualized the dependencies between those tasks. Next, you will define parameters that you can use to customize the execution of your tasks.

      Step 7 — Defining Configuration Parameters

      In this step, you will add configuration parameters to the pipeline. These will allow you to customize how many books to analyze and the number of words to include in the results.

      When you want to set parameters that are shared among tasks, you can create a Config() class. Other pipeline stages can reference the parameters defined in the Config() class; these are set by the pipeline when executing a job.

      Add the following Config() class to the end of word-frequency.py. This will define two new parameters in your pipeline for the number of books to analyze and the number of most frequent words to include in the summary:

      word-frequency.py

      class GlobalParams(luigi.Config):
          NumberBooks = luigi.IntParameter(default=10)
          NumberTopWords = luigi.IntParameter(default=500)
      

      Add the following class to word-frequency.py. This class aggregates the results from all of the CountWords() task to create a summary of the most frequent words:

      word-frequency.py

      class TopWords(luigi.Task):
          """
          Aggregate the count results from the different files
          """
      
          def requires(self):
              requiredInputs = []
              for i in range(GlobalParams().NumberBooks):
                  requiredInputs.append(CountWords(FileID=i))
              return requiredInputs
      
          def output(self):
              return luigi.LocalTarget("data/summary.txt")
      
          def run(self):
              total_count = Counter()
              for input in self.input():
                  with input.open("rb") as infile:
                      nextCounter = pickle.load(infile)
                      total_count += nextCounter
      
              with self.output().open("w") as f:
                  for item in total_count.most_common(GlobalParams().NumberTopWords):
                      f.write("{0: <15}{1}n".format(*item))
      
      

      In the requires() method, you can provide a list where you want a task to use the output of multiple dependent tasks. You use the GlobalParams().NumberBooks parameter to set the number of books you need word counts from.

      In the output() method, you define a data/summary.txt output file that will be the final output of your pipeline.

      In the run() method you:

      • create a Counter() object to store the total count.
      • open the file and “unpickle” it (convert it from a file back to a Python object), for each count carried out in the CountWords() method
      • append the loaded count and add it to the total count.
      • write the most common words to target output file.

      Run the pipeline with the following command:

      • python -m luigi --module word-frequency TopWords --GlobalParams-NumberBooks 15 --GlobalParams-NumberTopWords 750

      Luigi will execute the remaining tasks needed to generate the summary of the top words:

      Output

      ===== Luigi Execution Summary ===== Scheduled 31 tasks of which: * 2 complete ones were encountered: - 1 CountWords(FileID=2) - 1 GetTopBooks() * 29 ran successfully: - 14 CountWords(FileID=0,1,10,11,12,13,14,3,4,5,6,7,8,9) - 14 DownloadBooks(FileID=0,1,10,11,12,13,14,3,4,5,6,7,8,9) - 1 TopWords() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      You can visualize the execution of the pipeline from the Luigi scheduler. Select the GetTopBooks task in the task list and press the View Graph button.

      Showing how to view a graph from the Luigi user interface

      Deselect the Hide Done and Upstream Dependencies options.

      Visualizing the execution of the TopWords Task

      It will show the flow of processing that is happening in Luigi.

      Open the data/summary.txt file:

      You will find the calculated most common words:

      Output

      the 64593 and 41650 of 31896 to 31368 a 25265 i 23449 in 19496 it 16282 that 15907 he 14974 ...

      In this step, you have defined and used parameters to customize the execution of your tasks. You have generated a summary of the most common words for a set of books.

      Find all the code for this tutorial in this repository.

      Conclusion

      This tutorial has introduced you to using the Luigi data processing pipeline and its major features including tasks, parameters, configuration parameters, and the Luigi scheduler.

      Luigi supports connecting to a large number of common data sources out the box. You can also scale it to run large, complex data pipelines. This provides a powerful framework to start solving your data processing challenges.

      For more tutorials, check out our Data Analysis topic page and Python topic page.



      Source link