One place for hosting & domains

      Perform

      How To Use the PDO PHP Extension to Perform MySQL Transactions in PHP on Ubuntu 18.04


      The author selected Open Sourcing Mental Illness to receive a donation as part of the Write for DOnations program.

      Introduction

      A MySQL transaction is a group of logically related SQL commands that are executed in the database as a single unit. Transactions are used to enforce ACID (Atomicity, Consistency, Isolation, and Durability) compliance in an application. This is a set of standards that govern the reliability of processing operations in a database.

      Atomicity ensures the success of related transactions or a complete failure if an error occurs. Consistency guarantees the validity of the data submitted to the database according to defined business logic. Isolation is the correct execution of concurrent transactions ensuring the effects of different clients connecting to a database do not affect each other. Durability ensures that logically related transactions remain in the database permanently.

      SQL statements issued via a transaction should either succeed or fail altogether. If any of the queries fails, MySQL rolls back the changes and they are never committed to the database.

      A good example to understand how MySQL transactions work is an e-commerce website. When a customer makes an order, the application inserts records into several tables, such as: orders and orders_products, depending on the business logic. Multi-table records related to a single order must be atomically sent to the database as a single logical unit.

      Another use-case is in a bank application. When a client is transferring money, a couple of transactions are sent to the database. The sender’s account is debited and the receiver’s party account is credited. The two transactions must be committed simultaneously. If one of them fails, the database will revert to its original state and no changes should be saved to disk.

      In this tutorial, you will use the PDO PHP Extension, which provides an interface for working with databases in PHP, to perform MySQL transactions on an Ubuntu 18.04 server.

      Prerequisites

      Before you begin, you will need the following:

      Step 1 — Creating a Sample Database and Tables

      You’ll first create a sample database and add some tables before you start working with MySQL transactions. First, log in to your MySQL server as root:

      When prompted, enter your MySQL root password and hit ENTER to proceed. Then, create a database, for the purposes of this tutorial we’ll call the database sample_store:

      • CREATE DATABASE sample_store;

      You will see the following output:

      Output

      Query OK, 1 row affected (0.00 sec)

      Create a user called sample_user for your database. Remember to replace PASSWORD with a strong value:

      • CREATE USER 'sample_user'@'localhost' IDENTIFIED BY 'PASSWORD';

      Issue full privileges for your user to the sample_store database:

      • GRANT ALL PRIVILEGES ON sample_store.* TO 'sample_user'@'localhost';

      Finally, reload the MySQL privileges:

      You’ll see the following output once you’ve created your user:

      Output

      Query OK, 0 rows affected (0.01 sec) . . .

      With the database and user in place, you can now create several tables for demonstrating how MySQL transactions work.

      Log out from the MySQL server:

      Once the system logs you out, you will see the following output:

      Output

      Bye.

      Then, log in with the credentials of the sample_user you just created:

      • sudo mysql -u sample_user -p

      Enter the password for the sample_user and hit ENTER to proceed.

      Switch to the sample_store to make it the currently selected database:

      You’ll see the following output once it is selected:

      Output

      Database Changed.

      Next, create a products table:

      • CREATE TABLE products (product_id BIGINT PRIMARY KEY AUTO_INCREMENT, product_name VARCHAR(50), price DOUBLE) ENGINE = InnoDB;

      This command creates a products table with a field named product_id. You use a BIGINT data type that can accommodate a large value of up to 2^63-1. You use this same field as a PRIMARY KEY to uniquely identify products. The AUTO_INCREMENT keyword instructs MySQL to generate the next numeric value as new products are inserted.

      The product_name field is of type VARCHAR that can hold up to a maximum of 50 letters or numbers. For the product price, you use a DOUBLE data type to cater for floating point formats in prices with decimal numbers.

      Lastly, you use the InnoDB as the ENGINE because it comfortably supports MySQL transactions as opposed to other storage engines such as MyISAM.

      Once you’ve created your products table, you’ll get the following output:

      Output

      Query OK, 0 rows affected (0.02 sec)

      Next, add some items to the products table by running the following commands:

      • INSERT INTO products(product_name, price) VALUES ('WINTER COAT','25.50');
      • INSERT INTO products(product_name, price) VALUES ('EMBROIDERED SHIRT','13.90');
      • INSERT INTO products(product_name, price) VALUES ('FASHION SHOES','45.30');
      • INSERT INTO products(product_name, price) VALUES ('PROXIMA TROUSER','39.95');

      You’ll see output similar to the following after each INSERT operation:

      Output

      Query OK, 1 row affected (0.02 sec) . . .

      Then, verify that the data was added to the products table:

      You will see a list of the four products that you have inserted:

      Output

      +------------+-------------------+-------+ | product_id | product_name | price | +------------+-------------------+-------+ | 1 | WINTER COAT | 25.5 | | 2 | EMBROIDERED SHIRT | 13.9 | | 3 | FASHION SHOES | 45.3 | | 4 | PROXIMA TROUSER | 39.95 | +------------+-------------------+-------+ 4 rows in set (0.01 sec)

      Next, you’ll create a customers table for holding basic information about customers:

      • CREATE TABLE customers (customer_id BIGINT PRIMARY KEY AUTO_INCREMENT, customer_name VARCHAR(50) ) ENGINE = InnoDB;

      As in the products table, you use the BIGINT data type for the customer_id and this will ensure the table can support a lot of customers up to 2^63-1 records. The keyword AUTO_INCREMENT increments the value of the columns once you insert a new customer.

      Since the customer_name column accepts alphanumeric values, you use VARCHAR data type with a limit of 50 characters. Again, you use the InnoDB storage ENGINE to support transactions.

      After running the previous command to create the customers table, you will see the following output:

      Output

      Query OK, 0 rows affected (0.02 sec)

      You’ll add three sample customers to the table. Run the following commands:

      • INSERT INTO customers(customer_name) VALUES ('JOHN DOE');
      • INSERT INTO customers(customer_name) VALUES ('ROE MARY');
      • INSERT INTO customers(customer_name) VALUES ('DOE JANE');

      Once the customers have been added, you will see an output similar to the following:

      Output

      Query OK, 1 row affected (0.02 sec) . . .

      Then, verify the data in the customers table:

      You’ll see a list of the three customers:

      Output

      +-------------+---------------+ | customer_id | customer_name | +-------------+---------------+ | 1 | JOHN DOE | | 2 | ROE MARY | | 3 | DOE JANE | +-------------+---------------+ 3 rows in set (0.00 sec)

      Next, you’ll create an orders table for recording orders placed by different customers. To create the orders table, execute the following command:

      • CREATE TABLE orders (order_id BIGINT AUTO_INCREMENT PRIMARY KEY, order_date DATETIME, customer_id BIGINT, order_total DOUBLE) ENGINE = InnoDB;

      You use the column order_id as the PRIMARY KEY. The BIGINT data type allows you to accommodate up to 2^63-1 orders and will auto-increment after each order insertion. The order_date field will hold the actual date and time the order is placed and hence, you use the DATETIME data type. The customer_id relates to the customers table that you created previously.

      You will see the following output:

      Output

      Query OK, 0 rows affected (0.02 sec)

      Since a single customer’s order may contain multiple items, you need to create an orders_products table to hold this information.

      To create the orders_products table, run the following command:

      • CREATE TABLE orders_products (ref_id BIGINT PRIMARY KEY AUTO_INCREMENT, order_id BIGINT, product_id BIGINT, price DOUBLE, quantity BIGINT) ENGINE = InnoDB;

      You use the ref_id as the PRIMARY KEY and this will auto-increment after each record insertion. The order_id and product_id relate to the orders and the products tables respectively. The price column is of data type DOUBLE in order to accommodate floating values.

      The storage engine InnoDB must match the other tables created previously since a single customer’s order will affect multiple tables simultaneously using transactions.

      Your output will confirm the table’s creation:

      Output

      Query OK, 0 rows affected (0.02 sec)

      You won’t be adding any data to the orders and orders_products tables for now but you’ll do this later using a PHP script that implements MySQL transactions.

      Log out from the MySQL server:

      Your database schema is now complete and you’ve populated it with some records. You’ll now create a PHP class for handling database connections and MySQL transactions.

      Step 2 — Designing a PHP Class to Handle MySQL Transactions

      In this step, you will create a PHP class that will use PDO (PHP Data Objects) to handle MySQL transactions. The class will connect to your MySQL database and insert data atomically to the database.

      Save the class file in the root directory of your Apache web server. To do this, create a DBTransaction.php file using your text editor:

      • sudo nano /var/www/html/DBTransaction.php

      Then, add the following code to the file. Replace PASSWORD with the value you created in Step 1:

      /var/www/html/DBTransaction.php

      <?php
      
      class DBTransaction
      {
          protected $pdo;
          public $last_insert_id;
      
          public function __construct()
          {
              define('DB_NAME', 'sample_store');
              define('DB_USER', 'sample_user');
              define('DB_PASSWORD', 'PASSWORD');
              define('DB_HOST', 'localhost');
      
              $this->pdo = new PDO("mysql:host=" . DB_HOST . ";dbname=" . DB_NAME, DB_USER, DB_PASSWORD);
              $this->pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
              $this->pdo->setAttribute(PDO::ATTR_EMULATE_PREPARES, false);
          }
      

      Toward the beginning of the DBTransaction class, the PDO will use the constants (DB_HOST, DB_NAME, DB_USER, and DB_PASSWORD) to initialize and connect to the database that you created in step 1.

      Note: Since we are demonstrating MySQL transactions in a small scale here, we have declared the database variables in the DBTransaction class. In a large production project, you would normally create a separate configuration file and load the database constants from that file using a PHP require_once statement.

      Next, you set two attributes for the PDO class:

      • ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION: This attribute instructs PDO to throw an exception if an error is encountered. Such errors can be logged for debugging.
      • ATTR_EMULATE_PREPARES, false: This option disables emulation of prepared statements and allows the MySQL database engine to prepare the statements itself.

      Now add the following code to your file to create the methods for your class:

      /var/www/html/DBTransaction.php

      . . .
          public function startTransaction()
          {
              $this->pdo->beginTransaction();
          }
      
          public function insertTransaction($sql, $data)
          {
              $stmt = $this->pdo->prepare($sql);
              $stmt->execute($data);
              $this->last_insert_id = $this->pdo->lastInsertId();
          }
      
          public function submitTransaction()
          {
              try {
                  $this->pdo->commit();
              } catch(PDOException $e) {
                  $this->pdo->rollBack();
                  return false;
              }
      
                return true;
          }
      }
      

      Save and close the file by pressing CTRL + X, Y, then ENTER.

      To work with MySQL transactions, you create three main methods in the DBTransaction class; startTransaction, insertTransaction, and submitTransaction.

      • startTransaction: This method instructs PDO to start a transaction and turns auto-commit off until a commit command is issued.

      • insertTransaction : This method takes two arguments. The $sql variable holds the SQL statement to be executed while the $data variable is an array of the data to be bound to the SQL statement since you’re using prepared statements. The data is passed as an array to the insertTransaction method.

      • submitTransaction : This method commits the changes to the database permanently by issuing a commit() command. However, if there is an error and the transactions have a problem, the method calls the rollBack() method to revert the database to its original state in case a PDO exception is raised.

      Your DBTransaction class initializes a transaction, prepares the different SQL commands to be executed, and finally commits the changes to the database atomically if there are no issues, otherwise, the transaction is rolled back. In addition, the class allows you to retrieve the record order_id you just created by accessing the public property last_insert_id.

      The DBTransaction class is now ready to be called and used by any PHP code, which you’ll create next.

      Step 3 — Creating a PHP Script to Use the DBTransaction Class

      You’ll create a PHP script that will implement the DBTransaction class and send a group of SQL commands to the MySQL database. You’ll mimic the workflow of a customer’s order in an online shopping cart.

      These SQL queries will affect the orders and the orders_products tables. Your DBTransaction class should only allow changes to the database if all of the queries are executed without any errors. Otherwise, you’ll get an error back and any attempted changes will roll back.

      You are creating a single order for the customer JOHN DOE identified with customer_id 1. The customer’s order has three different items with differing quantities from the products table. Your PHP script takes the customer’s order data and submits it into the DBTransaction class.

      Create the orders.php file:

      • sudo nano /var/www/html/orders.php

      Then, add the following code to the file:

      /var/www/html/orders.php

      <?php
      
      require("DBTransaction.php");
      
      $db_host = "database_host";
      $db_name = "database_name";
      $db_user = "database_user";
      $db_password = "PASSWORD";
      
      $customer_id = 2;
      
      $products[] = [
        'product_id' => 1,
        'price' => 25.50,
        'quantity' => 1
      ];
      
      $products[] = [
        'product_id' => 2,
        'price' => 13.90,
        'quantity' => 3
      ];
      
      $products[] = [
        'product_id' => 3,
        'price' => 45.30,
        'quantity' => 2
      ];
      
      $transaction = new DBTransaction($db_host, $db_user, $db_password, $db_name);
      

      You’ve created a PHP script that initializes an instance of the DBTransaction class that you created in Step 2.

      In this script, you include the DBTransaction.php file and you initialize the DBTransaction class. Next, you prepare a multi-dimensional array of all the products the customer is ordering from the store. You also invoke the startTransaction() method to start a transaction.

      Next add the following code to finish your orders.php script:

      /var/www/html/orders.php

      . . .
      $order_query = "insert into orders (order_id, customer_id, order_date, order_total) values(:order_id, :customer_id, :order_date, :order_total)";
      $product_query = "insert into orders_products (order_id, product_id, price, quantity) values(:order_id, :product_id, :price, :quantity)";
      
      $transaction->insertQuery($order_query, [
        'customer_id' => $customer_id,
        'order_date' => "2020-01-11",
        'order_total' => 157.8
      ]);
      
      $order_id = $transaction->last_insert_id;
      
      foreach ($products as $product) {
        $transaction->insertQuery($product_query, [
          'order_id' => $order_id,
          'product_id' => $product['product_id'],
          'price' => $product['price'],
          'quantity' => $product['quantity']
        ]);
      }
      
      $result = $transaction->submit();
      
      if ($result) {
          echo "Records successfully submitted";
      } else {
          echo "There was an error.";
      }
      
      

      Save and close the file by pressing CTRL + X, Y, then ENTER.

      You prepare the command to be inserted to the orders table via the insertTransaction method. After this, you retrieve the value of the public property last_insert_id from the DBTransaction class and use it as the $order_id.

      Once you have an $order_id, you use the unique ID to insert the customer’s order items to the orders_products table.

      Finally, you call the method submitTransaction to commit the entire customer’s order details to the database if there are no problems. Otherwise, the method submitTransaction will rollback the attempted changes.

      Now you’ll run the orders.php script in your browser. Run the following and replace your-server-IP with the public IP address of your server:

      http://your-server-IP/orders.php

      You will see confirmation that the records were successfully submitted:

      PHP Output from MySQL Transactions Class

      Your PHP script is working as expected and the order together with the associated order products were submitted to the database atomically.

      You’ve run the orders.php file on a browser window. The script invoked the DBTransaction class which in turn submitted the orders details to the database. In the next step, you will verify if the records saved to the related database tables.

      Step 4 — Confirming the Entries in Your Database

      In this step, you’ll check if the transaction initiated from the browser window for the customer’s order was posted to the database tables as expected.

      To do this, log in to your MySQL database again:

      • sudo mysql -u sample_user -p

      Enter the password for the sample_user and hit ENTER to continue.

      Switch to the sample_store database:

      Ensure the database is changed before proceeding by confirming the following output:

      Output

      Database Changed.

      Then, issue the following command to retrieve records from the orders table:

      This will display the following output detailing the customer’s order:

      Output

      +----------+---------------------+-------------+-------------+ | order_id | order_date | customer_id | order_total | +----------+---------------------+-------------+-------------+ | 1 | 2020-01-11 00:00:00 | 2 | 157.8 | +----------+---------------------+-------------+-------------+ 1 row in set (0.00 sec)

      Next, retrieve the records from the orders_products table:

      • SELECT * FROM orders_products;

      You’ll see output similar to the following with a list of products from the customer’s order:

      Output

      +--------+----------+------------+-------+----------+ | ref_id | order_id | product_id | price | quantity | +--------+----------+------------+-------+----------+ | 1 | 1 | 1 | 25.5 | 1 | | 2 | 1 | 2 | 13.9 | 3 | | 3 | 1 | 3 | 45.3 | 2 | +--------+----------+------------+-------+----------+ 3 rows in set (0.00 sec)

      The output confirms that the transaction was saved to the database and your helper DBTransaction class is working as expected.

      Conclusion

      In this guide, you used the PHP PDO to work with MySQL transactions. Although this is not a conclusive article on designing an e-commerce software, it has provided an example for using MySQL transactions in your applications.

      To learn more about the MySQL ACID model, consider visiting the InnoDB and the ACID Model guide from the official MySQL website. Visit our MySQL content page for more related tutorials, articles, and Q&A.



      Source link

      How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)


      The author selected the Open Internet/Free Speech fund to receive a donation as part of the Write for DOnations program.

      Introduction

      A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. You will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.

      In this tutorial, you will prepare a dataset of sample tweets from the NLTK package for NLP with different data cleaning methods. Once the dataset is ready for processing, you will train a model on pre-classified tweets and use the model to classify the sample tweets into negative and positives sentiments.

      This article assumes that you are familiar with the basics of Python (see our How To Code in Python 3 series), primarily the use of data structures, classes, and methods. The tutorial assumes that you have no background in NLP and nltk, although some knowledge on it is an added advantage.

      Prerequisites

      Step 1 — Installing NLTK and Downloading the Data

      You will use the NLTK package in Python for all NLP tasks in this tutorial. In this step you will install NLTK and download the sample tweets that you will use to train and test your model.

      First, install the NLTK package with the pip package manager:

      This tutorial will use sample tweets that are part of the NLTK package. First, start a Python interactive session by running the following command:

      Then, import the nltk module in the python interpreter.

      Download the sample tweets from the NLTK package:

      • nltk.download('twitter_samples')

      Running this command from the Python interpreter downloads and stores the tweets locally. Once the samples are downloaded, they are available for your use.

      You will use the negative and positive tweets to train your model on sentiment analysis later in the tutorial. The tweets with no sentiments will be used to test your model.

      If you would like to use your own dataset, you can gather tweets from a specific time period, user, or hashtag by using the Twitter API.

      Now that you’ve imported NLTK and downloaded the sample tweets, exit the interactive session by entering in exit(). You are ready to import the tweets and begin processing the data.

      Step 2 — Tokenizing the Data

      Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

      A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

      To get started, create a new .py file to hold your script. This tutorial will use nlp_test.py:

      In this file, you will first import the twitter_samples so you can work with that data:

      nlp_test.py

      from nltk.corpus import twitter_samples
      

      This will import three datasets from NLTK that contain various tweets to train and test the model:

      • negative_tweets.json: 5000 tweets with negative sentiments
      • positive_tweets.json: 5000 tweets with positive sentiments
      • tweets.20150430-223406.json: 20000 tweets with no sentiments

      Next, create variables for positive_tweets, negative_tweets, and text:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      

      The strings() method of twitter_samples will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

      Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences. For instance, this model knows that a name may contain a period (like “S. Daityari”) and the presence of this period in a sentence does not necessarily end it. First, start a Python interactive session:

      Run the following commands in the session to download the punkt resource:

      • import nltk
      • nltk.download('punkt')

      Once the download is complete, you are ready to use NLTK’s tokenizers. NLTK provides a default tokenizer for tweets with the .tokenized() method. Add a line to create an object that tokenizes the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      

      If you’d like to test the script to see the .tokenized method in action, add the highlighted content to your nlp_test.py script. This will tokenize a single tweet from the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      print(tweet_tokens[0])
      

      Save and close the file, and run the script:

      The process of tokenization takes some time because it’s not a simple split on white space. After a few moments of processing, you’ll see the following:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']

      Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions later in this tutorial.

      Now that you’ve seen how the .tokenized() method works, make sure to comment out or remove the last line to print the tokenized tweet from the script by adding a # to the start of the line:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      #print(tweet_tokens[0])
      

      Your script is now configured to tokenize data. In the next step you will update the script to normalize the data.

      Step 3 — Normalizing the Data

      Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

      Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

      Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

      In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

      Before you proceed to use lemmatization, download the necessary resources by entering the following in to a Python interactive session:

      Run the following commands in the session to download the resources:

      • import nltk
      • nltk.download('wordnet')
      • nltk.download('averaged_perceptron_tagger')

      wordnet is a lexical database for the English language that helps the script determine the base word. You need the averaged_perceptron_tagger resource to determine the context of a word in a sentence.

      Once downloaded, you are almost ready to use the lemmatizer. Before running a lemmatizer, you need to determine the context for each word in your text. This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence. In a Python session, Import the pos_tag function, and provide a list of tokens as an argument to get the tags. Let us try this out in Python:

      • from nltk.tag import pos_tag
      • from nltk.corpus import twitter_samples
      • tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      • print(pos_tag(tweet_tokens[0]))

      Here is the output of the pos_tag function.

      Output

      [('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]

      From the list of tags, here is the list of the most common items and their meaning:

      • NNP: Noun, proper, singular
      • NN: Noun, common, singular or mass
      • IN: Preposition or conjunction, subordinating
      • VBG: Verb, gerund or present participle
      • VBN: Verb, past participle

      Here is a full list of the dataset.

      In general, if a tag starts with NN, the word is a noun and if it stars with VB, the word is a verb. After reviewing the tags, exit the Python session by entering exit().

      To incorporate this into a function that normalizes a sentence, you should first generate the tags for each token in the text, and then lemmatize each word using the tag.

      Update the nlp_test.py file with the following function that lemmatizes a sentence:

      nlp_test.py

      ...
      
      from nltk.tag import pos_tag
      from nltk.stem.wordnet import WordNetLemmatizer
      
      def lemmatize_sentence(tokens):
          lemmatizer = WordNetLemmatizer()
          lemmatized_sentence = []
          for word, tag in pos_tag(tokens):
              if tag.startswith('NN'):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
              lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
          return lemmatized_sentence
      
      print(lemmatize_sentence(tweet_tokens[0]))
      

      This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer.

      The function lemmatize_sentence first gets the position tag of each token of a tweet. Within the if statement, if the tag starts with NN, the token is assigned as a noun. Similarly, if the tag starts with VB, the token is assigned as a verb.

      Save and close the file, and run the script:

      Here is the output:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']

      You will notice that the verb being changes to its root form, be, and the noun members changes to member. Before you proceed, comment out the last line that prints the sample tweet from the script.

      Now that you have successfully created a function to normalize words, you are ready to move on to remove noise.

      Step 4 — Removing Noise from the Data

      In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

      Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

      In this tutorial, you will use regular expressions in Python to search for and remove these items:

      • Hyperlinks - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.
      • Twitter handles in replies - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.
      • Punctuation and special characters - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

      To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

      Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

      Add the following code to your nlp_test.py file to remove noise from the dataset:

      nlp_test.py

      ...
      
      import re, string
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      

      This code creates a remove_noise() function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

      The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string, or ''.

      Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular expressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _, and replaces them with an empty string.

      Finally, you can remove punctuation using the library string.

      In addition to this, you will also remove stop words using a built-in set of stop words in NLTK, which needs to be downloaded separately.

      Execute the following command from a Python interactive session to download this resource:

      • nltk.download('stopwords')

      Once the resource is downloaded, exit the interactive session.

      You can use the .words() method to get a list of stop words in English. To test the function, let us run it on our sample tweet. Add the following lines to the end of the nlp_test.py file:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      print(remove_noise(tweet_tokens[0], stop_words))
      

      After saving and closing the file, run the script again to receive output similar to the following:

      Output

      ['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

      Notice that the function removes all @ mentions, stop words, and converts the words to lowercase.

      Before proceeding to the modeling exercise in the next step, use the remove_noise() function to clean the positive and negative tweets. Comment out the line to print the output of remove_noise() on the sample tweet and add the following to the nlp_test.py script:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      #print(remove_noise(tweet_tokens[0], stop_words))
      
      positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
      positive_cleaned_tokens_list = []
      negative_cleaned_tokens_list = []
      
      for tokens in positive_tweet_tokens:
          positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
      for tokens in negative_tweet_tokens:
          negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      

      Now that you’ve added the code to clean the sample tweets, you may want to compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this, add the following code to the file to compare both versions of the 500th tweet in the list:

      nlp_test.py

      ...
      print(positive_tweet_tokens[500])
      print(positive_cleaned_tokens_list[500])
      

      Save and close the file and run the script. From the output you will see that the punctuation and links have been removed, and the words have been converted to lowercase.

      Output

      ['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht'] ['dang', 'rad', '#fanart', ':d']

      There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

      Now that you’ve seen the remove_noise() function in action, be sure to comment out or remove the last two lines from the script so you can add more to it:

      nlp_test.py

      ...
      #print(positive_tweet_tokens[500])
      #print(positive_cleaned_tokens_list[500])
      

      In this step you removed noise from the data to make the analysis more effective. In the next step you will analyze the data to find the most common words in your sample dataset.

      Step 5 — Determining Word Density

      The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

      The following snippet defines a generator function, named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined. Add the following code to your nlp_test.py file:

      nlp_test.py

      ...
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      all_pos_words = get_all_words(positive_cleaned_tokens_list)
      

      Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the FreqDist class of NLTK. Adding the following code to the nlp_test.py file:

      nlp_test.py

      from nltk import FreqDist
      
      freq_dist_pos = FreqDist(all_pos_words)
      print(freq_dist_pos.most_common(10))
      

      The .most_common() method lists the words which occur most frequently in the data. Save and close the file after making these changes.

      When you run the file now, you will find the most common terms in the data:

      Output

      [(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]

      From this data, you can see that emoticon entities form some of the most common parts of positive tweets. Before proceeding to the next step, make sure you comment out the last line of the script that prints the top ten tokens.

      To summarize, you extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

      In the next step you will prepare data for sentiment analysis.

      Step 6 — Preparing Data for the Model

      Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.

      Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

      A model is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. A sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

      In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

      Converting Tokens to a Dictionary

      First, you will prepare the data to be fed into the model. You will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

      Add the following code to convert the tweets from a list of cleaned tokens to dictionaries with keys as the tokens and True as values. The corresponding dictionaries are stored in positive_tokens_for_model and negative_tokens_for_model.

      nlp_test.py

      ...
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
      negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      

      Splitting the Dataset for Training and Testing the Model

      Next, you need to prepare the data for training the NaiveBayesClassifier class. Add the following code to the file to prepare the data:

      nlp_test.py

      ...
      import random
      
      positive_dataset = [(tweet_dict, "Positive")
                           for tweet_dict in positive_tokens_for_model]
      
      negative_dataset = [(tweet_dict, "Negative")
                           for tweet_dict in negative_tokens_for_model]
      
      dataset = positive_dataset + negative_dataset
      
      random.shuffle(dataset)
      
      train_data = dataset[:7000]
      test_data = dataset[7000:]
      

      This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

      By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

      Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

      In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the dataset, and split it into training and testing data.

      Step 7 — Building and Testing the Model

      Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

      nlp_test.py

      ...
      from nltk import classify
      from nltk import NaiveBayesClassifier
      classifier = NaiveBayesClassifier.train(train_data)
      
      print("Accuracy is:", classify.accuracy(classifier, test_data))
      
      print(classifier.show_most_informative_features(10))
      

      Save, close, and execute the file after adding the code. The output of the code will be as follows:

      Output

      Accuracy is: 0.9956666666666667 Most Informative Features :( = True Negati : Positi = 2085.6 : 1.0 :) = True Positi : Negati = 986.0 : 1.0 welcome = True Positi : Negati = 37.2 : 1.0 arrive = True Positi : Negati = 31.3 : 1.0 sad = True Negati : Positi = 25.9 : 1.0 follower = True Positi : Negati = 21.1 : 1.0 bam = True Positi : Negati = 20.7 : 1.0 glad = True Positi : Negati = 18.1 : 1.0 x15 = True Negati : Positi = 15.9 : 1.0 community = True Positi : Negati = 14.1 : 1.0

      Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

      In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. The first row in the data signifies that in all tweets containing the token :(, the ratio of negative to positives tweets was 2085.6 to 1. Interestingly, it seems that there was one token with :( in the positive datasets. You can see that the top two discriminating items in the text are the emoticons. Further, words such as sad lead to negative sentiments, whereas welcome and glad are associated with positive sentiments.

      Next, you can check how the model performs on random tweets from Twitter. Add this code to the file:

      nlp_test.py

      ...
      from nltk.tokenize import word_tokenize
      
      custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
      custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
      print(classifier.classify(dict([token, True] for token in custom_tokens)))
      

      This code will allow you to test custom tweets by updating the string associated with the custom_tweet variable. Save and close the file after making these changes.

      Run the script to analyze the custom text. Here is the output for the custom text in the example:

      Output

      'Negative'

      You can also check if it characterizes positive tweets correctly:

      nlp_test.py

      ...
      custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworldies'
      

      Here is the output:

      Output

      'Positive'

      Now that you’ve tested both positive and negative sentiments, update the variable to test a more complex sentiment like sarcasm.

      nlp_test.py

      ...
      custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'
      

      Here is the output:

      Output

      'Positive'

      The model classified this example as positive. This is because the training data wasn’t comprehensive enough to classify sarcastic tweets as negative. In case you want your model to predict sarcasm, you would need to provide sufficient amount of training data to train it accordingly.

      In this step you built and tested the model. You also explored some of its limitations, such as not detecting sarcasm in particular examples. Your completed code still has artifacts leftover from following the tutorial, so the next step will guide you through aligning the code to Python’s best practices.

      Step 8 — Cleaning Up the Code (Optional)

      Though you have completed the tutorial, it is recommended to reorganize the code in the nlp_test.py file to follow best programming practices. Per best practice, your code should meet this criteria:

      • All imports should be at the top of the file. Imports from the same library should be grouped together in a single statement.
      • All functions should be defined after the imports.
      • All the statements in the file should be housed under an if __name__ == "__main__": condition. This ensures that the statements are not executed if you are importing the functions of the file in another file.

      We will also remove the code that was commented out by following the tutorial, along with the lemmatize_sentence function, as the lemmatization is completed by the new remove_noise function.

      Here is the cleaned version of nlp_test.py:

      from nltk.stem.wordnet import WordNetLemmatizer
      from nltk.corpus import twitter_samples, stopwords
      from nltk.tag import pos_tag
      from nltk.tokenize import word_tokenize
      from nltk import FreqDist, classify, NaiveBayesClassifier
      
      import re, string, random
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      if __name__ == "__main__":
      
          positive_tweets = twitter_samples.strings('positive_tweets.json')
          negative_tweets = twitter_samples.strings('negative_tweets.json')
          text = twitter_samples.strings('tweets.20150430-223406.json')
          tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
          stop_words = stopwords.words('english')
      
          positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
          negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
          positive_cleaned_tokens_list = []
          negative_cleaned_tokens_list = []
      
          for tokens in positive_tweet_tokens:
              positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          for tokens in negative_tweet_tokens:
              negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          all_pos_words = get_all_words(positive_cleaned_tokens_list)
      
          freq_dist_pos = FreqDist(all_pos_words)
          print(freq_dist_pos.most_common(10))
      
          positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
          negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      
          positive_dataset = [(tweet_dict, "Positive")
                               for tweet_dict in positive_tokens_for_model]
      
          negative_dataset = [(tweet_dict, "Negative")
                               for tweet_dict in negative_tokens_for_model]
      
          dataset = positive_dataset + negative_dataset
      
          random.shuffle(dataset)
      
          train_data = dataset[:7000]
          test_data = dataset[7000:]
      
          classifier = NaiveBayesClassifier.train(train_data)
      
          print("Accuracy is:", classify.accuracy(classifier, test_data))
      
          print(classifier.show_most_informative_features(10))
      
          custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
          custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
          print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))
      

      Conclusion

      This tutorial introduced you to a basic sentiment analysis model using the nltk library in Python 3. First, you performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and removing noise. Next, you visualized frequently occurring items in the data. Finally, you built a model to associate tweets to a particular sentiment.

      A supervised learning model is only as good as its training data. To further strengthen the model, you could considering adding more categories like excitement and anger. In this tutorial, you have only scratched the surface by building a rudimentary model. Here’s a detailed guide on various considerations that one must take care of while performing sentiment analysis.





      Source link