One place for hosting & domains

      Python

      How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)


      The author selected the Open Internet/Free Speech fund to receive a donation as part of the Write for DOnations program.

      Introduction

      A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. You will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.

      In this tutorial, you will prepare a dataset of sample tweets from the NLTK package for NLP with different data cleaning methods. Once the dataset is ready for processing, you will train a model on pre-classified tweets and use the model to classify the sample tweets into negative and positives sentiments.

      This article assumes that you are familiar with the basics of Python (see our How To Code in Python 3 series), primarily the use of data structures, classes, and methods. The tutorial assumes that you have no background in NLP and nltk, although some knowledge on it is an added advantage.

      Prerequisites

      Step 1 — Installing NLTK and Downloading the Data

      You will use the NLTK package in Python for all NLP tasks in this tutorial. In this step you will install NLTK and download the sample tweets that you will use to train and test your model.

      First, install the NLTK package with the pip package manager:

      This tutorial will use sample tweets that are part of the NLTK package. First, start a Python interactive session by running the following command:

      Then, import the nltk module in the python interpreter.

      Download the sample tweets from the NLTK package:

      • nltk.download('twitter_samples')

      Running this command from the Python interpreter downloads and stores the tweets locally. Once the samples are downloaded, they are available for your use.

      You will use the negative and positive tweets to train your model on sentiment analysis later in the tutorial. The tweets with no sentiments will be used to test your model.

      If you would like to use your own dataset, you can gather tweets from a specific time period, user, or hashtag by using the Twitter API.

      Now that you’ve imported NLTK and downloaded the sample tweets, exit the interactive session by entering in exit(). You are ready to import the tweets and begin processing the data.

      Step 2 — Tokenizing the Data

      Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

      A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

      To get started, create a new .py file to hold your script. This tutorial will use nlp_test.py:

      In this file, you will first import the twitter_samples so you can work with that data:

      nlp_test.py

      from nltk.corpus import twitter_samples
      

      This will import three datasets from NLTK that contain various tweets to train and test the model:

      • negative_tweets.json: 5000 tweets with negative sentiments
      • positive_tweets.json: 5000 tweets with positive sentiments
      • tweets.20150430-223406.json: 20000 tweets with no sentiments

      Next, create variables for positive_tweets, negative_tweets, and text:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      

      The strings() method of twitter_samples will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

      Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences. For instance, this model knows that a name may contain a period (like “S. Daityari”) and the presence of this period in a sentence does not necessarily end it. First, start a Python interactive session:

      Run the following commands in the session to download the punkt resource:

      • import nltk
      • nltk.download('punkt')

      Once the download is complete, you are ready to use NLTK’s tokenizers. NLTK provides a default tokenizer for tweets with the .tokenized() method. Add a line to create an object that tokenizes the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      

      If you’d like to test the script to see the .tokenized method in action, add the highlighted content to your nlp_test.py script. This will tokenize a single tweet from the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      print(tweet_tokens[0])
      

      Save and close the file, and run the script:

      The process of tokenization takes some time because it’s not a simple split on white space. After a few moments of processing, you’ll see the following:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']

      Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions later in this tutorial.

      Now that you’ve seen how the .tokenized() method works, make sure to comment out or remove the last line to print the tokenized tweet from the script by adding a # to the start of the line:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      #print(tweet_tokens[0])
      

      Your script is now configured to tokenize data. In the next step you will update the script to normalize the data.

      Step 3 — Normalizing the Data

      Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

      Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

      Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

      In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

      Before you proceed to use lemmatization, download the necessary resources by entering the following in to a Python interactive session:

      Run the following commands in the session to download the resources:

      • import nltk
      • nltk.download('wordnet')
      • nltk.download('averaged_perceptron_tagger')

      wordnet is a lexical database for the English language that helps the script determine the base word. You need the averaged_perceptron_tagger resource to determine the context of a word in a sentence.

      Once downloaded, you are almost ready to use the lemmatizer. Before running a lemmatizer, you need to determine the context for each word in your text. This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence. In a Python session, Import the pos_tag function, and provide a list of tokens as an argument to get the tags. Let us try this out in Python:

      • from nltk.tag import pos_tag
      • from nltk.corpus import twitter_samples
      • tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      • print(pos_tag(tweet_tokens[0]))

      Here is the output of the pos_tag function.

      Output

      [('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]

      From the list of tags, here is the list of the most common items and their meaning:

      • NNP: Noun, proper, singular
      • NN: Noun, common, singular or mass
      • IN: Preposition or conjunction, subordinating
      • VBG: Verb, gerund or present participle
      • VBN: Verb, past participle

      Here is a full list of the dataset.

      In general, if a tag starts with NN, the word is a noun and if it stars with VB, the word is a verb. After reviewing the tags, exit the Python session by entering exit().

      To incorporate this into a function that normalizes a sentence, you should first generate the tags for each token in the text, and then lemmatize each word using the tag.

      Update the nlp_test.py file with the following function that lemmatizes a sentence:

      nlp_test.py

      ...
      
      from nltk.tag import pos_tag
      from nltk.stem.wordnet import WordNetLemmatizer
      
      def lemmatize_sentence(tokens):
          lemmatizer = WordNetLemmatizer()
          lemmatized_sentence = []
          for word, tag in pos_tag(tokens):
              if tag.startswith('NN'):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
              lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
          return lemmatized_sentence
      
      print(lemmatize_sentence(tweet_tokens[0]))
      

      This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer.

      The function lemmatize_sentence first gets the position tag of each token of a tweet. Within the if statement, if the tag starts with NN, the token is assigned as a noun. Similarly, if the tag starts with VB, the token is assigned as a verb.

      Save and close the file, and run the script:

      Here is the output:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']

      You will notice that the verb being changes to its root form, be, and the noun members changes to member. Before you proceed, comment out the last line that prints the sample tweet from the script.

      Now that you have successfully created a function to normalize words, you are ready to move on to remove noise.

      Step 4 — Removing Noise from the Data

      In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

      Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

      In this tutorial, you will use regular expressions in Python to search for and remove these items:

      • Hyperlinks - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.
      • Twitter handles in replies - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.
      • Punctuation and special characters - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

      To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

      Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

      Add the following code to your nlp_test.py file to remove noise from the dataset:

      nlp_test.py

      ...
      
      import re, string
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      

      This code creates a remove_noise() function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

      The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string, or ''.

      Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular expressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _, and replaces them with an empty string.

      Finally, you can remove punctuation using the library string.

      In addition to this, you will also remove stop words using a built-in set of stop words in NLTK, which needs to be downloaded separately.

      Execute the following command from a Python interactive session to download this resource:

      • nltk.download('stopwords')

      Once the resource is downloaded, exit the interactive session.

      You can use the .words() method to get a list of stop words in English. To test the function, let us run it on our sample tweet. Add the following lines to the end of the nlp_test.py file:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      print(remove_noise(tweet_tokens[0], stop_words))
      

      After saving and closing the file, run the script again to receive output similar to the following:

      Output

      ['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

      Notice that the function removes all @ mentions, stop words, and converts the words to lowercase.

      Before proceeding to the modeling exercise in the next step, use the remove_noise() function to clean the positive and negative tweets. Comment out the line to print the output of remove_noise() on the sample tweet and add the following to the nlp_test.py script:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      #print(remove_noise(tweet_tokens[0], stop_words))
      
      positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
      positive_cleaned_tokens_list = []
      negative_cleaned_tokens_list = []
      
      for tokens in positive_tweet_tokens:
          positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
      for tokens in negative_tweet_tokens:
          negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      

      Now that you’ve added the code to clean the sample tweets, you may want to compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this, add the following code to the file to compare both versions of the 500th tweet in the list:

      nlp_test.py

      ...
      print(positive_tweet_tokens[500])
      print(positive_cleaned_tokens_list[500])
      

      Save and close the file and run the script. From the output you will see that the punctuation and links have been removed, and the words have been converted to lowercase.

      Output

      ['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht'] ['dang', 'rad', '#fanart', ':d']

      There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

      Now that you’ve seen the remove_noise() function in action, be sure to comment out or remove the last two lines from the script so you can add more to it:

      nlp_test.py

      ...
      #print(positive_tweet_tokens[500])
      #print(positive_cleaned_tokens_list[500])
      

      In this step you removed noise from the data to make the analysis more effective. In the next step you will analyze the data to find the most common words in your sample dataset.

      Step 5 — Determining Word Density

      The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

      The following snippet defines a generator function, named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined. Add the following code to your nlp_test.py file:

      nlp_test.py

      ...
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      all_pos_words = get_all_words(positive_cleaned_tokens_list)
      

      Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the FreqDist class of NLTK. Adding the following code to the nlp_test.py file:

      nlp_test.py

      from nltk import FreqDist
      
      freq_dist_pos = FreqDist(all_pos_words)
      print(freq_dist_pos.most_common(10))
      

      The .most_common() method lists the words which occur most frequently in the data. Save and close the file after making these changes.

      When you run the file now, you will find the most common terms in the data:

      Output

      [(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]

      From this data, you can see that emoticon entities form some of the most common parts of positive tweets. Before proceeding to the next step, make sure you comment out the last line of the script that prints the top ten tokens.

      To summarize, you extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

      In the next step you will prepare data for sentiment analysis.

      Step 6 — Preparing Data for the Model

      Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.

      Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

      A model is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. A sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

      In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

      Converting Tokens to a Dictionary

      First, you will prepare the data to be fed into the model. You will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

      Add the following code to convert the tweets from a list of cleaned tokens to dictionaries with keys as the tokens and True as values. The corresponding dictionaries are stored in positive_tokens_for_model and negative_tokens_for_model.

      nlp_test.py

      ...
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
      negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      

      Splitting the Dataset for Training and Testing the Model

      Next, you need to prepare the data for training the NaiveBayesClassifier class. Add the following code to the file to prepare the data:

      nlp_test.py

      ...
      import random
      
      positive_dataset = [(tweet_dict, "Positive")
                           for tweet_dict in positive_tokens_for_model]
      
      negative_dataset = [(tweet_dict, "Negative")
                           for tweet_dict in negative_tokens_for_model]
      
      dataset = positive_dataset + negative_dataset
      
      random.shuffle(dataset)
      
      train_data = dataset[:7000]
      test_data = dataset[7000:]
      

      This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

      By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

      Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

      In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the dataset, and split it into training and testing data.

      Step 7 — Building and Testing the Model

      Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

      nlp_test.py

      ...
      from nltk import classify
      from nltk import NaiveBayesClassifier
      classifier = NaiveBayesClassifier.train(train_data)
      
      print("Accuracy is:", classify.accuracy(classifier, test_data))
      
      print(classifier.show_most_informative_features(10))
      

      Save, close, and execute the file after adding the code. The output of the code will be as follows:

      Output

      Accuracy is: 0.9956666666666667 Most Informative Features 🙁 = True Negati : Positi = 2085.6 : 1.0 🙂 = True Positi : Negati = 986.0 : 1.0 welcome = True Positi : Negati = 37.2 : 1.0 arrive = True Positi : Negati = 31.3 : 1.0 sad = True Negati : Positi = 25.9 : 1.0 follower = True Positi : Negati = 21.1 : 1.0 bam = True Positi : Negati = 20.7 : 1.0 glad = True Positi : Negati = 18.1 : 1.0 x15 = True Negati : Positi = 15.9 : 1.0 community = True Positi : Negati = 14.1 : 1.0

      Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

      In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. The first row in the data signifies that in all tweets containing the token :(, the ratio of negative to positives tweets was 2085.6 to 1. Interestingly, it seems that there was one token with :( in the positive datasets. You can see that the top two discriminating items in the text are the emoticons. Further, words such as sad lead to negative sentiments, whereas welcome and glad are associated with positive sentiments.

      Next, you can check how the model performs on random tweets from Twitter. Add this code to the file:

      nlp_test.py

      ...
      from nltk.tokenize import word_tokenize
      
      custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
      custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
      print(classifier.classify(dict([token, True] for token in custom_tokens)))
      

      This code will allow you to test custom tweets by updating the string associated with the custom_tweet variable. Save and close the file after making these changes.

      Run the script to analyze the custom text. Here is the output for the custom text in the example:

      Output

      'Negative'

      You can also check if it characterizes positive tweets correctly:

      nlp_test.py

      ...
      custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year 🙂 #Baller #Topbin #oneofmanyworldies'
      

      Here is the output:

      Output

      'Positive'

      Now that you’ve tested both positive and negative sentiments, update the variable to test a more complex sentiment like sarcasm.

      nlp_test.py

      ...
      custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'
      

      Here is the output:

      Output

      'Positive'

      The model classified this example as positive. This is because the training data wasn’t comprehensive enough to classify sarcastic tweets as negative. In case you want your model to predict sarcasm, you would need to provide sufficient amount of training data to train it accordingly.

      In this step you built and tested the model. You also explored some of its limitations, such as not detecting sarcasm in particular examples. Your completed code still has artifacts leftover from following the tutorial, so the next step will guide you through aligning the code to Python’s best practices.

      Step 8 — Cleaning Up the Code (Optional)

      Though you have completed the tutorial, it is recommended to reorganize the code in the nlp_test.py file to follow best programming practices. Per best practice, your code should meet this criteria:

      • All imports should be at the top of the file. Imports from the same library should be grouped together in a single statement.
      • All functions should be defined after the imports.
      • All the statements in the file should be housed under an if __name__ == "__main__": condition. This ensures that the statements are not executed if you are importing the functions of the file in another file.

      We will also remove the code that was commented out by following the tutorial, along with the lemmatize_sentence function, as the lemmatization is completed by the new remove_noise function.

      Here is the cleaned version of nlp_test.py:

      from nltk.stem.wordnet import WordNetLemmatizer
      from nltk.corpus import twitter_samples, stopwords
      from nltk.tag import pos_tag
      from nltk.tokenize import word_tokenize
      from nltk import FreqDist, classify, NaiveBayesClassifier
      
      import re, string, random
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      if __name__ == "__main__":
      
          positive_tweets = twitter_samples.strings('positive_tweets.json')
          negative_tweets = twitter_samples.strings('negative_tweets.json')
          text = twitter_samples.strings('tweets.20150430-223406.json')
          tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
          stop_words = stopwords.words('english')
      
          positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
          negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
          positive_cleaned_tokens_list = []
          negative_cleaned_tokens_list = []
      
          for tokens in positive_tweet_tokens:
              positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          for tokens in negative_tweet_tokens:
              negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          all_pos_words = get_all_words(positive_cleaned_tokens_list)
      
          freq_dist_pos = FreqDist(all_pos_words)
          print(freq_dist_pos.most_common(10))
      
          positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
          negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      
          positive_dataset = [(tweet_dict, "Positive")
                               for tweet_dict in positive_tokens_for_model]
      
          negative_dataset = [(tweet_dict, "Negative")
                               for tweet_dict in negative_tokens_for_model]
      
          dataset = positive_dataset + negative_dataset
      
          random.shuffle(dataset)
      
          train_data = dataset[:7000]
          test_data = dataset[7000:]
      
          classifier = NaiveBayesClassifier.train(train_data)
      
          print("Accuracy is:", classify.accuracy(classifier, test_data))
      
          print(classifier.show_most_informative_features(10))
      
          custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
          custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
          print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))
      

      Conclusion

      This tutorial introduced you to a basic sentiment analysis model using the nltk library in Python 3. First, you performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and removing noise. Next, you visualized frequently occurring items in the data. Finally, you built a model to associate tweets to a particular sentiment.

      A supervised learning model is only as good as its training data. To further strengthen the model, you could considering adding more categories like excitement and anger. In this tutorial, you have only scratched the surface by building a rudimentary model. Here’s a detailed guide on various considerations that one must take care of while performing sentiment analysis.





      Source link

      How To Set Up a Jupyter Notebook with Python 3 on Debian 10


      Introduction

      Jupyter Notebook offers a command shell for interactive computing as a web application so that you can share and communicate with code. The tool can be used with several languages, including Python, Julia, R, Haskell, and Ruby. It is often used for working with data, statistical modeling, and machine learning.

      This tutorial will walk you through setting up Jupyter Notebook to run from a Debian 10 server, as well as teach you how to connect to and use the Notebook. Jupyter Notebooks (or just “Notebooks”) are documents produced by the Jupyter Notebook app which contain both computer code and rich text elements (paragraph, equations, figures, links, etc.) which aid in presenting and sharing reproducible research.

      By the end of this guide, you will be able to run Python 3 code using Jupyter Notebook running on a remote Debian 10 server.

      Prerequisites

      In order to complete this guide, you should have a fresh Debian 10 server instance with a basic firewall and a non-root user with sudo privileges configured. You can learn how to set this up by running through our Initial Server Setup with Debian 10 guide.

      Step 1 — Install Pip and Python Headers

      To begin the process, we’ll download and install all of the items we need from the Debian repositories. We will use the Python package manager pip to install additional components a bit later.

      We first need to update the local apt package index and then download and install the packages:

      Next, install pip and the Python header files, which are used by some of Jupyter’s dependencies:

      • sudo apt install python3-pip python3-dev

      Debian 10 (“Buster”) comes preinstalled with Python 3.7.

      We can now move on to setting up a Python virtual environment into which we’ll install Jupyter.

      Step 2 — Create a Python Virtual Environment for Jupyter

      Now that we have Python 3, its header files, and pip ready to go, we can create a Python virtual environment for easier management. We will install Jupyter into this virtual environment.

      To do this, we first need access to the virtualenv command. We can install this with pip.

      Upgrade pip and install the package by typing:

      • sudo -H pip3 install --upgrade pip
      • sudo -H pip3 install virtualenv

      With virtualenv installed, we can start forming our environment. Create and move into a directory where we can keep our project files:

      • mkdir ~/myprojectdir
      • cd ~/myprojectdir

      Within the project directory, create a Python virtual environment by typing:

      This will create a directory called myprojectenv within your myprojectdir directory. Inside, it will install a local version of Python and a local version of pip. We can use this to install and configure an isolated Python environment for Jupyter.

      Before we install Jupyter, we need to activate the virtual environment. You can do that by typing:

      • source myprojectenv/bin/activate

      Your prompt should change to indicate that you are now operating within a Python virtual environment. It will look something like this: (myprojectenv)user@host:~/myprojectdir$.

      You’re now ready to install Jupyter into this virtual environment.

      Step 3 — Install Jupyter

      With your virtual environment active, install Jupyter with the local instance of pip:

      Note: When the virtual environment is activated (when your prompt has (myprojectenv) preceding it), use pip instead of pip3, even if you are using Python 3. The virtual environment’s copy of the tool is always named pip, regardless of the Python version.

      At this point, you’ve successfully installed all the software needed to run Jupyter. We can now start the Notebook server.

      Step 4 — Run Jupyter Notebook

      You now have everything you need to run Jupyter Notebook! To run it, execute the following command:

      A log of the activities of the Jupyter Notebook will be printed to the terminal. When you run Jupyter Notebook, it runs on a specific port number. The first Notebook you run will usually use port 8888. To check the specific port number Jupyter Notebook is running on, refer to the output of the command used to start it:

      Output

      [I 21:23:21.198 NotebookApp] Writing notebook server cookie secret to /run/user/1001/jupyter/notebook_cookie_secret [I 21:23:21.361 NotebookApp] Serving notebooks from local directory: /home/sammy/myprojectdir [I 21:23:21.361 NotebookApp] The Jupyter Notebook is running at: [I 21:23:21.361 NotebookApp] http://localhost:8888/?token=1fefa6ab49a498a3f37c959404f7baf16b9a2eda3eaa6d72 [I 21:23:21.361 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 21:23:21.361 NotebookApp] No web browser found: could not locate runnable browser. [C 21:23:21.361 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=1fefa6ab49a498a3f37c959404f7baf16b9a2eda3eaa6d72

      If you are running Jupyter Notebook on a local Debian computer (not on a Droplet), you can simply navigate to the displayed URL to connect to Jupyter Notebook. If you are running Jupyter Notebook on a Droplet, you will need to connect to the server using SSH tunneling as outlined in the next section.

      At this point, you can keep the SSH connection open and keep Jupyter Notebook running or can exit the app and re-run it once you set up SSH tunneling. Let’s keep it simple and stop the Jupyter Notebook process. We will run it again once we have SSH tunneling working. To stop the Jupyter Notebook process, press CTRL+C, type Y, and hit ENTER to confirm. The following will be displayed:

      Output

      [C 21:28:28.512 NotebookApp] Shutdown confirmed [I 21:28:28.512 NotebookApp] Shutting down 0 kernels

      We’ll now set up an SSH tunnel so that we can access the Notebook.

      Step 5 — Connect to the Server Using SSH Tunneling

      In this section we will learn how to connect to the Jupyter Notebook web interface using SSH tunneling. Since Jupyter Notebook will run on a specific port on the server (such as :8888, :8889 etc.), SSH tunneling enables you to connect to the server’s port securely.

      The next two subsections describe how to create an SSH tunnel from 1) a Mac or Linux and 2) Windows. Please refer to the subsection for your local computer.

      SSH Tunneling with a Mac or Linux

      If you are using a Mac or Linux, the steps for creating an SSH tunnel are similar to using SSH to log in to your remote server, except that there are additional parameters in the ssh command. This subsection will outline the additional parameters needed in the ssh command to tunnel successfully.

      SSH tunneling can be done by running the following SSH command in a new local terminal window:

      • ssh -L 8888:localhost:8888 your_server_username@your_server_ip

      The ssh command opens an SSH connection, but -L specifies that the given port on the local (client) host is to be forwarded to the given host and port on the remote side (server). This means that whatever is running on the second port number (e.g. 8888) on the server will appear on the first port number (e.g. 8888) on your local computer.

      Optionally change port 8888 to one of your choosing to avoid using a port already in use by another process.

      server_username is your username (e.g. sammy) on the server which you created and your_server_ip is the IP address of your server.

      For example, for the username sammy and the server address 203.0.113.0, the command would be:

      • ssh -L 8888:localhost:8888 sammy@203.0.113.0

      If no error shows up after running the ssh -L command, you can move into your programming environment and run Jupyter Notebook:

      You’ll receive output with a URL. From a web browser on your local machine, open the Jupyter Notebook web interface with the URL that starts with http://localhost:8888. Ensure that the token number is included, or enter the token number string when prompted at http://localhost:8888.

      SSH Tunneling with Windows and Putty

      If you are using Windows, you can create an SSH tunnel using Putty.

      First, enter the server URL or IP address as the hostname as shown:

      Set Hostname for SSH Tunnel

      Next, click SSH on the bottom of the left pane to expand the menu, and then click Tunnels. Enter the local port number to use to access Jupyter on your local machine. Choose 8000 or greater to avoid ports used by other services, and set the destination as localhost:8888 where :8888 is the number of the port that Jupyter Notebook is running on.

      Now click the Add button, and the ports should appear in the Forwarded ports list:

      Forwarded ports list

      Finally, click the Open button to connect to the server via SSH and tunnel the desired ports. Navigate to http://localhost:8000 (or whatever port you chose) in a web browser to connect to Jupyter Notebook running on the server. Ensure that the token number is included, or enter the token number string when prompted at http://localhost:8000.

      Step 6 — Using Jupyter Notebook

      This section goes over the basics of using Jupyter Notebook. If you don’t currently have Jupyter Notebook running, start it with the jupyter notebook command.

      You should now be connected to it using a web browser. Jupyter Notebook is a very powerful tool with many features. This section will outline a few of the basic features to get you started using the Notebook. Jupyter Notebook will show all of the files and folders in the directory it is run from, so when you’re working on a project make sure to start it from the project directory.

      To create a new Notebook file, select New > Python 3 from the top right pull-down menu:

      Create a new Python 3 notebook

      This will open a Notebook. We can now run Python code in the cell or change the cell to markdown. For example, change the first cell to accept Markdown by clicking Cell > Cell Type > Markdown from the top navigation bar. We can now write notes using Markdown and even include equations written in LaTeX by putting them between the $$ symbols. For example, type the following into the cell after changing it to markdown:

      # First Equation
      
      Let us now implement the following equation:
      $$ y = x^2$$
      
      where $x = 2$
      

      To turn the markdown into rich text, press CTRL+ENTER, and the following should be the results:

      results of markdown

      You can use the markdown cells to make notes and document your code. Let’s implement that equation and print the result. Click on the top cell, then press ALT+ENTER to add a cell below it. Enter the following code in the new cell.

      x = 2
      y = x**2
      print(y)
      

      To run the code, press CTRL+ENTER. You’ll receive the following results:

      first equation results

      You now have the ability to import modules and use the Notebook as you would with any other Python development environment!

      Conclusion

      At this point, you should be able to write reproducible Python code and notes in Markdown using Jupyter Notebook. To get a quick tour of Jupyter Notebook from within the interface, select Help > User Interface Tour from the top navigation menu to learn more.

      From here, you can begin a data analysis and visualization project by reading Data Analysis and Visualization with pandas and Jupyter Notebook in Python 3.



      Source link

      How To Install Python 3 and Set Up a Programming Environment on Debian 10


      Introduction

      Python is a flexible and versatile programming language suitable for many use cases, including scripting, automation, data analysis, machine learning, and back-end development. First published in 1991 with a name inspired by the British comedy group Monty Python, the development team wanted to make Python a language that was fun to use. Quick to set up with immediate feedback on errors, Python is a useful language to learn for beginners and experienced developers alike. Python 3 is the most current version of the language and is considered to be the future of Python.

      This tutorial will get your Debian 10 server set up with a Python 3 programming environment. Programming on a server has many advantages and supports collaboration across development projects.

      Prerequisites

      In order to complete this tutorial, you should have a non-root user with sudo privileges on a Debian 10 server. To learn how to achieve this setup, follow our Debian 10 initial server setup guide.

      If you’re not already familiar with a terminal environment, you may find the article “An Introduction to the Linux Terminal” useful for becoming better oriented with the terminal.

      With your server and user set up, you are ready to begin.

      Step 1 — Setting Up Python 3

      Debian Linux ships with both Python 3 and Python 2 pre-installed. To make sure that our versions are up-to-date, let’s update and upgrade the system with the apt command to work with the Advanced Packaging Tool:

      • sudo apt update
      • sudo apt -y upgrade

      The -y flag will confirm that we are agreeing for all items to be installed.

      Once the process is complete, we can check the version of Python 3 that is installed in the system by typing:

      You’ll receive output in the terminal window that will let you know the version number. While this number may vary, the output will be similar to this:

      Output

      Python 3.7.3

      To manage software packages for Python, let’s install pip, a tool that will install and manage programming packages we may want to use in our development projects. You can learn more about modules or packages that you can install with pip by reading “How To Import Modules in Python 3.”

      • sudo apt install -y python3-pip

      Python packages can be installed by typing:

      • pip3 install package_name

      Here, package_name can refer to any Python package or library, such as Django for web development or NumPy for scientific computing. So if you would like to install NumPy, you can do so with the command pip3 install numpy.

      There are a few more packages and development tools to install to ensure that we have a robust set-up for our programming environment:

      • sudo apt install build-essential libssl-dev libffi-dev python3-dev

      Once Python is set up, and pip and other tools are installed, we can set up a virtual environment for our development projects.

      Step 2 — Setting Up a Virtual Environment

      Virtual environments enable you to have an isolated space on your server for Python projects, ensuring that each of your projects can have its own set of dependencies that won’t disrupt any of your other projects.

      Setting up a programming environment provides us with greater control over our Python projects and over how different versions of packages are handled. This is especially important when working with third-party packages.

      You can set up as many Python programming environments as you want. Each environment is basically a directory or folder on your server that has a few scripts in it to make it act as an environment.

      While there are a few ways to achieve a programming environment in Python, we’ll be using the venv module here, which is part of the standard Python 3 library. Let’s install venv by typing:

      • sudo apt install -y python3-venv

      With this installed, we are ready to create environments. Let’s either choose which directory we would like to put our Python programming environments in, or create a new directory with mkdir, as in:

      • mkdir environments
      • cd environments

      Once you are in the directory where you would like the environments to live, you can create an environment by running the following command:

      Essentially, pyvenv sets up a new directory that contains a few items which we can view with the ls command:

      Output

      bin include lib lib64 pyvenv.cfg share

      Together, these files work to make sure that your projects are isolated from the broader context of your local machine, so that system files and project files don’t mix. This is good practice for version control and to ensure that each of your projects has access to the particular packages that it needs. Python Wheels, a built-package format for Python that can speed up your software production by reducing the number of times you need to compile, will be in the Ubuntu 18.04 share directory.

      To use this environment, you need to activate it, which you can achieve by typing the following command that calls the activate script:

      • source my_env/bin/activate

      Your command prompt will now be prefixed with the name of your environment, in this case it is called my_env. Depending on what version of Debian Linux you are running, your prefix may appear somewhat differently, but the name of your environment in parentheses should be the first thing you see on your line:

      This prefix lets us know that the environment my_env is currently active, meaning that when we create programs here they will use only this particular environment’s settings and packages.

      Note: Within the virtual environment, you can use the command python instead of python3, and pip instead of pip3 if you would prefer. If you use Python 3 on your machine outside of an environment, you will need to use the python3 and pip3 commands exclusively.

      After following these steps, your virtual environment is ready to use.

      Step 3 — Creating a “Hello, World” Program

      Now that we have our virtual environment set up, let’s create a traditional “Hello, World!” program. This will let us test our environment and provides us with the opportunity to become more familiar with Python if we aren’t already.

      To do this, we’ll open up a command-line text editor such as nano and create a new file:

      Once the text file opens up in the terminal window we’ll type out our program:

      print("Hello, World!")
      

      Exit nano by typing the CTRL and X keys, and when prompted to save the file press y.

      Once you exit out of nano and return to your shell, let’s run the program:

      The hello.py program that you just created should cause your terminal to produce the following output:

      Output

      Hello, World!

      To leave the environment, simply type the command deactivate and you will return to your original directory.

      Conclusion

      Congratulations! At this point you have a Python 3 programming environment set up on your Debian 10 Linux server and you can now begin a coding project!

      If you are using a local machine rather than a server, refer to the tutorial that is relevant to your operating system in our “How To Install and Set Up a Local Programming Environment for Python 3” series.

      With your server ready for software development, you can continue to learn more about coding in Python by reading our free How To Code in Python 3 eBook, or consulting our Programming Project tutorials.

      Download our free Python eBook!

      How To Code in Python eBook in EPUB format

      How To Code in Python eBook in PDF format



      Source link