One place for hosting & domains

      Language

      How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)


      The author selected the Open Internet/Free Speech fund to receive a donation as part of the Write for DOnations program.

      Introduction

      A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. You will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.

      In this tutorial, you will prepare a dataset of sample tweets from the NLTK package for NLP with different data cleaning methods. Once the dataset is ready for processing, you will train a model on pre-classified tweets and use the model to classify the sample tweets into negative and positives sentiments.

      This article assumes that you are familiar with the basics of Python (see our How To Code in Python 3 series), primarily the use of data structures, classes, and methods. The tutorial assumes that you have no background in NLP and nltk, although some knowledge on it is an added advantage.

      Prerequisites

      Step 1 — Installing NLTK and Downloading the Data

      You will use the NLTK package in Python for all NLP tasks in this tutorial. In this step you will install NLTK and download the sample tweets that you will use to train and test your model.

      First, install the NLTK package with the pip package manager:

      This tutorial will use sample tweets that are part of the NLTK package. First, start a Python interactive session by running the following command:

      Then, import the nltk module in the python interpreter.

      Download the sample tweets from the NLTK package:

      • nltk.download('twitter_samples')

      Running this command from the Python interpreter downloads and stores the tweets locally. Once the samples are downloaded, they are available for your use.

      You will use the negative and positive tweets to train your model on sentiment analysis later in the tutorial. The tweets with no sentiments will be used to test your model.

      If you would like to use your own dataset, you can gather tweets from a specific time period, user, or hashtag by using the Twitter API.

      Now that you’ve imported NLTK and downloaded the sample tweets, exit the interactive session by entering in exit(). You are ready to import the tweets and begin processing the data.

      Step 2 — Tokenizing the Data

      Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

      A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

      To get started, create a new .py file to hold your script. This tutorial will use nlp_test.py:

      In this file, you will first import the twitter_samples so you can work with that data:

      nlp_test.py

      from nltk.corpus import twitter_samples
      

      This will import three datasets from NLTK that contain various tweets to train and test the model:

      • negative_tweets.json: 5000 tweets with negative sentiments
      • positive_tweets.json: 5000 tweets with positive sentiments
      • tweets.20150430-223406.json: 20000 tweets with no sentiments

      Next, create variables for positive_tweets, negative_tweets, and text:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      

      The strings() method of twitter_samples will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

      Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences. For instance, this model knows that a name may contain a period (like “S. Daityari”) and the presence of this period in a sentence does not necessarily end it. First, start a Python interactive session:

      Run the following commands in the session to download the punkt resource:

      • import nltk
      • nltk.download('punkt')

      Once the download is complete, you are ready to use NLTK’s tokenizers. NLTK provides a default tokenizer for tweets with the .tokenized() method. Add a line to create an object that tokenizes the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      

      If you’d like to test the script to see the .tokenized method in action, add the highlighted content to your nlp_test.py script. This will tokenize a single tweet from the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      print(tweet_tokens[0])
      

      Save and close the file, and run the script:

      The process of tokenization takes some time because it’s not a simple split on white space. After a few moments of processing, you’ll see the following:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']

      Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions later in this tutorial.

      Now that you’ve seen how the .tokenized() method works, make sure to comment out or remove the last line to print the tokenized tweet from the script by adding a # to the start of the line:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      #print(tweet_tokens[0])
      

      Your script is now configured to tokenize data. In the next step you will update the script to normalize the data.

      Step 3 — Normalizing the Data

      Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

      Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

      Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

      In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

      Before you proceed to use lemmatization, download the necessary resources by entering the following in to a Python interactive session:

      Run the following commands in the session to download the resources:

      • import nltk
      • nltk.download('wordnet')
      • nltk.download('averaged_perceptron_tagger')

      wordnet is a lexical database for the English language that helps the script determine the base word. You need the averaged_perceptron_tagger resource to determine the context of a word in a sentence.

      Once downloaded, you are almost ready to use the lemmatizer. Before running a lemmatizer, you need to determine the context for each word in your text. This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence. In a Python session, Import the pos_tag function, and provide a list of tokens as an argument to get the tags. Let us try this out in Python:

      • from nltk.tag import pos_tag
      • from nltk.corpus import twitter_samples
      • tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      • print(pos_tag(tweet_tokens[0]))

      Here is the output of the pos_tag function.

      Output

      [('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]

      From the list of tags, here is the list of the most common items and their meaning:

      • NNP: Noun, proper, singular
      • NN: Noun, common, singular or mass
      • IN: Preposition or conjunction, subordinating
      • VBG: Verb, gerund or present participle
      • VBN: Verb, past participle

      Here is a full list of the dataset.

      In general, if a tag starts with NN, the word is a noun and if it stars with VB, the word is a verb. After reviewing the tags, exit the Python session by entering exit().

      To incorporate this into a function that normalizes a sentence, you should first generate the tags for each token in the text, and then lemmatize each word using the tag.

      Update the nlp_test.py file with the following function that lemmatizes a sentence:

      nlp_test.py

      ...
      
      from nltk.tag import pos_tag
      from nltk.stem.wordnet import WordNetLemmatizer
      
      def lemmatize_sentence(tokens):
          lemmatizer = WordNetLemmatizer()
          lemmatized_sentence = []
          for word, tag in pos_tag(tokens):
              if tag.startswith('NN'):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
              lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
          return lemmatized_sentence
      
      print(lemmatize_sentence(tweet_tokens[0]))
      

      This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer.

      The function lemmatize_sentence first gets the position tag of each token of a tweet. Within the if statement, if the tag starts with NN, the token is assigned as a noun. Similarly, if the tag starts with VB, the token is assigned as a verb.

      Save and close the file, and run the script:

      Here is the output:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']

      You will notice that the verb being changes to its root form, be, and the noun members changes to member. Before you proceed, comment out the last line that prints the sample tweet from the script.

      Now that you have successfully created a function to normalize words, you are ready to move on to remove noise.

      Step 4 — Removing Noise from the Data

      In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

      Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

      In this tutorial, you will use regular expressions in Python to search for and remove these items:

      • Hyperlinks - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.
      • Twitter handles in replies - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.
      • Punctuation and special characters - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

      To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

      Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

      Add the following code to your nlp_test.py file to remove noise from the dataset:

      nlp_test.py

      ...
      
      import re, string
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      

      This code creates a remove_noise() function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

      The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string, or ''.

      Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular expressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _, and replaces them with an empty string.

      Finally, you can remove punctuation using the library string.

      In addition to this, you will also remove stop words using a built-in set of stop words in NLTK, which needs to be downloaded separately.

      Execute the following command from a Python interactive session to download this resource:

      • nltk.download('stopwords')

      Once the resource is downloaded, exit the interactive session.

      You can use the .words() method to get a list of stop words in English. To test the function, let us run it on our sample tweet. Add the following lines to the end of the nlp_test.py file:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      print(remove_noise(tweet_tokens[0], stop_words))
      

      After saving and closing the file, run the script again to receive output similar to the following:

      Output

      ['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

      Notice that the function removes all @ mentions, stop words, and converts the words to lowercase.

      Before proceeding to the modeling exercise in the next step, use the remove_noise() function to clean the positive and negative tweets. Comment out the line to print the output of remove_noise() on the sample tweet and add the following to the nlp_test.py script:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      #print(remove_noise(tweet_tokens[0], stop_words))
      
      positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
      positive_cleaned_tokens_list = []
      negative_cleaned_tokens_list = []
      
      for tokens in positive_tweet_tokens:
          positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
      for tokens in negative_tweet_tokens:
          negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      

      Now that you’ve added the code to clean the sample tweets, you may want to compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this, add the following code to the file to compare both versions of the 500th tweet in the list:

      nlp_test.py

      ...
      print(positive_tweet_tokens[500])
      print(positive_cleaned_tokens_list[500])
      

      Save and close the file and run the script. From the output you will see that the punctuation and links have been removed, and the words have been converted to lowercase.

      Output

      ['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht'] ['dang', 'rad', '#fanart', ':d']

      There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

      Now that you’ve seen the remove_noise() function in action, be sure to comment out or remove the last two lines from the script so you can add more to it:

      nlp_test.py

      ...
      #print(positive_tweet_tokens[500])
      #print(positive_cleaned_tokens_list[500])
      

      In this step you removed noise from the data to make the analysis more effective. In the next step you will analyze the data to find the most common words in your sample dataset.

      Step 5 — Determining Word Density

      The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

      The following snippet defines a generator function, named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined. Add the following code to your nlp_test.py file:

      nlp_test.py

      ...
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      all_pos_words = get_all_words(positive_cleaned_tokens_list)
      

      Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the FreqDist class of NLTK. Adding the following code to the nlp_test.py file:

      nlp_test.py

      from nltk import FreqDist
      
      freq_dist_pos = FreqDist(all_pos_words)
      print(freq_dist_pos.most_common(10))
      

      The .most_common() method lists the words which occur most frequently in the data. Save and close the file after making these changes.

      When you run the file now, you will find the most common terms in the data:

      Output

      [(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]

      From this data, you can see that emoticon entities form some of the most common parts of positive tweets. Before proceeding to the next step, make sure you comment out the last line of the script that prints the top ten tokens.

      To summarize, you extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

      In the next step you will prepare data for sentiment analysis.

      Step 6 — Preparing Data for the Model

      Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.

      Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

      A model is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. A sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

      In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

      Converting Tokens to a Dictionary

      First, you will prepare the data to be fed into the model. You will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

      Add the following code to convert the tweets from a list of cleaned tokens to dictionaries with keys as the tokens and True as values. The corresponding dictionaries are stored in positive_tokens_for_model and negative_tokens_for_model.

      nlp_test.py

      ...
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
      negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      

      Splitting the Dataset for Training and Testing the Model

      Next, you need to prepare the data for training the NaiveBayesClassifier class. Add the following code to the file to prepare the data:

      nlp_test.py

      ...
      import random
      
      positive_dataset = [(tweet_dict, "Positive")
                           for tweet_dict in positive_tokens_for_model]
      
      negative_dataset = [(tweet_dict, "Negative")
                           for tweet_dict in negative_tokens_for_model]
      
      dataset = positive_dataset + negative_dataset
      
      random.shuffle(dataset)
      
      train_data = dataset[:7000]
      test_data = dataset[7000:]
      

      This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

      By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

      Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

      In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the dataset, and split it into training and testing data.

      Step 7 — Building and Testing the Model

      Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

      nlp_test.py

      ...
      from nltk import classify
      from nltk import NaiveBayesClassifier
      classifier = NaiveBayesClassifier.train(train_data)
      
      print("Accuracy is:", classify.accuracy(classifier, test_data))
      
      print(classifier.show_most_informative_features(10))
      

      Save, close, and execute the file after adding the code. The output of the code will be as follows:

      Output

      Accuracy is: 0.9956666666666667 Most Informative Features 🙁 = True Negati : Positi = 2085.6 : 1.0 🙂 = True Positi : Negati = 986.0 : 1.0 welcome = True Positi : Negati = 37.2 : 1.0 arrive = True Positi : Negati = 31.3 : 1.0 sad = True Negati : Positi = 25.9 : 1.0 follower = True Positi : Negati = 21.1 : 1.0 bam = True Positi : Negati = 20.7 : 1.0 glad = True Positi : Negati = 18.1 : 1.0 x15 = True Negati : Positi = 15.9 : 1.0 community = True Positi : Negati = 14.1 : 1.0

      Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

      In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. The first row in the data signifies that in all tweets containing the token :(, the ratio of negative to positives tweets was 2085.6 to 1. Interestingly, it seems that there was one token with :( in the positive datasets. You can see that the top two discriminating items in the text are the emoticons. Further, words such as sad lead to negative sentiments, whereas welcome and glad are associated with positive sentiments.

      Next, you can check how the model performs on random tweets from Twitter. Add this code to the file:

      nlp_test.py

      ...
      from nltk.tokenize import word_tokenize
      
      custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
      custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
      print(classifier.classify(dict([token, True] for token in custom_tokens)))
      

      This code will allow you to test custom tweets by updating the string associated with the custom_tweet variable. Save and close the file after making these changes.

      Run the script to analyze the custom text. Here is the output for the custom text in the example:

      Output

      'Negative'

      You can also check if it characterizes positive tweets correctly:

      nlp_test.py

      ...
      custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year 🙂 #Baller #Topbin #oneofmanyworldies'
      

      Here is the output:

      Output

      'Positive'

      Now that you’ve tested both positive and negative sentiments, update the variable to test a more complex sentiment like sarcasm.

      nlp_test.py

      ...
      custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'
      

      Here is the output:

      Output

      'Positive'

      The model classified this example as positive. This is because the training data wasn’t comprehensive enough to classify sarcastic tweets as negative. In case you want your model to predict sarcasm, you would need to provide sufficient amount of training data to train it accordingly.

      In this step you built and tested the model. You also explored some of its limitations, such as not detecting sarcasm in particular examples. Your completed code still has artifacts leftover from following the tutorial, so the next step will guide you through aligning the code to Python’s best practices.

      Step 8 — Cleaning Up the Code (Optional)

      Though you have completed the tutorial, it is recommended to reorganize the code in the nlp_test.py file to follow best programming practices. Per best practice, your code should meet this criteria:

      • All imports should be at the top of the file. Imports from the same library should be grouped together in a single statement.
      • All functions should be defined after the imports.
      • All the statements in the file should be housed under an if __name__ == "__main__": condition. This ensures that the statements are not executed if you are importing the functions of the file in another file.

      We will also remove the code that was commented out by following the tutorial, along with the lemmatize_sentence function, as the lemmatization is completed by the new remove_noise function.

      Here is the cleaned version of nlp_test.py:

      from nltk.stem.wordnet import WordNetLemmatizer
      from nltk.corpus import twitter_samples, stopwords
      from nltk.tag import pos_tag
      from nltk.tokenize import word_tokenize
      from nltk import FreqDist, classify, NaiveBayesClassifier
      
      import re, string, random
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      if __name__ == "__main__":
      
          positive_tweets = twitter_samples.strings('positive_tweets.json')
          negative_tweets = twitter_samples.strings('negative_tweets.json')
          text = twitter_samples.strings('tweets.20150430-223406.json')
          tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
          stop_words = stopwords.words('english')
      
          positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
          negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
          positive_cleaned_tokens_list = []
          negative_cleaned_tokens_list = []
      
          for tokens in positive_tweet_tokens:
              positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          for tokens in negative_tweet_tokens:
              negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          all_pos_words = get_all_words(positive_cleaned_tokens_list)
      
          freq_dist_pos = FreqDist(all_pos_words)
          print(freq_dist_pos.most_common(10))
      
          positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
          negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      
          positive_dataset = [(tweet_dict, "Positive")
                               for tweet_dict in positive_tokens_for_model]
      
          negative_dataset = [(tweet_dict, "Negative")
                               for tweet_dict in negative_tokens_for_model]
      
          dataset = positive_dataset + negative_dataset
      
          random.shuffle(dataset)
      
          train_data = dataset[:7000]
          test_data = dataset[7000:]
      
          classifier = NaiveBayesClassifier.train(train_data)
      
          print("Accuracy is:", classify.accuracy(classifier, test_data))
      
          print(classifier.show_most_informative_features(10))
      
          custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
          custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
          print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))
      

      Conclusion

      This tutorial introduced you to a basic sentiment analysis model using the nltk library in Python 3. First, you performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and removing noise. Next, you visualized frequently occurring items in the data. Finally, you built a model to associate tweets to a particular sentiment.

      A supervised learning model is only as good as its training data. To further strengthen the model, you could considering adding more categories like excitement and anger. In this tutorial, you have only scratched the surface by building a rudimentary model. Here’s a detailed guide on various considerations that one must take care of while performing sentiment analysis.





      Source link

      Learn the AWK Programming Language


      Updated by Linode Contributed by Mihalis Tsoukalos

      What is AWK?

      AWK is a Turing-complete pattern matching programming language. The name AWK is derived from the family names of its three authors: Alfred Aho, Peter Weinberger and Brian Kernighan. AWK is often associated with sed, which is a UNIX command line tool. However, sed is more appropriate for one line UNIX shell commands and is typically used only for text processing.

      AWK is great for data reporting, analysis, and extraction and supports arrays, associative arrays, functions, variables, loops, and regular expressions. Current Linux systems use improved versions of the original AWK utility. The main enhancement to these AWK variants is support for a larger set of built-in functions and variables. The most widely used variants of AWK are: Gawk, Mawk, and Nawk.

      Note

      This guide uses the Gawk version of AWK.

      There are many practical uses of AWK. For example, you can use AWK and the history command to find your top 10 most frequently issued commands:

      history | awk '{CMD[$2]++;count++;} END {for (a in CMD)print CMD[a] " "CMD[a]/count*100 " % " a;} ' | grep -v "./" | column -c3 -s " " -t | sort -rn | head -n10
      

      This guide assumes familiarity with programming language concepts and is meant to provide an overview of some basic elements of the AWK programming language. In this guide you will learn:

      AWK Basics

      In this section you will learn basics of the AWK programming language, including:

      • How to execute AWK from the command line with one-off commands and by storing AWK code in files.
      • Creating and using variables, arrays, and functions.
      • Special patterns, like BEGIN and END.

      Note

      A pattern in AWK controls the execution of rules and a rule is executed when its pattern is a match for the current input record.

      Run an AWK Program

      A program in AWK can be written via the command line or by executing a file containing the program. If you want to reuse your code, it is better to store it in a file. AWK reads input from standard input or from files specified as command line arguments. Input is divided into individual records and fields. By default, new lines are parsed as a record and whitespace is parsed as a field. After a record is read, it is split into fields. AWK does not alter the original input.

      The next two sections will walk you through creating a Hello World! program that you will run, both as a one-off program on the command line, and as reusable code saved in a file.

      Hello World! – Command Line

      When an AWK program contains the BEGIN pattern without another special pattern, AWK will not expect any further command line input and exit. Typically, when an AWK program is executed on the command line, without the BEGIN special pattern, AWK will continue to expect input until you exit by typing CTRL-D. The example Hello World! program below will print and immediately exit.

      1. Execute the command as follows:

        awk 'BEGIN { print "Hello World!" }'
        

        The output will be as follows:

          
        Hello World!
        
        

      Hello World! – Input File

      In this section, you will create an AWK program in an input file and then run it from the command line.

      1. Create a file called helloworld.awk with the following content:

        helloworld.awk
        1
        2
        
        BEGIN { print "Hello World!" }
            
      2. On the command line, run the helloworld.awk program. The -f option tells AWK to expect a source file as the program to run.

        awk -f helloworld.awk
        
      3. The output will be as follows:

          
        Hello World!
            
        
      4. You can also run AWK programs as executable scripts. Open helloworld.awk and add a bash script line to the top of the file and save it without the .awk extension.

        helloworld
        1
        2
        3
        4
        
        #!/usr/bin/awk -f
        
        BEGIN { print "Hello World!" }
            

        The #!/usr/bin/awk -f line defines the start of script execution.

      5. Add execute permissions to helloworld:

        chmod +x helloworld
        
      6. Execute the helloworld program:

        ./helloworld
        
      7. The output will resemble the following:

          
        Hello World!
            
        

      Variables in AWK

      AWK supports built-in and user defined variables. Built-in variables are native to AWK, whereas user defined variables are ones you define.

      Built-in Variables

      AWK has many built-in variables that are automatically initialized. Some of the most important ones are the following:

      Variable Definition
      NF Holds the number of fields in the current input record. Each record can have a different number of fields.
      FS Defines the input field separator. The default value is a whitespace, but it also matches any sequence of spaces and tabs. Additionally, any number of leading or trailing whitespaces and tabs are ignored. If the value of FS is set to the null string, then each character in the current line becomes a separate field.
      FILENAME Stores the filename of the current input file. You cannot use FILENAME inside a BEGIN block, because there are no input files being processed.
      NR Keeps track of the total number of records that have been read so far.
      FNR Stores the total number of records that have been read from the current input file.
      IGNORECASE Tells AWK whether or not to ignore case in all of its comparisons or regular expressions. If IGNORECASE stores a non-zero or null value, then AWK will ignore case.
      ARGC Holds the number of command line arguments.
      ARGV Stores the actual command line arguments of an AWK program.

      User Defined Variables

      User defined variables can store numeric or string values. AWK dynamically assigns variables a type based on the variable’s initial value. User defined variables, by default, are initialized to the empty string. If you convert a variable from a string to a number, the default value is zero. You can convert a string to a number and vice versa as long as the string can be converted to a valid number. It is important to keep in mind that AWK is not a type safe programming language, since this can sometimes generate bugs.

      • You can set a variable via the command line using the -v option. This command will initialize the variable count and print its value:

        awk -v count=8 'BEGIN { print count }'
        
      • To initialize variables within an input file, you can use the form myvariable = "myvar" for strings and myvariable = 10 for numeric values. Create a file named count.awk and add the following content:

        count.awk
        1
        2
        3
        4
        5
        
        BEGIN {
            count = 10
            print count
        }
            

        To run this file, switch back to the command line and execute the following command:

        awk -f count.awk
        

        Your output should display:

          
              10
            
        

      Special Patterns

      AWK uses patterns to control how a rule should be executed against an input record. The two main categories of patterns in AWK are regular expressions and expressions. Regular expressions use a special format to target specific sets of strings, while expressions encompass various ways to target patterns in AWK, like comparison expressions that may utilize regular expressions. Special patterns in AWK include reserved keywords that perform special actions within your AWK programs. The sections below discuss the special patterns BEGIN, END, BEGINFILE, and ENDFILE.

      BEGIN and END

      BEGIN and END are executed only once: before receiving any input and after processing all input, respectively. In this way, they can be used to perform startup and cleanup actions in your AWK programs.

      Although it is not required to use BEGIN and END at the beginning and end of your AWK programs, it is considered good practice to do so. Additionally, you can include multiple BEGIN and END blocks in one program.

      If an AWK program uses only BEGIN rules without any other code, the program terminates without reading any of the specified input. However, if an AWK program contains only END rules without any additional code, all the specified input is read. This is necessary in case the END rule references the FNR and NR variables.

      BEGINFILE and ENDFILE

      Note

      BEGINFILE and ENDFILE only work with gawk.

      Two other patterns with special functionality are BEGINFILE and ENDFILE. BEGINFILE is executed before AWK reads the first record from a file, whereas ENDFILE is executed after AWK is done with the last record of a file.

      ENDFILE is convenient for recovering from I/O errors during processing. The AWK program can pass control to ENDFILE, and instead of stopping abnormally it sets the ERRNO variable to describe the error that occurred. AWK clears the ERRNO variable before it starts processing the next file. Similarly, the nextfile statement – when used inside BEGINFILE – allows gawk to move to the next data file instead of exiting with a fatal error and without executing the ENDFILE block.

      1. As an example, create a file named beginfile.awk:

        beginfile.awk
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        
        BEGIN {
            numberOfFiles = 0
        }
        
        BEGINFILE {
            print "New file", FILENAME
        
            # Check if there is an error while trying to read the file
            if (ERRNO) {
                print "Cannot read", FILENAME, "– processing next file!"
                nextfile
            }
        }
        
        ENDFILE {
            numberOfFiles++
        }
        
        END {
            print "Total number of files processed: ", numberOfFiles
        }
        • This program showcases the usage of BEGIN, END, BEGINFILE, and ENDFILE by printing the total number of files read as well as the filename of each file.

        • If there is a problem while reading a file, the code will report it.

        • Printing the filename is done with the help of the FILENAME variable.

      2. Execute the file with the following command:

        gawk -f hw.awk beginfile.awk givenLine.awk doesNotExist
        
      3. The output will be similar to the following example. The program does not stop abnormally when it does not find an input file and provides a useful error message.

          
        New file hw.awk
        Cannot read hw.awk – processing next file!
        New file beginfile.awk
        New file givenLine.awk
        Cannot read givenLine.awk – processing next file!
        New file doesNotExist
        Cannot read doesNotExist – processing next file!
        Total number of files processed:  1
            
        

      Looping in AWK

      AWK supports for, do-while, and while loops that behave similarly to control flow statements in other programming languages. Loops execute code contained within a code block as many times as specified in the control flow statement. To illustrate loops in AWK, a working example is provided below.

      1. Create and save a file named loops.awk:

        loops.awk
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        
        BEGIN {
            for (i = 0; i < ARGC; i++)
                printf "ARGV[%d] = %sn", i, ARGV[i]
        
            k = 0
            while ( k < ARGC ) {
                printf "ARGV[%d] = %sn", k, ARGV[k]
                k++
            }
        
            m = 0
            do {
                printf "ARGV[%d] = %sn", m, ARGV[m]
                m++
            } while ( m < ARGC )
        }
        
        END {
            for (i = 0; i < 10; i++)
                printf "%d ", i
            printf "n"
        }
        • The program uses the value of the ARGC built-in variable to control how many times to loop through each separate block of code. The result will vary depending on how many command line arguments you pass to AWK when executing the program.
        • The for loop after the END special pattern will print numbers from 0 – 9.
      2. Execute the loops.awk input program with the following command:

        echo "" | awk -f loops.awk
        
      3. The output will be similar to the following:

          
        ARGV[0] = awk
        ARGV[0] = awk
        ARGV[0] = awk
        0 1 2 3 4 5 6 7 8 9
        
        

      Arrays

      AWK does not require array indices to be consecutive integers. Instead, strings and numbers may be used. This is because AWK uses string keys internally to represent an array’s indices, and so arrays in AWK are more like associative arrays that store a collection of pairs. Unlike other programming languages, you do not need to declare an array and its size before using it, and new pairs can be added at any time. The file below serves to illustrate the behavior of arrays in AWK.

      1. Create the file arrays.awk:

        arrays.awk
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        
        BEGIN {
            a[0] = 1;
            a[1] = 2;
            a[2] = 3;
            a[3] = 4;
        
            for (i in a)
                print "Index:", i, "with value:", a[i];
                print "Adding two elements and deleting a[0]";
        
                a["One"] = "One_value";
                a["Two"] = "Two_value";
                delete a[0];
        
            for (i in a)
                print "Index:", i, "with value:", a[i];
        
            if (a["1"] == a[1])
                printf "a[1] = a["1"] = %sn", a["1"];
        }
        • The program creates the a[] array and initializes it with four separate numeric values.
        • The for block will loop through the array and print the current index and value.
        • It then adds two new elements to array a[] that use string indices instead of numbers.
        • It demonstrates how to delete an element from an array by deleting the a[0] element.
        • Finally, the if statement evaluates if a["1"] and a[1] are equivalent. Since AWK stores all array elements as string keys, both indices point to the same array element and the code in the if statement executes.
      2. Run the program with the following command:

        awk -f arrays.awk
        
      3. The output will look similar to the following:

          
        Index: 0 with value: 1
        Index: 1 with value: 2
        Index: 2 with value: 3
        Index: 3 with value: 4
        Adding two elements and deleting a[0]
        Index: Two with value: Two_value
        Index: One with value: One_value
        Index: 1 with value: 2
        Index: 2 with value: 3
        Index: 3 with value: 4
        a[1] = a["1"] = 2
          
        

        Note

        The order of the array indices may be out of order. This is because arrays in AWK are associative and not assigned in blocks of contiguous memory.

      Functions

      Like most programming languages, AWK supports user-defined functions and ships with several useful built-in functions. This section will provide examples demonstrating how to use both types of functions.

      Predefined Functions

      AWK’s built-in functions provide mechanisms for string manipulation, numeric operations, and I/O functions to work with files and shell commands. The example below utilizes the built-in numeric functions rand() and int() to show how to call built-in functions.

      1. Create and save a file named rand.awk:

        rand.awk
        1
        2
        3
        4
        5
        6
        7
        
        BEGIN {
            while (i < 20) {
                n = int(rand()*10);
                print "value of n:", n;
                i++;
            }
        }
        • The rand.awk program uses the rand() function to generate a random number and stores it in the n variable. By default, rand() returns a random number between 0 and 1. To generate numbers larger than 1, the program multiplies the returned random number by 10.
        • AWK’s int() function rounds the result of the rand() function to the nearest integer.
      2. Execute the rand.awk program with the following command:

        awk -f rand.awk
        
      3. The output will resemble the following:

          
        value of n: 2
        value of n: 2
        value of n: 8
        value of n: 1
        value of n: 5
        value of n: 1
        value of n: 8
        value of n: 1
        ...
        
        

      User Defined Functions

      The AWK programming language allows you to define your own functions and call them throughout an AWK program file. A function definition must include a name and can include a parameter list. Function names can only contain a sequence of letters, digits, and underscores. The function name cannot begin with a digit. In the example below, you will declare a function definition and utilize it within the AWK program.

      1. Create and save the myFunction.awk file:

        myFunction.awk
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        
        function isnum(x) { return(x==x+0) }
        
        function sumToN(n) {
            sum = 0
            if (n < 0) { n = -n }
            if ( isnum(n) ) {
                for (j = 1; j <= n; j++)
                    sum = sum + j
            } else { return -1 }
            return sum
        }
        {
            for (i=1; i<=NF; i++)
                print $i, "t:", sumToN($i)
        }
        • The user defined function sumToN() takes a single parameter n and uses a for loop to increment its value and stores it in the sum variable.
        • The program will take command line input, and pass it as a parameter to the sumToN() function and print the calculated sum.
      2. Execute myFunction.awk with the following command:

        echo "10 12" | awk -f myFunction.awk
        
      3. Your output will resemble the example below. If you use a different set of numbers, your output will differ from the example.

          
        10 : 55
        12 : 78
        
        

      Practical Examples

      This section of the guide provides a variety of practical examples to further demonstrate the AWK programming language. You can try out each example on your own Linux machine or expand on the examples for your own specific needs.

      Printing

      Printing a Given Line from a File

      1. To use AWK to print a given line from a text file, create and save the givenLine.awk file:

        givenLine.awk
        1
        2
        3
        4
        
        {
            if (NR == line)
                print $0;
        }
        • This program will print out the record that corresponds to the value passed to the line variable. The program will require input either from the command line or from a file.
        • You should pass the value of the line variable to the AWK program as a command line argument using the -v option.
      2. By executing the givenLine.awk program as follows, it will print out the first line found in the myFunction.awk program written in the previous section. (You could similarly pass it any text file.)

        awk -v line=1 -f givenLine.awk myFunction.awk
        

        The output will resemble the following:

          
        function isnum(x) { return(x==x+0) }
        
        
      3. Execute givenLine.awk again, passing line 4:

        awk -v line=4 -f givenLine.awk myFunction.awk
        
      4. This time the output is as follows:

          
            sum = 0
        
        

      Printing Two Given Fields from a File

      In this example, the AWK program will print the values of the first and third fields of any text file.

      1. Create and save the file field1and3.awk:

        field1and3.awk
      2. Create and save the file words.txt:

        words.txt
      3. Execute field1and3.awk passing words.txt as input:

        awk -f field1and3.awk words.txt
        
      4. The output will print only the first and third words (fields) contained in the file:

          
        one three
        
        

        Note

        You can also execute the contents of field1and3.awk on the command line and pass words.txt as input:

        awk '{print $1, $3}' words.txt
        

      Counting

      Counting Lines

      The following example AWK program will count the number of lines that are found in the given text file(s).

      FNR stores the total number of records that have been read from the current input file.

      1. Create and save the countLines.awk file:

        countLines.awk
        1
        2
        3
        4
        5
        6
        7
        8
        
        {
            if (FNR==1)
                print "Processing:", FILENAME;
        }
        
        END {
            print "Read", NR, "records in total";
        }
        • The use of FNR makes sure that the filename of each processed file will be printed only once.
        • END makes sure that the results will be printed just before AWK finishes executing countLines.awk.
      2. Create and save the data.txt file. This file will be passed to AWK as input for processing.

        data.txt
        1
        2
        3
        4
        5
        6
        7
        
        one
        two
        three
        4
        
        6
        seven not eight
      3. Execute countLines.awk with the following command, passing data.txt as input:

        awk -f countLines.awk data.txt
        
      4. The output will resemble the following:

          
        Processing: data.txt
        Read 7 records in total
        
        
      5. Execute countLines.awk with multiple files for processing. You can use words.txt from the previous exercise.

        awk -f countLines.awk data.txt words.txt
        
      6. You should see a similar output:

          
        Processing: data.txt
        Processing: words.txt
        Read 8 records in total
        
        

      Counting Lines with a Specific Pattern

      The following AWK code uses the variable n to count the number of lines that contain the string three:

      awk '/three/ { n++ }; END { print n+0 }'
      
      • The code above tells AWK to execute n++ each time there is a match to the /three/ regular expression.

      • When the processing is done, the code in END is executed. This code prints the current value of n converted to a number by adding the numeral zero.

      1. Create a file named dataFile.txt to pass to AWK as input for processing:

        dataFile.txt
        1
        2
        3
        4
        5
        6
        7
        
            one
            two
            three
            four
            three
            two
            one
      2. Execute the example code and pass dataFile.txt as input:

        awk '/three/ { n++ }; END { print n+0 }' dataFile.txt
        
      3. The output will look as follows:

          
        2
        
        

      Counting Characters

      In this example, the countChars.awk file calculates the number of characters found in an input file.

      1. Create and save the file countChars.awk:

        countChars.awk
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        
        BEGIN {
            n = 0;
        }
        
        {
            if (FNR==1)
                print "Processing:", FILENAME;
        
            n = n + length($0) + 1;
        }
        
        END {
            print "Read", n, "characters in total";
        }
        • This program makes use of the built-in string function length(), which returns the number of characters in a string. In the case of the program, the string will be provided by the entirety of the current record, which is indicated by $0.
        • The + 1 appended to the length() function is used to account for the new line character that each line includes.
      2. Execute countChars.awk by running the following command and pass it the countLines.awk file from the previous exercise.

        awk -f countChars.awk countLines.awk
        
      3. The output will look similar to the following:

          
        Processing: countLines.awk
        Read 110 characters in total
        
        
      4. Execute countChars.awk with multiple files to process as follows:

        awk -f countChars.awk countLines.awk field1and3.awk
        
          
        Processing: countLines.awk
        Processing: field1and3.awk
        Read 132 characters in total
        
        

      Calculating Word Frequencies

      This example demonstrates some of the advanced capabilities of AWK. The file wordFreq.awk reads a text file and counts how many times each word appears in the text file using associative arrays.

      1. Create and save the file wordFreq.awk:

        wordFreq.awk
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        
        {
            for (i= 1; i<=NF; i++ ) {
                $i = tolower($i)
                freq[$i]++
            }
        }
        
        END {
            for (word in freq)
                print word, ":", freq[word]
        }
        • wordFreq.awk uses a for loop to traverse through an input file and add each record to the freq[] array.
        • The tolower() built-in string function is used to ensure the program does not count the same word multiple times based on differences in case, e.g., seven and Seven are not counted as different words.
        • Before the program exits, the END block prints out each word and its frequency with the input file.
      2. Create and save the file wordFreq.txt to use as an input file.

        wordFreq.txt
        1
        2
        3
        4
        5
        
        one two
        three one four seven Seven
        One Two TWO
        
        one three five
      3. Execute the wordFreq.awk program and pass wordFreq.txt as input:

        awk -f wordFreq.awk wordFreq.txt | sort -k3rn
        

        Note

        The sort -k3rn command is used to sort the output of wordFreq.awk based on a numeric sort in reverse order.

      4. The output will resemble the following:

          
        one : 4
        two : 3
        seven : 2
        three : 2
        five : 1
        four : 1
        
        

      Updating Docker Images

      1. Use the following series of piped commands to update all Docker images found on your local machine to their latest version:

        docker images | grep -v REPOSITORY | awk '{print $1}' | xargs -L1 docker pull
        
        • In this example, AWK is just a piece of the entire command. AWK does the job of extracting the first field from the result of executing the docker images command.

      Finding

      Finding the Top-10 Commands of your Command History

      1. Use The following shell command to find your top 10 most used commands by piping the output of history to AWK as input:

        history | awk '{CMD[$2]++;count++;} END {for (a in CMD)print CMD[a] " "CMD[a]/count*100 " % " a;} ' | grep -v "./" | column -c3 -s " " -t | sort -rn | head -n10
        
        • First, the command executes the history command to be used as AWK’s input.

        • This is processed by a complex awk command that calculates the number of times each command appears in history by considering the second field of each record. This is the field that corresponds to the previously issued commands. These values are stored in the CMD[] associative array.

        • At the same time, the total number of commands that have been processed are stored in the count variable.

        • The frequency of each command is calculated with the CMD[a]/count*100 statement and printed on the screen along with the command name.

        • The formatting and the sorting of the output is handled by the grep, column, sort, and head command line utilities.

      2. Your output should resemble the following:

          
        2318  18.4775    %  git
        1224  9.75688    %  ll
        1176  9.37425    %  go
        646   5.14946    %  docker
        584   4.65524    %  cat
        564   4.49582    %  brew
        427   3.40375    %  lenses-cli
        421   3.35592    %  cd
        413   3.29215    %  vi
        378   3.01315    %  rm
        
        

      Finding the Number of Records that Appear More than Once

      This program’s logic utilizes the behavior of AWK associative arrays. The associative array’s keys are the entire lines of the passed input. This means that if a line appears more than once, it will be found in the associative array and will have a value that is different from the default, which is 0.

      1. Create and save the file nDuplicates.awk:

        nDuplicates.awk
         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        12
        13
        14
        15
        16
        17
        
        BEGIN {
            total = 0;
        }
        
        {
            i = tolower($0);
            if (freq[i] == 1) {
                total = total + 2;
            } else if (freq[i] > 1) {
                total++;
            }
                freq[i]++;
        }
        
        END {
            print "Found", total, "lines with duplicate records.";
        }
      2. Execute the nDuplicates.awk file and pass the file to itself as input:

        awk -f nDuplicates.awk nDuplicates.awk
        
      3. The output will look similar to the following:

          
        Found 5 lines with duplicate records.
            
        
      4. Execute the command again, passing the file twice to itself:

        awk -f nDuplicates.awk nDuplicates.awk nDuplicates.awk
        
      5. The output will look similar to the following:

          
        Found 42 lines with records that already existed
        
        

      More Information

      You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.

      Find answers, ask questions, and help others.

      This guide is published under a CC BY-ND 4.0 license.



      Source link

      Introduction to HashiCorp Configuration Language (HCL)


      Updated by Linode Written by Linode

      HCL is a configuration language authored by HashiCorp. HCL is used with HashiCorp’s cloud infrastructure automation tools, like Terraform. The language was created with the goal of being both human and machine friendly. It is JSON compatible, which means it is interoperable with other systems outside of the Terraform product line.

      This guide provides an introduction to HCL syntax and some commonly used HCL terminology.

      HCL Syntax Overview

      HashiCorp’s configuration syntax is easy to read and write. It was created to have a more clearly visible and defined structure when compared with other well known configuration languages, like YAML.

      ~/terraform/main.tf
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      
      # Linode provider block. Installs Linode plugin.
      provider "linode" {
          token = "${var.token}"
      }
      
      variable "region" {
        description = "This is the location where the Linode instance is deployed."
      }
      
      /* A multi
         line comment. */
      resource "linode_instance" "example_linode" {
          image = "linode/ubuntu18.04"
          label = "example-linode"
          region = "${var.region}"
          type = "g6-standard-1"
          authorized_keys = [ "my-key" ]
          root_pass = "example-password"
      }
          

      Note

      Key Elements of HCL

      • HCL syntax is composed of stanzas or blocks that define a variety of configurations available to Terraform. Provider plugins expand on the available base Terraform configurations.

      • Stanzas or blocks are comprised of key = value pairs. Terraform accepts values of type string, number, boolean, map, and list.

      • Single line comments start with #, while multi-line comments use an opening /* and a closing */.

      • Interpolation syntax can be used to reference values stored outside of a configuration block, like in an input variable, or from a Terraform module’s output.

        An interpolated variable reference is constructed with the "${var.region}" syntax. This example references a variable named region, which is prefixed by var.. The opening ${ and closing } indicate the start of interpolation syntax.

      • You can include multi-line strings by using an opening <<EOF, followed by a closing EOF on its own line.

      • Strings are wrapped in double quotes.

      • Lists of primitive types (string, number, and boolean) are wrapped in square brackets: ["Andy", "Leslie", "Nate", "Angel", "Chris"].

      • Maps use curly braces {} and colons :, as follows: { "password" : "my_password", "db_name" : "wordpress" }.

      See Terraform’s Configuration Syntax documentation for more details.

      Providers

      In Terraform, a provider is used to interact with an Infrastructure as a Service (IaaS) or Platform as a Service (PaaS) API, like the Linode APIv4. The provider determines which resources are exposed and available to create, read, update, and delete. A credentials set or token is usually required to interface with your service account. For example, the Linode Terraform provider requires your Linode API access token. A list of all official Terraform providers is available from HashiCorp.

      Configuring a Linode as your provider requires that you include a block which specifies Linode as the provider and sets your Linode API token in one of your .tf files:

      ~/terraform/terraform.tf
      1
      2
      3
      
      provider "linode" {
          token = "my-token"
      }

      Once your provider is declared, you can begin configuring resources available from the provider.

      Note

      Providers are packaged as plugins for Terraform. Whenever declaring a new provider in your Terraform configuration files, the terraform init command should be run. This command will complete several initialization steps that are necessary before you can apply your Terraform configuration, including downloading the plugins for any providers you’ve specified.

      Resources

      A Terraform resource is any component of your infrastructure that can be managed by your provider. Resources available with the Linode provider range from a Linode instance, to a block storage volume, to a DNS record. Terraform’s Linode Provider documentation contains a full listing of all supported resources.

      Resources are declared with a resource block in a .tf configuration file. This example block deploys a 2GB Linode instance located in the US East data center from an Ubuntu 18.04 image. Values are also provided for the Linode’s label, public SSH key, and root password:

      ~/terraform/main.tf
      1
      2
      3
      4
      5
      6
      7
      8
      
      resource "linode_instance" "WordPress" {
          image = "linode/ubuntu18.04"
          label = "WPServer"
          region = "us-east"
          type = "g6-standard-1"
          authorized_keys = [ "example-key" ]
          root_pass = "example-root-pass"
      }

      HCL-specific meta-parameters are available to all resources and are independent of the provider you use. Meta-parameters allow you to do things like customize the lifecycle behavior of the resource, define the number of resources to create, or protect certain resources from being destroyed. See Terraform’s Resource Configuration documentation for more information on meta-parameters.

      Modules

      A module is an encapsulated set of Terraform configurations used to organize the creation of resources in reusable configurations.

      The Terraform Module Registry is a repository of community modules that can help you get started creating resources for various providers. You can also create your own modules to better organize your Terraform configurations and make them available for reuse. Once you have created your modules, you can distribute them via a remote version control repository, like GitHub.

      Using Modules

      A module block instructs Terraform to create an instance of a module. This block instantiates any resources defined within that module.

      The only universally required configuration for all module blocks is the source parameter which indicates the location of the module’s source code. All other required configurations will vary from module to module. If you are using a local module you can use a relative path as the source value. The source path for a Terraform Module Registry module will be available on the module’s registry page.

      This example creates an instance of a module named linode-module-example and provides a relative path as the location of the module’s source code:

      ~/terraform/main.tf
      1
      2
      3
      
      module "linode-module-example" {
          source = "/modules/linode-module-example"
      }

      Authoring modules involves defining resource requirements and parameterizing configurations using input variables, variable files, and outputs. To learn how to write your own Terraform modules, see Create a Terraform Module.

      Input Variables

      You can define input variables to serve as Terraform configuration parameters. By convention, input variables are normally defined within a file named variables.tf. Terraform will load all files ending in .tf, so you can also define variables in files with other names.

      • Terraform accepts variables of type string, number, boolean, map, and list. If a variable type is not explicitly defined, Terraform will default to type = "string".

      • It is good practice to provide a meaningful description for all your input variables.

      • If a variable does not contain a default value, or if you would like to override a variable’s default value, you must provide a value as an environment variable or within a variable values file.

      Variable Declaration Example

      ~/terraform/variables.tf
      1
      2
      3
      4
      5
      6
      7
      8
      
      variable "token" {
        description = "This is your Linode APIv4 Token."
      }
      
      variable "region" {
          description: "This is the location where the Linode instance is deployed."
          default = "us-east"
      }

      Two input variables named token and region are defined, respectively. The region variable defines a default value. Both variables will default to type = "string", since a type is not explicitly declared.

      Supplying Variable Values

      Variable values can be specified in .tfvars files. These files use the same syntax as Terraform configuration files:

      ~/terraform/terraform.tfvars
      1
      2
      
      token = "my-token"
      region = "us-west"

      Terraform will automatically load values from filenames which match terraform.tfvars or *.auto.tfvars. If you store values in a file with another name, you need to specify that file with the -var-file option when running terraform apply. The -var-file option can be invoked multiple times:

      terraform apply 
      -var-file="variable-values-1.tfvars" 
      -var-file="variable-values-2.tfvars"
      

      Values can also be specified in environment variables when running terraform apply. The name of the variable should be prefixed with TF_VAR_:

      TF_VAR_token=my-token-value TF_VAR_region=us-west terraform apply
      

      Note

      Environment variables can only assign values to variables of type = "string"

      Referencing Variables

      You can call existing input variables within your configuration file using Terraform’s interpolation syntax. Observe the value of the region parameter:

      ~/terraform/main.tf
      1
      2
      3
      4
      5
      6
      7
      8
      
      resource "linode_instance" "WordPress" {
          image = "linode/ubuntu18.04"
          label = "WPServer"
          region = "${var.region}"
          type = "g6-standard-1"
          authorized_keys = [ "example-key" ]
          root_pass = "example-root-pass"
      }

      Note

      If a variable value is not provided in any of the ways discussed above, and the variable is called in a resource configuration, Terraform will prompt you for the value when you run terraform apply.

      For more information on variables, see Terraform’s Input Variables documentation.

      Interpolation

      HCL supports the interpolation of values. Interpolations are wrapped in an opening ${ and a closing }. Input variable names are prefixed with var.:

      ~/terraform/terraform.tf
      1
      2
      3
      
      provider "linode" {
          token = "${var.token}"
      }

      Interpolation syntax is powerful and includes the ability to reference attributes of other resources, call built-in functions, and use conditionals and templates.

      This resource’s configuration uses a conditional to provide a value for the tags parameter:

      ~/terraform/terraform.tf
      1
      2
      3
      
      resource "linode_instance" "web" {
          tags = ["${var.env == "production" ? var.prod_subnet : var.dev_subnet}"]
      }

      If the env variable has the value production, then the prod_subnet variable is used. If not, then the variable dev_subent is used.

      Functions

      Terraform has built-in computational functions that perform a variety of operations, including reading files, concatenating lists, encrypting or creating a checksum of an object, and searching and replacing.

      ~/terraform/terraform.tf
      1
      2
      3
      4
      
      resource "linode_sshkey" "main_key" {
          label = "foo"
          ssh_key = "${chomp(file("~/.ssh/id_rsa.pub"))}"
      }

      In this example, ssh_key = "${chomp(file("~/.ssh/id_rsa.pub"))}" uses Terraform’s built-in function file() to provide a local file path to the public SSH key’s location. The chomp() function removes trailing new lines from the SSH key. Observe that the nested functions are wrapped in opening ${ and closing } to indicate that the value should be interpolated.

      Note

      Running terraform console creates an environment where you can test interpolation functions. For example:

      terraform console
      
        
      > list("newark", "atlanta", "dallas")
      [
        "newark",
        "atlanta",
        "dallas",
      ]
      >
      
      

      Terraform’s official documentation includes a complete list of supported built-in functions.

      Templates

      Templates can be used to store large strings of data. The template provider exposes the data sources for other Terraform resources or outputs to consume. The data source can be a file or an inline template.

      The data source can use Terraform’s standard interpolation syntax for variables. The template is then rendered with variable values that you supply in the data block.

      This example template resource substitutes in the value from ${linode_instance.web.ip_address} anywhere ${web_ip} appears inside the template file ips.json:

      1
      2
      3
      4
      5
      6
      7
      
      data "template_file" "web" {
          template = "${file("${path.module}/ips.json")}"
      
          vars {
              web_ip = "${linode_instance.web.ip_address}"
          }
      }

      You could then define an output variable to view the rendered template when you later run terraform apply:

      1
      2
      3
      
      output "ip" {
        value = "${data.template_file.web.rendered}"
      }

      Terraform’s official documentation has a list of all available components of interpolation syntax.

      Next Steps

      Now that you are familiar with HCL, you can begin creating your own Linode instance with Terraform by following the Use Terraform to Provision Linode Environments guide.

      More Information

      You may wish to consult the following resources for additional information on this topic. While these are provided in the hope that they will be useful, please note that we cannot vouch for the accuracy or timeliness of externally hosted materials.

      Find answers, ask questions, and help others.

      This guide is published under a CC BY-ND 4.0 license.



      Source link