One place for hosting & domains

      September 2019

      How To Manage Sets in Redis


      Introduction

      Redis is an open-source, in-memory key-value data store. Sets in Redis are collections of strings stored at a given key. When held in a set, an individual record value is referred to as a member. Unlike lists, sets are unordered and do not allow repeated values.

      This tutorial explains how to create sets, retrieve and remove members, and compare the members of different sets.

      How To Use This Guide
      This guide is written as a cheat sheet with self-contained examples. We encourage you to jump to any section that is relevant to the task you’re trying to complete.

      The commands and outputs shown in this guide were tested on an Ubuntu 18.04 server running Redis version 4.0.9. To obtain a similar setup, you can follow Step 1 of our guide on How To Install and Secure Redis on Ubuntu 18.04. We will demonstrate how these commands behave by running them with redis-cli, the Redis command line interface. Note that if you’re using a different Redis interface — Redli, for example — the exact outputs of certain commands may differ.

      Alternatively, you could provision a managed Redis database instance to test these commands, but note that depending on the level of control allowed by your database provider, some commands in this guide may not work as described. To provision a DigitalOcean Managed Database, follow our Managed Databases product documentation. Then, you must either install Redli or set up a TLS tunnel in order to connect to the Managed Database over TLS.

      Creating Sets

      The sadd command allows you to create a set and add one or more members to it. The following example will create a set at a key named key_horror with the members "Frankenstein" and "Godzilla":

      • sadd key_horror "Frankenstein" "Godzilla"

      If successful, sadd will return an integer showing how many members it added to the set:

      Output

      (integer) 2

      If you try adding members of a set to a key that’s already holding a non-set value, it will return an error. The first command in this block creates a list named key_action with one element, "Shaft". The next command tries to add a set member, "Shane", to the list, but this produces an error because of the clashing data types:

      • rpush key_action "Shaft"
      • sadd key_action "Shane"

      Output

      (error) WRONGTYPE Operation against a key holding the wrong kind of value

      Note that sets don’t allow more than one occurrence of the same member:

      • sadd key_comedy "It's" "A" "Mad" "Mad" "Mad" "Mad" "Mad" "World"

      Output

      (integer) 4

      Even though this saddcommand specifies eight members, it discards four of the duplicate "Mad" members resulting in a set size of 4.

      Retrieving Members from Sets

      In this section, we’ll go over a number of Redis commands that return information about the members held in a set. To practice the commands outlined here, run the following command, which will create a set with six members at a key called key_stooges:

      • sadd key_stooges "Moe" "Larry" "Curly" "Shemp" "Joe" "Curly Joe"

      To return every member from a set, run the smembers command followed by the key you want to inspect:

      Output

      1) "Curly" 2) "Moe" 3) "Larry" 4) "Shemp" 5) "Curly Joe" 6) "Joe"

      To check if a specific value is a member of a set, use the sismember command:

      • sismember key_stooges "Harpo"

      If the element "Harpo" is a member of the key_stooges set, sismember will return 1. Otherwise, it will return 0:

      Output

      (integer) 0

      To see how many members are in a given set (in other words, to find the cardinality of a given set), run scard:

      Output

      (integer) 6

      To return a random element from a set, run srandmember:

      Output

      "Larry"

      To return multiple random, distinct elements from a set, you can follow the srandmember command with the number of elements you want to retrieve:

      • srandmember key_stooges 3

      Output

      1) "Larry" 2) "Moe" 3) "Curly Joe"

      If you pass a negative number to srandmember, the command is allowed to return the same element multiple times:

      • srandmember key_stooges -3

      Output

      1) "Shemp" 2) "Curly Joe" 3) "Curly Joe"

      The random element function used in srandmember is not perfectly random, although its performance improves in larger data sets. See the command’s official documentation for more details.

      Removing Members from Sets

      Redis comes with three commands used to remove members from a set: spop, srem, and smove.

      spop randomly selects a specified number of members from a set and returns them, similar to srandmember, but then deletes them from the set. It accepts the name of the key containing a set and the number of members to remove from the set as arguments. If you don’t specify a number, spop will default to returning and removing a single value.

      The following example command will remove and return two randomly-selected elements from the key_stooges set created in the previous section:

      Output

      1) "Shemp" 2) "Larry"

      srem allows you to remove one or more specific members from a set, rather than random ones:

      • srem key_stooges "Joe" "Curly Joe"

      Instead of returning the members removed from the set, srem returns an integer showing how many members were removed:

      Output

      (integer) 2

      Use smove to move a member from one set to another. This command accepts as arguments the source set, the destination set, and the member to move, in that order. Note that smove only allows you to move one member at a time:

      • smove key_stooges key_jambands "Moe"

      If the command moves the member successfully, it will return (integer) 1:

      Output

      (integer) 1

      If smove fails, it will instead return (integer) 0. Note that if the destination key does not already exist, smove will create it before moving the member into it.

      Comparing Sets

      Redis also provides a number of commands that find the differences and similarities between sets. To demonstrate how these work, this section will reference three sets named presidents, kings, and beatles. If you’d like to try out the commands in this section yourself, create these sets and populate them using the following sadd commands:

      • sadd presidents "George" "John" "Thomas" "James"
      • sadd kings "Edward" "Henry" "John" "James" "George"
      • sadd beatles "John" "George" "Paul" "Ringo"

      sinter compares different sets and returns the set intersection, or values that appear in every set:

      • sinter presidents kings beatles

      Output

      1) "John" 2) "George"

      sinterstore performs a similar function, but instead of returning the intersecting members it creates a new set at the specified destination containing these intersecting members. Note that if the destination already exists, sinterstore will overwrite its contents:

      • sinterstore new_set presidents kings beatles
      • smembers new_set

      Output

      1) "John" 2) "George"

      sdiff returns the set difference — members resulting from the difference of the first specified set from each of the following sets:

      • sdiff presidents kings beatles

      Output

      1) "Thomas"

      In other words, sdiff looks at each member in the first given set and then compares those to members in each successive set. Any member in the first set that also appears in the following sets is removed, and sdiff returns the remaining members. Think of it as removing members of subsequent sets from the first set.

      sdiffstore performs a function similar to sdiff, but instead of returning the set difference it creates a new set at a given destination, containing the set difference:

      • sdiffstore new_set beatles kings presidents
      • smembers new_set

      Output

      1) "Paul" 2) "Ringo"

      Like sinterstore, sdiffstore will overwrite the destination key if it already exists.

      sunion returns the set union, or a set containing every member of every set you specify:

      • sunion presidents kings beatles

      Output

      1) "Thomas" 2) "George" 3) "Paul" 4) "Henry" 5) "James" 6) "Edward" 7) "John" 8) "Ringo"

      sunion treats the results like a new set in that it only allows one occurrence of any given member.

      sunionstore performs a similar function, but creates a new set containing the set union at a given destination instead of just returning the results:

      • sunionstore new_set presidents kings beatles

      Output

      (integer) 8

      As with sinterstore and sdiffstore, sunionstore will overwrite the destination key if it already exists.

      Conclusion

      This guide details a number of commands used to create and manage sets in Redis. If there are other related commands, arguments, or procedures you’d like to see outlined in this guide, please ask or make suggestions in the comments below.

      For more information on Redis commands, see our tutorial series on How to Manage a Redis Database.



      Source link

      How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)


      The author selected the Open Internet/Free Speech fund to receive a donation as part of the Write for DOnations program.

      Introduction

      A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. You will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.

      In this tutorial, you will prepare a dataset of sample tweets from the NLTK package for NLP with different data cleaning methods. Once the dataset is ready for processing, you will train a model on pre-classified tweets and use the model to classify the sample tweets into negative and positives sentiments.

      This article assumes that you are familiar with the basics of Python (see our How To Code in Python 3 series), primarily the use of data structures, classes, and methods. The tutorial assumes that you have no background in NLP and nltk, although some knowledge on it is an added advantage.

      Prerequisites

      Step 1 — Installing NLTK and Downloading the Data

      You will use the NLTK package in Python for all NLP tasks in this tutorial. In this step you will install NLTK and download the sample tweets that you will use to train and test your model.

      First, install the NLTK package with the pip package manager:

      This tutorial will use sample tweets that are part of the NLTK package. First, start a Python interactive session by running the following command:

      Then, import the nltk module in the python interpreter.

      Download the sample tweets from the NLTK package:

      • nltk.download('twitter_samples')

      Running this command from the Python interpreter downloads and stores the tweets locally. Once the samples are downloaded, they are available for your use.

      You will use the negative and positive tweets to train your model on sentiment analysis later in the tutorial. The tweets with no sentiments will be used to test your model.

      If you would like to use your own dataset, you can gather tweets from a specific time period, user, or hashtag by using the Twitter API.

      Now that you’ve imported NLTK and downloaded the sample tweets, exit the interactive session by entering in exit(). You are ready to import the tweets and begin processing the data.

      Step 2 — Tokenizing the Data

      Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

      A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

      To get started, create a new .py file to hold your script. This tutorial will use nlp_test.py:

      In this file, you will first import the twitter_samples so you can work with that data:

      nlp_test.py

      from nltk.corpus import twitter_samples
      

      This will import three datasets from NLTK that contain various tweets to train and test the model:

      • negative_tweets.json: 5000 tweets with negative sentiments
      • positive_tweets.json: 5000 tweets with positive sentiments
      • tweets.20150430-223406.json: 20000 tweets with no sentiments

      Next, create variables for positive_tweets, negative_tweets, and text:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      

      The strings() method of twitter_samples will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

      Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences. For instance, this model knows that a name may contain a period (like “S. Daityari”) and the presence of this period in a sentence does not necessarily end it. First, start a Python interactive session:

      Run the following commands in the session to download the punkt resource:

      • import nltk
      • nltk.download('punkt')

      Once the download is complete, you are ready to use NLTK’s tokenizers. NLTK provides a default tokenizer for tweets with the .tokenized() method. Add a line to create an object that tokenizes the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      

      If you’d like to test the script to see the .tokenized method in action, add the highlighted content to your nlp_test.py script. This will tokenize a single tweet from the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      print(tweet_tokens[0])
      

      Save and close the file, and run the script:

      The process of tokenization takes some time because it’s not a simple split on white space. After a few moments of processing, you’ll see the following:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']

      Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions later in this tutorial.

      Now that you’ve seen how the .tokenized() method works, make sure to comment out or remove the last line to print the tokenized tweet from the script by adding a # to the start of the line:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      #print(tweet_tokens[0])
      

      Your script is now configured to tokenize data. In the next step you will update the script to normalize the data.

      Step 3 — Normalizing the Data

      Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

      Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

      Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

      In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

      Before you proceed to use lemmatization, download the necessary resources by entering the following in to a Python interactive session:

      Run the following commands in the session to download the resources:

      • import nltk
      • nltk.download('wordnet')
      • nltk.download('averaged_perceptron_tagger')

      wordnet is a lexical database for the English language that helps the script determine the base word. You need the averaged_perceptron_tagger resource to determine the context of a word in a sentence.

      Once downloaded, you are almost ready to use the lemmatizer. Before running a lemmatizer, you need to determine the context for each word in your text. This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence. In a Python session, Import the pos_tag function, and provide a list of tokens as an argument to get the tags. Let us try this out in Python:

      • from nltk.tag import pos_tag
      • from nltk.corpus import twitter_samples
      • tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      • print(pos_tag(tweet_tokens[0]))

      Here is the output of the pos_tag function.

      Output

      [('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]

      From the list of tags, here is the list of the most common items and their meaning:

      • NNP: Noun, proper, singular
      • NN: Noun, common, singular or mass
      • IN: Preposition or conjunction, subordinating
      • VBG: Verb, gerund or present participle
      • VBN: Verb, past participle

      Here is a full list of the dataset.

      In general, if a tag starts with NN, the word is a noun and if it stars with VB, the word is a verb. After reviewing the tags, exit the Python session by entering exit().

      To incorporate this into a function that normalizes a sentence, you should first generate the tags for each token in the text, and then lemmatize each word using the tag.

      Update the nlp_test.py file with the following function that lemmatizes a sentence:

      nlp_test.py

      ...
      
      from nltk.tag import pos_tag
      from nltk.stem.wordnet import WordNetLemmatizer
      
      def lemmatize_sentence(tokens):
          lemmatizer = WordNetLemmatizer()
          lemmatized_sentence = []
          for word, tag in pos_tag(tokens):
              if tag.startswith('NN'):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
              lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
          return lemmatized_sentence
      
      print(lemmatize_sentence(tweet_tokens[0]))
      

      This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer.

      The function lemmatize_sentence first gets the position tag of each token of a tweet. Within the if statement, if the tag starts with NN, the token is assigned as a noun. Similarly, if the tag starts with VB, the token is assigned as a verb.

      Save and close the file, and run the script:

      Here is the output:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']

      You will notice that the verb being changes to its root form, be, and the noun members changes to member. Before you proceed, comment out the last line that prints the sample tweet from the script.

      Now that you have successfully created a function to normalize words, you are ready to move on to remove noise.

      Step 4 — Removing Noise from the Data

      In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

      Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

      In this tutorial, you will use regular expressions in Python to search for and remove these items:

      • Hyperlinks - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.
      • Twitter handles in replies - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.
      • Punctuation and special characters - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

      To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

      Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

      Add the following code to your nlp_test.py file to remove noise from the dataset:

      nlp_test.py

      ...
      
      import re, string
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      

      This code creates a remove_noise() function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

      The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string, or ''.

      Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular expressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _, and replaces them with an empty string.

      Finally, you can remove punctuation using the library string.

      In addition to this, you will also remove stop words using a built-in set of stop words in NLTK, which needs to be downloaded separately.

      Execute the following command from a Python interactive session to download this resource:

      • nltk.download('stopwords')

      Once the resource is downloaded, exit the interactive session.

      You can use the .words() method to get a list of stop words in English. To test the function, let us run it on our sample tweet. Add the following lines to the end of the nlp_test.py file:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      print(remove_noise(tweet_tokens[0], stop_words))
      

      After saving and closing the file, run the script again to receive output similar to the following:

      Output

      ['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

      Notice that the function removes all @ mentions, stop words, and converts the words to lowercase.

      Before proceeding to the modeling exercise in the next step, use the remove_noise() function to clean the positive and negative tweets. Comment out the line to print the output of remove_noise() on the sample tweet and add the following to the nlp_test.py script:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      #print(remove_noise(tweet_tokens[0], stop_words))
      
      positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
      positive_cleaned_tokens_list = []
      negative_cleaned_tokens_list = []
      
      for tokens in positive_tweet_tokens:
          positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
      for tokens in negative_tweet_tokens:
          negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      

      Now that you’ve added the code to clean the sample tweets, you may want to compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this, add the following code to the file to compare both versions of the 500th tweet in the list:

      nlp_test.py

      ...
      print(positive_tweet_tokens[500])
      print(positive_cleaned_tokens_list[500])
      

      Save and close the file and run the script. From the output you will see that the punctuation and links have been removed, and the words have been converted to lowercase.

      Output

      ['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht'] ['dang', 'rad', '#fanart', ':d']

      There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

      Now that you’ve seen the remove_noise() function in action, be sure to comment out or remove the last two lines from the script so you can add more to it:

      nlp_test.py

      ...
      #print(positive_tweet_tokens[500])
      #print(positive_cleaned_tokens_list[500])
      

      In this step you removed noise from the data to make the analysis more effective. In the next step you will analyze the data to find the most common words in your sample dataset.

      Step 5 — Determining Word Density

      The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

      The following snippet defines a generator function, named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined. Add the following code to your nlp_test.py file:

      nlp_test.py

      ...
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      all_pos_words = get_all_words(positive_cleaned_tokens_list)
      

      Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the FreqDist class of NLTK. Adding the following code to the nlp_test.py file:

      nlp_test.py

      from nltk import FreqDist
      
      freq_dist_pos = FreqDist(all_pos_words)
      print(freq_dist_pos.most_common(10))
      

      The .most_common() method lists the words which occur most frequently in the data. Save and close the file after making these changes.

      When you run the file now, you will find the most common terms in the data:

      Output

      [(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]

      From this data, you can see that emoticon entities form some of the most common parts of positive tweets. Before proceeding to the next step, make sure you comment out the last line of the script that prints the top ten tokens.

      To summarize, you extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

      In the next step you will prepare data for sentiment analysis.

      Step 6 — Preparing Data for the Model

      Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.

      Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

      A model is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. A sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

      In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

      Converting Tokens to a Dictionary

      First, you will prepare the data to be fed into the model. You will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

      Add the following code to convert the tweets from a list of cleaned tokens to dictionaries with keys as the tokens and True as values. The corresponding dictionaries are stored in positive_tokens_for_model and negative_tokens_for_model.

      nlp_test.py

      ...
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
      negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      

      Splitting the Dataset for Training and Testing the Model

      Next, you need to prepare the data for training the NaiveBayesClassifier class. Add the following code to the file to prepare the data:

      nlp_test.py

      ...
      import random
      
      positive_dataset = [(tweet_dict, "Positive")
                           for tweet_dict in positive_tokens_for_model]
      
      negative_dataset = [(tweet_dict, "Negative")
                           for tweet_dict in negative_tokens_for_model]
      
      dataset = positive_dataset + negative_dataset
      
      random.shuffle(dataset)
      
      train_data = dataset[:7000]
      test_data = dataset[7000:]
      

      This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

      By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

      Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

      In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the dataset, and split it into training and testing data.

      Step 7 — Building and Testing the Model

      Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

      nlp_test.py

      ...
      from nltk import classify
      from nltk import NaiveBayesClassifier
      classifier = NaiveBayesClassifier.train(train_data)
      
      print("Accuracy is:", classify.accuracy(classifier, test_data))
      
      print(classifier.show_most_informative_features(10))
      

      Save, close, and execute the file after adding the code. The output of the code will be as follows:

      Output

      Accuracy is: 0.9956666666666667 Most Informative Features :( = True Negati : Positi = 2085.6 : 1.0 :) = True Positi : Negati = 986.0 : 1.0 welcome = True Positi : Negati = 37.2 : 1.0 arrive = True Positi : Negati = 31.3 : 1.0 sad = True Negati : Positi = 25.9 : 1.0 follower = True Positi : Negati = 21.1 : 1.0 bam = True Positi : Negati = 20.7 : 1.0 glad = True Positi : Negati = 18.1 : 1.0 x15 = True Negati : Positi = 15.9 : 1.0 community = True Positi : Negati = 14.1 : 1.0

      Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

      In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. The first row in the data signifies that in all tweets containing the token :(, the ratio of negative to positives tweets was 2085.6 to 1. Interestingly, it seems that there was one token with :( in the positive datasets. You can see that the top two discriminating items in the text are the emoticons. Further, words such as sad lead to negative sentiments, whereas welcome and glad are associated with positive sentiments.

      Next, you can check how the model performs on random tweets from Twitter. Add this code to the file:

      nlp_test.py

      ...
      from nltk.tokenize import word_tokenize
      
      custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
      custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
      print(classifier.classify(dict([token, True] for token in custom_tokens)))
      

      This code will allow you to test custom tweets by updating the string associated with the custom_tweet variable. Save and close the file after making these changes.

      Run the script to analyze the custom text. Here is the output for the custom text in the example:

      Output

      'Negative'

      You can also check if it characterizes positive tweets correctly:

      nlp_test.py

      ...
      custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworldies'
      

      Here is the output:

      Output

      'Positive'

      Now that you’ve tested both positive and negative sentiments, update the variable to test a more complex sentiment like sarcasm.

      nlp_test.py

      ...
      custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'
      

      Here is the output:

      Output

      'Positive'

      The model classified this example as positive. This is because the training data wasn’t comprehensive enough to classify sarcastic tweets as negative. In case you want your model to predict sarcasm, you would need to provide sufficient amount of training data to train it accordingly.

      In this step you built and tested the model. You also explored some of its limitations, such as not detecting sarcasm in particular examples. Your completed code still has artifacts leftover from following the tutorial, so the next step will guide you through aligning the code to Python’s best practices.

      Step 8 — Cleaning Up the Code (Optional)

      Though you have completed the tutorial, it is recommended to reorganize the code in the nlp_test.py file to follow best programming practices. Per best practice, your code should meet this criteria:

      • All imports should be at the top of the file. Imports from the same library should be grouped together in a single statement.
      • All functions should be defined after the imports.
      • All the statements in the file should be housed under an if __name__ == "__main__": condition. This ensures that the statements are not executed if you are importing the functions of the file in another file.

      We will also remove the code that was commented out by following the tutorial, along with the lemmatize_sentence function, as the lemmatization is completed by the new remove_noise function.

      Here is the cleaned version of nlp_test.py:

      from nltk.stem.wordnet import WordNetLemmatizer
      from nltk.corpus import twitter_samples, stopwords
      from nltk.tag import pos_tag
      from nltk.tokenize import word_tokenize
      from nltk import FreqDist, classify, NaiveBayesClassifier
      
      import re, string, random
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      if __name__ == "__main__":
      
          positive_tweets = twitter_samples.strings('positive_tweets.json')
          negative_tweets = twitter_samples.strings('negative_tweets.json')
          text = twitter_samples.strings('tweets.20150430-223406.json')
          tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
          stop_words = stopwords.words('english')
      
          positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
          negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
          positive_cleaned_tokens_list = []
          negative_cleaned_tokens_list = []
      
          for tokens in positive_tweet_tokens:
              positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          for tokens in negative_tweet_tokens:
              negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          all_pos_words = get_all_words(positive_cleaned_tokens_list)
      
          freq_dist_pos = FreqDist(all_pos_words)
          print(freq_dist_pos.most_common(10))
      
          positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
          negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      
          positive_dataset = [(tweet_dict, "Positive")
                               for tweet_dict in positive_tokens_for_model]
      
          negative_dataset = [(tweet_dict, "Negative")
                               for tweet_dict in negative_tokens_for_model]
      
          dataset = positive_dataset + negative_dataset
      
          random.shuffle(dataset)
      
          train_data = dataset[:7000]
          test_data = dataset[7000:]
      
          classifier = NaiveBayesClassifier.train(train_data)
      
          print("Accuracy is:", classify.accuracy(classifier, test_data))
      
          print(classifier.show_most_informative_features(10))
      
          custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
          custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
          print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))
      

      Conclusion

      This tutorial introduced you to a basic sentiment analysis model using the nltk library in Python 3. First, you performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and removing noise. Next, you visualized frequently occurring items in the data. Finally, you built a model to associate tweets to a particular sentiment.

      A supervised learning model is only as good as its training data. To further strengthen the model, you could considering adding more categories like excitement and anger. In this tutorial, you have only scratched the surface by building a rudimentary model. Here’s a detailed guide on various considerations that one must take care of while performing sentiment analysis.





      Source link

      How to Start a Photography Blog (In 4 Steps)


      Photography is a popular and useful hobby, especially with the variety and convenience of advanced camera options we have now. Whether you’re into dark rooms and film or high-end digital lenses, turning your photography hobby into a business might be on your radar. Figuring out how best to get your work online, however, can be a full-time job.

      That’s where WordPress and the time-saving functionality of website builders come in. When you combine the content management options of WordPress with drag-and-drop site design capability, it’s easy to turn your big ideas into a professional photography site.

      In this article, we’ll cover four steps for creating a photography website with WordPress. We’ll also discuss why this platform is the best option and share the best website builder tools for WordPress to help you attain your dream photography blogging site.

      Take that lens cap off your camera, friend, and let’s get started!

      Why You Should Consider a WordPress Website for Your Photography Blog

      When it comes to Content Management Systems (CMSs), we make no bones about it — WordPress is the best. You don’t have to take our word for it, though. WordPress owns 50–60% of the global CMS market. Additionally, it’s the first choice for 14.7% of the top 100 sites on the web.

      Outside of those numbers, WordPress’s practical, open-source platform is another reason we suggest it for a photography blog. A nearly endless array of custom themes and plugins are available to help you eliminate distractions and create a truly unique website for your photography.

      One more plus? WordPress software is free. That means that even as a brand new blogger, you can afford a self-hosted website.

      How to Start a Photography Blog With WordPress (In 4 Steps)

      One of the first steps in designing a photography website is to determine your own style or niche. Whatever your blogging focus might be, knowing this ahead of time will help you design your site and target your specific audience. Take a few minutes to set some goals for your site and then write them down.

      Once you have your blogging goals, the following four steps should help to guide you through setting up and designing a site with WordPress.

      Step 1: Choose Your Domain Name and Web Host

      Picking out a domain name can be a fun but frustrating process. One of the best ways to stand out as a blogger is with a unique and brand-oriented name. You might find, however, that many of the names you want are already taken.

      Fortunately, while choosing a .com is still popular, there are quite a few new Top-Level Domains (TLDs) available that might be just right for your photography site if the domain you wanted is unavailable.

      Searching for a domain name.

      As for selecting a hosting provider, this step might seem overwhelming at first. However, there are a few things to keep in mind as you shop that should help. For instance, if you plan on setting up an e-commerce page as part of your blogging strategy, you’ll want to look into what each host provides for that use case.

      Regardless of your goals as a blogger, other significant features to look out for include:

      • Storage. If you plan on using the same host for your website and your photos, you’ll want to investigate the amount of storage that’s available. There may be several options, or even additional storage available as an add-on to handle your larger, high-quality images.
      • Software. You’ll also want to consider whether you’ll need a one-click solution to get started with WordPress. This is an excellent option for anyone who is not hiring a developer and isn’t a programming expert.
      • Support. The last thing you want is for your clients to run into downtime while trying to view your photos. Make sure your web host has 24/7 support, and read up on its site backup and restoration options in case something happens.
      • Extras. Some hosts come with extra features you might want to consider. These can include premium themes or plugins, staging sites, or site builders.

      No matter what type of hosting you ultimately decide you need, here at DreamHost we offer a wide range of WordPress plans.

      WordPress hosting at DreamHost.

      You Take Great Photos, We’ll Handle the Hosting

      Our automatic updates and strong security defenses take server management off your hands so you can stay behind the camera.

      Step 2: Install a Dedicated Photography Theme

      Installing a theme enables you to customize the look of your WordPress site. What’s more, it’s as easy as uploading a file or clicking a button. There are a lot of photography themes out there, however, so deciding which one is best for you might be the hardest part.

      If you’re using DreamHost as your WordPress hosting service, you’ll have access to WP Website Builder. As a photographer, this means you can drag-and-drop your site elements in a front-end view of your website. You can choose from photography-specific custom templates and view your changes live as you make them.

      Getting started is easy. You simply need to select “WP Website Builder” as an option when purchasing your DreamHost plan.

      Adding WP Website Builder to DreamHost.

      Once you complete your purchase with the website builder selected, WordPress and premium plugins built by our friends at BoldGrid will be installed. The Page and Post Builder and Inspirations will appear in your menu options once you visit your WordPress dashboard.

      Once you’re logged into WordPress for the first time, you’ll be immediately taken to a setup page. When you’re ready, select Let’s Get Started!.

      BoldGrid Inspirations.

      Next, you’ll be able to choose from a menu of theme categories. Inspirations includes 20 stunning photography-friendly themes.

      Theme choices.

      Once you select the theme you want, you’ll be guided through choosing some custom content options. You can use preset page layouts and create menus. You’ll also be able to test your theme’s responsiveness to mobile devices.

      Content choice settings.

      You might notice additional content on your WordPress dashboard now as well. There are some tutorial videos, for example, in case you need extra support along the way. Plus, if you want to spice things up later and change your theme, the Inspirations menu will lead you through that process.

      Step 3: Select Plugins to Enhance Your Site

      Now that you’ve selected a theme, you might want to look at some plugins as well. WordPress plugins are add-on packages of code that can enhance and extend the functionality of the platform. You’ll want to familiarize yourself with the best way to manage them, in order to make sure you’re keeping your site safe and secure.

      Photography blogs and websites have some unique needs, such as the ability to display and watermark high-quality images. You may also need to create image galleries with password protection or tie your e-commerce options to a file download manager. All of these tasks can be tackled with plugins.

      One tool to check out is Photography Management.

      The Photography Management plugin.

      This plugin is a client management solution for photographers who need to provide images and galleries to their customers. It can help you create a login portal for clients and provides notifications when your clients complete an action.

      Another reliable photography plugin is Envira Gallery.

      The Envira Gallery plugin.

      The Envira feature list is extensive. It includes options for watermarking your images, which may be an important part of your security strategy. You can also set up a storefront, create video galleries, and import content from Instagram. Combining this kind of tool with our website builder makes displaying your work dynamically online a cinch.

      Be Awesome on the Internet

      Join our monthly newsletter for tips and tricks to build your dream website!

      Step 4: Create Compelling Content

      When it comes to Search Engine Optimization (SEO), there is more to think about than just keywords. What other pages say about you is just one other element that is vital for securing better page rankings.

      Gaining backlinks or having your pages shared on social media are both effective ways to build page rank and clients. One way to garner more backlinks is to create compelling content. This could come in the form of tutorials, downloads, infographics, videos, or podcasts.

      A beautiful example of these options can be seen on the Julia & Gil photography site.

      Julia and Gil photography site.

      Adding a blog to your page is also a great way to build a following and establish yourself as a trusted name in the industry.

      How to Promote Your Photography Business

      Now that your photography has a home on the web, you might be wondering how to get more eyeballs on your work. Self-promotion can be a challenge at times, but with WordPress and your professional theme, you have plenty to showcase!

      There are a few ways to approach promoting your new site, including:

      • Social Media. Sharing your work on social media can reap significant benefits. One way to get into the habit is to stay on a posting schedule, so interested viewers know they can regularly expect new content. Here’s how we recommend promoting your blog on social.
      • Testimonials. Research shows that 82% of consumers seek recommendations from family and friends before making a purchase. This makes customer testimonials a powerful tool on your website.
      • Call to Action. If your goal is to gain clients or fill up your email subscriber list, you might want to study up on the art of writing a good Call to Action (CTA). This will clearly guide your site’s visitors towards the action you want them to take.
      • Portfolios. Creating a portfolio can give you a portion of your site that is specifically geared toward promoting your skills. While your entire website might function as an advertisement, a portfolio allows you to pick and choose your best work to highlight.

      However you decide to promote your new website, it’s a proven best practice that keeping your site information up-to-date and accurate is crucial when it comes to improving SEO and gaining a following.

      Blogging Photographers

      Whether it’s nature, weddings, family portraits, or street photography, you can personally display your images with a professional photography theme and WP Website Builder. WordPress’ niche photography plugins can also help you add unique elements to set your site apart.

      Here at DreamHost, we want you to be focused on the next shot — not whether your site might crash. Our complete WordPress hosting solutions come with easy built-in solutions for backing up your website and maintaining top-notch performance. Additionally, WordPress setup is fast and easy, so you can get up and running and share your amazing images with the world!



      Source link