One place for hosting & domains

      Python

      Cómo instalar Python 3 y configurar un entorno de programación en Ubuntu 18.04 [Guía de inicio rápido]


      Introducción

      Python es un lenguaje de programación flexible y versátil, con puntos fuertes en términos de secuencias de comandos, automatización, análisis de datos, aprendizaje automático y desarrollo de backend.

      Este tutorial le servirá a modo de orientación para la instalación de Python y la creación de un entorno de programación en un servidor de Ubuntu 18.04. Para hallar una versión más detallada de este tutorial, que incluya mejores explicaciones de cada paso, consulte Cómo instalar Python 3 y configurar un entorno de programación en un servidor de Ubuntu 18.04.

      Paso 1: Realizar la actualización y la renovación

      Tras haber iniciado sesión en su servidor de Ubuntu 18.04 como usuario sudo no root, primero actualice y renueve su sistema para asegurarse de que la versión de Python 3 que recibió esté actualizada.

      • sudo apt update
      • sudo apt -y upgrade

      Confirme la instalación si se solicita.

      Paso 2: Comprobar la versión de Python

      Compruebe la versión de Python 3 instalada escribiendo lo siguiente:

      Obtendrá un resultado similar al siguiente, según el momento en que haya actualizado su sistema.

      Output

      Python 3.6.7

      Paso 3: Instalar pip

      Para administrar paquetes de software de Python, instale pip, una herramienta que instalará y administrará bibliotecas o módulos que se utilizarán en sus proyectos.

      • sudo apt install -y python3-pip

      Los paquetes de Python pueden instalarse escribiendo lo siguiente:

      • pip3 install package_name

      Aquí, package_namepuede referirse a cualquier paquete o biblioteca de Python, como Django para el desarrollo web o NumPy para la informática científica. Por lo tanto, si desea instalar NumPy puede hacerlo con el comando pip3 install numpy.

      Paso 4: Instalar herramientas adicionales

      Hay más paquetes y herramientas de desarrollo que debemos instalar para asegurarnos de contar con una configuración sólida para nuestro entorno de programación:

      • sudo apt install build-essential libssl-dev libffi-dev python3-dev

      Paso 5: Instalar venv

      Los entornos virtuales le permiten disponer de un espacio aislado en su servidor para proyectos de Python. Utilizaremos venv, parte de la biblioteca estándar de Python 3, que podemos instalar escribiendo lo siguiente:

      • sudo apt install -y python3-venv

      Paso 6: Crear un entorno virtual

      Puede crear un nuevo entorno con el comando pyvenv. En este caso, llamaremos a nuestro nuevo entorno my_env, pero puede llamarlo como desee.

      Paso 7: Activar un entorno virtual

      Active el entorno utilizando el comando siguiente, donde my_env es el nombre de su entorno de programación.

      • source my_env/bin/activate

      Ahora, su línea de comandos llevará el nombre de su entorno como prefijo:

      Paso 8: Probar el entorno virtual

      Abra el intérprete de Python:

      Tenga en cuenta que dentro del entorno virtual de Python 3 puede utilizar el comando python en vez de python3 y pip en vez de pip3.

      Sabrá que se encuentra en el intérprete cuando reciba el siguiente resultado:

      Python 3.6.5 (default, Apr  1 2018, 05:46:30) 
      [GCC 7.3.0] on linux
      Type "help", "copyright", "credits" or "license" for more information.
      >>> 
      

      Ahora, utilice la función print() para crear el programa tradicional “Hello, World”:

      Output

      Hello, World!

      Paso 9: Desactivar un entorno virtual

      Cierre el intérprete de Python:

      Luego, cierre el entorno virtual:

      Lecturas adicionales

      A continuación, se ofrecen enlaces a tutoriales más detallados relacionados con esta guía:



      Source link

      Python 2与Python 3的对比:从实际出发的考量


      简介

      Python是可读性很强,并且用途极其多样的编程语言。它的名字是从英国的喜剧团体“Monty Python”受到启发,因此“让Python语言用起来很有趣”对于Python开发团队来说是非常重要的核心目标。易于安装配置、代码风格相对简单直接、有即时的反馈和报错,这些特性让Python成为编程初学者的一个很好选择。

      由于Python是一种“多范式”语言-也就是说它支持多种编程样式,包括“脚本”和“面向对象”,因此非常适用于通用编程,用途广泛。在工业界,Python被诸如United Space Alliance(“美国航天局”主要的航天飞机技术支持承包商)和Industrial Light & Magic(Lucasfilm的特效和动画工作室)这类组织越来越多地使用。拥有Python的基础,将会为学习其他编程语言的人们提供巨大的发展潜力。

      Python最初作者是Guido van Rossum(荷兰籍),他现在仍活跃在社区中。Python语言的开发始于20世纪80年代末,并于1991年首次发行。作为ABC编程语言的继任,Python的第一迭代版本中已经包含了异常处理、函数、以及类与继承。 1994年一个重要的、名为comp.lang.python的“用户网络咨询论坛”成立,此后Python的用户群体蓬勃发展,为Python成为“最受欢迎的开源开发语言”铺平了道路。

      概述

      在研究与Python2和Python3相关的潜在用途、以及它们之间的主要编程句法差异之前,让我们先来了解关于Python近年主要发行版的背景知识。

      Python 2

      Python 2于2000年底发布,它实现了PEP(Python Enhancement Proposal,“Python改进提议”),这表明Python 2比早期版本Python有更加透明、包容的语言开发过程。PEP是一个技术规范,可以向Python社区成员提供信息,也可以描述该语言的新特性。

      此外,Python 2包含了更多的编程功能,其中包括:周期检测垃圾收集器(用于自动化内存管理)、增加Unicode支持(使字符标准化)、以及“列表解析”(基于现有列表创建列表)。随着Python 2的不断发展,更多的功能加入其中,例如python 2.2版,将Python的类型(types)和类(classes)统一到一个层级结构中。

      Python 3

      Python 3被认为是Python的未来,也是当前正在开发的版本。作为一项重大改革,Python 3发布于2008年末,旨在解决和修正Python语言之前版本中存在的固有设计缺陷。Python 3的开发重点是清理代码库和消除冗余,使一种既定的任务只有一种方法去实现,让语言更加清晰明了。

      Python 3.0版中的重大修改包括:把print语句改成一个内置函数、改进整数除法、增加更多的Unicode支持。

      起先,Python 3的采纳使用非常缓慢,因为它不向后兼容Python 2,需要人们去决定使用哪个版本。此外,许多第三方库一开始仅支持Python 2。但由于Python 3的开发团队重申强调了对于Python 2的支持即将结束,之后更多的库已移植到了Python 3。通过提供Python 3支持的第三方包的数量,我们可以了解“Python 3被越来越多采用”的这个事实。在这份教程的英文版定稿时,360个最流行的Python包中,有339个已支持Python 3。

      Python 2.7

      继2008年Python 3.0发布之后,Python 2.7于2010年7月3日发布,并计划作为2.x版本中的最后一版。Python 2.7存在,是为了让Python 2.x用户更容易将特性移植到Python 3上,因此2.7提供了某种程度的兼容性。这种兼容性支持包括2.7版的增强模块,如支持自动化测试的unittest、用于命令行中句法分析的argparse,以及collections中更方便的类。

      由于Python 2.7有着“介于Python 2和Python 3.0早期迭代版之间”的独特地位,它与许多健全的库兼容。因此Python 2.7在程序员中一直是非常流行的选择。当我们今天谈到Python 2时,我们通常指的是Python 2.7(因为它是最常用的版本)。

      然而Python2.7被认为是一种将会停止支持的遗留语言,它的持续开发(目前主要是修改bug)将于2020完全停止。

      关键区别

      尽管Python 2.7和Python 3具有许多类似功能,但它们不应该被认为是可以完全互换的。虽然你可以在任一版本中写出“好代码”和“有用的程序”,但需要了解的是,它们在代码语法和处理上会有相当大的区别。

      下面是一些例子,但请记住随着继续学习Python,你可能会遇到更多的语法差异。

      打印(Print)

      在Python 2中,print被视为语句而不是函数,这是一个典型的易引起混淆的地方,因为Python中的许多其他操作都需要括号内的参数来执行。如果你想在命令行中用Python 2打印Sammy the Shark is my favorite sea creature,可以使用以下print语句:

      print "Sammy the Shark is my favorite sea creature"
      

      现在,在Python 3中print()被明确地定义为一个函数,因此想打印上面相同的字符串,可以使用函数的语法简单容易地执行此操作:

      print("Sammy the Shark is my favorite sea creature")
      

      这一变化使得Python的语法更加一致,也更易于在不同的打印函数之间进行更改。print()句法同样向后支持Python 2.7,因此 Python 3的print() 函数可以方便地在任一版本中运行。

      整数除法

      在Python2中,任何不带小数的数字都被视为整数integer)。乍一看这似乎是处理编程类型的一种简单方法,但当整数相除时,有时你希望得到一个带有小数位的答案,被称为浮点数(float),比如:

      5 / 2 = 2.5
      

      然而,在Python 2中整数是强类型,即使在凭直觉应当转换为带小数点的“浮点数”(float)的情况下,也不会转换。

      当除法/符号两边的两个数字是整数时,Python 2执行向下取整floor division)。因此对于商x,返回的数字是小于或等于x的最大整数。这意味着使用5/2进行这两个数字的除法时,Python 2.7将返回小于或等于2.5的最大整数,在这个例子中为2

      a = 5 / 2
      print a
      

      Output

      2

      若想强制覆盖这种行为,你可以添加小数位5.0/2.0,以获得预期的答案2.5

      在Python 3中, 整数除法 变得直观,比如:

      a = 5 / 2
      print(a)
      

      Output

      2.5

      你依然可以使用5.0 / 2.0去得到2.5,但是如果想向下取整运算,应该使用Python 3的语法//,如下所示:

      b = 5 // 2
      print(b)
      

      Output

      2

      Python 3中的这个修改使得整数除法更加直观,这个句法向后兼容Python 2.7。

      Unicode支持

      当编程语言处理字符串类型 字符串 (也就是一串字符)时,他们可以用几种不同的方法来实现,这样计算机可以把数字转换成字母和其他符号。

      Python 2默认使用ASCII字母表, 因此当你输入 "Hello, Sammy!"时Python 2会将字符串作为ASCII进行处理。ASCII在各种扩展形式中,也最多只有几百个字符。因此ASCII不是一种非常灵活的字符编码方式,尤其是非英语字符。

      若是想使用更加通用和健全的Unicode字符编码(它支持超过128000个字符,跨越当代和历史上的各类脚本和符号集),你必须键入u"Hello, Sammy!",其中u前缀表示Unicode。

      Python 3默认使用Unicode,这为程序员节省了额外的开发时间。而且你可以轻松地在程序中直接输入和显示更多字符。Unicode支持更大的语言字符多样性,以及emojis(表情符号)的显示。因此使用它作为默认字符编码,可以确保你的开发项目能无缝支持世界各地的移动设备。

      不过若希望你的Python 3代码向后兼容Python 2,你可你以在字符串之前保留u

      持续发展

      Python 3和Python 2之间最大的区别不是语法,而是Python 2.7将在2020年失去后续支持,但Python 3将持续发展更多的功能特性,以及持续修复Bug。

      Python 3最近的发展包括格式化字符串文本,更简单地定制类的创建,以及更简洁的处理矩阵乘法的语法。

      Python 3的持续发展意味着程序员可以确信:Python 3语言中的问题会及时得到修复,并且功能会随着时间的推移而增加,程序的效率会变得更高。

      需要考虑的其他要点

      作为一个用Python入门的新手程序员,或者不熟悉Python的有经验程序员,你需要考虑:想在Python学习中实现什么。

      如果只是想在“不考虑既定项目”的情况下学习,那么你应该优先考量:Python 3将继续得到支持和开发,而Python 2.7不会。

      但如果你正计划加入现有项目,你应当看看:团队使用的Python版本、不同的版本如何与老代码库进行交互、项目使用的包是否被不同的版本支持、以及项目的实现细节。

      如果你打算开始一个新的项目,那么应该研究哪些包可以使用,这些包与哪个版本的Python兼容。如上所述,尽管早期版本的Python 3与那些“为Python 2构建的库”兼容性较低,但许多版本已经移植到、或是承诺在未来四年内移植到Python 3。

      总结

      Python是一种用途广泛、文档齐全的编程语言,无论选择使用Python 2还是Python 3,你都可以写令人兴奋的软件项目。

      尽管有几处重要不同,但是只要稍加调整,从Python 3迁移到Python 2并不太难。并且你常会发现python 2.7可以轻松运行python 3的代码,特别是在你刚开始学习时。你可以通过阅读这个教程,去了解更多信息如何将Python 2代码移植到Python 3.

      需要记住,随着更多的开发人员和社区关注于Python 3,它将变得更加精炼,并符合程序员不断发展的需求。而对于Python 2.7的支持将会逐渐减少。



      Source link

      How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)


      The author selected the Open Internet/Free Speech fund to receive a donation as part of the Write for DOnations program.

      Introduction

      A large amount of data that is generated today is unstructured, which requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history. The process of analyzing natural language and making sense out of it falls under the field of Natural Language Processing (NLP). Sentiment analysis is a common NLP task, which involves classifying texts or parts of texts into a pre-defined sentiment. You will use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.

      In this tutorial, you will prepare a dataset of sample tweets from the NLTK package for NLP with different data cleaning methods. Once the dataset is ready for processing, you will train a model on pre-classified tweets and use the model to classify the sample tweets into negative and positives sentiments.

      This article assumes that you are familiar with the basics of Python (see our How To Code in Python 3 series), primarily the use of data structures, classes, and methods. The tutorial assumes that you have no background in NLP and nltk, although some knowledge on it is an added advantage.

      Prerequisites

      Step 1 — Installing NLTK and Downloading the Data

      You will use the NLTK package in Python for all NLP tasks in this tutorial. In this step you will install NLTK and download the sample tweets that you will use to train and test your model.

      First, install the NLTK package with the pip package manager:

      This tutorial will use sample tweets that are part of the NLTK package. First, start a Python interactive session by running the following command:

      Then, import the nltk module in the python interpreter.

      Download the sample tweets from the NLTK package:

      • nltk.download('twitter_samples')

      Running this command from the Python interpreter downloads and stores the tweets locally. Once the samples are downloaded, they are available for your use.

      You will use the negative and positive tweets to train your model on sentiment analysis later in the tutorial. The tweets with no sentiments will be used to test your model.

      If you would like to use your own dataset, you can gather tweets from a specific time period, user, or hashtag by using the Twitter API.

      Now that you’ve imported NLTK and downloaded the sample tweets, exit the interactive session by entering in exit(). You are ready to import the tweets and begin processing the data.

      Step 2 — Tokenizing the Data

      Language in its original form cannot be accurately processed by a machine, so you need to process the language to make it easier for the machine to understand. The first part of making sense of the data is through a process called tokenization, or splitting strings into smaller parts called tokens.

      A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

      To get started, create a new .py file to hold your script. This tutorial will use nlp_test.py:

      In this file, you will first import the twitter_samples so you can work with that data:

      nlp_test.py

      from nltk.corpus import twitter_samples
      

      This will import three datasets from NLTK that contain various tweets to train and test the model:

      • negative_tweets.json: 5000 tweets with negative sentiments
      • positive_tweets.json: 5000 tweets with positive sentiments
      • tweets.20150430-223406.json: 20000 tweets with no sentiments

      Next, create variables for positive_tweets, negative_tweets, and text:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      

      The strings() method of twitter_samples will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

      Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences. For instance, this model knows that a name may contain a period (like “S. Daityari”) and the presence of this period in a sentence does not necessarily end it. First, start a Python interactive session:

      Run the following commands in the session to download the punkt resource:

      • import nltk
      • nltk.download('punkt')

      Once the download is complete, you are ready to use NLTK’s tokenizers. NLTK provides a default tokenizer for tweets with the .tokenized() method. Add a line to create an object that tokenizes the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      

      If you’d like to test the script to see the .tokenized method in action, add the highlighted content to your nlp_test.py script. This will tokenize a single tweet from the positive_tweets.json dataset:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      print(tweet_tokens[0])
      

      Save and close the file, and run the script:

      The process of tokenization takes some time because it’s not a simple split on white space. After a few moments of processing, you’ll see the following:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']

      Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions later in this tutorial.

      Now that you’ve seen how the .tokenized() method works, make sure to comment out or remove the last line to print the tokenized tweet from the script by adding a # to the start of the line:

      nlp_test.py

      from nltk.corpus import twitter_samples
      
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      negative_tweets = twitter_samples.strings('negative_tweets.json')
      text = twitter_samples.strings('tweets.20150430-223406.json')
      tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
      #print(tweet_tokens[0])
      

      Your script is now configured to tokenize data. In the next step you will update the script to normalize the data.

      Step 3 — Normalizing the Data

      Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of the same verb, “run”. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, “run”. Normalization in NLP is the process of converting a word to its canonical form.

      Normalization helps group together words with the same meaning but different forms. Without normalization, “ran”, “runs”, and “running” would be treated as different words, even though you may want them to be treated as the same word. In this section, you explore stemming and lemmatization, which are two popular techniques of normalization.

      Stemming is a process of removing affixes from a word. Stemming, working with only simple verb forms, is a heuristic process that removes the ends of words.

      In this tutorial you will use the process of lemmatization, which normalizes a word with the context of vocabulary and morphological analysis of words in text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization ultimately comes down to a trade off between speed and accuracy.

      Before you proceed to use lemmatization, download the necessary resources by entering the following in to a Python interactive session:

      Run the following commands in the session to download the resources:

      • import nltk
      • nltk.download('wordnet')
      • nltk.download('averaged_perceptron_tagger')

      wordnet is a lexical database for the English language that helps the script determine the base word. You need the averaged_perceptron_tagger resource to determine the context of a word in a sentence.

      Once downloaded, you are almost ready to use the lemmatizer. Before running a lemmatizer, you need to determine the context for each word in your text. This is achieved by a tagging algorithm, which assesses the relative position of a word in a sentence. In a Python session, Import the pos_tag function, and provide a list of tokens as an argument to get the tags. Let us try this out in Python:

      • from nltk.tag import pos_tag
      • from nltk.corpus import twitter_samples
      • tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      • print(pos_tag(tweet_tokens[0]))

      Here is the output of the pos_tag function.

      Output

      [('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]

      From the list of tags, here is the list of the most common items and their meaning:

      • NNP: Noun, proper, singular
      • NN: Noun, common, singular or mass
      • IN: Preposition or conjunction, subordinating
      • VBG: Verb, gerund or present participle
      • VBN: Verb, past participle

      Here is a full list of the dataset.

      In general, if a tag starts with NN, the word is a noun and if it stars with VB, the word is a verb. After reviewing the tags, exit the Python session by entering exit().

      To incorporate this into a function that normalizes a sentence, you should first generate the tags for each token in the text, and then lemmatize each word using the tag.

      Update the nlp_test.py file with the following function that lemmatizes a sentence:

      nlp_test.py

      ...
      
      from nltk.tag import pos_tag
      from nltk.stem.wordnet import WordNetLemmatizer
      
      def lemmatize_sentence(tokens):
          lemmatizer = WordNetLemmatizer()
          lemmatized_sentence = []
          for word, tag in pos_tag(tokens):
              if tag.startswith('NN'):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
              lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
          return lemmatized_sentence
      
      print(lemmatize_sentence(tweet_tokens[0]))
      

      This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer.

      The function lemmatize_sentence first gets the position tag of each token of a tweet. Within the if statement, if the tag starts with NN, the token is assigned as a noun. Similarly, if the tag starts with VB, the token is assigned as a verb.

      Save and close the file, and run the script:

      Here is the output:

      Output

      ['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']

      You will notice that the verb being changes to its root form, be, and the noun members changes to member. Before you proceed, comment out the last line that prints the sample tweet from the script.

      Now that you have successfully created a function to normalize words, you are ready to move on to remove noise.

      Step 4 — Removing Noise from the Data

      In this step, you will remove noise from the dataset. Noise is any part of the text that does not add meaning or information to data.

      Noise is specific to each project, so what constitutes noise in one project may not be in a different project. For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

      In this tutorial, you will use regular expressions in Python to search for and remove these items:

      • Hyperlinks - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.
      • Twitter handles in replies - These Twitter usernames are preceded by a @ symbol, which does not convey any meaning.
      • Punctuation and special characters - While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

      To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

      Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

      Add the following code to your nlp_test.py file to remove noise from the dataset:

      nlp_test.py

      ...
      
      import re, string
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      

      This code creates a remove_noise() function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

      The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string, or ''.

      Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular expressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _, and replaces them with an empty string.

      Finally, you can remove punctuation using the library string.

      In addition to this, you will also remove stop words using a built-in set of stop words in NLTK, which needs to be downloaded separately.

      Execute the following command from a Python interactive session to download this resource:

      • nltk.download('stopwords')

      Once the resource is downloaded, exit the interactive session.

      You can use the .words() method to get a list of stop words in English. To test the function, let us run it on our sample tweet. Add the following lines to the end of the nlp_test.py file:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      print(remove_noise(tweet_tokens[0], stop_words))
      

      After saving and closing the file, run the script again to receive output similar to the following:

      Output

      ['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

      Notice that the function removes all @ mentions, stop words, and converts the words to lowercase.

      Before proceeding to the modeling exercise in the next step, use the remove_noise() function to clean the positive and negative tweets. Comment out the line to print the output of remove_noise() on the sample tweet and add the following to the nlp_test.py script:

      nlp_test.py

      ...
      from nltk.corpus import stopwords
      stop_words = stopwords.words('english')
      
      #print(remove_noise(tweet_tokens[0], stop_words))
      
      positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
      negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
      positive_cleaned_tokens_list = []
      negative_cleaned_tokens_list = []
      
      for tokens in positive_tweet_tokens:
          positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
      for tokens in negative_tweet_tokens:
          negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      

      Now that you’ve added the code to clean the sample tweets, you may want to compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this, add the following code to the file to compare both versions of the 500th tweet in the list:

      nlp_test.py

      ...
      print(positive_tweet_tokens[500])
      print(positive_cleaned_tokens_list[500])
      

      Save and close the file and run the script. From the output you will see that the punctuation and links have been removed, and the words have been converted to lowercase.

      Output

      ['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht'] ['dang', 'rad', '#fanart', ':d']

      There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

      Now that you’ve seen the remove_noise() function in action, be sure to comment out or remove the last two lines from the script so you can add more to it:

      nlp_test.py

      ...
      #print(positive_tweet_tokens[500])
      #print(positive_cleaned_tokens_list[500])
      

      In this step you removed noise from the data to make the analysis more effective. In the next step you will analyze the data to find the most common words in your sample dataset.

      Step 5 — Determining Word Density

      The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

      The following snippet defines a generator function, named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined. Add the following code to your nlp_test.py file:

      nlp_test.py

      ...
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      all_pos_words = get_all_words(positive_cleaned_tokens_list)
      

      Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the FreqDist class of NLTK. Adding the following code to the nlp_test.py file:

      nlp_test.py

      from nltk import FreqDist
      
      freq_dist_pos = FreqDist(all_pos_words)
      print(freq_dist_pos.most_common(10))
      

      The .most_common() method lists the words which occur most frequently in the data. Save and close the file after making these changes.

      When you run the file now, you will find the most common terms in the data:

      Output

      [(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]

      From this data, you can see that emoticon entities form some of the most common parts of positive tweets. Before proceeding to the next step, make sure you comment out the last line of the script that prints the top ten tokens.

      To summarize, you extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

      In the next step you will prepare data for sentiment analysis.

      Step 6 — Preparing Data for the Model

      Sentiment analysis is a process of identifying an attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, your model will use the “positive” and “negative” sentiments.

      Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

      A model is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. A sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

      In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

      Converting Tokens to a Dictionary

      First, you will prepare the data to be fed into the model. You will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

      Add the following code to convert the tweets from a list of cleaned tokens to dictionaries with keys as the tokens and True as values. The corresponding dictionaries are stored in positive_tokens_for_model and negative_tokens_for_model.

      nlp_test.py

      ...
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
      negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      

      Splitting the Dataset for Training and Testing the Model

      Next, you need to prepare the data for training the NaiveBayesClassifier class. Add the following code to the file to prepare the data:

      nlp_test.py

      ...
      import random
      
      positive_dataset = [(tweet_dict, "Positive")
                           for tweet_dict in positive_tokens_for_model]
      
      negative_dataset = [(tweet_dict, "Negative")
                           for tweet_dict in negative_tokens_for_model]
      
      dataset = positive_dataset + negative_dataset
      
      random.shuffle(dataset)
      
      train_data = dataset[:7000]
      test_data = dataset[7000:]
      

      This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

      By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

      Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

      In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the dataset, and split it into training and testing data.

      Step 7 — Building and Testing the Model

      Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

      nlp_test.py

      ...
      from nltk import classify
      from nltk import NaiveBayesClassifier
      classifier = NaiveBayesClassifier.train(train_data)
      
      print("Accuracy is:", classify.accuracy(classifier, test_data))
      
      print(classifier.show_most_informative_features(10))
      

      Save, close, and execute the file after adding the code. The output of the code will be as follows:

      Output

      Accuracy is: 0.9956666666666667 Most Informative Features 🙁 = True Negati : Positi = 2085.6 : 1.0 🙂 = True Positi : Negati = 986.0 : 1.0 welcome = True Positi : Negati = 37.2 : 1.0 arrive = True Positi : Negati = 31.3 : 1.0 sad = True Negati : Positi = 25.9 : 1.0 follower = True Positi : Negati = 21.1 : 1.0 bam = True Positi : Negati = 20.7 : 1.0 glad = True Positi : Negati = 18.1 : 1.0 x15 = True Negati : Positi = 15.9 : 1.0 community = True Positi : Negati = 14.1 : 1.0

      Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

      In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. The first row in the data signifies that in all tweets containing the token :(, the ratio of negative to positives tweets was 2085.6 to 1. Interestingly, it seems that there was one token with :( in the positive datasets. You can see that the top two discriminating items in the text are the emoticons. Further, words such as sad lead to negative sentiments, whereas welcome and glad are associated with positive sentiments.

      Next, you can check how the model performs on random tweets from Twitter. Add this code to the file:

      nlp_test.py

      ...
      from nltk.tokenize import word_tokenize
      
      custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
      custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
      print(classifier.classify(dict([token, True] for token in custom_tokens)))
      

      This code will allow you to test custom tweets by updating the string associated with the custom_tweet variable. Save and close the file after making these changes.

      Run the script to analyze the custom text. Here is the output for the custom text in the example:

      Output

      'Negative'

      You can also check if it characterizes positive tweets correctly:

      nlp_test.py

      ...
      custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year 🙂 #Baller #Topbin #oneofmanyworldies'
      

      Here is the output:

      Output

      'Positive'

      Now that you’ve tested both positive and negative sentiments, update the variable to test a more complex sentiment like sarcasm.

      nlp_test.py

      ...
      custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'
      

      Here is the output:

      Output

      'Positive'

      The model classified this example as positive. This is because the training data wasn’t comprehensive enough to classify sarcastic tweets as negative. In case you want your model to predict sarcasm, you would need to provide sufficient amount of training data to train it accordingly.

      In this step you built and tested the model. You also explored some of its limitations, such as not detecting sarcasm in particular examples. Your completed code still has artifacts leftover from following the tutorial, so the next step will guide you through aligning the code to Python’s best practices.

      Step 8 — Cleaning Up the Code (Optional)

      Though you have completed the tutorial, it is recommended to reorganize the code in the nlp_test.py file to follow best programming practices. Per best practice, your code should meet this criteria:

      • All imports should be at the top of the file. Imports from the same library should be grouped together in a single statement.
      • All functions should be defined after the imports.
      • All the statements in the file should be housed under an if __name__ == "__main__": condition. This ensures that the statements are not executed if you are importing the functions of the file in another file.

      We will also remove the code that was commented out by following the tutorial, along with the lemmatize_sentence function, as the lemmatization is completed by the new remove_noise function.

      Here is the cleaned version of nlp_test.py:

      from nltk.stem.wordnet import WordNetLemmatizer
      from nltk.corpus import twitter_samples, stopwords
      from nltk.tag import pos_tag
      from nltk.tokenize import word_tokenize
      from nltk import FreqDist, classify, NaiveBayesClassifier
      
      import re, string, random
      
      def remove_noise(tweet_tokens, stop_words = ()):
      
          cleaned_tokens = []
      
          for token, tag in pos_tag(tweet_tokens):
              token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|'
                             '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
              token = re.sub("(@[A-Za-z0-9_]+)","", token)
      
              if tag.startswith("NN"):
                  pos = 'n'
              elif tag.startswith('VB'):
                  pos = 'v'
              else:
                  pos = 'a'
      
              lemmatizer = WordNetLemmatizer()
              token = lemmatizer.lemmatize(token, pos)
      
              if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
                  cleaned_tokens.append(token.lower())
          return cleaned_tokens
      
      def get_all_words(cleaned_tokens_list):
          for tokens in cleaned_tokens_list:
              for token in tokens:
                  yield token
      
      def get_tweets_for_model(cleaned_tokens_list):
          for tweet_tokens in cleaned_tokens_list:
              yield dict([token, True] for token in tweet_tokens)
      
      if __name__ == "__main__":
      
          positive_tweets = twitter_samples.strings('positive_tweets.json')
          negative_tweets = twitter_samples.strings('negative_tweets.json')
          text = twitter_samples.strings('tweets.20150430-223406.json')
          tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]
      
          stop_words = stopwords.words('english')
      
          positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
          negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
      
          positive_cleaned_tokens_list = []
          negative_cleaned_tokens_list = []
      
          for tokens in positive_tweet_tokens:
              positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          for tokens in negative_tweet_tokens:
              negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
      
          all_pos_words = get_all_words(positive_cleaned_tokens_list)
      
          freq_dist_pos = FreqDist(all_pos_words)
          print(freq_dist_pos.most_common(10))
      
          positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
          negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
      
          positive_dataset = [(tweet_dict, "Positive")
                               for tweet_dict in positive_tokens_for_model]
      
          negative_dataset = [(tweet_dict, "Negative")
                               for tweet_dict in negative_tokens_for_model]
      
          dataset = positive_dataset + negative_dataset
      
          random.shuffle(dataset)
      
          train_data = dataset[:7000]
          test_data = dataset[7000:]
      
          classifier = NaiveBayesClassifier.train(train_data)
      
          print("Accuracy is:", classify.accuracy(classifier, test_data))
      
          print(classifier.show_most_informative_features(10))
      
          custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."
      
          custom_tokens = remove_noise(word_tokenize(custom_tweet))
      
          print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))
      

      Conclusion

      This tutorial introduced you to a basic sentiment analysis model using the nltk library in Python 3. First, you performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and removing noise. Next, you visualized frequently occurring items in the data. Finally, you built a model to associate tweets to a particular sentiment.

      A supervised learning model is only as good as its training data. To further strengthen the model, you could considering adding more categories like excitement and anger. In this tutorial, you have only scratched the surface by building a rudimentary model. Here’s a detailed guide on various considerations that one must take care of while performing sentiment analysis.





      Source link