One place for hosting & domains

      Content

      How to Create a Content Marketing Strategy


      Content marketing is one of the primary means of getting your brand noticed online. However, without a well-developed marketing strategy, you may struggle when deciding where to begin, see your conversions sink, or launch an unsuccessful campaign.

      The good news? We’ve got you covered when it comes to creating your content marketing plan. There are 10 easy steps you can follow to not only get yourself started on the right foot but also set yourself and your material up for success.

      In this article, we’ll give you an in-depth look at what content marketing is. Then we’ll outline 10 steps you can follow when formulating your own content marketing plan:

      1. Define Your Marketing Goals
      2. Identify Your Target Audience
      3. Run an Audit
      4. Choose a CMS
      5. Brainstorm Ideas
      6. Determine Your Content Niche
      7. Map Out Publication Roles
      8. Build a Content Calendar
      9. Create Value-Add Content
      10. Measure Your Results

      Ready to create the ideal content marketing strategy for your site? Let’s dive right in!

      An Introduction to Content Marketing

      The concept of content marketing is pretty simple. You create material — think blog posts, social media posts, videos, infographics, white papers, case studies and beyond — which provides real value to your audience. This work then acts as a means of marketing your business.

      According to the Content Marketing Institute, the key to doing this effectively is by producing great content. To accomplish that, you must provide people with something that they genuinely need, is unique, and engages with your target audience.

      Of course, before you can start developing content, you’ll need to begin with a solid strategy. This means following a few steps:

      • Creating business goals for your content marketing
      • Finding your audience
      • Knowing what will make your content unique
      • Picking a formula that works for you to create content
      • Deciding where you’ll publish the results and which channels you’ll use
      • Managing the content creation and publication process
      • Determine how you’ll track key performance metrics to measure success

      This all requires a good deal of planning, but that’s the origin story of most marketing techniques. In case you’re not convinced yet, however, let’s take a look at why your business needs a content marketing strategy.

      Support Your Content Strategy with Great Hosting

      We’ll make sure your website is fast, secure, and always up so your visitors trust you. Plans start at $2.59/mo.

      Why Your Business Needs a Content Marketing Strategy

      The benefits of marketing are relatively self-explanatory, but what about content marketing in particular? It’s a relatively new focus, and you may not see why going to all that effort to make high-quality content is worth the time.

      First of all, whether you’re a small or large business, it makes sense to have a website. It’s a fantastic way to find customers and raise awareness of your brand. What’s more, your website needs plenty of inbound traffic to be as effective as possible.

      Content marketing can help drive people towards your website and into your sales funnel. Plus, producing informative and quality content to feature on your site and elsewhere can increase awareness of your brand and build trust by cementing you as an expert in your field.

      Even better, you can use content marketing to establish (and grow!) relationships with your customers. Once you know who your ideal audience is, you can hone in and focus on content that benefits them. For example, if you sell stationery and office supplies, you can curate articles about office life or write tips for professionals who work from home.

      Plus, you don’t have to be an already-established mega-company to benefit from this type of marketing. Have a vegan bakery? Write about subjects vegans care about and branch out into articles about clean living. Run a dog grooming business? Produce blog posts about pet care, how to train dogs, and so on.

      When it comes down to it, most businesses can use content marketing to great effect. You just have to find the right angle, and that’s where creating a top-notch content marketing strategy comes into play.

      How to Create a Strong Content Marketing Strategy (In 10 Steps)

      First and foremost, don’t get overwhelmed by the number of steps ahead. Each one is crucial to set yourself and your business up for success, but all of them are approachable no matter what your marketing background (or lack thereof) might be. Let’s walk through the process of getting started with content marketing, one step at a time.

      Step 1: Define Your Marketing Goals

      You may have done your fair share of work on coming up with a marketing plan in the past. If so, then you might know that your first step should be to sit down and decide on your goals. After all, you have to know the “why” behind what you’re doing to see success.

      Without purpose, you may find yourself creating content that lacks coherence or doesn’t provide value to your target audience. Alternately, you may not be able to come up with a fixed schedule that ensures new content is being pushed out regularly.

      To start making goals for your new content marketing campaign, you can ask yourself a few questions.

      • Why are you engaging in content marketing?
      • What are you going to offer to your audience or customers?
      • How will your content improve their experiences?
      • What do you want to gain from the content you’ll create?
      • How will you measure your marketing efforts?

      You may want to consider writing down your answers and bringing in other perspectives from within your company or even outside of it. These questions can help map out your focus and connect it back to the overall vision for your company. Plus, having clear goals makes it much easier to know when you’re achieving them.

      Step 2: Conduct Market Research to Identify Your Target Audience

      As you create your marketing plan, figuring out who your audience is can be just as vital as deciding on your overall goals. If you don’t know who is most likely to engage with your products or services, creating content that helps to drive conversions will likely be a challenge.

      To start your market research, it helps to first determine the demographics of your target audience. Your buyer personas should include characteristics such as your audience’s typical age range, gender, family status, education level, hobbies, interests, etc.

      Once you know the “who” you’ll be focusing on, you can then hone in on the “why” and create a “target customer profile” or “buyer persona.” In other words, you need to figure out what the needs of your target persona are and what may convince them to try your products or services.

      One valuable starting place is to reach out to past customers. You can ask them why they were interested in your business, and what “pain points” it helped to address for them. You can even ask about what makes them feel frustrated in your particular industry, and if they have any specific feedback for you.

      You can take this information and use it to determine what people in your audience are looking for and who might be searching for your business in particular. This can be an excellent blueprint to use later on when you’re coming up with content ideas.

      Step 3: Run an Audit to Determine Your Most Popular Type of Content

      Next up, it’s time to run a content audit. This involves taking a close look at the content you’ve created and shared in the past and determining what pieces have been the most popular and successful.

      “Performing a content audit.”

      This isn’t a quick process, but it’s a necessary one. Once you know what has worked well in the past, you can build on that success. Otherwise, you may end up repeating mistakes that made past content less useful. This way, you can compare those missteps with what worked and figure out how to correct them.

      There’s no need to be overwhelmed, however. Completing a content audit really only requires four major steps:

      • Create a spreadsheet of all your past content (or at least a large portion of it).
      • Decide what kind of data you’ll focus on when evaluating that content (was it functional, readable, relevant, etc.).
      • Gather and record that data for each piece of content in your spreadsheet.
      • Analyze the information as a whole in order to create an action plan for future content.

      Often, the part that takes the longest is gathering all of the data in one place. However, once you have everything at hand, you can make direct comparisons, see where you encouraged high conversions and lots of click-throughs, and identify areas where you can grow. This is your best chance for setting future articles, blogs, and other material up for success.

      Step 4: Choose a Content Management System (CMS)

      If you already have a website that you’re happy with, you can skip to the next step. If not, however, your business’ site will play a pivotal role in your content marketing strategy. Therefore, it’s critical that you get a high-quality and branded website up and running now.

      The first thing you’ll need to do is select a Content Management System (CMS). This is the software that will enable you to create and display content on your website. Fortunately, most of the big CMS names are free to use and relatively easy to navigate. They also come with plugins and themes to make content creation easier and assist you in designing your site.

      “The WordPress home page.”

      Some examples of CMSs you can try include:

      • WordPress. One of the most adaptable platforms, especially if you want to host blog posts or articles and still have a storefront
      • Joomla!. A popular choice that’s fairly approachable for beginners
      • Drupal. A more advanced system for those who have a bit of website-building under their belts
      • Magento. A solid option if you want to have an online store, as it supports e-commerce websites

      Every CMS has its strengths and weaknesses, but each one makes website creation more attainable to those with limited programming knowledge. In fact, with the right CMS, you no longer need to be a computer expert (or even know how to code) to build yourself a successful website. Plus, this will enable you to fully own all of your content.

      After choosing your CMS (we recommend WordPress!), you’ll need to choose a domain name and seek out a quality hosting provider. With those elements in place, getting your site up and running is a piece of cake.

      Step 5: Brainstorm Ideas to Guide Your Future Path

      At this stage, you will likely have a rough idea of where you’ve been successful in the past and where your content might have needed more work. Now’s the time to brainstorm!

      Based on all the information you’ve gathered, especially during your content audit, you’ll want to come up with some general ideas of where you’d like to go in the future. Of course, any practical strategy should point you towards attaining the goals you set in the first step.

      When brainstorming, you may want to focus on coming up with keywords, particularly long-tail keywords, to give your content a competitive edge. If you understand which keywords are being used by your competition and by potential customers, you can use them to ensure that your content is visible in search engines.

      It’s also useful to understand the different types of search queries, so you can better optimize your content for them. For example, there are:

      • Informational search queries
      • Navigational search queries
      • Transactional search queries

      Depending on what your business’ niche is, you may rely more heavily on one or two of these searches than the others. For example, referring back to our earlier example of a fictional vegan bakery, we might focus on both transactional and informational search queries (“Where can I find a vegan cupcake?” and “Best ways to make your own vegan milk substitute”).

      Understanding these queries and which ones your audience prefers can help you with your next stage of planning. If you know what your audience is looking for, you can create content that meets those needs.

      Step 6: Determine Which Types of Content You Want to Create

      When it comes to the material you’re going to produce, you have a lot of options to choose from. To name only a few, you can try blog posts, informative articles, e-books, case studies, templates, infographics, videos, how-tos, podcasts, online courses, and various forms of social media.

      “An example of a blog post.”

      All those choices can be overwhelming. However, each avenue has its own unique benefits.

      For example, blog posts offer a way to grow your audience and attract new clients. E-books can be a means of generating profit time and again, case studies can demonstrate the proven successes of your company, customer spotlights can create social proof, and infographics are easy for visitors to consume and share.

      Yet, of all the mediums you could hone in on, video still reigns supreme online. Videos are the most popular way for most people to pass time on the internet. Fortunately, you can depend on websites such as YouTube to host your content (and you can even turn a profit from it if you like).

      [Embed Craftograph Video Here] https://www.youtube.com/watch?v=UURvkk215lg&t=11s

      Using those pre-existing platforms can keep your website from being bogged down with heavy media files. Best of all, you can still feature those videos on your website, simply by embedding YouTube videos on your pages to save precious space.

      Once you know what kinds of content you’d like to focus on, you’ll be ready to move to the next step. Remember that variety is key, but you don’t want to overextend yourself. So you may want to choose two or three types to pursue at the beginning.

      Step 7: Map Out Publication and Management Roles

      No human is an island, and no content-creation team is complete without publication and management roles. Once you know what you’re going to create, it’s time to determine who will be responsible for which parts of the process.

      Unless you’re working alone, you’ll likely have to discuss with your team to decide who’s going to do what, including publishing and managing. To be productive, each role will need to be clearly defined. What will each role entail? Who will be accountable for responsibilities such as meeting deadlines, idea generation, editing, and more?

      When you have those basic roles sorted out, you’ll know who is in charge of the decision-making process and who is in charge of the execution. However, these positions don’t end with the content itself. You’ll also need to look at your website and decide who will do what there too.

      For instance, if you have a WordPress site, you may also plot out what you’ll allow various users to do. As the website owner, you’ll likely distribute tasks (such as writing and editing posts, controlling plugins, and managing other users) so you can keep your site orderly.

      To divvy out these duties, you can create different roles. WordPress’ basic user roles include:

      • Super Admin — Manages multiple websites on one network.
      • Administrator — Manages one site, and can do everything from deleting pages to creating posts and adding plugins.
      • Editor — Can create posts, edit pages, and moderate comments, but cannot touch the site’s infrastructure.
      • Author — Can upload files, delete posts, and edit posts, but has less authority than an editor.
      • Contributor — Can only write and manage their own posts (but not delete them).
      • Subscriber — Can simply read content and manage their user profile.

      If you want to give your team some further guidance, there are additional tools you can use to assist with workflow management, such as:

      • Oasis Workflow, which enables you to create easy-to-use templates for assigning, reviewing, and publishing content.
      • CoSchedule, a global calendar that lets everyone view the status of each project and who’s responsible for what.
      • User Role Editor, which lets you not only assign roles but also add and block specific tasks within those roles.

      Having clear roles established from the get-go can make the whole process of content marketing smoother. You won’t have to make decisions on the fly, and people will already know what is expected of them.

      Step 8: Create a Content Calendar to Maintain Your Schedule

      The day-to-day work of managing and organizing content can become hectic and quickly overwhelming. With a content calendar, you can map out your content production and delivery, and then track each piece’s progress over days, weeks, or even months. This type of editorial calendar can help you streamline and coordinate your content marketing strategy.

      That level of coordination can be particularly advantageous for ensuring there’s a consistent voice and identity that transcends the different types of content you’re distributing. These might include blog posts, social media updates on Facebook and Twitter, and other off-site content. After all, with the overview your content calendar provides, your team will know exactly what everyone is doing.

      With that in mind, your choice of platform is up to you. For instance, you could use Microsoft Excel, Google Calendar, or Google Sheets. You could also opt for a WordPress plugin to manage your content calendar, such as Editorial Calendar or PublishPress Content Calendar and Notifications.

      "The PublishPress Content Calendar and Notifications plugin.”

      Once you’ve made your decision, your next step is populating the calendar with data. That will likely include dates and topic ideas. However, it might also incorporate suggested titles for articles, relevant SEO data (such as target keywords), and any helpful notes that can benefit your team’s content creation.

      Calendars can also be used to schedule content updates and conduct audits, so you can identify older posts that are no longer encouraging conversions and click-throughs. You can even maintain individual calendars for each user or team.

      Finally, you should color-code your editorial calendar to avoid any confusion. This can be as simple as blue for blog posts, red for editorials, and green for proposed ideas. This way, no one gets confused, and your calendar is easy to understand at a glance.

      Step 9: Create Content That Provides Visitors With Valuable Information

      Long gone are the days where you can simply hammer out a blog post chock full of keywords and hope to find quick SEO success. In today’s world, you’re going to have to invest time and effort into each post and other pieces of content.

      That means juggling all of your new posts, repurposing or reusing old content, curating content from other sources, making use of user-generated content, and even atomization. If you haven’t heard of atomization, it involves taking well-written work and implementing it in multiple ways.

      Fortunately, there is a recipe of sorts to creating successful blog posts. This includes ingredients such as dedicating a significant amount of time to each post (on average, four hours) and adhering to your mission statement with every piece.

      “DreamHost’s mission statement.”

      You may also find it valuable to create a schedule and stick to it, thoroughly edit your work, and maintain credibility through following certain best practices. Those include proper sourcing for facts and data, following reputable citation standards, and even integrating testimonials.

      Doing these things, and sticking to who you are as a company, can assist in improving brand awareness. Other considerations to look out for when blogging include focusing on quality rather than quantity, using a web host that can keep up with your needs, and dedicating as much (if not more) time to promotion as you do to creation.

      Content Creation Simplified

      Whether you need help optimizing for search engines, refreshing old posts, or upping your social media game, we can help! Subscribe to our monthly digest so you never miss an article.

      Step 10: Measure Your Results to Improve Your Content

      Keeping track of your successes and failures can help you quickly course correct when it’s most necessary. This may help prevent you from continuing down a path of content and revenue stagnation.

      To guide your efforts in this area, you might want to look out for a few signposts when measuring your content’s performance. These include bounce rates, conversions, overall time spent on your site, and subscriber numbers.

      Fortunately, there are plenty of tools that can enable you to measure these metrics, such as Google Analytics for tracking your bounce rate. You can also monitor other statistics, such as return rates, where your visitors are coming from, and more. It’s also free to use, which is an added bonus.

      However, there are many other web analytics tools you can try as well. Some, like Google’s platform, are free. Others, such as Crazy Egg, are more comprehensive and come with a price tag attached.

      It might also be a good idea to track Key Performance Indicators (KPIs). Doing so will help you answer some very pertinent questions, such as:

      • Do you have more visitors now than you did a year ago?
      • Are they staying longer on your site?
      • Have your search engine rankings improved?
      • Has there been a sales revenue increase, if applicable?
      • Have you experienced social media traffic growth?
      • Has your email (or your newsletter subscriber) list grown?

      Once you’ve analyzed your successes and shortfalls, you can then reinvest in what worked well and alter what did not. As with many marketing strategies, that’s what really can help growth take off.

      Digital Content Strategy Made Easy

      As you can see, it takes work to develop your content marketing plan. However, the time you invest upfront can pay off through increased conversions and lowered bounce rates.

      Are you ready to get started? By having a professional WordPress website, you can start your content marketing off on the right foot!



      Source link

      How To Scrape Web Pages and Post Content to Twitter with Python 3


      The author selected The Computer History Museum to receive a donation as part of the Write for DOnations program.

      Introduction

      Twitter bots are a powerful way of managing your social media as well as extracting information from the microblogging network. By leveraging Twitter’s versatile APIs, a bot can do a lot of things: tweet, retweet, “favorite-tweet”, follow people with certain interests, reply automatically, and so on. Even though people can, and do, abuse their bot’s power, leading to a negative experience for other users, research shows that people view Twitter bots as a credible source of information. For example, a bot can keep your followers engaged with content even when you’re not online. Some bots even provide critical and helpful information, like @EarthquakesSF. The applications for bots are limitless. As of 2019, it is estimated that bots account for about 24% of all tweets on Twitter.

      In this tutorial, you’ll build a Twitter bot using this Twitter API library for Python. You’ll use API keys from your Twitter account to authorize your bot and build a to capable of scraping content from two websites. Furthermore, you’ll program your bot to alternately tweet content from these two websites and at set time intervals. Note that you’ll use Python 3 in this tutorial.

      Prerequisites

      You will need the following to complete this tutorial:

      Note: You’ll be setting up a developer account with Twitter, which involves an application review by Twitter before your can access the API keys you require for this bot. Step 1 walks through the specific details for completing the application.

      Step 1 — Setting Up Your Developer Account and Accessing Your Twitter API Keys

      Before you begin coding your bot, you’ll need the API keys for Twitter to recognize the requests of your bot. In this step, you’ll set up your Twitter Developer Account and access your API keys for your Twitter bot.

      To get your API keys, head over to developer.twitter.com and register your bot application with Twitter by clicking on Apply in the top right section of the page.

      Now click on Apply for a developer account.

      Next, click on Continue to associate your Twitter username with your bot application that you’ll be building in this tutorial.

      Twitter Username Association with Bot

      On the next page, for the purposes of this tutorial, you’ll choose the I am requesting access for my own personal use option since you’ll be building a bot for your own personal education use.

      Twitter API Personal Use

      After choosing your Account Name and Country, move on to the next section. For What use case(s) are you interested in?, pick the Publish and curate Tweets and Student project / Learning to code options. These categories are the best representation of why you’re completing this tutorial.

      Twitter Bot Purpose

      Then provide a description of the bot you’re trying to build. Twitter requires this to protect against bot abuse; in 2018 they introduced such vetting. For this tutorial, you’ll be scraping tech-focused content from The New Stack and The Coursera Blog.

      When deciding what to enter into the description box, model your answer on the following lines for the purposes of this tutorial:

      I’m following a tutorial to build a Twitter bot that will scrape content from websites like thenewstack.io (The New Stack) and blog.coursera.org (Coursera’s Blog) and tweet quotes from them. The scraped content will be aggregated and will be tweeted in a round-robin fashion via Python generator functions.

      Finally, choose no for Will your product, service, or analysis make Twitter content or derived information available to a government entity?

      Twitter Bot Intent

      Next, accept Twitter’s terms and conditions, click on Submit application, and then verify your email address. Twitter will send a verification email to you after your submission of this form.

      Once you verify your email, you’ll get an Application under review page with a feedback form for the application process.

      You will also receive another email from Twitter regarding the review:

      Application Review Email

      The timeline for Twitter’s application review process can vary significantly, but often Twitter will confirm this within a few minutes. However, should your application’s review take longer than this, it is not unusual, and you should receive it within a day or two. Once you receive confirmation, Twitter has authorized you to generate your keys. You can access these under the Keys and tokens tab after clicking the details button of your app on developer.twitter.com/apps.

      Finally go to the Permissions tab on your app’s page and set the Access Permission option to Read and Write since you want to write tweet content too. Usually, you would use the read-only mode for research purposes like analyzing trends, data-mining, and so on. The final option allows users to integrate chatbots into their existing apps, since chatbots require access to direct messages.

      Twitter App Permissions Page

      You have access to Twitter’s powerful API, which will be a crucial part of your bot application. Now you’ll set up your environment and begin building your bot.

      Step 2 — Building the Essentials

      In this step, you’ll write code to authenticate your bot with Twitter using the API keys, and make the first programmatic tweet via your Twitter handle. This will serve as a good milestone in your path towards the goal of building a Twitter bot that scrapes content from The New Stack and the Coursera Blog and tweets them periodically.

      First, you’ll set up a project folder and a specific programming environment for your project.

      Create your project folder:

      Move into your project folder:

      Then create a new Python virtual environment for your project:

      Then activate your environment using the following command:

      • source bird-env/bin/activate

      This will attach a (bird-env) prefix to the prompt in your terminal window.

      Now move to your text editor and create a file called credentials.py, which will store your Twitter API keys:

      Add the following content, replacing the highlighted code with your keys from Twitter:

      bird/credentials.py

      
      ACCESS_TOKEN='your-access-token'
      ACCESS_SECRET='your-access-secret'
      CONSUMER_KEY='your-consumer-key'
      CONSUMER_SECRET='your-consumer-secret'
      

      Now, you'll install the main API library for sending requests to Twitter. For this project, you'll require the following libraries: nltk, requests, twitter, lxml, random, and time. random and time are part of Python's standard library, so you don't need to separately install these libraries. To install the remaining libraries, you'll use pip, a package manager for Python.

      Open your terminal, ensure you're in the project folder, and run the following command:

      • pip3 install lxml nltk requests twitter
      • lxml and requests: You will use them for web scraping.
      • twitter: This is the library for making API calls to Twitter's servers.
      • nltk: (natural language toolkit) You will use to split paragraphs of blogs into sentences.
      • random: You will use this to randomly select parts of an entire scraped blog post.
      • time: You will use to make your bot sleep periodically after certain actions.

      Once you have installed the libraries, you're all set to begin programming. Now, you'll import your credentials into the main script that will run the bot. Alongside credentials.py, from your text editor create a file in the bird project directory, and name it bot.py:

      In practice, you would spread the functionality of your bot across multiple files as it grows more and more sophisticated. However, in this tutorial, you'll put all of your code in a single script, bot.py, for demonstration purposes.

      First you'll test your API keys by authorizing your bot. Begin by adding the following snippet to bot.py:

      bird/bot.py

      import random
      import time
      
      from lxml.html import fromstring
      import nltk
      nltk.download('punkt')
      import requests
      from twitter import OAuth, Twitter
      
      import credentials
      

      Here, you import the required libraries; and in a couple of instances you import the necessary functions from the libraries. You will use the fromstring function later in the code to convert the string source of a scraped webpage to a tree structure that makes it easier to extract relevant information from the page. OAuth will help you in constructing an authentication object from your keys, and Twitter will build the main API object for all further communication with Twitter's servers.

      Now extend bot.py with the following lines:

      bird/bot.py

      ...
      tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
      
      oauth = OAuth(
              credentials.ACCESS_TOKEN,
              credentials.ACCESS_SECRET,
              credentials.CONSUMER_KEY,
              credentials.CONSUMER_SECRET
          )
      t = Twitter(auth=oauth)
      

      nltk.download('punkt') downloads a dataset necessary for parsing paragraphs and tokenizing (splitting) them into smaller components. tokenizer is the object you'll use later in the code for splitting paragraphs written in English.

      oauth is the authentication object constructed by feeding the imported OAuth class with your API keys. You authenticate your bot via the line t = Twitter(auth=oauth). ACCESS_TOKEN and ACCESS_SECRET help in recognizing your application. Finally, CONSUMER_KEY and CONSUMER_SECRET help in recognizing the handle via which the application interacts with Twitter. You'll use this t object to communicate your requests to Twitter.

      Now save this file and run it in your terminal using the following command:

      Your output will look similar to the following, which means your authorization was successful:

      Output

      [nltk_data] Downloading package punkt to /Users/binaryboy/nltk_data... [nltk_data] Package punkt is already up-to-date!

      If you do receive an error, verify your saved API keys with those in your Twitter developer account and try again. Also ensure that the required libraries are installed correctly. If not, use pip3 again to install them.

      Now you can try tweeting something programmatically. Type the same command on the terminal with the -i flag to open the Python interpreter after the execution of your script:

      Next, type the following to send a tweet via your account:

      • t.statuses.update(status="Just setting up my Twttr bot")

      Now open your Twitter timeline in a browser, and you'll see a tweet at the top of your timeline containing the content you posted.

      First Programmatic Tweet

      Close the interpreter by typing quit() or CTRL + D.

      Your bot now has the fundamental capability to tweet. To develop your bot to tweet useful content, you'll incorporate web scraping in the next step.

      Step 3 — Scraping Websites for Your Tweet Content

      To introduce some more interesting content to your timeline, you'll scrape content from the New Stack and the Coursera Blog, and then post this content to Twitter in the form of tweets. Generally, to scrape the appropriate data from your target websites, you have to experiment with their HTML structure. Each tweet coming from the bot you'll build in this tutorial will have a link to a blog post from the chosen websites, along with a random quote from that blog. You'll implement this procedure within a function specific to scraping content from Coursera, so you'll name it scrape_coursera().

      First open bot.py:

      Add the scrape_coursera() function to the end of your file:

      bird/bot.py

      ...
      t = Twitter(auth=oauth)
      
      
      def scrape_coursera():
      

      To scrape information from the blog, you'll first request the relevant webpage from Coursera's servers. For that you will use the get() function from the requests library. get() takes in a URL and fetches the corresponding webpage. So, you'll pass blog.coursera.org as an argument to get(). But you also need to provide a header in your GET request, which will ensure Coursera's servers recognize you as a genuine client. Add the following highlighted lines to your scrape_coursera() function to provide a header:

      bird/bot.py

      def scrape_coursera():
          HEADERS = {
              'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                            ' AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'
              }
      

      This header will contain information pertaining to a defined web browser running on a specific operating system. As long as this information (usually referred to as User-Agent) corresponds to real web browsers and operating systems, it doesn't matter whether the header information aligns with the actual web browser and operating system on your computer. Therefore this header will work fine for all systems.

      Once you have defined the headers, add the following highlighted lines to make a GET request to Coursera by specifying the URL of the blog webpage:

      bird/bot.py

      ...
      def scrape_coursera():
          HEADERS = {
              'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                            ' AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'
              }
          r = requests.get('https://blog.coursera.org', headers=HEADERS)
          tree = fromstring(r.content)
      

      This will fetch the webpage to your machine and save the information from the entire webpage in the variable r. You can assess the HTML source code of the webpage using the content attribute of r. Therefore, the value of r.content is the same as what you see when you inspect the webpage in your browser by right clicking on the page and choosing the Inspect Element option.

      Here you've also added the fromstring function. You can pass the webpage's source code to the fromstring function imported from the lxml library to construct the tree structure of the webpage. This tree structure will allow you to conveniently access different parts of the webpage. HTML source code has a particular tree-like structure; every element is enclosed in the <html> tag and nested thereafter.

      Now, open https://blog.coursera.org in a browser and inspect its HTML source using the browser's developer tools. Right click on the page and choose the Inspect Element option. You'll see a window appear at the bottom of the browser, showing part of the page's HTML source code.

      browser-inspect

      Next, right click on the thumbnail of any visible blog post and then inspect it. The HTML source will highlight the relevant HTML lines where that blog thumbnail is defined. You'll notice that all blog posts on this page are defined within a <div> tag with a class of "recent":

      blog-div

      Thus, in your code, you'll use all such blog post div elements via their XPath, which is a convenient way of addressing elements of a web page.

      To do so, extend your function in bot.py as follows:

      bird/bot.py

      ...
      def scrape_coursera():
          HEADERS = {
              'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                            ' AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'
                          }
          r = requests.get('https://blog.coursera.org', headers=HEADERS)
          tree = fromstring(r.content)
          links = tree.xpath('//div[@class="recent"]//div[@class="title"]/a/@href')
          print(links)
      
      scrape_coursera()
      

      Here, the XPath (the string passed to tree.xpath()) communicates that you want div elements from the entire web page source, of class "recent". The // corresponds to searching the whole webpage, div tells the function to extract only the div elements, and [@class="recent"] asks it to only extract those div elements that have the values of their class attribute as "recent".

      However, you don't need these elements themselves, you only need the links they're pointing to, so that you can access the individual blog posts to scrape their content. Therefore, you extract all the links using the values of the href anchor tags that are within the previous div tags of the blog posts.

      To test your program so far, you call the scrape_coursera() function at the end of bot.py.

      Save and exit bot.py.

      Now run bot.py with the following command:

      In your output, you'll see a list of URLs like the following:

      Output

      ['https://blog.coursera.org/career-stories-from-inside-coursera/', 'https://blog.coursera.org/unlock-the-power-of-data-with-python-university-of-michigan-offers-new-programming-specializations-on-coursera/', ...]

      After you verify the output, you can remove the last two highlighted lines from bot.py script:

      bird/bot.py

      ...
      def scrape_coursera():
          ...
          tree = fromstring(r.content)
          links = tree.xpath('//div[@class="recent"]//div[@class="title"]/a/@href')
          ~~print(links)~~
      
      ~~scrape_coursera()~~
      

      Now extend the function in bot.py with the following highlighted line to extract the content from a blog post:

      bird/bot.py

      ...
      def scrape_coursera():
          ...
          links = tree.xpath('//div[@class="recent"]//div[@class="title"]/a/@href')
          for link in links:
              r = requests.get(link, headers=HEADERS)
              blog_tree = fromstring(r.content)
      

      You iterate over each link, fetch the corresponding blog post, extract a random sentence from the post, and then tweet this sentence as a quote, along with the corresponding URL. Extracting a random sentence involves three parts:

      1. Grabbing all the paragraphs in the blog post as a list.
      2. Selecting a paragraph at random from the list of paragraphs.
      3. Selecting a sentence at random from this paragraph.

      You'll execute these steps for each blog post. For fetching one, you make a GET request for its link.

      Now that you have access to the content of a blog, you will introduce the code that executes these three steps to extract the content you want from it. Add the following extension to your scraping function that executes the three steps:

      bird/bot.py

      ...
      def scrape_coursera():
          ...
          for link in links:
              r = requests.get(link, headers=HEADERS)
              blog_tree = fromstring(r.content)
              paras = blog_tree.xpath('//div[@class="entry-content"]/p')
              paras_text = [para.text_content() for para in paras if para.text_content()]
              para = random.choice(paras_text)
              para_tokenized = tokenizer.tokenize(para)
              for _ in range(10):
                  text = random.choice(para)
                  if text and 60 < len(text) < 210:
                      break
      

      If you inspect the blog post by opening the first link, you'll notice that all the paragraphs belong to the div tag having entry-content as its class. Therefore, you extract all paragraphs as a list with paras = blog_tree.xpath('//div[@class="entry-content"]/p').

      Div Enclosing Paragraphs

      The list elements aren't literal paragraphs; they are Element objects. To extract the text out of these objects, you use the text_content() method. This line follows Python's list comprehension design pattern, which defines a collection using a loop that is usually written out in a single line. In bot.py, you extract the text for each paragraph element object and store it in a list if the text is not empty. To randomly choose a paragraph from this list of paragraphs, you incorporate the random module.

      Finally, you have to select a sentence at random from this paragraph, which is stored in the variable para. For this task, you first break the paragraph into sentences. One approach to accomplish this is using the Python's split() method. However this can be difficult since a sentence can be split at multiple breakpoints. Therefore, to simplify your splitting tasks, you leverage natural language processing through the nltk library. The tokenizer object you defined earlier in the tutorial will be useful for this purpose.

      Now that you have a list of sentences, you call random.choice() to extract a random sentence. You want this sentence to be a quote for a tweet, so it can't exceed 280 characters. However, for aesthetic reasons, you'll select a sentence that is neither too big nor too small. You designate that your tweet sentence should have a length between 60 to 210 characters. The sentence random.choice() picks might not satisfy this criterion. To identify the right sentence, your script will make ten attempts, checking for the criterion each time. Once the randomly picked-up sentence satisfies your criterion, you can break out of the loop.

      Although the probability is quite low, it is possible that none of the sentences meet this size condition within ten attempts. In this case, you'll ignore the corresponding blog post and move on to the next one.

      Now that you have a sentence to quote, you can tweet it with the corresponding link. You can do this by yielding a string that contains the randomly picked-up sentence as well as the corresponding blog link. The code that calls this scrape_coursera() function will then post the yielded string to Twitter via Twitter's API.

      Extend your function as follows:

      bird/bot.py

      ...
      def scrape_coursera():
          ...
          for link in links:
              ...
              para_tokenized = tokenizer.tokenize(para)
              for _ in range(10):
                  text = random.choice(para)
                  if text and 60 < len(text) < 210:
                      break
              else:
                  yield None
              yield '"%s" %s' % (text, link)
      

      The script only executes the else statement when the preceding for loop doesn't break. Thus, it only happens when the loop is not able to find a sentence that fits your size condition. In that case, you simply yield None so that the code that calls this function is able to determine that there is nothing to tweet. It will then move on to call the function again and get the content for the next blog link. But if the loop does break it means the function has found an appropriate sentence; the script will not execute the else statement, and the function will yield a string composed of the sentence as well as the blog link, separated by a single whitespace.

      The implementation of the scrape_coursera() function is almost complete. If you want to make a similar function to scrape another website, you will have to repeat some of the code you've written for scraping Coursera's blog. To avoid rewriting and duplicating parts of the code and to ensure your bot's script follows the DRY principle (Don't Repeat Yourself), you'll identify and abstract out parts of the code that you will use again and again for any scraper function written later.

      Regardless of the website the function is scraping, you'll have to randomly pick up a paragraph and then choose a random sentence from this chosen paragraph — you can extract out these functionalities in separate functions. Then you can simply call these functions from your scraper functions and achieve the desired result. You can also define HEADERS outside the scrape_coursera() function so that all of the scraper functions can use it. Therefore, in the code that follows, the HEADERS definition should precede that of the scraper function, so that eventually you're able to use it for other scrapers:

      bird/bot.py

      ...
      HEADERS = {
          'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                        ' AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'
          }
      
      
      def scrape_coursera():
          r = requests.get('https://blog.coursera.org', headers=HEADERS)
          ...
      

      Now you can define the extract_paratext() function for extracting a random paragraph from a list of paragraph objects. The random paragraph will pass to the function as a paras argument, and return the chosen paragraph's tokenized form that you'll use later for sentence extraction:

      bird/bot.py

      ...
      HEADERS = {
              'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                            ' AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'
              }
      
      def extract_paratext(paras):
          """Extracts text from <p> elements and returns a clean, tokenized random
          paragraph."""
      
          paras = [para.text_content() for para in paras if para.text_content()]
          para = random.choice(paras)
          return tokenizer.tokenize(para)
      
      
      def scrape_coursera():
          r = requests.get('https://blog.coursera.org', headers=HEADERS)
          ...
      

      Next, you will define a function that will extract a random sentence of suitable length (between 60 and 210 characters) from the tokenized paragraph it gets as an argument, which you can name as para. If such a sentence is not discovered after ten attempts, the function returns None instead. Add the following highlighted code to define the extract_text() function:

      bird/bot.py

      ...
      
      def extract_paratext(paras):
          ...
          return tokenizer.tokenize(para)
      
      
      def extract_text(para):
          """Returns a sufficiently-large random text from a tokenized paragraph,
          if such text exists. Otherwise, returns None."""
      
          for _ in range(10):
              text = random.choice(para)
              if text and 60 < len(text) < 210:
                  return text
      
          return None
      
      
      def scrape_coursera():
          r = requests.get('https://blog.coursera.org', headers=HEADERS)
          ...
      

      Once you have defined these new helper functions, you can redefine the scrape_coursera() function to look as follows:

      bird/bot.py

      ...
      def extract_paratext():
          for _ in range(10):<^>
              text = random.choice(para)
          ...
      
      
      def scrape_coursera():
          """Scrapes content from the Coursera blog."""
      
          url = 'https://blog.coursera.org'
          r = requests.get(url, headers=HEADERS)
          tree = fromstring(r.content)
          links = tree.xpath('//div[@class="recent"]//div[@class="title"]/a/@href')
      
          for link in links:
              r = requests.get(link, headers=HEADERS)
              blog_tree = fromstring(r.content)
              paras = blog_tree.xpath('//div[@class="entry-content"]/p')
              para = extract_paratext(paras)
              text = extract_text(para)
              if not text:
                  continue
      
              yield '"%s" %s' % (text, link)
      

      Save and exit bot.py.

      Here you're using yield instead of return because, for iterating over the links, the scraper function will give you the tweet strings one-by-one in a sequential fashion. This means when you make a first call to the scraper sc defined as sc = scrape_coursera(), you will get the tweet string corresponding to the first link among the list of links that you computed within the scraper function. If you run the following code in the interpreter, you'll get string_1 and string_2 as displayed below, if the links variable within scrape_coursera() holds a list that looks like ["https://thenewstack.io/cloud-native-live-twistlocks-virtual-conference/", "https://blog.coursera.org/unlock-the-power-of-data-with-python-university-of-michigan-offers-new-programming-specializations-on-coursera/", ...].

      Instantiate the scraper and call it sc:

      >>> sc = scrape_coursera()
      

      It is now a generator; it generates or scrapes relevant content from Coursera, one at a time. You can access the scraped content one-by-one by calling next() over sc sequentially:

      >>> string_1 = next(sc)
      >>> string_2 = next(sc)
      

      Now you can print the strings you've defined to display the scraped content:

      >>> print(string_1)
      "Other speakers include Priyanka Sharma, director of cloud native alliances at GitLab and Dan Kohn, executive director of the Cloud Native Computing Foundation." https://thenewstack.io/cloud-native-live-twistlocks-virtual-conference/
      >>>
      >>> print(string_2)
      "You can learn how to use the power of Python for data analysis with a series of courses covering fundamental theory and project-based learning." https://blog.coursera.org/unlock-the-power-of-data-with-python-university-of-michigan-offers-new-programming-specializations-on-coursera/
      >>>
      

      If you use return instead, you will not be able to obtain the strings one-by-one and in a sequence. If you simply replace the yield with return in scrape_coursera(), you'll always get the string corresponding to the first blog post, instead of getting the first one in the first call, second one in the second call, and so on. You can modify the function to simply return a list of all the strings corresponding to all the links, but that is more memory intensive. Also, this kind of program could potentially make a lot of requests to Coursera's servers within a short span of time if you want the entire list quickly. This could result in your bot getting temporarily banned from accessing a website. Therefore, yield is the best fit for a wide variety of scraping jobs, where you only need information scraped one-at-a-time.

      Step 4 — Scraping Additional Content

      In this step, you'll build a scraper for thenewstack.io. The process is similar to what you've completed in the previous step, so this will be a quick overview.

      Open the website in your browser and inspect the page source. You'll find here that all blog sections are div elements of class normalstory-box.

      HTML Source Inspection of The New Stack website

      Now you'll make a new scraper function named scrape_thenewstack() and make a GET request to thenewstack.io from within it. Next, extract the links to the blogs from these elements and then iterate over each link. Add the following code to achieve this:

      bird/bot.py

      ...
      def scrape_coursera():
          ...
          yield '"%s" %s' % (text, link)
      
      
      def scrape_thenewstack():
          """Scrapes news from thenewstack.io"""
      
          r = requests.get('https://thenewstack.io', verify=False)
      
              tree = fromstring(r.content)
              links = tree.xpath('//div[@class="normalstory-box"]/header/h2/a/@href')
              for link in links:
      

      You use the verify=False flag because websites can sometimes have expired security certificates and it's OK to access them if no sensitive data is involved, as is the case here. The verify=False flag tells the requests.get method to not verify the certificates and continue fetching data as usual. Otherwise, the method throws an error about expired security certificates.

      You can now extract the paragraphs of the blog corresponding to each link, and use the extract_paratext() function you built in the previous step to pull out a random paragraph from the list of available paragraphs. Finally, extract a random sentence from this paragraph using the extract_text() function, and then yield it with the corresponding blog link. Add the following highlighted code to your file to accomplish these tasks:

      bird/bot.py

      ...
      def scrape_thenewstack():
          ...
          links = tree.xpath('//div[@class="normalstory-box"]/header/h2/a/@href')
      
          for link in links:
              r = requests.get(link, verify=False)
              tree = fromstring(r.content)
              paras = tree.xpath('//div[@class="post-content"]/p')
              para = extract_paratext(paras)
              text = extract_text(para)  
              if not text:
                  continue
      
              yield '"%s" %s' % (text, link)
      

      You now have an idea of what a scraping process generally encompasses. You can now build your own, custom scrapers that can, for example, scrape the images in blog posts instead of random quotes. For that, you can look for the relevant <img> tags. Once you have the right path for tags, which serve as their identifiers, you can access the information within tags using the names of corresponding attributes. For example, in the case of scraping images, you can access the links of images using their src attributes.

      At this point, you've built two scraper functions for scraping content from two different websites, and you've also built two helper functions to reuse functionalities that are common across the two scrapers. Now that your bot knows how to tweet and what to tweet, you'll write the code to tweet the scraped content.

      Step 5 — Tweeting the Scraped Content

      In this step, you'll extend the bot to scrape content from the two websites and tweet it via your Twitter account. More precisely, you want it to tweet content from the two websites alternately, and at regular intervals of ten minutes, for an indefinite period of time. Thus, you will use an infinite while loop to implement the desired functionality. You'll do this as part of a main() function, which will implement the core high-level process that you'll want your bot to follow:

      bird/bot.py

      ...
      def scrape_thenewstack():
          ...
          yield '"%s" %s' % (text, link)
      
      
      def main():
          """Encompasses the main loop of the bot."""
          print('---Bot started---n')
          news_funcs = ['scrape_coursera', 'scrape_thenewstack']
          news_iterators = []  
          for func in news_funcs:
              news_iterators.append(globals()[func]())
          while True:
              for i, iterator in enumerate(news_iterators):
                  try:
                      tweet = next(iterator)
                      t.statuses.update(status=tweet)
                      print(tweet, end='nn')
                      time.sleep(600)  
                  except StopIteration:
                      news_iterators[i] = globals()[newsfuncs[i]]()
      

      You first create a list of the names of the scraping functions you defined earlier, and call it as news_funcs. Then you create an empty list that will hold the actual scraper functions, and name that list as news_iterators. You then populate it by going through each name in the news_funcs list and appending the corresponding iterator in the news_iterators list. You're using Python's built-in globals() function. This returns a dictionary that maps variable names to actual variables within your script. An iterator is what you get when you call a scraper function: for example, if you write coursera_iterator = scrape_coursera(), then coursera_iterator will be an iterator on which you can invoke next() calls. Each next() call will return a string containing a quote and its corresponding link, exactly as defined in the scrape_coursera() function's yield statement. Each next() call goes through one iteration of the for loop in the scrape_coursera() function. Thus, you can only make as many next() calls as there are blog links in the scrape_coursera() function. Once that number exceeds, a StopIteration exception will be raised.

      Once both the iterators populate the news_iterators list, the main while loop starts. Within it, you have a for loop that goes through each iterator and tries to obtain the content to be tweeted. After obtaining the content, your bot tweets it and then sleeps for ten minutes. If the iterator has no more content to offer, a StopIteration exception is raised, upon which you refresh that iterator by re-instantiating it, to check for the availability of newer content on the source website. Then you move on to the next iterator, if available. Otherwise, if execution reaches the end of the iterators list, you restart from the beginning and tweet the next available content. This makes your bot tweet content alternately from the two scrapers for as long as you want.

      All that remains now is to make a call to the main() function. You do this when the script is called directly by the Python interpreter:

      bird/bot.py

      ...
      def main():
          print('---Bot started---n')<^>
          news_funcs = ['scrape_coursera', 'scrape_thenewstack']
          ...
      
      if __name__ == "__main__":  
          main()
      

      The following is a completed version of the bot.py script. You can also view the script on this GitHub repository.

      bird/bot.py

      
      """Main bot script - bot.py
      For the DigitalOcean Tutorial.
      """
      
      
      import random
      import time
      
      
      from lxml.html import fromstring
      import nltk  
      nltk.download('punkt')
      import requests  
      
      from twitter import OAuth, Twitter
      
      
      import credentials
      
      tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
      
      oauth = OAuth(
              credentials.ACCESS_TOKEN,
              credentials.ACCESS_SECRET,
              credentials.CONSUMER_KEY,
              credentials.CONSUMER_SECRET
          )
      t = Twitter(auth=oauth)
      
      HEADERS = {
              'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                            ' AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'
              }
      
      
      def extract_paratext(paras):
          """Extracts text from <p> elements and returns a clean, tokenized random
          paragraph."""
      
          paras = [para.text_content() for para in paras if para.text_content()]
          para = random.choice(paras)
          return tokenizer.tokenize(para)
      
      
      def extract_text(para):
          """Returns a sufficiently-large random text from a tokenized paragraph,
          if such text exists. Otherwise, returns None."""
      
          for _ in range(10):
              text = random.choice(para)
              if text and 60 < len(text) < 210:
                  return text
      
          return None
      
      
      def scrape_coursera():
          """Scrapes content from the Coursera blog."""
          url = 'https://blog.coursera.org'
          r = requests.get(url, headers=HEADERS)
          tree = fromstring(r.content)
          links = tree.xpath('//div[@class="recent"]//div[@class="title"]/a/@href')
      
          for link in links:
              r = requests.get(link, headers=HEADERS)
              blog_tree = fromstring(r.content)
              paras = blog_tree.xpath('//div[@class="entry-content"]/p')
              para = extract_paratext(paras)  
              text = extract_text(para)  
              if not text:
                  continue
      
              yield '"%s" %s' % (text, link)  
      
      
      def scrape_thenewstack():
          """Scrapes news from thenewstack.io"""
      
          r = requests.get('https://thenewstack.io', verify=False)
      
          tree = fromstring(r.content)
          links = tree.xpath('//div[@class="normalstory-box"]/header/h2/a/@href')
      
          for link in links:
              r = requests.get(link, verify=False)
              tree = fromstring(r.content)
              paras = tree.xpath('//div[@class="post-content"]/p')
              para = extract_paratext(paras)
              text = extract_text(para)  
              if not text:
                  continue
      
              yield '"%s" %s' % (text, link)
      
      
      def main():
          """Encompasses the main loop of the bot."""
          print('Bot started.')
          news_funcs = ['scrape_coursera', 'scrape_thenewstack']
          news_iterators = []  
          for func in news_funcs:
              news_iterators.append(globals()[func]())
          while True:
              for i, iterator in enumerate(news_iterators):
                  try:
                      tweet = next(iterator)
                      t.statuses.update(status=tweet)
                      print(tweet, end='n')
                      time.sleep(600)
                  except StopIteration:
                      news_iterators[i] = globals()[newsfuncs[i]]()
      
      
      if __name__ == "__main__":  
          main()
      
      

      Save and exit bot.py.

      The following is a sample execution of bot.py:

      You will receive output showing the content that your bot has scraped, in a similar format to the following:

      Output

      [nltk_data] Downloading package punkt to /Users/binaryboy/nltk_data... [nltk_data] Package punkt is already up-to-date! ---Bot started--- "Take the first step toward your career goals by building new skills." https://blog.coursera.org/career-stories-from-inside-coursera/ "Other speakers include Priyanka Sharma, director of cloud native alliances at GitLab and Dan Kohn, executive director of the Cloud Native Computing Foundation." https://thenewstack.io/cloud-native-live-twistlocks-virtual-conference/ "You can learn how to use the power of Python for data analysis with a series of courses covering fundamental theory and project-based learning." https://blog.coursera.org/unlock-the-power-of-data-with-python-university-of-michigan-offers-new-programming-specializations-on-coursera/ "“Real-user monitoring is really about trying to understand the underlying reasons, so you know, ‘who do I actually want to fly with?" https://thenewstack.io/how-raygun-co-founder-and-ceo-spun-gold-out-of-monitoring-agony/

      After a sample run of your bot, you'll see a full timeline of programmatic tweets posted by your bot on your Twitter page. It will look something like the following:

      Programmatic Tweets posted

      As you can see, the bot is tweeting the scraped blog links with random quotes from each blog as highlights. This feed is now an information feed with tweets alternating between blog quotes from Coursera and thenewstack.io. You've built a bot that aggregates content from the web and posts it on Twitter. You can now broaden the scope of this bot as per your wish by adding more scrapers for different websites, and the bot will tweet content coming from all the scrapers in a round-robin fashion, and in your desired time intervals.

      Conclusion

      In this tutorial you built a basic Twitter bot with Python and scraped some content from the web for your bot to tweet. There are many bot ideas to try; you could also implement your own ideas for a bot's utility. You can combine the versatile functionalities offered by Twitter's API and create something more complex. For a version of a more sophisticated Twitter bot, check out chirps, a Twitter bot framework that uses some advanced concepts like multithreading to make the bot do multiple things simultaneously. There are also some fun-idea bots, like misheardly. There are no limits on the creativity one can use while building Twitter bots. Finding the right API endpoints to hit for your bot's implementation is essential.

      Finally, bot etiquette or ("botiquette") is important to keep in mind when building your next bot. For example, if your bot incorporates retweeting, make all tweets' text pass through a filter to detect abusive language before retweeting them. You can implement such features using regular expressions and natural language processing. Also, while looking for sources to scrape, follow your judgment and avoid ones that spread misinformation. To read more about botiquette, you can visit this blog post by Joe Mayo on the topic.





      Source link

      Using a CDN to Speed Up Static Content Delivery


      Introduction

      Modern websites and applications must often deliver a significant amount of static content to end users. This content includes images, stylesheets, JavaScript, and video. As these static assets grow in number and size, bandwidth usage swells and page load times increase, deteriorating the browsing experience for your users and reducing your servers’ available capacity.

      To dramatically reduce page load times, improve performance, and reduce your bandwidth and infrastructure costs, you can implement a CDN, or content delivery network, to cache these assets across a set of geographically distributed servers.

      In this tutorial, we’ll provide a high-level overview of CDNs and how they work, as well as the benefits they can provide for your web applications.

      What is a CDN?

      A content delivery network is a geographically distributed group of servers optimized to deliver static content to end users. This static content can be almost any sort of data, but CDNs are most commonly used to deliver web pages and their related files, streaming video and audio, and large software packages.

      Diagram of content delivery without a CDN

      A CDN consists of multiple points of presence (PoPs) in various locations, each consisting of several edge servers that cache assets from your origin, or host server. When a user visits your website and requests static assets like images or JavaScript files, their requests are routed by the CDN to the nearest edge server, from which the content is served. If the edge server does not have the assets cached or the cached assets have expired, the CDN will fetch and cache the latest version from either another nearby CDN edge server or your origin servers. If the CDN edge does have a cache entry for your assets (which occurs the majority of the time if your website receives a moderate amount of traffic), it will return the cached copy to the end user.

      Content Delivery Network (CDN) diagram

      This allows geographically dispersed users to minimize the number of hops needed to receive static content, fetching the content directly from a nearby edge’s cache. The result is significantly decreased latencies and packet loss, faster page load times, and drastically reduced load on your origin infrastructure.

      CDN providers often offer additional features such as DDoS mitigation and rate-limiting, user analytics, and optimizations for streaming or mobile use cases at additional cost.

      How Does a CDN Work?

      When a user visits your website, they first receive a response from a DNS server containing the IP address of your host web server. Their browser then requests the web page content, which often consists of a variety of static files, such as HTML pages, CSS stylesheets, JavaScript code, and images.

      Once you roll out a CDN and offload these static assets onto CDN servers, either by “pushing” them out manually or having the CDN “pull” the assets automatically (both mechanisms are covered in the next section), you then instruct your web server to rewrite links to static content such that these links now point to files hosted by the CDN. If you’re using a CMS such as WordPress, this link rewriting can be implemented using a third-party plugin like CDN Enabler.

      Many CDNs provide support for custom domains, allowing you to create a CNAME record under your domain pointing to a CDN endpoint. Once the CDN receives a user request at this endpoint (located at the edge, much closer to the user than your backend servers), it then routes the request to the Point of Presence (PoP) located closest to the user. This PoP often consists of one or more CDN edge servers collocated at an Internet Exchange Point (IxP), essentially a data center that Internet Service Providers (ISPs) use to interconnect their networks. The CDN’s internal load balancer then routes the request to an edge server located at this PoP, which then serves the content to the user.

      Caching mechanisms vary across CDN providers, but generally they work as follows:

      1. When the CDN receives a first request for a static asset, such as a PNG image, it does not have the asset cached and must fetch a copy of the asset from either a nearby CDN edge server, or the origin server itself. This is known as a cache “miss,” and can usually be detected by inspecting the HTTP response header, containing X-Cache: MISS. This initial request will be slower than future requests because after completing this request the asset will have been cached at the edge.
      2. Future requests for this asset (cache “hits”), routed to this edge location, will now be served from cache, until expiry (usually set through HTTP headers). These responses will be significantly faster than the initial request, dramatically reducing latencies for users and offloading web traffic onto the CDN network. You can verify that the response was served from a CDN cache by inspecting the HTTP response header, which should now contain X-Cache: HIT.

      To learn more about how a specific CDN works and has been implemented, consult your CDN provider’s documentation.

      In the next section, we’ll introduce the two popular types of CDNs: push and pull CDNs.

      Push vs. Pull Zones

      Most CDN providers offer two ways of caching your data: pull zones and push zones.

      Pull Zones involve entering your origin server’s address, and letting the CDN automatically fetch and cache all the static resources available on your site. Pull zones are commonly used to deliver frequently updated, small to medium sized web assets like HTML, CSS, and JavaScript files. After providing the CDN with your origin server’s address, the next step is usually rewriting links to static assets such that they now point to the URL provided by the CDN. From that point onwards, the CDN will handle your users’ incoming asset requests and serve content from its geographically distributed caches and your origin as appropriate.

      To use a Push Zone, you upload your data to a designated bucket or storage location, which the CDN then pushes out to caches on its distributed fleet of edge servers. Push zones are typically used for larger, infrequently changing files, like archives, software packages, PDFs, video, and audio files.

      Benefits of Using a CDN

      Almost any site can reap the benefits provided by rolling out a CDN, but generally the core reasons for implementing one are to offload bandwidth from your origin servers onto the CDN servers, and to reduce latency for geographically distributed users.

      We’ll go through these and several of the other major advantages afforded by using a CDN below.

      Origin Offload

      If you’re nearing bandwidth capacity on your servers, offloading static assets like images, videos, CSS and JavaScript files will drastically reduce your servers’ bandwidth usage. Content delivery networks are designed and optimized for serving static content, and client requests for this content will be routed to and served by edge CDN servers. This has the added benefit of reducing load on your origin servers, as they then serve this data at a much lower frequency.

      Lower Latency for Improved User Experience

      If your user base is geographically dispersed, and a non-trivial portion of your traffic comes from a distant geographical area, a CDN can decrease latency by caching static assets on edge servers closer to your users. By reducing the distance between your users and static content, you can more quickly deliver content to your users and improve their experience by boosting page load speeds.

      These benefits are compounded for websites serving primarily bandwidth-intensive video content, where high latencies and slow loading times more directly impact user experience and content engagement.

      Manage Traffic Spikes and Avoid Downtime

      CDNs allow you to handle large traffic spikes and bursts by load balancing requests across a large, distributed network of edge servers. By offloading and caching static content on a delivery network, you can accommodate a larger number of simultaneous users with your existing infrastructure.

      For websites using a single origin server, these large traffic spikes can often overwhelm the system, causing unplanned outages and downtime. Shifting traffic onto highly available and redundant CDN infrastructure, designed to handle variable levels of web traffic, can increase the availability of your assets and content.

      Reduce Costs

      As serving static content usually makes up the majority of your bandwidth usage, offloading these assets onto a content delivery network can drastically reduce your monthly infrastructure spend. In addition to reducing bandwidth costs, a CDN can decrease server costs by reducing load on the origin servers, enabling your existing infrastructure to scale. Finally, some CDN providers offer fixed-price monthly billing, allowing you to transform your variable monthly bandwidth usage into a stable, predictable recurring spend.

      Increase Security

      Another common use case for CDNs is DDoS attack mitigation. Many CDN providers include features to monitor and filter requests to edge servers. These services analyze web traffic for suspicious patterns, blocking malicious attack traffic while continuing to allow reputable user traffic through. CDN providers usually offer a variety of DDoS mitigation services, from common attack protection at the infrastructure level (OSI layers 3 and 4), to more advanced mitigation services and rate limiting.

      In addition, most CDNs let you configure full SSL, so that you can encrypt traffic between the CDN and the end user, as well as traffic between the CDN and your origin servers, using either CDN-provided or custom SSL certificates.

      Choosing the Best Solution

      If your bottleneck is CPU load on the origin server, and not bandwidth, a CDN may not be the most appropriate solution. In this case, local caching using popular caches such as NGINX or Varnish may significantly reduce load by serving assets from system memory.

      Before rolling out a CDN, additional optimization steps — like minifying and compressing JavaScript and CSS files, and enabling web server HTTP request compression — can also have a significant impact on page load times and bandwidth usage.

      A helpful tool to measure your page load speed and improve it is Google’s PageSpeed Insights. Another helpful tool that provides a waterfall breakdown of request and response times as well as suggested optimizations is Pingdom.

      Conclusion

      A content delivery network can be a quick and effective solution for improving the scalability and availability of your web sites. By caching static assets on a geographically distributed network of optimized servers, you can greatly reduce page load times and latencies for end users. In addition, CDNs allow you to significantly reduce your bandwidth usage by absorbing user requests and responding from cache at the edge, thus lowering your bandwidth and infrastructure costs.

      With plugins and third-party support for major frameworks like WordPress, Drupal, Django, and Ruby on Rails, as well as additional features like DDoS mitigation, full SSL, user monitoring, and asset compression, CDNs can be an impactful tool for securing and optimizing high-traffic web sites.



      Source link