One place for hosting & domains

      How to Fix the “Too Many Redirects” Error in WordPress (13 Methods)


      URL redirection is necessary when pages have changed their addresses permanently or temporarily. However, sometimes your website can get stuck in a redirection loop. If this happens, you may face the “too many redirects” error that prevents you from accessing web pages.

      Fortunately, you can use several methods to fix this redirection issue. The problem usually lies within your website, browser, server, or third-party plugins or programs. By taking the time to diagnose the cause of the error, you can solve it relatively quickly.

      In this article, we’ll look at common causes of the “too many redirects” error in WordPress and how to fix them. We’ll also explain how to prevent the problem from happening again in the future. Let’s get started!

      What Causes the “Too Many Redirects” Error in WordPress

      The “too many redirects” error happens when your WordPress website gets stuck in redirection loops. For example, it may try to send you to another URL that points to an entirely different link. If this process continues, your browser may trigger the error and fail to load the site.

      This error looks different depending on the browser you use. For example, in Google Chrome, it usually displays as “ERR_TOO_MANY_REDIRECTS” or “This webpage has a redirect loop.”

      The “too many redirects” error in Google Chrome. 

      If you use Mozilla Firefox, the error usually reads as “The page isn’t redirecting properly.” Alternatively, it displays as “This page isn’t working right now” in Microsoft Edge. Finally, Safari users may encounter “Safari Can’t Open the Page.”

      Unlike some other common WordPress errors, the “too many redirects” issue doesn’t usually solve itself. As such, you’ll need to troubleshoot the origins of the problem to fix it.

      Skip the Stress

      Avoid troubleshooting when you sign up for DreamPress. Our friendly WordPress experts are available 24/7 to help solve website problems — big or small.

      How to Fix the “Too Many Redirects” Error in WordPress (13 Methods)

      Various factors can cause the  “too many redirects” error in WordPress. Therefore, you may need to try a few different methods to solve it. Let’s take a look at a comprehensive list of all the possible solutions.

      1. Force the Page to Refresh

      The first solution is a very simple one. You can force your browser to refresh and retrieve a new version of the page. This method overrides any stored data and displays the latest information available for the WordPress website.

      You might like to try this method first because it’s quick and won’t interfere with any other strategies. You’ll also know straight away if it has fixed the problem or not.

      You can use the following keyboard shortcuts to force a refresh in your browser:

      • Google Chrome (Windows): Ctrl + F5
      • Google Chrome (Mac): Command + Shift + R
      • Safari: Command + Option + R
      • Firefox (Windows): Ctrl + F5
      • Firefox (Mac): Command + Shift + R
      • Microsoft Edge: Ctrl + F5

      That’s all you need to do. However, if this simple method doesn’t work, you can continue through the troubleshooting guide.

      2. Delete Cookies on the Site

      Cookies are small blocks of data that enable websites to remember information about your visit. Then, the sites use that data to customize your experiences.

      For example, an e-commerce platform might send you on-site recommendations based on your previous purchases and searches. This way, you’ll be able to save time when looking for related products.

      However, cookies can sometimes hold onto incorrect data. In turn, this can cause the “too many redirects” error. As such, you can try deleting cookies from the WordPress site.

      In Google Chrome, navigate to the three dots at the top of your menu. Then, click on Settings.

      How to access the Settings in Google Chrome.

      Scroll down to Privacy and security and select Cookies and other site data.

      Finding cookies and other site data in Google Chrome.

      Move down the page and select See all cookies and site data. This will open a list of all the cookies that different sites hold with your data.

      A list of the cookies in a Google Chrome browser.

      Scroll down to find the site that is throwing the “too many redirects” error. Then, click on the trashcan icon next to its corresponding cookie to delete it.

      There is a slightly different method if you’re using Safari, Microsoft Edge, or Firefox. Once you’re done, try refreshing the WordPress site to see if the error is fixed.

      3. Clear Your WordPress Site or Server Cache

      Caching stores information about your site so that it can load faster the next time you access it. However, your cache may be holding outdated data and causing a redirection error. Therefore, you can try clearing out the stored information to see if it fixes the problem.

      If you can access your WordPress site, you can try clearing the cache with a dedicated caching plugin. For example, you could use WP Super Cache.

      The WP Super Cache plugin. 

      However, the redirection error will likely prevent you from getting to your dashboard. Therefore, you might need to try clearing your server cache.

      If you’re a DreamPress customer and have a shell account, you’ll need to log into your domain with Secure Shell (SSH) protocol. Then, you can enter the following code to purge your cache:

      curl -X PURGE “http://yourwebsite.com/.*” ; wp cache flush

      Alternatively, you can use the following command if you don’t use the WP Super Cache plugin:

      wp vanish purge --wildcard

      Once you’ve cleared out the cache, try reloading your site. If that doesn’t work, you may need to try another method.

      4. Clear Your Browser Cache

      Your browser also stores cached information about the websites you visit, including your own. If your browser is holding onto outdated data, you may need to clear it out to fix the redirection error in WordPress.

      If you’re working with Chrome, you can head back to Settings and scroll down to Privacy and security. Here, click on Clear browsing data.

      Clear browsing data in Google Chrome.

      This will bring up a new window that enables you to choose the data you want to delete. Select each item by checking the box next to it and then click on Clear data.

      Clearing data in Google Chrome.

      You’ll need to use slightly different methods if you’re working with a different browser. When you’re done, try reloading your site to see if the “too many redirects” error has gone.

      5. Determine the Cause of the Redirect Loop

      If the earlier methods didn’t solve the redirection error, you might like to try to diagnose the problem. Otherwise, you might spend a lot of effort on more time-consuming strategies that may not fix the error.

      There are a couple of different methods that can determine the cause of redirect loops. Firstly, you can enter your site’s URL into the Redirect Checker tool.

      The Redirect Checker tool from httpstatus. 

      This free online application enables you to enter multiple URLs and check their statuses. You can also specify the user agent, such as your browser, search engine bots, and mobile devices.

      Once you enter your URL, you’ll be able to see any status or error codes associated with your site at the bottom of the page.

      Status codes associated with the DreamHost URL. 

      Alternatively, some browser add-ons can show you the nature of redirects on different sites. For example, the Redirect Path Chrome extension flags redirect error messages in real-time.

      The Redirect Path Chrome extension.

      However, these tools might not always tell you why your redirect error is happening. If this is the case, you can continue with the other strategies in this troubleshooting guide.

      6. Temporarily Disable Your WordPress Plugins

      WordPress plugins are helpful tools that can introduce new functionalities to your website. However, these add-ons can also cause many issues, such as the “too many redirects” error.

      Anyone can develop and share WordPress plugins. As such, you may accidentally download one that contains faulty code. These add-ons also have frequent updates. If you haven’t updated your plugins, they may also be causing problems on your site.

      You may like to try this method if you recently added new plugins to your WordPress site. If so, you’ll probably have a good idea of the one causing the problem. Even if you don’t suspect a particular plugin, you can use the following steps to address the issue.

      If you can’t access your WordPress site, you’ll need to access the plugin files via a Secure File Transfer Protocol (SFTP) application such as WinSCP.

      The WinSCP SFTP client.

      Once you’ve connected the SFTP client to your site, you’ll need to find the folder that holds your plugins. You’ll usually find it under wp-content > plugins. Here, you’ll see a series of folders with the names of your installed plugins.

      Plugin folders for WordPress sites.

      Rename the plugins folder to “plugins-off”. This will deactivate all of your plugins. You should now be able to access your WordPress dashboard.

      Next, rename your plugins folder to its original title. Then go through the process of reactivating each add-on from your WordPress dashboard to see which one throws the “too many redirects” error.

      If you find a problem plugin, you’ll need to keep it deactivated. You’ll also need to find an alternative option for your website.

      7. Check Your WordPress Site Settings

      Sometimes an error in your WordPress site settings can cause redirect loops. For example, your website might be pointing to the wrong domain name for your site files. This more commonly happens if you’ve recently migrated your website.

      You can check your site settings in your WordPress dashboard. If you can access it, log in and head to Settings > General. You’ll then see two fields for WordPress Address (URL) and Site Address (URL).

      Accessing URL settings in WordPress.

      These two addresses should be identical unless you want WordPress to have its own directory. If the URLs don’t match, and they should, you can change the settings manually. You’ll need to edit your site’s wp-config.php file.

      Access your website using SFTP as you did previously. Then, locate and open the wp-config.php file in a text editor.

      Next, you’re going to paste the following code into the file:

      define( 'WP_HOME', 'http://example.com' );
      
      define( 'WP_SITEURL', 'http://example.com' );

      Replace the example URLs with the correct ones and save the file. Then reload your website and see if this solved the problem.

      8. Check Your SSL Certificate

      If you’ve recently migrated your site to HTTPS, there are various steps you need to complete. Unfortunately, if you miss some of them or misconfigure some settings, you could trigger the “too many redirects” error in WordPress.

      For example, if you didn’t install your Secure Sockets Layer (SSL) certificate correctly, it could be causing problems. If you didn’t add it at all, your site would automatically get stuck in a redirect loop.

      However, there might also be some minor issues with your SSL certificate installation. For example, you might have incorrectly installed the intermediate certificates that work together with your main one.

      You can check if your SSL certificate is correctly installed using a tool such as the Qualys SSL Server Test.

      The SSL Server Test from Qualys.

      This application scans your domain to find any associated SSL issues. This process can take a few minutes, but it will alert you to any problems with your certificate installation.

      9. Update Your Hard-Coded Links

      If you’ve just switched from HTTP to HTTPS, you’ll need to redirect your links. Otherwise, these URLs will point to locations that no longer exist on your website.

      Many users utilize plugins that can change these links automatically. For example, you could use Better Find and Replace.

      The Better Find and Replace plugin.

      However, it can be risky to use an add-on. If your chosen plugin has any issues with its code or updates, it can misconfigure your redirects and trigger the “too many redirects” error.

      As such, we recommend that you manually update your hard-coded links. You can do this with the search and replace method in WordPress.

      We have a complete guide on how to change your WordPress URLs. If you’re a DreamHost customer, you can also reach out to our technical support team for assistance.

      10. Check for HTTPS Redirects on Your Server

      HTTPS redirect server rules can also cause the “too many redirects” error in WordPress. These settings may have been misconfigured when you migrated your site.

      For example, the settings may not be correctly redirecting your links to HTTPS. As such, you’ll need to amend them.

      If your host uses an Apache server, you’ll need to edit your .htaccess file. Locate it within your SFTP client and open the file in a text editor. Then, you can enter the following code:

      RewriteEngine On
      
      RewriteCond %{HTTPS} off
      
      RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

      This code will cause all HTTP links to redirect to HTTPS automatically. Save the .htaccess file and try to reload your WordPress site. If it still triggers the redirect error, you’ll need to try another solution.

      Alternatively, you can adjust your HTTPS redirects on Nginx servers. If you’re not sure which server type your host uses, you might like to double-check with the company first.

      In Nginx, you’ll need to adjust the config file. Open it with your SFTP client as usual, and then locate the file. Insert the following code to set up your redirects:

      server { listen 80; server_name domain.com www.domain.com; return 301 https://domain.com$request_uri; }

      Save the file and reload your WordPress site. If it doesn’t fix the problem, keep moving through this troubleshooting guide.

      11. Check Your Third-Party Service Settings

      Suppose you use a third-party service such as a Content Delivery Network (CDN). In that case, its settings may cause the “too many redirects” error. For example, Cloudflare is a popular option that can improve your website’s performance and security.

      The Cloudflare Content Delivery Network (CDN).

      Cloudflare can trigger the “too many redirects” error if you have the Flexible SSL setting enabled and an SSL certificate from another source (such as your hosting provider).

      In this scenario, your hosting server requests are already redirecting URLs from HTTP to HTTPS. However, with the Flexible SSL setting, all server requests are being sent in HTTP. As such, there are redirection loops happening between the different processes.

      As such, we don’t recommend using the Flexible SSL setting if you have an SSL certificate from a third-party source. Instead, change your Cloudflare Crypto settings and choose either Full or Full (strict). Doing so will automatically send requests in HTTPS.

      Additionally, you may like to enable the Always Use HTTPS rule in Cloudflare. This forces your site to send all requests in HTTPS. Therefore, it avoids causing a redirect loop and triggering the WordPress error.

      Finally, you might like to double-check that you’ve correctly configured your redirects in Cloudflare. For example, you’ll want to ensure that your domain doesn’t redirect to itself. Otherwise, it can trigger a redirect error.

      12. Check Redirects on Your Server

      We already covered how to check for HTTPS redirects on your server. However, other redirects can trigger an error when loading your WordPress website.

      For example, you might have a 301 redirect misconfigured. It might be pointing to the original link, triggering a redirect loop that prevents your site from loading. You can usually find redirects such as this one by checking your config files.

      If your host uses an Apache server, you may have issues with your .htaccess file. We recommend creating a new one with default settings.

      First, you’ll need to access your site via SFTP. Find the .htaccess file and save a copy of it in case you make a mistake. You can do this by renaming it to something like “.htaccess_old”.

      Next, you’ll need to make a new .htaccess file. Put the following code into it to establish default settings:

      # BEGIN WordPress
      
      RewriteEngine On
      
      RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization}]
      
      RewriteBase /
      
      RewriteRule ^index.php$ - [L]
      
      RewriteCond %{REQUEST_FILENAME} !-f
      
      RewriteCond %{REQUEST_FILENAME} !-d
      
      RewriteRule . /index.php [L]
      
      # END WordPress

      Save the file and try reloading your WordPress site. If this process worked, you can delete the old .htaccess file and keep working with the new one.

      However, if your host uses an Nginx server, you’ll need to follow a slightly different process. This server type uses a variety of different config files, depending on the hosting provider. We recommend reaching out to your host to see which one applies to your situation.

      13. Contact Your Web Hosting Provider

      If you’ve tried all of these methods and you can’t fix the “too many redirects” error, it might be time to get some help. You might be missing a crucial step, or there could be a deeper issue with your WordPress site.

      By contacting your web hosting provider, you can get fast assistance with the error. For example, DreamHost customers can contact our technical support team.

      The DreamHost technical support landing page.

      You’ll need to log in to your account. You may also need to provide some information, such as your domain name and customer details.

      How to Prevent the “Too Many Redirects” Error in the Future (3 Methods)

      If you want to prevent the “too many redirects” error, there are a few steps you can take within your browser and site. Let’s take a look at a few different methods.

      1. Keep Your Plugins and WordPress Files Up to Date

      Outdated or faulty plugins are some of the leading causes of the “too many redirects” error. We already covered how you can deactivate any add-ons that may be triggering the issue. However, you can also take preventative steps with your current plugins and theme files.

      For example, you should update your plugins and WordPress theme frequently. You can tell if the software has a new release because you’ll see an alert in your WordPress dashboard. You can also navigate to Plugins > Installed Plugins.

      Updating plugins in WordPress.

      You can update any plugin by clicking on update now or Enable auto-updates. However, if you prefer to do the process manually, we recommend checking this page on a regular basis. Doing so will enable you to stay on top of any new releases and bug fixes.

      Additionally, you can report any faulty plugins if they cause the “too many redirects” error. Find the corresponding plugin support forum and document your issue to see if there is a known solution. Moreover, this action could prompt the plugin developers to fix the problem.

      2. Clear Out Your Cache and Stored Cookies Regularly

      Earlier in the guide, we explained how to clear out your cache and your saved cookies. These methods prevent your browser or WordPress site from trying to access outdated data.

      It’s likely that you won’t need to use these methods as most browsers are smart enough to remove outdated cookies and cache items. However, you can streamline the process by using a WordPress plugin to clear your site’s cache. An add-on such as this one can make sure that the most current version of your site is always available to your users.

      For example, if you’re using WP Super Cache, you can set up automatic processes. In your WordPress dashboard, navigate to Settings > WP Super Cache.

      Configuring settings in WP Super Cache. 

      If you want to remove cached files manually, you can click on Delete Cache. You can also navigate to the Advanced tab and scroll down to Expiry Time & Garbage Collection. Here, you can control how long cached files remain active on your site.

      Configuring the WP Super Cache settings.

      Here you can choose a custom cache timeout duration in seconds. Alternatively, you can select a custom time and interval to scan your site for outdated cache files. You can even elect to receive emails when this process happens.

      You likely won’t be able to access the plugin if you’re already receiving the “too many redirects” error. However, using this add-on can be a sound preventative measure.

      3. Use a Checklist or Company for Website Migrations

      Many of the causes for redirect errors in WordPress arise from migrations from HTTP to HTTPS. If you’re not familiar with migrating a site, you may miss some of the essential processes needed to make your website redirect and function correctly.

      Therefore, we recommend using a dedicated migration service to take care of the process. Professionals have experience with every aspect of migrating a site. As such, they’re less likely to make mistakes.

      If you prefer to do the migration yourself, you might like to use a checklist during the process:

      1. Prepare for the migration. First, you’ll need to make a copy of your site as a backup. You’ll also need to block access to your new site until you can check it for errors and migrate all your content.
      2. Create a URL mapping. You’ll need to create a redirect map for all your site’s URLs. Then, you’ll need to update them and create sitemaps so that you can transition the links easily.
      3. Create backups. Before starting the migration, you’ll probably want to back up all your individual content. Otherwise, you could lose it if something goes wrong during the process.
      4. Update your DNS settings. You’ll need to change your domain settings so that the URL points to your new address. Usually, your new host can take care of this for you.
      5. Set up your redirects. This step is crucial because misconfiguring your redirects can trigger the “too many redirects” error. Make sure you test each link to see that it works.
      6. Send your URLs to Google Search Console. You’ll need to verify your new site and send sitemaps with your new URLs indexed. This process is essential for Search Engine Optimization (SEO).
      7. Update your links. If other websites link to your site, you might like to ask them to update those URLs. Additionally, you should ensure that any ad campaigns contain the correct links for your new website address.
      8. Check for problems. Finally, you might like to run a site audit. This process can test all your links and identify any issues.

      If you’re migrating to a different server, the process might be slightly different. It pays to do your research before the migration to avoid any errors.

      Have Another Error Message to Fix?

      If you need to resolve other technical problems on your site, we’ve put together several comprehensive tutorials to help you troubleshoot every common WordPress error:

      And if you’re looking for more information and best practices for running a WordPress site, check out our WordPress Tutorials section. This is a collection of expert-written guides designed to help you navigate the admin dashboard like a pro.

      Take Your WordPress Website to the Next Level

      Whether you need help logging into the WordPress admin area, fixing a redirect issue, or finding the plugins folder, we can help! Subscribe to our monthly digest so you never miss an article.

      No More Redirect Loop Error

      The “too many redirects” error can happen in WordPress when the site gets stuck in a redirection loop. Although the problem can be frustrating, you should be able to solve it pretty quickly.

      You can usually fix the error by clearing out your cache or cookies. Additionally, there may be solvable issues with your server, third-party platforms, or plugins. Finally, if you still can’t troubleshoot the redirection error, your hosting provider may be able to help you out.

      Are you looking for a WordPress hosting provider that can help you with redirection issues and other common errors? Check out our DreamHost packages today! We provide personalized technical support to assist you with any WordPress problems.

      Image source: Flickr



      Source link

      How To Build a Media Processing API in Node.js With Express and FFmpeg.wasm


      The author selected the Electronic Frontier Foundation to receive a donation as part of the Write for DOnations program.

      Introduction

      Handling media assets is becoming a common requirement of modern back-end services. Using dedicated, cloud-based solutions may help when you’re dealing with massive scale or performing expensive operations, such as video transcoding. However, the extra cost and added complexity may be hard to justify when all you need is to extract a thumbnail from a video or check that user-generated content is in the correct format. Particularly at a smaller scale, it makes sense to add media processing capability directly to your Node.js API.

      In this guide, you will build a media API in Node.js with Express and ffmpeg.wasm — a WebAssembly port of the popular media processing tool. You’ll build an endpoint that extracts a thumbnail from a video as an example. You can use the same techniques to add other features supported by FFmpeg to your API.

      When you’re finished, you will have a good grasp on handling binary data in Express and processing them with ffmpeg.wasm. You’ll also handle requests made to your API that cannot be processed in parallel.

      Prerequisites

      To complete this tutorial, you will need:

      This tutorial was verified with Node v16.11.0, npm v7.15.1, express v4.17.1, and ffmpeg.wasm v0.10.1.

      Step 1 — Setting Up the Project and Creating a Basic Express Server

      In this step, you will create a project directory, initialize Node.js and install ffmpeg, and set up a basic Express server.

      Start by opening the terminal and creating a new directory for the project:

      Navigate to the new directory:

      Use npm init to create a new package.json file. The -y parameter indicates that you’re happy with the default settings for the project.

      Finally, use npm install to install the packages required to build the API. The --save flag indicates that you wish to save those as dependencies in the package.json file.

      • npm install --save @ffmpeg/ffmpeg @ffmpeg/core express cors multer p-queue

      Now that you have installed ffmpeg, you’ll set up a web server that responds to requests using Express.

      First, open a new file called server.mjs with nano or your editor of choice:

      The code in this file will register the cors middleware which will permit requests made from websites with a different origin. At the top of the file, import the express and cors dependencies:

      server.mjs

      import express from 'express';
      import cors from 'cors';
      

      Then, create an Express app and start the server on the port :3000 by adding the following code below the import statements:

      server.mjs

      ...
      const app = express();
      const port = 3000;
      
      app.use(cors());
      
      app.listen(port, () => {
          console.log(`[info] ffmpeg-api listening at http://localhost:${port}`)
      });
      

      You can start the server by running the following command:

      You’ll see the following output:

      Output

      [info] ffmpeg-api listening at http://localhost:3000

      When you try loading http://localhost:3000 in your browser, you’ll see Cannot GET /. This is Express telling you it is listening for requests.

      With your Express server now set up, you’ll create a client to upload the video and make requests to your Express server.

       Step 2 — Creating a Client and Testing the Server

      In this section, you’ll create a web page that will let you select a file and upload it to the API for processing.

      Start by opening a new file called client.html:

      In your client.html file, create a file input and a Create Thumbnail button. Below, add an empty <div> element to display errors and an image that will show the thumbnail that the API sends back. At the very end of the <body> tag, load a script called client.js. Your final HTML template should look as follows:

      client.html

      <!DOCTYPE html>
      <html lang="en">
      <head>
          <meta charset="UTF-8">
          <title>Create a Thumbnail from a Video</title>
          <style>
              #thumbnail {
                  max-width: 100%;
              }
          </style>
      </head>
      <body>
          <div>
              <input id="file-input" type="file" />
              <button id="submit">Create Thumbnail</button>
              <div id="error"></div>
              <img id="thumbnail" />
          </div>
          <script src="https://www.digitalocean.com/community/tutorials/client.js"></script>
      </body>
      </html>
      

      Note that each element has a unique id. You’ll need them when referring to the elements from the client.js script. The styling on the #thumbnail element is there to ensure that the image fits on the screen when it loads.

      Save the client.html file and open client.js:

      In your client.js file, start by defining variables that store references to your HTML elements you created:

      client.js

      const fileInput = document.querySelector('#file-input');
      const submitButton = document.querySelector('#submit');
      const thumbnailPreview = document.querySelector('#thumbnail');
      const errorDiv = document.querySelector('#error');
      

      Then, attach a click event listener to the submitButton variable to check whether you’ve selected a file:

      client.js

      ...
      submitButton.addEventListener('click', async () => {
          const { files } = fileInput;
      }
      

      Next, create a function showError() that will output an error message when a file is not selected. Add the showError() function above your event listener:

      client.js

      const fileInput = document.querySelector('#file-input');
      const submitButton = document.querySelector('#submit');
      const thumbnailPreview = document.querySelector('#thumbnail');
      const errorDiv = document.querySelector('#error');
      
      function showError(msg) {
          errorDiv.innerText = `ERROR: ${msg}`;
      }
      
      submitButton.addEventListener('click', async () => {
      ...
      

      Now, you will build a function createThumbnail() that will make a request to the API, send the video, and receive a thumbnail in response. At the top of your client.js file, define a new constant with the URL to a /thumbnail endpoint:

      const API_ENDPOINT = 'http://localhost:3000/thumbnail';
      
      const fileInput = document.querySelector('#file-input');
      const submitButton = document.querySelector('#submit');
      const thumbnailPreview = document.querySelector('#thumbnail');
      const errorDiv = document.querySelector('#error');
      ...
      

      You will define and use the /thumbnail endpoint in your Express server.

      Next, add the createThumbnail() function below your showError() function:

      client.js

      ...
      function showError(msg) {
          errorDiv.innerText = `ERROR: ${msg}`;
      }
      
      async function createThumbnail(video) {
      
      }
      ...
      

      Web APIs frequently use JSON to transfer structured data from and to the client. To include a video in a JSON, you would have to encode it in base64, which would increase its size by about 30%. You can avoid this by using multipart requests instead. Multipart requests allow you to transfer structured data including binary files over http without the unnecessary overhead. You can do this using the FormData() constructor function.

      Inside the createThumbnail() function, create an instance of FormData and append the video file to the object. Then make a POST request to the API endpoint using the Fetch API with the FormData() instance as the body. Interpret the response as a binary file (or blob) and convert it to a data URL so that you can assign it to the <img> tag you created earlier.

      Here’s the full implementation of createThumbnail():

      client.js

      ...
      async function createThumbnail(video) {
          const payload = new FormData();
          payload.append('video', video);
      
          const res = await fetch(API_ENDPOINT, {
              method: 'POST',
              body: payload
          });
      
          if (!res.ok) {
              throw new Error('Creating thumbnail failed');
          }
      
          const thumbnailBlob = await res.blob();
          const thumbnail = await blobToDataURL(thumbnailBlob);
      
          return thumbnail;
      }
      ...
      

      You’ll notice createThumbnail() has the function blobToDataURL() in its body. This is a helper function that will convert a blob to a data URL.

      Above your createThumbnail() function, create the function blobDataToURL() that returns a promise:

      client.js

      ...
      async function blobToDataURL(blob) {
          return new Promise((resolve, reject) => {
              const reader = new FileReader();
              reader.onload = () => resolve(reader.result);
              reader.onerror = () => reject(reader.error);
              reader.onabort = () => reject(new Error("Read aborted"));
              reader.readAsDataURL(blob);
          });
      }
      ...
      

      blobToDataURL() uses FileReader to read the contents of the binary file and format it as a data URL.

      With the createThumbnail() and showError() functions now defined, you can use them to finish implementing the event listener:

      client.js

      ...
      submitButton.addEventListener('click', async () => {
          const { files } = fileInput;
      
          if (files.length > 0) {
              const file = files[0];
              try {
                  const thumbnail = await createThumbnail(file);
                  thumbnailPreview.src = thumbnail;
              } catch(error) {
                  showError(error);
              }
          } else {
              showError('Please select a file');
          }
      });
      

      When a user clicks on the button, the event listener will pass the file to the createThumbnail() function. If successful, it will assign the thumbnail to the <img> element you created earlier. In case the user doesn’t select a file or the request fails, it will call the showError() function to display an error.

      At this point, your client.js file will look like the following:

      client.js

      const API_ENDPOINT = 'http://localhost:3000/thumbnail';
      
      const fileInput = document.querySelector('#file-input');
      const submitButton = document.querySelector('#submit');
      const thumbnailPreview = document.querySelector('#thumbnail');
      const errorDiv = document.querySelector('#error');
      
      function showError(msg) {
          errorDiv.innerText = `ERROR: ${msg}`;
      }
      
      async function blobToDataURL(blob) {
          return new Promise((resolve, reject) => {
              const reader = new FileReader();
              reader.onload = () => resolve(reader.result);
              reader.onerror = () => reject(reader.error);
              reader.onabort = () => reject(new Error("Read aborted"));
              reader.readAsDataURL(blob);
          });
      }
      
      async function createThumbnail(video) {
          const payload = new FormData();
          payload.append('video', video);
      
          const res = await fetch(API_ENDPOINT, {
              method: 'POST',
              body: payload
          });
      
          if (!res.ok) {
              throw new Error('Creating thumbnail failed');
          }
      
          const thumbnailBlob = await res.blob();
          const thumbnail = await blobToDataURL(thumbnailBlob);
      
          return thumbnail;
      }
      
      submitButton.addEventListener('click', async () => {
          const { files } = fileInput;
      
          if (files.length > 0) {
              const file = files[0];
      
              try {
                  const thumbnail = await createThumbnail(file);
                  thumbnailPreview.src = thumbnail;
              } catch(error) {
                  showError(error);
              }
          } else {
              showError('Please select a file');
          }
      });
      

      Start the server again by running:

      With your client now set up, uploading the video file here will result in receiving an error message. This is because the /thumbnail endpoint is not built yet. In the next step, you’ll create the /thumbnail endpoint in Express to accept the video file and create the thumbnail.

       Step 3 — Setting Up an Endpoint to Accept Binary Data

      In this step, you will set up a POST request for the /thumbnail endpoint and use middleware to accept multipart requests.

      Open server.mjs in an editor:

      Then, import multer at the top of the file:

      server.mjs

      import express from 'express';
      import cors from 'cors';
      import multer from 'multer';
      ...
      

      Multer is a middleware that processes incoming multipart/form-data requests before passing them to your endpoint handler. It extracts fields and files from the body and makes them available as an array on the request object in Express. You can configure where to store the uploaded files and set limits on file size and format.

      After importing it, initialize the multer middleware with the following options:

      server.mjs

      ...
      const app = express();
      const port = 3000;
      
      const upload = multer({
          storage: multer.memoryStorage(),
          limits: { fileSize: 100 * 1024 * 1024 }
      });
      
      app.use(cors());
      ...
      

      The storage option lets you choose where to store the incoming files. Calling multer.memoryStorage() will initialize a storage engine that keeps files in Buffer objects in memory as opposed to writing them to disk. The limits option lets you define various limits on what files will be accepted. Set the fileSize limit to 100MB or a different number that matches your needs and the amount of memory available on your server. This will prevent your API from crashing when the input file is too big.

      Note: Due to the limitations of WebAssembly, ffmpeg.wasm cannot handle input files over 2GB in size.

      Next, set up the POST /thumbnail endpoint itself:

      server.mjs

      ...
      app.use(cors());
      
      app.post('/thumbnail', upload.single('video'), async (req, res) => {
          const videoData = req.file.buffer;
      
          res.sendStatus(200);
      });
      
      app.listen(port, () => {
          console.log(`[info] ffmpeg-api listening at http://localhost:${port}`)
      });
      

      The upload.single('video') call will set up a middleware for that endpoint only that will parse the body of a multipart request that includes a single file. The first parameter is the field name. It must match the one you gave to FormData when creating the request in client.js. In this case, it’s video. multer will then attach the parsed file to the req parameter. The content of the file will be under req.file.buffer.

      At this point, the endpoint doesn’t do anything with the data it receives. It acknowledges the request by sending an empty 200 response. In the next step, you’ll replace that with the code that extracts a thumbnail from the video data received.

      In this step, you’ll use ffmpeg.wasm to extract a thumbnail from the video file received by the POST /thumbnail endpoint.

      ffmpeg.wasm is a pure WebAssembly and JavaScript port of FFmpeg. Its main goal is to allow running FFmpeg directly in the browser. However, because Node.js is built on top of V8 — Chrome’s JavaScript engine — you can use the library on the server too.

      The benefit of using a native port of FFmpeg over a wrapper built on top of the ffmpeg command is that if you’re planning to deploy your app with Docker, you don’t have to build a custom image that includes both FFmpeg and Node.js. This will save you time and reduce the maintenance burden of your service.

      Add the following import to the top of server.mjs:

      server.mjs

      import express from 'express';
      import cors from 'cors';
      import multer from 'multer';
      import { createFFmpeg } from '@ffmpeg/ffmpeg';
      ...
      

      Then, create an instance of ffmpeg.wasm and start loading the core:

      server.mjs

      ...
      import { createFFmpeg } from '@ffmpeg/ffmpeg';
      
      const ffmpegInstance = createFFmpeg({ log: true });
      let ffmpegLoadingPromise = ffmpegInstance.load();
      
      const app = express();
      ...
      

      The ffmpegInstance variable holds a reference to the library. Calling ffmpegInstance.load() starts loading the core into memory asynchronously and returns a promise. Store the promise in the ffmpegLoadingPromise variable so that you can check whether the core has loaded.

      Next, define the following helper function that will use fmpegLoadingPromise to wait for the core to load in case the first request arrives before it’s ready:

      server.mjs

      ...
      let ffmpegLoadingPromise = ffmpegInstance.load();
      
      async function getFFmpeg() {
          if (ffmpegLoadingPromise) {
              await ffmpegLoadingPromise;
              ffmpegLoadingPromise = undefined;
          }
      
          return ffmpegInstance;
      }
      
      const app = express();
      ...
      

      The getFFmpeg() function returns a reference to the library stored in the ffmpegInstance variable. Before returning it, it checks whether the library has finished loading. If not, it will wait until ffmpegLoadingPromise resolves. In case the first request to your POST /thumbnail endpoint arrives before ffmpegInstance is ready to use, your API will wait and resolve it when it can rather than rejecting it.

      Now, implement the POST /thumbnail endpoint handler. Replace res.sendStatus(200); at the end of the end of the function with a call to getFFmpeg to get a reference to ffmpeg.wasm when it’s ready:

      server.mjs

      ...
      app.post('/thumbnail', upload.single('video'), async (req, res) => {
          const videoData = req.file.buffer;
      
          const ffmpeg = await getFFmpeg();
      });
      ...
      

      ffmpeg.wasm works on top of an in-memory file system. You can read and write to it using ffmpeg.FS. When running FFmpeg operations, you will pass virtual file names to the ffmpeg.run function as an argument the same way as you would when working with the CLI tool. Any output files created by FFmpeg will be written to the file system for you to retrieve.

      In this case, the input file is a video. The output file will be a single PNG image. Define the following variables:

      server.mjs

      ...
          const ffmpeg = await getFFmpeg();
      
          const inputFileName = `input-video`;
          const outputFileName = `output-image.png`;
          let outputData = null;
      });
      ...
      

      The file names will be used on the virtual file system. outputData is where you’ll store the thumbnail when it’s ready.

      Call ffmpeg.FS() to write the video data to the in-memory file system:

      server.mjs

      ...
          let outputData = null;
      
          ffmpeg.FS('writeFile', inputFileName, videoData);
      });
      ...
      

      Then, run the FFmpeg operation:

      server.mjs

      ...
          ffmpeg.FS('writeFile', inputFileName, videoData);
      
          await ffmpeg.run(
              '-ss', '00:00:01.000',
              '-i', inputFileName,
              '-frames:v', '1',
              outputFileName
          );
      });
      ...
      

      The -i parameter specifies the input file. -ss seeks to the specified time (in this case, 1 second from the beginning of the video). -frames:v limits the number of frames that will be written to the output (a single frame in this scenario). outputFileName at the end indicates where will FFmpeg write the output.

      After FFmpeg exits, use ffmpeg.FS() to read the data from the file system and delete both the input and output files to free up memory:

      server.mjs

      ...
          await ffmpeg.run(
              '-ss', '00:00:01.000',
              '-i', inputFileName,
              '-frames:v', '1',
              outputFileName
          );
      
          outputData = ffmpeg.FS('readFile', outputFileName);
          ffmpeg.FS('unlink', inputFileName);
          ffmpeg.FS('unlink', outputFileName);
      });
      ...
      

      Finally, dispatch the output data in the body of the response:

      server.mjs

      ...
          ffmpeg.FS('unlink', outputFileName);
      
          res.writeHead(200, {
              'Content-Type': 'image/png',
              'Content-Disposition': `attachment;filename=${outputFileName}`,
              'Content-Length': outputData.length
          });
          res.end(Buffer.from(outputData, 'binary'));
      });
      ...
      

      Calling res.writeHead() dispatches the response head. The second parameter includes custom http headers) with information about the data in the body of the request that will follow. The res.end() function sends the data from its first argument as the body of the request and finalizes the request. The outputData variable is a raw array of bytes as returned by ffmpeg.FS(). Passing it to Buffer.from() initializes a Buffer to ensure the binary data will be handled correctly by res.end().

      At this point, your POST /thumbnail endpoint implementation should look like this:

      server.mjs

      ...
      app.post('/thumbnail', upload.single('video'), async (req, res) => {
          const videoData = req.file.buffer;
      
          const ffmpeg = await getFFmpeg();
      
          const inputFileName = `input-video`;
          const outputFileName = `output-image.png`;
          let outputData = null;
      
          ffmpeg.FS('writeFile', inputFileName, videoData);
      
          await ffmpeg.run(
              '-ss', '00:00:01.000',
              '-i', inputFileName,
              '-frames:v', '1',
              outputFileName
          );
      
          outputData = ffmpeg.FS('readFile', outputFileName);
          ffmpeg.FS('unlink', inputFileName);
          ffmpeg.FS('unlink', outputFileName);
      
          res.writeHead(200, {
              'Content-Type': 'image/png',
              'Content-Disposition': `attachment;filename=${outputFileName}`,
              'Content-Length': outputData.length
          });
          res.end(Buffer.from(outputData, 'binary'));
      });
      ...
      

      Aside from the 100MB file limit for uploads, there’s no input validation or error handling. When ffmpeg.wasm fails to process a file, reading the output from the virtual file system will fail and prevent the response from being sent. For the purposes of this tutorial, wrap the implementation of the endpoint in a try-catch block to handle that scenario:

      server.mjs

      ...
      app.post('/thumbnail', upload.single('video'), async (req, res) => {
          try {
              const videoData = req.file.buffer;
      
              const ffmpeg = await getFFmpeg();
      
              const inputFileName = `input-video`;
              const outputFileName = `output-image.png`;
              let outputData = null;
      
              ffmpeg.FS('writeFile', inputFileName, videoData);
      
              await ffmpeg.run(
                  '-ss', '00:00:01.000',
                  '-i', inputFileName,
                  '-frames:v', '1',
                  outputFileName
              );
      
              outputData = ffmpeg.FS('readFile', outputFileName);
              ffmpeg.FS('unlink', inputFileName);
              ffmpeg.FS('unlink', outputFileName);
      
              res.writeHead(200, {
                  'Content-Type': 'image/png',
                  'Content-Disposition': `attachment;filename=${outputFileName}`,
                  'Content-Length': outputData.length
              });
              res.end(Buffer.from(outputData, 'binary'));
          } catch(error) {
              console.error(error);
              res.sendStatus(500);
          }
      ...
      });
      

      Secondly, ffmpeg.wasm cannot handle two requests in parallel. You can try this yourself by launching the server:

      • node --experimental-wasm-threads server.mjs

      Note the flag required for ffmpeg.wasm to work. The library depends on WebAssembly threads and bulk memory operations. These have been in V8/Chrome since 2019. However, as of Node.js v16.11.0, WebAssembly threads remain behind a flag in case there might be changes before the proposal is finalised. Bulk memory operations also require a flag in older versions of Node. If you’re running Node.js 15 or lower, add --experimental-wasm-bulk-memory as well.

      The output of the command will look like this:

      Output

      [info] use ffmpeg.wasm v0.10.1 [info] load ffmpeg-core [info] loading ffmpeg-core [info] fetch ffmpeg.wasm-core script from @ffmpeg/core [info] ffmpeg-api listening at http://localhost:3000 [info] ffmpeg-core loaded

      Open client.html in a web browser and select a video file. When you click the Create Thumbnail button, you should see the thumbnail appear on the page. Behind the scenes, the site uploads the video to the API, which processes it and responds with the image. However, when you click the button repeatedly in quick succession, the API will handle the first request. The subsequent requests will fail:

      Output

      Error: ffmpeg.wasm can only run one command at a time at Object.run (.../ffmpeg-api/node_modules/@ffmpeg/ffmpeg/src/createFFmpeg.js:126:13) at file://.../ffmpeg-api/server.mjs:54:26 at runMicrotasks (<anonymous>) at processTicksAndRejections (internal/process/task_queues.js:95:5)

      In the next section, you’ll learn how to deal with concurrent requests.

      Step 5 — Handling Concurrent Requests

      Since ffmpeg.wasm can only execute a single operation at a time, you’ll need a way of serializing requests that come in and processing them one at a time. In this scenario, a promise queue is a perfect solution. Instead of starting to process each request right away, it will be queued up and processed when all the requests that arrived before it have been handled.

      Open server.mjs in your preferred editor:

      Import p-queue at the top of server.mjs:

      server.mjs

      import express from 'express';
      import cors from 'cors';
      import { createFFmpeg } from '@ffmpeg/ffmpeg';
      import PQueue from 'p-queue';
      ...
      

      Then, create a new queue at the top of server.mjs file under the variable ffmpegLoadingPromise:

      server.mjs

      ...
      const ffmpegInstance = createFFmpeg({ log: true });
      let ffmpegLoadingPromise = ffmpegInstance.load();
      
      const requestQueue = new PQueue({ concurrency: 1 });
      ...
      

      In the POST /thumbnail endpoint handler, wrap the calls to ffmpeg in a function that will be queued up:

      server.mjs

      ...
      app.post('/thumbnail', upload.single('video'), async (req, res) => {
          try {
              const videoData = req.file.buffer;
      
              const ffmpeg = await getFFmpeg();
      
              const inputFileName = `input-video`;
              const outputFileName = `thumbnail.png`;
              let outputData = null;
      
              await requestQueue.add(async () => {
                  ffmpeg.FS('writeFile', inputFileName, videoData);
      
                  await ffmpeg.run(
                      '-ss', '00:00:01.000',
                      '-i', inputFileName,
                      '-frames:v', '1',
                      outputFileName
                  );
      
                  outputData = ffmpeg.FS('readFile', outputFileName);
                  ffmpeg.FS('unlink', inputFileName);
                  ffmpeg.FS('unlink', outputFileName);
              });
      
              res.writeHead(200, {
                  'Content-Type': 'image/png',
                  'Content-Disposition': `attachment;filename=${outputFileName}`,
                  'Content-Length': outputData.length
              });
              res.end(Buffer.from(outputData, 'binary'));
          } catch(error) {
              console.error(error);
              res.sendStatus(500);
          }
      });
      ...
      

      Every time a new request comes in, it will only start processing when there’s nothing else queued up in front of it. Note that the final sending of the response can happen asynchronously. Once the ffmpeg.wasm operation finishes running, another request can start processing while the response goes out.

      To test that everything works as expected, start up the server again:

      • node --experimental-wasm-threads server.mjs

      Open the client.html file in your browser and try uploading a file.

      A screenshot of client.html with a thumbnail loaded

      With the queue in place, the API will now respond every time. The requests will be handled sequentially in the order in which they arrive.

      Conclusion

      In this article, you built a Node.js service that extracts a thumbnail from a video using ffmpeg.wasm. You learned how to upload binary data from the browser to your Express API using multipart requests and how to process media with FFmpeg in Node.js without relying on external tools or having to write data to disk.

      FFmpeg is an incredibly versatile tool. You can use the knowledge from this tutorial to take advantage of any features that FFmpeg supports and use them in your project. For example, to generate a three-second GIF, change the ffmpeg.run call to this on the POST /thumbnail endpoint:

      server.mjs

      ...
      await ffmpeg.run(
          '-y',
          '-t', '3',
          '-i', inputFileName,
          '-filter_complex', 'fps=5,scale=720:-1:flags=lanczos[x];[x]split[x1][x2];[x1]palettegen[p];[x2][p]paletteuse',
          '-f', 'gif',
          outputFileName
      );
      ...
      

      The library accepts the same parameters as the original ffmpeg CLI tool. You can use the official documentation to find a solution for your use case and test it quickly in the terminal.

      Thanks to ffmpeg.wasm being self-contained, you can dockerize this service using the stock Node.js base images and scale your service up by keeping multiple nodes behind a load balancer. Follow the tutorial How To Build a Node.js Application with Docker to learn more.

      If your use case requires performing more expensive operations, such as transcoding large videos, make sure that you run your service on machines with enough memory to store them. Due to current limitations in WebAssembly, the maximum input file size cannot exceed 2GB, although this might change in the future.

      Additionally, ffmpeg.wasm cannot take advantage of some x86 assembly optimizations from the original FFmpeg codebase. That means some operations can take a long time to finish. If that’s the case, consider whether this is the right solution for your use case. Alternatively, make requests to your API asynchronous. Instead of waiting for the operation to finish, queue it up and respond with a unique ID. Create another endpoint that the clients can query to find out whether the processing ended and the output file is ready. Learn more about the asynchronous request-reply pattern for REST APIs and how to implement it.



      Source link

      How To Test Your Data With Great Expectations


      The author selected the Diversity in Tech Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      In this tutorial, you will set up a local deployment of Great Expectations, an open source data validation and documentation library written in Python. Data validation is crucial to ensuring that the data you process in your pipelines is correct and free of any data quality issues that might occur due to errors such as incorrect inputs or transformation bugs. Great Expectations allows you to establish assertions about your data called Expectations, and validate any data using those Expectations.

      When you’re finished, you’ll be able to connect Great Expectations to your data, create a suite of Expectations, validate a batch of data using those Expectations, and generate a data quality report with the results of your validation.

      Prerequisites

      To complete this tutorial, you will need:

      Step 1 — Installing Great Expectations and Initializing a Great Expectations Project

      In this step, you will install the Great Expectations package in your local Python environment, download the sample data you’ll use in this tutorial, and initialize a Great Expectations project.

      To begin, open a terminal and make sure to activate your virtual Python environment. Install the Great Expectations Python package and command-line tool (CLI) with the following command:

      • pip install great_expectations==0.13.35

      Note: This tutorial was developed for Great Expectations version 0.13.35 and may not be applicable to other versions.

      In order to have access to the example data repository, run the following git command to clone the directory and change into it as your working directory:

      • git clone https://github.com/do-community/great_expectations_tutorial
      • cd great_expectations_tutorial

      The repository only contains one folder called data, which contains two example CSV files with data that you will use in this tutorial. Take a look at the contents of the data directory:

      You’ll see the following output:

      Output

      yellow_tripdata_sample_2019-01.csv yellow_tripdata_sample_2019-02.csv

      Great Expectations works with many different types of data, such as connections to relational databases, Spark dataframes, and various file formats. For the purpose of this tutorial, you will use these CSV files containing a small set of taxi ride data to get started.

      Finally, initialize your directory as a Great Expectations project by running the following command. Make sure to use the --v3-api flag, as this will switch you to using the most recent API of the package:

      • great_expectations --v3-api init

      When asked OK to proceed? [Y/n]:, press ENTER to proceed.

      This will create a folder called great_expectations, which contains the basic configuration for your Great Expectations project, also called the Data Context. You can inspect the contents of the folder:

      You will see the first level of files and subdirectories that were created inside the great_expectations folder:

      Output

      checkpoints great_expectations.yml plugins expectations notebooks uncommitted

      The folders store all the relevant content for your Great Expectations setup. The great_expectations.yml file contains all important configuration information. Feel free to explore the folders and configuration file a little more before moving on to the next step in the tutorial.

      In the next step, you will add a Datasource to point Great Expectations at your data.

      Step 2 — Adding a Datasource

      In this step, you will configure a Datasource in Great Expectations, which allows you to automatically create data assertions called Expectations as well as validate data with the tool.

      While in your project directory, run the following command:

      • great_expectations --v3-api datasource new

      You will see the following output. Enter the options shown when prompted to configure a file-based Datasource for the data directory:

      Output

      What data would you like Great Expectations to connect to? 1. Files on a filesystem (for processing with Pandas or Spark) 2. Relational database (SQL) : 1 What are you processing your files with? 1. Pandas 2. PySpark : 1 Enter the path of the root directory where the data files are stored. If files are on local disk enter a path relative to your current working directory or an absolute path. : data

      After confirming the directory path with ENTER, Great Expectations will open a Jupyter notebook in your web browser, which allows you to complete the configuration of the Datasource and store it to your Data Context. The following screenshot shows the first few cells of the notebook.

      Screenshot of a Jupyter notebook

      The notebook contains several pre-populated cells of Python code to configure your Datasource. You can modify the settings for the Datasource, such as the name, if you like. However, for the purpose of this tutorial, you’ll leave everything as-is and execute all cells using the Cell > Run All menu option. If run successfully, the last cell output will look as follows:

      Output

      [{'data_connectors': {'default_inferred_data_connector_name': {'module_name': 'great_expectations.datasource.data_connector', 'base_directory': '../data', 'class_name': 'InferredAssetFilesystemDataConnector', 'default_regex': {'group_names': ['data_asset_name'], 'pattern': '(.*)'}}, 'default_runtime_data_connector_name': {'module_name': 'great_expectations.datasource.data_connector', 'class_name': 'RuntimeDataConnector', 'batch_identifiers': ['default_identifier_name']}}, 'module_name': 'great_expectations.datasource', 'class_name': 'Datasource', 'execution_engine': {'module_name': 'great_expectations.execution_engine', 'class_name': 'PandasExecutionEngine'}, 'name': 'my_datasource'}]

      This shows that you have added a new Datasource called my_datasource to your Data Context. Feel free to read through the instructions in the notebook to learn more about the different configuration options before moving on to the next step.

      Warning: Before moving forward, close the browser tab with the notebook, return to your terminal, and press CTRL+C to shut down the running notebook server before proceeding.

      You have now successfully set up a Datasource that points at the data directory, which will allow you to access the CSV files in the directory through Great Expectations. In the next step, you will use one of these CSV files in your Datasource to automatically generate Expectations with a profiler.

      Step 3 — Creating an Expectation Suite With an Automated Profiler

      In this step of the tutorial, you will use the built-in Profiler to create a set of Expectations based on some existing data. For this purpose, let’s take a closer look at the sample data that you downloaded:

      • The files yellow_tripdata_sample_2019-01.csv and yellow_tripdata_sample_2019-02.csv contain taxi ride data from January and February 2019, respectively.
      • This tutorial assumes that you know the January data is correct, and that you want to ensure that any subsequent data files match the January data in terms of number or rows, columns, and the distributions of certain column values.

      For this purpose, you will create Expectations (data assertions) based on certain properties of the January data and then, in a later step, use those Expectations to validate the February data. Let’s get started by creating an Expectation Suite, which is a set of Expectations that are grouped together:

      • great_expectations --v3-api suite new

      By selecting the options shown in the output below, you specify that you would like to use a profiler to generate Expectations automatically, using the yellow_tripdata_sample_2019-01.csv data file as an input. Enter the name my_suite as the Expectation Suite name when prompted and press ENTER at the end when asked Would you like to proceed? [Y/n]:

      Output

      Using v3 (Batch Request) API How would you like to create your Expectation Suite? 1. Manually, without interacting with a sample batch of data (default) 2. Interactively, with a sample batch of data 3. Automatically, using a profiler : 3 A batch of data is required to edit the suite - let's help you to specify it. Which data asset (accessible by data connector "my_datasource_example_data_connector") would you like to use? 1. yellow_tripdata_sample_2019-01.csv 2. yellow_tripdata_sample_2019-02.csv : 1 Name the new Expectation Suite [yellow_tripdata_sample_2019-01.csv.warning]: my_suite When you run this notebook, Great Expectations will store these expectations in a new Expectation Suite "my_suite" here: <path_to_project>/great_expectations_tutorial/great_expectations/expectations/my_suite.json Would you like to proceed? [Y/n]: <press ENTER>

      This will open another Jupyter notebook that lets you complete the configuration of your Expectation Suite. The notebook contains a fair amount of code to configure the built-in profiler, which looks at the CSV file you selected and creates certain types of Expectations for each column in the file based on what it finds in the data.

      Scroll down to the second code cell in the notebook, which contains a list of ignored_columns. By default, the profiler will ignore all columns, so let’s comment out some of them to make sure the profiler creates Expectations for them. Modify the code so it looks like this:

      ignored_columns = [
      #     "vendor_id"
      # ,    "pickup_datetime"
      # ,    "dropoff_datetime"
      # ,    "passenger_count"
          "trip_distance"
      ,    "rate_code_id"
      ,    "store_and_fwd_flag"
      ,    "pickup_location_id"
      ,    "dropoff_location_id"
      ,    "payment_type"
      ,    "fare_amount"
      ,    "extra"
      ,    "mta_tax"
      ,    "tip_amount"
      ,    "tolls_amount"
      ,    "improvement_surcharge"
      ,    "total_amount"
      ,    "congestion_surcharge"
      ,]
      

      Make sure to remove the comma before "trip_distance". By commenting out the columns vendor_id, pickup_datetime, dropoff_datetime, and passenger_count, you are telling the profiler to generate Expectations for those columns. In addition, the profiler will also generate table-level Expectations, such as the number and names of columns in your data, and the number of rows. Once again, execute all cells in the notebook by using the Cell > Run All menu option.

      When executing all cells in this notebook, two things happen:

      1. The code creates an Expectation Suite using the automated profiler and the yellow_tripdata_sample_2019-01.csv file you told it to use.
      2. The last cell in the notebook is also configured to run validation and open a new browser window with Data Docs, which is a data quality report.

      In the next step, you will take a closer look at the Data Docs that were opened in the new browser window.

      Step 4 — Exploring Data Docs

      In this step of the tutorial, you will inspect the Data Docs that Great Expectations generated and learn how to interpret the different pieces of information. Go to the browser window that just opened and take a look at the page, shown in the screenshot below.

      Screenshot of Data Docs

      At the top of the page, you will see a box titled Overview, which contains some information about the validation you just ran using your newly created Expectation Suite my_suite. It will tell you Status: Succeeded and show some basic statistics about how many Expectations were run. If you scroll further down, you will see a section titled Table-Level Expectations. It contains two rows of Expectations, showing the Status, Expectation, and Observed Value for each row. Below the table Expectations, you will see the column-level Expectations for each of the columns you commented out in the notebook.

      Let’s focus on one specific Expectation: The passenger_count column has an Expectation stating “values must belong to this set: 1 2 3 4 5 6.” which is marked with a green checkmark and has an Observed Value of “0% unexpected”. This is telling you that the profiler looked at the values in the passenger_count column in the January CSV file and detected only the values 1 through 6, meaning that all taxi rides had between 1 and 6 passengers. Great Expectations then created an Expectation for this fact. The last cell in the notebook then triggered validation of the January CSV file and it found no unexpected values. This is spuriously true, since the same data that was used to create the Expectation was also the data used for validation.

      In this step, you reviewed the Data Docs and observed the passenger_count column for its Expectation. In the next step, you’ll see how you can validate a different batch of data.

      Step 5 — Creating a Checkpoint and Running Validation

      In the final step of this tutorial, you will create a new Checkpoint, which bundles an Expectation Suite and a batch of data to execute validation of that data. After creating the Checkpoint, you will then run it to validate the February taxi data CSV file and see whether the file passed the Expectations you previously created. To begin, return to your terminal and stop the Jupyter notebook by pressing CTRL+C if it is still running. The following command will start the workflow to create a new Checkpoint called my_checkpoint:

      • great_expectations --v3-api checkpoint new my_checkpoint

      This will open a Jupyter notebook with some pre-populated code to configure the Checkpoint. The second code cell in the notebook will have a random data_asset_name pre-populated from your existing Datasource, which will be one of the two CSV files in the data directory you’ve seen earlier. Ensure that the data_asset_name is yellow_tripdata_sample_2019-02.csv and modify the code if needed to use the correct filename.

      my_checkpoint_name = "my_checkpoint" # This was populated from your CLI command.
      
      yaml_config = f"""
      name: {my_checkpoint_name}
      config_version: 1.0
      class_name: SimpleCheckpoint
      run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
      validations:
        - batch_request:
            datasource_name: my_datasource
            data_connector_name: default_inferred_data_connector_name
            data_asset_name: yellow_tripdata_sample_2019-02.csv
            data_connector_query:
              index: -1
          expectation_suite_name: my_suite
      """
      print(yaml_config)
      """
      

      This configuration snippet configures a new Checkpoint, which reads the data asset yellow_tripdata_sample_2019-02.csv, i.e., your February CSV file, and validates it using the Expectation Suite my_suite. Confirm that you modified the code correctly, then execute all cells in the notebook. This will save the new Checkpoint to your Data Context.

      Finally, in order to run this new Checkpoint and validate the February data, scroll down to the last cell in the notebook. Uncomment the code in the cell to look as follows:

      context.run_checkpoint(checkpoint_name=my_checkpoint_name)
      context.open_data_docs()
      

      Select the cell and run it using the Cell > Run Cells menu option or the SHIFT+ENTER keyboard shortcut. This will open Data Docs in a new browser tab.

      On the Validation Results overview page, click on the topmost run to navigate to the Validation Result details page. The Validation Result details page will look very similar to the page you saw in the previous step, but it will now show that the Expectation Suite failed, validating the new CSV file. Scroll through the page to see which Expectations have a red X next to them, marking them as failed.

      Find the Expectation on the passenger_count column you looked at in the previous step: “values must belong to this set: 1 2 3 4 5 6”. You will notice that it now shows up as failed and highlights that 1579 unexpected values found. ≈15.79% of 10000 total rows. The row also displays a sample of the unexpected values that were found in the column, namely the value 0. This means that the February taxi ride data suddenly introduced the unexpected value 0 as in the passenger_counts column, which seems like a potential data bug. By running the Checkpoint, you validated the new data with your Expectation Suite and detected this issue.

      Note that each time you execute the run_checkpoint method in the last notebook cell, you kick off another validation run. In a production data pipeline environment, you would call the run_checkpoint command outside of a notebook whenever you’re processing a new batch of data to ensure that the new data passes all validations.

      Conclusion

      In this article you created a first local deployment of the Great Expectations framework for data validation. You initialized a Great Expectations Data Context, created a new file-based Datasource, and automatically generated an Expectation Suite using the built-in profiler. You then created a Checkpoint to run validation against a new batch of data, and inspected the Data Docs to view the validation results.

      This tutorial only taught you the basics of Great Expectations. The package contains more options for configuring Datasources to connect to other types of data, for example relational databases. It also comes with a powerful mechanism to automatically recognize new batches of data based on pattern-matching in the tablename or filename, which allows you to only configure a Checkpoint once to validate any future data inputs. You can learn more about Great Expectations in the official documentation.



      Source link