One place for hosting & domains

      Puppeteer

      Como coletar dados automaticamente de um site usando o Node.js e o Puppeteer


      O autor selecionou o Free and Open Source Fund para receber uma doação como parte do programa Write for DOnations.

      Introdução

      O scraping (coleta de dados) é o processo de automatizar a coleta de dados da Web. O processo envolve geralmente a implementação de um “rastreador” que navega automaticamente pela Web e coleta dados de páginas selecionadas. Existem muitas razões pelas quais pode ser interessante coletar dados automaticamente. A principal delas é que esse processo torna a coleta de dados muito mais rápida, eliminando a necessidade de um processo manual. O scraping também funciona como uma solução nos casos em que a coleta de dados é desejada ou necessária, mas o site não fornece uma API.

      Neste tutorial, você irá construir um aplicativo Web de scraping usando o Node.js e o Puppeteer. Seu app irá ganhar complexidade à medida que você progredir. Primeiro, você irá programar seu app para abrir o Chromium e carregar um site especial projetado como uma área restrita de scraping na internet: books.toscrape.com. Nos dois passos seguintes, você irá coletar todos os livros em uma única página do books.toscrape e então todos os livros em várias páginas. Nos passos restantes, irá filtrar seu scraping por categoria de livro e então salvar seus dados como um arquivo JSON.

      Atenção: a ética e a legalidade do scraping na internet são muito complexas e estão em constante evolução. Elas também diferem com base em sua localização, na localização dos dados e no site em questão. Esse tutorial faz a coleta de um site especial, o books.toscrape.com, que foi projetado especificamente para testar aplicativos de coleta de dados. Aplicar o scraping em qualquer outro domínio não está no âmbito deste tutorial.

      Pré-requisitos

      Passo 1 — Configurando o coletor de dados Web

      Com o Node.js instalado, já é possível começar a configurar seu coletor de dados Web. Primeiro, você irá criar um diretório raiz do projeto e então instalar as dependências necessárias. Esse tutorial exige apenas uma dependência, e você irá instalá-la usando o gerenciador de pacotes padrão do Node.js, o npm. O npm já vem instalado previamente junto com o Node.js, de forma que não é necessário instalá-lo.

      Crie uma pasta para este projeto e então entre nela:

      • mkdir book-scraper
      • cd book-scraper

      Você irá executar todos os comandos subsequentes a partir deste diretório.

      Precisamos instalar um pacote usando o npm (node package manager). Primeiro, inicialize o npm para criar um arquivo packages.json, que irá gerenciar as dependências e os metadados do seu projeto.

      Inicialize o npm para o seu projeto:

      O npm irá apresentar uma sequência de prompts. Você pode pressionar ENTER para todos os prompts, ou adicionar descrições personalizadas. Certifique-se de pressionar ENTER e deixar os valores padrão intactos quando questinado sobre o entry point: e test command:. De maneira alternativa, você pode passar a flag y para o npmnpm init -y— ela fará com que todos os valores padrão sejam submetidos para você.

      Seu resultado se parecerá com este:

      Output

      { "name": "sammy_scraper", "version": "1.0.0", "description": "a web scraper", "main": "index.js", "scripts": { "test": "echo "Error: no test specified" && exit 1" }, "keywords": [], "author": "sammy the shark", "license": "ISC" } Is this OK? (yes) yes

      Digite yes e pressione ENTER. O npm irá salvar esse resultado como seu arquivo package.json.

      Agora, use o npm para instalar o Puppeteer:

      • npm install --save puppeteer

      Esse comando instala tanto o Puppeteer quanto uma versão do Chromium que a equipe do Puppeteer sabe que irá funcionar com sua API.

      Em máquinas Linux, o Puppeteer pode exigir algumas dependências adicionais.

      Caso esteja usando o Ubuntu 18.04, verifique o menu suspenso ‘Debian Dependencies’ dentro da seção ‘Chrome headless doesn’t launch on UNIX’ dos documentos de solução de problemas do Puppeteer. Você pode usar o comando a seguir como auxílio para encontrar quaisquer dependências que estejam faltando:

      Com o npm, o Puppeteer e as dependências adicionais instaladas, seu arquivo package.json exige uma última configuração antes que você possa começar a codificar. Neste tutorial, você irá iniciar seu app a partir da linha de comando com npm run start. É necessário adicionar algumas informações sobre esse script start no package.json. Mais especificamente, é preciso adicionar uma linha abaixo da diretiva scripts sobre seu comando start.

      Abra o arquivo no seu editor de texto de preferência:

      Encontre a seção scripts: e adicione as seguintes configurações. Lembre-se de colocar uma vírgula no final da linha test do script, ou seu arquivo não funcionará corretamente.

      Output

      { . . . "scripts": { "test": "echo "Error: no test specified" && exit 1", "start": "node index.js" }, . . . "dependencies": { "puppeteer": "^5.2.1" } }

      Você também verá que o puppeteer agora aparece sob dependências, próximos do final do arquivo. Seu arquivo package.json não exigirá mais revisões. Salve suas alterações e saia do seu editor.

      Agora, você está pronto para começar a codificar seu coletor de dados. No próximo passo, você irá configurar uma instância de navegador e testar funcionalidades básicas do seu coletor de dados.

      Passo 2 — Configurando a instância do navegador

      Ao abrir um navegador tradicional, você pode fazer coisas como clicar em botões, navegar com seu mouse, digitar, abrir as ferramentas de desenvolvedor e muito mais. Um navegador sem periféricos como o Chromium lhe permite fazer essas mesmas coisas, mas programaticamente e sem uma interface de usuário. Neste passo, você irá configurar a instância de navegador do seu coletor de dados. Ao iniciar seu aplicativo, ele irá automaticamente abrir o Chromium e navegar para books.toscrape.com. Essas ações iniciais irão formar a base do seu programa.

      Seu coletor de dados exigirá quatro arquivos .js: browser.js, index.js, pageController.js e pageScraper.js. Neste passo, você irá criar todos os quatro arquivos e então atualizá-los continuamente enquanto seu programa cresce em complexidade. Comece com o browser.js. Esse arquivo irá conter o script que inicia seu navegador.

      A partir do diretório raiz do seu projeto, crie e abra o browser.js em um editor de texto:

      Primeiro, você irá require (solicitar) o Puppeteer e então criar uma função async chamada startBrowser(). Essa função irá iniciar o navegador e retornar uma instância dele. Adicione o seguinte código:

      ./book-scraper/browser.js

      const puppeteer = require('puppeteer');
      
      async function startBrowser(){
          let browser;
          try {
              console.log("Opening the browser......");
              browser = await puppeteer.launch({
                  headless: false,
                  args: ["--disable-setuid-sandbox"],
                  'ignoreHTTPSErrors': true
              });
          } catch (err) {
              console.log("Could not create a browser instance => : ", err);
          }
          return browser;
      }
      
      module.exports = {
          startBrowser
      };
      

      O Puppeteer tem um método .launch() que inicia uma instância de um navegador. Esse método retorna uma Promessa, então é necessário garantir que a Promesa resolva usando um bloco .then ou await.

      Você está usando o await para garantir que a Promessa resolva, envolvendo essa instância em torno de um bloco de código try-catch e então retornando uma instância do navegador.

      Observe que o método .launch() recebe um parâmetro JSON com diversos valores:

      • headlessfalse significa que o navegador será executado com uma Interface para que você possa assistir ao seu script sendo executado, enquanto true significa que o navegador será executado em modo sem periféricos. No entanto, observe que se você quiser implantar seu coletor de dados na nuvem, deve redefinir headless para true. A maioria das máquinas virtuais são sem periféricos e não incluem uma interface de usuário. Dessa forma, o navegador só pode ser executado no modo sem periféricos. O Puppeteer também inclui um modo headful (com periféricos), mas que deve ser usado exclusivamente para fins de teste.
      • ignoreHTTPSErrorstrue permite que você visite sites que não estão hospedados em um protocolo HTTPS seguro e ignore quaisquer erros relacionados ao HTTPS.

      Salve e feche o arquivo.

      Agora, crie seu segundo arquivo .js, o index.js:

      Aqui você irá usar o require para o browser.js e o pageController.js. Em seguida, irá chamar a função startBrowser() e passar a instância do navegador criada para nosso controlador de páginas, que irá direcionar suas ações. Adicione as linhas a seguir:

      ./book-scraper/index.js

      const browserObject = require('./browser');
      const scraperController = require('./pageController');
      
      //Start the browser and create a browser instance
      let browserInstance = browserObject.startBrowser();
      
      // Pass the browser instance to the scraper controller
      scraperController(browserInstance)
      

      Salve e feche o arquivo.

      Crie seu terceiro arquivo .js, o pageController.js:

      O pageController.js controla seu processo de coleta de dados. Ele usa a instância do navegador para controlar o arquivo pageScraper.js, onde todos os scripts de coleta de dados são executados. Por fim, você irá usá-lo para especificar qual categoria de livro deseja coletar. Por enquanto, você só deseja, no entanto, garantir que seja capaz de abrir o Chromium e navegar até uma página da Web:

      ./book-scraper/pageController.js

      const pageScraper = require('./pageScraper');
      async function scrapeAll(browserInstance){
          let browser;
          try{
              browser = await browserInstance;
              await pageScraper.scraper(browser); 
      
          }
          catch(err){
              console.log("Could not resolve the browser instance => ", err);
          }
      }
      
      module.exports = (browserInstance) => scrapeAll(browserInstance)
      

      Esse código exporta uma função que toma a instância do navegador e a passa para uma função chamada scrapeAll(). Essa função, por sua vez, passa essa instância para o pageScraper.scraper() como um argumento que é usado para fazer a coleta de páginas.

      Salve e feche o arquivo.

      Por fim, crie seu último arquivo .js, o pageScraper.js:

      Aqui você irá criar um objeto literal com uma propriedade url e um método scraper(). O url é o URL da página Web na qual deseja fazer a coleta, enquanto que o método scraper() contém o código que irá realizar a coleta de dados em si, embora neste estágio ele meramente navegue até uma URL. Adicione as linhas a seguir:

      ./book-scraper/pageScraper.js

      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              await page.goto(this.url);
      
          }
      }
      
      module.exports = scraperObject;
      

      O Puppeteer possui um método newPage() que cria uma nova instância de página no navegador, e essas instâncias de página podem fazer algumas coisas. Em nosso método scraper(), você criou uma instância de página e então usou o método page.goto() para navegar até a página inicial do books.toscrape.com.

      Salve e feche o arquivo.

      A estrutura de arquivos do seu programa agora está completa. O primeiro nível da árvore de diretórios do seu projeto se parecerá com isto:

      Output

      . ├── browser.js ├── index.js ├── node_modules ├── package-lock.json ├── package.json ├── pageController.js └── pageScraper.js

      Agora, execute o comando npm run start e acompanhe seu aplicativo coletor de dados enquanto ele é executado:

      Ele irá abrir automaticamente uma instância do navegador Chromium, abrir uma nova página no navegador e navegar até books.toscrape.com.

      Neste passo, você criou um aplicativo Puppeteer que abriu o Chromium e carregou a página inicial de uma livraria online fictícia, books.toscrape.com. No próximo passo, você irá coletar os dados de todos os livros nessa página inicial.

      Passo 3 — Coletando os dados de uma única página

      Antes de adicionar mais funcionalidades ao seu aplicativo coletor de dados, abra seu navegador Web de preferência e navegue manualmente até a página inicial de books to scrape. Navegue pelo site e observe como os dados são estruturados.

      Imagem do site books to scrape

      Você verá uma seção de categoria à esquerda e os livros exibidos à direita. Ao clicar em um livro, o navegador irá até uma nova URL que exibirá informações relevantes sobre esse livro em particular.

      Neste passo, esse comportamento será replicado, mas em código. Você fará a automação do processo de navegar pelo site e consumir seus dados.

      Primeiro, se você inspecionar o código fonte para a página inicial usando as ferramentas de desenvolvedor dentro do seu navegador, verá que a página lista os dados de cada livro sob uma etiqueta section. Dentro da etiqueta section, todos os livros estão sob uma etiqueta list (li), e é aqui que você encontra o link para a página dedicada do livro, seu preço e a disponibilidade em estoque.

      O código fonte de books.toscrape visto com ferramentas de desenvolvedor

      Você irá coletar essas URLs de livros, filtrando por livros que estão em estoque. Isso será feito navegando até a página de cada livro e coletando os dados deste livro.

      Reabra seu arquivo pageScraper.js:

      Adicione o conteúdo destacado a seguir: Você irá aninhar outro bloco await dentro de await page.goto(this.url):

      ./book-scraper/pageScraper.js

      
      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              // Navigate to the selected page
              await page.goto(this.url);
              // Wait for the required DOM to be rendered
              await page.waitForSelector('.page_inner');
              // Get the link to all the required books
              let urls = await page.$$eval('section ol > li', links => {
                  // Make sure the book to be scraped is in stock
                  links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
                  // Extract the links from the data
                  links = links.map(el => el.querySelector('h3 > a').href)
                  return links;
              });
              console.log(urls);
          }
      }
      
      module.exports = scraperObject;
      
      

      Nesse bloco de código, você chamou o método page.waitForSelector(). Isso fez com que houvesse a espera pelo div que contém todas as informações relacionadas ao livro ser renderizado no DOM, e então você chamou o método page.$$eval(). Esse método recebe o elemento URL com o seletor section ol li (certifique-se de que seja retornado sempre somente uma string ou um número dos métodos page.$eval() e page.$$eval()).

      Cada livro possui dois status; ou um livro está In Stock (em estoque) ou Out of stock (fora de estoque). Você só deseja coletar os livros que estão In Stock. Como o page.$$eval() retorna uma matriz de elementos correspondentes, você filtrou essa matriz para garantir que estivesse trabalhando apenas com livros em estoque. Você fez isso procurando e avaliando a classe .instock.availability. Em seguida, mapeou a propriedade href dos links dos livros e a retornou do método.

      Salve e feche o arquivo.

      Execute seu aplicativo novamente:

      O navegador será aberto. Navegue até a página Web e então feche-a assim que a tarefa for concluída. Agora, verifique seu console; ele irá conter todas as URLs coletadas:

      Output

      > [email protected] start /Users/sammy/book-scraper > node index.js Opening the browser...... Navigating to http://books.toscrape.com... [ 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', 'http://books.toscrape.com/catalogue/soumission_998/index.html', 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html', 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html', 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html', 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html', 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html', 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html', 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html', 'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html', 'http://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html', 'http://books.toscrape.com/catalogue/set-me-free_988/index.html', 'http://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html', 'http://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html', 'http://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html', 'http://books.toscrape.com/catalogue/olio_984/index.html', 'http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html', 'http://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html', 'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html' ]

      Esse é um ótimo começo, mas é desejável coletar todos os dados relevantes para um livro em particular, e não apenas sua URL. Agora, você irá usar essas URLs para abrir cada página e coletar o título do livro, autor, preço, disponibilidade, código de barras, descrição e a URL da imagem.

      Abra o pageScraper.js novamente:

      Adicione o código a seguir, que irá percorrer em loop todos os links coletados, abrir uma nova instância de página e então recuperar os dados relevantes:

      ./book-scraper/pageScraper.js

      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              // Navigate to the selected page
              await page.goto(this.url);
              // Wait for the required DOM to be rendered
              await page.waitForSelector('.page_inner');
              // Get the link to all the required books
              let urls = await page.$$eval('section ol > li', links => {
                  // Make sure the book to be scraped is in stock
                  links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
                  // Extract the links from the data
                  links = links.map(el => el.querySelector('h3 > a').href)
                  return links;
              });
      
      
              // Loop through each of those links, open a new page instance and get the relevant data from them
              let pagePromise = (link) => new Promise(async(resolve, reject) => {
                  let dataObj = {};
                  let newPage = await browser.newPage();
                  await newPage.goto(link);
                  dataObj['bookTitle'] = await newPage.$eval('.product_main > h1', text => text.textContent);
                  dataObj['bookPrice'] = await newPage.$eval('.price_color', text => text.textContent);
                  dataObj['noAvailable'] = await newPage.$eval('.instock.availability', text => {
                      // Strip new line and tab spaces
                      text = text.textContent.replace(/(rnt|n|r|t)/gm, "");
                      // Get the number of stock available
                      let regexp = /^.*((.*)).*$/i;
                      let stockAvailable = regexp.exec(text)[1].split(' ')[0];
                      return stockAvailable;
                  });
                  dataObj['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src);
                  dataObj['bookDescription'] = await newPage.$eval('#product_description', div => div.nextSibling.nextSibling.textContent);
                  dataObj['upc'] = await newPage.$eval('.table.table-striped > tbody > tr > td', table => table.textContent);
                  resolve(dataObj);
                  await newPage.close();
              });
      
              for(link in urls){
                  let currentPageData = await pagePromise(urls);
                  // scrapedData.push(currentPageData);
                  console.log(currentPageData);
              }
      
          }
      }
      
      module.exports = scraperObject;
      

      Você tem uma matriz de todas as URLs. Você deseja que o loop percorra essa matriz, abra a URL em uma nova página, colete os dados dela, feche-a e abra uma nova página para a próxima URL na matriz. Observe que você envolveu esse código em uma Promessa. Isso foi feito porque deseja-se ser capaz de esperar que cada ação no loop seja concluída. Portanto, cada Promessa abre uma nova URL e não irá ser finalizada até que o programa tenha coletado todos os dados na URL, e que essa instância de página tenha sido fechada.

      Aviso: note que você esperou pela Promessa usando um loop for-in. Qualquer outro loop será suficiente, mas evite iterar sobre suas matrizes de URL usando um método de iteração de matrizes, como o forEach, ou qualquer outro método que use uma função de callback. Isso ocorre porque a função de callback terá que percorrer a fila de callbacks e o loop de eventos primeiro e, portanto, várias instâncias de página serão abertas de uma só vez. Isso irá colocar uma tensão muito maior em sua memória.

      Dê uma olhada mais de perto na sua função pagePromise. Seu coletor de dados criou uma nova página para cada URL, e então você usou a função page.$eval() para direcionar os coletores de dados para detalhes relevantes que você queria coletar na nova página. Alguns dos textos contêm espaços em branco, caracteres de nova linha e outros caracteres não alfanuméricos, que você retirou usando uma expressão regular. Em seguida, você anexou o valor para cada parte dos dados coletada nesta página a um Objeto e resolveu esse objeto.

      Salve e feche o arquivo.

      Execute o script novamente:

      O navegador abre a página inicial e então abre cada página de livro e registra os dados coletados de cada uma dessas páginas. Esse resultado será impresso em seu console:

      Output

      Opening the browser...... Navigating to http://books.toscrape.com... { bookTitle: 'A Light in the Attic', bookPrice: '£51.77', noAvailable: '22', imageUrl: 'http://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg', bookDescription: "It's hard to imagine a world without A Light in the Attic. [...]', upc: 'a897fe39b1053632' } { bookTitle: 'Tipping the Velvet', bookPrice: '£53.74', noAvailable: '20', imageUrl: 'http://books.toscrape.com/media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg', bookDescription: `"Erotic and absorbing...Written with starling power."--"The New York Times Book Review " Nan King, an oyster girl, is captivated by the music hall phenomenon Kitty Butler [...]`, upc: '90fa61229261140a' } { bookTitle: 'Soumission', bookPrice: '£50.10', noAvailable: '20', imageUrl: 'http://books.toscrape.com/media/cache/ee/cf/eecfe998905e455df12064dba399c075.jpg', bookDescription: 'Dans une France assez proche de la nôtre, [...]', upc: '6957f44c3847a760' } ...

      Neste passo, você coletou dados relevantes para todos os livros na página inicial de books.toscrape.com, mas poderia adicionar ainda muito mais funcionalidades. Cada página de livros, por exemplo, é paginada; como você pega livros dessas outras páginas? Além disso, no lado esquerdo do site que você viu categorias de livros; e se você não quiser todos os livros, mas apenas aqueles de um gênero em particular? Agora, você irá adicionar esses recursos.

      Passo 4 — Coletando dados de várias páginas

      As páginas em books.toscrape.com que são paginadas têm um botão next abaixo do seu conteúdo, enquanto que as páginas que não são paginadas não o tem.

      Você irá usar a presença desse botão para determinar se a página é paginada ou não. Como os dados em cada página têm a mesma estrutura e a mesma marcação, não será necessário escrever um coletor de dados para todas as páginas possíveis. Em vez disso, você irá usar a prática de recursão.

      Primeiro, é necessário alterar um pouco a estrutura do seu código para acomodar a navegação recursiva para várias páginas.

      Abra o pagescraper.js novamente:

      Você irá adicionar uma nova função chamada scrapeCurrentPage() ao seu método scraper(). Essa função irá conter todo o código que coleta dados de uma página em particular e então clica no botão ‘next’ se ele existir. Adicione o conteúdo destacado a seguir:

      ./book-scraper/pageScraper.js scraper()

      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              // Navigate to the selected page
              await page.goto(this.url);
              let scrapedData = [];
              // Wait for the required DOM to be rendered
              async function scrapeCurrentPage(){
                  await page.waitForSelector('.page_inner');
                  // Get the link to all the required books
                  let urls = await page.$$eval('section ol > li', links => {
                      // Make sure the book to be scraped is in stock
                      links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
                      // Extract the links from the data
                      links = links.map(el => el.querySelector('h3 > a').href)
                      return links;
                  });
                  // Loop through each of those links, open a new page instance and get the relevant data from them
                  let pagePromise = (link) => new Promise(async(resolve, reject) => {
                      let dataObj = {};
                      let newPage = await browser.newPage();
                      await newPage.goto(link);
                      dataObj['bookTitle'] = await newPage.$eval('.product_main > h1', text => text.textContent);
                      dataObj['bookPrice'] = await newPage.$eval('.price_color', text => text.textContent);
                      dataObj['noAvailable'] = await newPage.$eval('.instock.availability', text => {
                          // Strip new line and tab spaces
                          text = text.textContent.replace(/(rnt|n|r|t)/gm, "");
                          // Get the number of stock available
                          let regexp = /^.*((.*)).*$/i;
                          let stockAvailable = regexp.exec(text)[1].split(' ')[0];
                          return stockAvailable;
                      });
                      dataObj['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src);
                      dataObj['bookDescription'] = await newPage.$eval('#product_description', div => div.nextSibling.nextSibling.textContent);
                      dataObj['upc'] = await newPage.$eval('.table.table-striped > tbody > tr > td', table => table.textContent);
                      resolve(dataObj);
                      await newPage.close();
                  });
      
                  for(link in urls){
                      let currentPageData = await pagePromise(urls);
                      scrapedData.push(currentPageData);
                      // console.log(currentPageData);
                  }
                  // When all the data on this page is done, click the next button and start the scraping of the next page
                  // You are going to check if this button exist first, so you know if there really is a next page.
                  let nextButtonExist = false;
                  try{
                      const nextButton = await page.$eval('.next > a', a => a.textContent);
                      nextButtonExist = true;
                  }
                  catch(err){
                      nextButtonExist = false;
                  }
                  if(nextButtonExist){
                      await page.click('.next > a');   
                      return scrapeCurrentPage(); // Call this function recursively
                  }
                  await page.close();
                  return scrapedData;
              }
              let data = await scrapeCurrentPage();
              console.log(data);
              return data;
          }
      }
      
      module.exports = scraperObject;
      
      

      Você define a variável nextButtonExist como falsa inicialmente, para então verificar-se se o botão existe. Se o botão next existir, você define o nextButtonExists como true, para em seguida clicar no botão next. Depois disso, chama essa função de forma recursiva.

      Se o nextButtonExists for falso, ele retorna a matriz scrapedData como de costume.

      Salve e feche o arquivo.

      Execute seu script novamente:

      Ele pode demorar um tempo para ser concluído. Sua aplicação, afinal de contas, está agora coletando dados de mais de 800 livros. Sinta-se livre para fechar o navegador ou pressionar CTRL + C para cancelar o processo.

      Agora, você maximizou as capacidades do seu coletor de dados, mas criou um novo problema no processo. Agora, o problema não existe por haver poucos dados, mas sim muitos dados. No próximo passo, você irá ajustar seu aplicativo para filtrar sua coleta de dados por categoria de livro.

      Passo 5 — Coletando dados por categoria

      Para coletar dados por categoria, será necessário modificar tanto seu arquivo pageScraper.js quanto seu arquivo pageController.js.

      Abra o pageController.js em um editor de texto:

      nano pageController.js
      

      Chame o coletor de dados para que ele colete apenas livros de viagens. Adicione as linhas a seguir:

      ./book-scraper/pageController.js

      const pageScraper = require('./pageScraper');
      async function scrapeAll(browserInstance){
          let browser;
          try{
              browser = await browserInstance;
              let scrapedData = {};
              // Call the scraper for different set of books to be scraped
              scrapedData['Travel'] = await pageScraper.scraper(browser, 'Travel');
              await browser.close();
              console.log(scrapedData)
          }
          catch(err){
              console.log("Could not resolve the browser instance => ", err);
          }
      }
      module.exports = (browserInstance) => scrapeAll(browserInstance)
      

      Agora, você está passando dois parâmetros para seu método pageScraper.scraper(), sendo o segundo parâmetro a categoria de livros que deseja coletar, que neste exemplo é Travel. Mas seu arquivo pageScraper.js ainda não reconhece esse parâmetro. Será necessário ajustar também esse arquivo.

      Salve e feche o arquivo.

      Abra o pageScraper.js:

      Adicione o código a seguir, que irá adicionar seu parâmetro de categoria, navegar até essa página da categoria e então começar a coletar os resultados paginados:

      ./book-scraper/pageScraper.js

      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser, category){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              // Navigate to the selected page
              await page.goto(this.url);
              // Select the category of book to be displayed
              let selectedCategory = await page.$$eval('.side_categories > ul > li > ul > li > a', (links, _category) => {
      
                  // Search for the element that has the matching text
                  links = links.map(a => a.textContent.replace(/(rnt|n|r|t|^s|s$|Bs|sB)/gm, "") === _category ? a : null);
                  let link = links.filter(tx => tx !== null)[0];
                  return link.href;
              }, category);
              // Navigate to the selected category
              await page.goto(selectedCategory);
              let scrapedData = [];
              // Wait for the required DOM to be rendered
              async function scrapeCurrentPage(){
                  await page.waitForSelector('.page_inner');
                  // Get the link to all the required books
                  let urls = await page.$$eval('section ol > li', links => {
                      // Make sure the book to be scraped is in stock
                      links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
                      // Extract the links from the data
                      links = links.map(el => el.querySelector('h3 > a').href)
                      return links;
                  });
                  // Loop through each of those links, open a new page instance and get the relevant data from them
                  let pagePromise = (link) => new Promise(async(resolve, reject) => {
                      let dataObj = {};
                      let newPage = await browser.newPage();
                      await newPage.goto(link);
                      dataObj['bookTitle'] = await newPage.$eval('.product_main > h1', text => text.textContent);
                      dataObj['bookPrice'] = await newPage.$eval('.price_color', text => text.textContent);
                      dataObj['noAvailable'] = await newPage.$eval('.instock.availability', text => {
                          // Strip new line and tab spaces
                          text = text.textContent.replace(/(rnt|n|r|t)/gm, "");
                          // Get the number of stock available
                          let regexp = /^.*((.*)).*$/i;
                          let stockAvailable = regexp.exec(text)[1].split(' ')[0];
                          return stockAvailable;
                      });
                      dataObj['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src);
                      dataObj['bookDescription'] = await newPage.$eval('#product_description', div => div.nextSibling.nextSibling.textContent);
                      dataObj['upc'] = await newPage.$eval('.table.table-striped > tbody > tr > td', table => table.textContent);
                      resolve(dataObj);
                      await newPage.close();
                  });
      
                  for(link in urls){
                      let currentPageData = await pagePromise(urls);
                      scrapedData.push(currentPageData);
                      // console.log(currentPageData);
                  }
                  // When all the data on this page is done, click the next button and start the scraping of the next page
                  // You are going to check if this button exist first, so you know if there really is a next page.
                  let nextButtonExist = false;
                  try{
                      const nextButton = await page.$eval('.next > a', a => a.textContent);
                      nextButtonExist = true;
                  }
                  catch(err){
                      nextButtonExist = false;
                  }
                  if(nextButtonExist){
                      await page.click('.next > a');   
                      return scrapeCurrentPage(); // Call this function recursively
                  }
                  await page.close();
                  return scrapedData;
              }
              let data = await scrapeCurrentPage();
              console.log(data);
              return data;
          }
      }
      
      module.exports = scraperObject;
      

      Esse bloco de código usa a categoria que você passou para obter a URL onde os livros dessa categoria residem.

      O page.$$eval() pode receber argumentos passando o argumento como um terceiro parâmetro para o método $$eval() e o definindo como o terceiro parâmetro na callback, dessa forma:

      example page.$$eval() function

      page.$$eval('selector', function(elem, args){
          // .......
      }, args)
      

      Isso foi o que você fez em seu código: passou a categoria de livros da qual queria coletar dados, verificou todas as categorias para verificar qual é a correspondente e então retornou a URL dessa categoria.

      Essa URL é então usado para navegar até a página que exibe a categoria de livros da qual deseja coletar dados usando o método page.goto(selectedCategory).

      Salve e feche o arquivo.

      Execute seu aplicativo novamente. Você verá que ele navega até a categoria Travel, abre de forma recursiva os livros dessa categoria página por página e registra os resultados:

      Neste passo, você primeiro coletou dados de diversas páginas e então coletou dados de diversas páginas de uma categoria em particular. No passo final, você irá modificar seu script para coletar dados de diversas categorias e então salvar esses dados coletados em um arquivo JSON em string.

      Passo 6 — Coletando dados de diversas categorias e salvando-os como JSON

      Neste passo final, você fará com que seu script colete dados de todas as categorias que quiser e então altere a forma do seu resultado. Em vez de registrar os resultados, você irá salvá-los em um arquivo estruturado chamado data.json.

      É possível adicionar mais categorias de onde serão coletados os dados. Para fazer isso, é necessário apenas uma linha adicional por gênero.

      Abra o pageController.js:

      Ajuste seu código para incluir categorias adicionais. O exemplo abaixo adiciona HistoricalFiction e Mystery à nossa categoria Travel existente:

      ./book-scraper/pageController.js

      const pageScraper = require('./pageScraper');
      async function scrapeAll(browserInstance){
          let browser;
          try{
              browser = await browserInstance;
              let scrapedData = {};
              // Call the scraper for different set of books to be scraped
              scrapedData['Travel'] = await pageScraper.scraper(browser, 'Travel');
              scrapedData['HistoricalFiction'] = await pageScraper.scraper(browser, 'Historical Fiction');
              scrapedData['Mystery'] = await pageScraper.scraper(browser, 'Mystery');
              await browser.close();
              console.log(scrapedData)
          }
          catch(err){
              console.log("Could not resolve the browser instance => ", err);
          }
      }
      
      module.exports = (browserInstance) => scrapeAll(browserInstance)
      

      Salve e feche o arquivo.

      Execute o script novamente e obeserve-o enquanto coleta dados de todas as três categorias:

      Com o coletor de dados totalmente funcional, seu passo final envolve salvar seus dados em um formato que seja mais útil. Agora, você irá armazená-los em um arquivo JSON usando o módulo fs no Node.js.

      Primeiro, abra o pageController.js novamente:

      Adicione o conteúdo destacado a seguir:

      ./book-scraper/pageController.js

      const pageScraper = require('./pageScraper');
      const fs = require('fs');
      async function scrapeAll(browserInstance){
          let browser;
          try{
              browser = await browserInstance;
              let scrapedData = {};
              // Call the scraper for different set of books to be scraped
              scrapedData['Travel'] = await pageScraper.scraper(browser, 'Travel');
              scrapedData['HistoricalFiction'] = await pageScraper.scraper(browser, 'Historical Fiction');
              scrapedData['Mystery'] = await pageScraper.scraper(browser, 'Mystery');
              await browser.close();
              fs.writeFile("data.json", JSON.stringify(scrapedData), 'utf8', function(err) {
                  if(err) {
                      return console.log(err);
                  }
                  console.log("The data has been scraped and saved successfully! View it at './data.json'");
              });
          }
          catch(err){
              console.log("Could not resolve the browser instance => ", err);
          }
      }
      
      module.exports = (browserInstance) => scrapeAll(browserInstance)
      

      Primeiro, você está solicitando o módulo fs do Node,js em pageController.js. Isso garante que seja possível salvar seus dados como um arquivo JSON. Em seguida, está adicionando código para que quando a coleta de dados for concluída e o navegador for fechado, o programa crie um novo arquivo chamado data.json. Observe que o conteúdo de data.json é JSON em string. Portanto, ao ler o conteúdo de data.json, analise-o sempre como JSON antes de reusar os dados.

      Salve e feche o arquivo.

      Agora, você criou um aplicativo de coleta de dados na Web que coleta livros de várias categorias e então armazena seus dados coletados em um arquivo JSON. À medida que seu aplicativo cresce em complexidade, pode ser desejável armazenar esses dados coletados em um banco de dados ou atendê-los por meio de uma API. A forma como esses dados são consumidos depende somente de você.

      Conclusão

      Neste tutorial, você construiu um rastreador Web que coletou dados de várias páginas de forma recursiva e então os salvou em um arquivo JSON. Em resumo, você aprendeu uma nova maneira de automatizar a coleta de dados de sites.

      O Puppeteer possui muitos recursos que não estavam no âmbito deste tutorial. Para aprender mais, confira Usando o Puppeteer para o controle fácil do Chrome sem periféricos. Visite também a documentação oficial do Puppeteer.



      Source link

      Скрейпинг веб-сайта с помощью Node.js и Puppeteer


      Автор выбрал фонд Free and Open Source Fund для получения пожертвования в рамках программы Write for DOnations.

      Введение

      Веб-скрейпинг — это процесс автоматизации сбора данных из сети. В ходе данного процесса обычно используется «поисковый робот», который выполняет автоматический серфинг по сети и собирает данные с выбранных страниц. Существует много причин, по которым вам может потребоваться скрейпинг. Его главное достоинство состоит в том, что он позволяет выполнять сбор данных значительно быстрее, устраняя необходимость в ручном сборе данных. Скрейпинг также является отличным решением, когда собрать данные желательно или необходимо, но веб-сайт не предоставляет API для выполнения этой задачи.

      В этом руководстве вы создадите приложение для веб-скрейпинга с помощью Node.js и Puppeteer. Ваше приложение будет усложняться по мере вашего прогресса. Сначала вы запрограммируете ваше приложение на открытие Chromium и загрузку специального сайта, который вы будете использовать для практики веб-скрейпинга: books.toscrape.com. В следующих двух шагах вы выполните скрейпинг сначала всех книг на отдельной странице books.toscrape, а затем всех книг на нескольких страницах. В ходе остальных шагов вы сможете отфильтровать результаты по категориям книг, а затем сохраните ваши данные в виде файла JSON.

      Предупреждение: этичность и законность веб-скрейпинга являются очень сложной темой, которая постоянно подвергается изменениям. Ситуация зависит от вашего местонахождения, расположения данных и рассматриваемого веб-сайта. В этом руководстве мы будем выполнять скрейпинг специального сайта books.toscrape.com, который предназначен непосредственно для тестирования приложений для скрейпинга. Скрейпинг любого другого домена выходит за рамки темы данного руководства.

      Предварительные требования

      Шаг 1 — Настройка веб-скрейпера

      После установки Node.js вы можете начать настройку вашего веб-скрейпера. Сначала вам нужно будет создать корневой каталог проекта, а затем установить необходимые зависимости. Данное руководство требует только одной зависимости, и вы установите эту зависимость с помощью npm, стандартного диспетчера пакетов Node.js. npm предоставляется вместе с Node.js, поэтому вам не придется устанавливать его отдельно.

      Создайте папку для данного проекта, а затем перейдите в эту папку:

      • mkdir book-scraper
      • cd book-scraper

      Вы будете запускать все последующие команды из этого каталога.

      Нам нужно установить один пакет с помощью npm (node package manager). Сначала инициализируйте npm для создания файла packages.json, который будет управлять зависимостями вашего проекта и метаданными.

      Инициализация npm для вашего проекта:

      npm отобразит последовательность запросов. Вы можете нажать ENTER в ответ на каждый запрос или добавить персонализированные описания. Нажмите ENTER и оставьте значения по умолчанию при запросе значений для точки входа: и тестовой команды:. В качестве альтернативы вы можете передать флаг y для npmnpm init -y— в результате чего npm добавит все значения по умолчанию.

      Полученный вами вывод будет выглядеть примерно следующим образом:

      Output

      { "name": "sammy_scraper", "version": "1.0.0", "description": "a web scraper", "main": "index.js", "scripts": { "test": "echo "Error: no test specified" && exit 1" }, "keywords": [], "author": "sammy the shark", "license": "ISC" } Is this OK? (yes) yes

      Введите yes и нажмите ENTER. После этого npm сохранит этот результат в виде вашего файла package.json.

      Теперь вы можете воспользоваться npm для установки Puppeteer:

      • npm install --save puppeteer

      Эта команда устанавливает Puppeteer и версию Chromium, которая, как известно команде Puppeteer, будет корректно работать с их API.

      На компьютерах с Linux для работы Puppeteer может потребоваться установка дополнительных зависимостей.

      Если вы используете Ubuntu 18.04, ознакомьтесь с данными в выпадающем списке «Зависимости Debian» в разделе «Chrome Headless не запускается в UNIX» документации Puppeteer по устранению ошибок. Вы можете воспользоваться следующей командой для поиска любых недостающих зависимостей:

      После установки npm, Puppeteer и любых дополнительных зависимостей ваш файл package.json потребует одной последней настройки, прежде чем вы сможете начать писать код. В этом руководстве вы будете запускать ваше приложение из командной строки с помощью команды npm run start. Вы должны добавить определенную информацию об этом скрипте start в package.json. В частности, вы должны добавить одну строку под директивой scripts для вашей команды start.

      Откройте в файл в предпочитаемом вами текстовом редакторе:

      Найдите раздел scripts: и добавьте следующие конфигурации. Не забудьте поместить запятую в конце строки test скрипта, иначе ваш файл не будет интерпретироваться корректно.

      Output

      { . . . "scripts": { "test": "echo "Error: no test specified" && exit 1", "start": "node index.js" }, . . . "dependencies": { "puppeteer": "^5.2.1" } }

      Также вы можете заметить, что puppeteer сейчас появляется под разделом dependencies в конце файла. Ваш файл package.json больше не потребует изменений. Сохраните изменения и закройте редактор.

      Теперь вы можете перейти к программированию вашего скрейпера. В следующем шаге вы настроите экземпляр браузера и протестируете базовый функционал вашего скрейпера.

      Шаг 2 — Настройка экземпляра браузера

      Когда вы открываете традиционный браузер, то можете выполнять такие действия, как нажатие кнопок, навигация с помощью мыши, печать, открытие инструментов разработчик и многое другое. Браузер без графического интерфейса, например, Chromium, позволяет вам выполнять эти же вещи, но уже программным путем без использования пользовательского интерфейса. В этом шаге вы настроите экземпляр браузера для вашего скрейпера. Когда вы запустите ваше приложение, оно автоматически откроет Chromium и перейдет на сайт books.toscrape.com. Эти первоначальные действия будут служить основой вашей программы.

      Вашему веб-скрейперу потребуется четыре файла .js: browser.js, index.js, pageController.js и pageScraper.js. В этом шаге вы создадите все четыре файла, а затем постепенно будете обновлять их по мере того, как ваша программа будет усложняться. Начнем с browser.js; этот файл будет содержать скрипт, который запускает ваш браузер.

      В корневом каталоге вашего проекта создайте и откройте файл browser.js в текстовом редакторе:

      Во-первых, необходимо подключить Puppeteer с помощью require, а затем создать асинхронную функцию с именем startBrowser(). Эта функция будет запускать браузер и возвращать его экземпляр. Добавьте следующий код:

      ./book-scraper/browser.js

      const puppeteer = require('puppeteer');
      
      async function startBrowser(){
          let browser;
          try {
              console.log("Opening the browser......");
              browser = await puppeteer.launch({
                  headless: false,
                  args: ["--disable-setuid-sandbox"],
                  'ignoreHTTPSErrors': true
              });
          } catch (err) {
              console.log("Could not create a browser instance => : ", err);
          }
          return browser;
      }
      
      module.exports = {
          startBrowser
      };
      

      Puppeteer имеет метод launch(), который запускает экземпляр браузера. Этот метод возвращает промис, поэтому вам нужно гарантировать, что промис исполняется, воспользовавшись для этого блоком .then или await.

      Вы будете использовать await для гарантии исполнения промиса, обернув этот экземпляр в блок try-catch, а затем вернув экземпляр браузера.

      Обратите внимание, что метод .launch() принимает в качестве параметра JSON с несколькими значениями:

      • headless – false означает, что браузер будет запускаться с интерфейсом, чтобы вы могли наблюдать за выполнением вашего скрипта, а значение true для данного параметра означает, что браузер будет запускаться в режиме без графического интерфейс. Обратите внимание, что, если вы хотите развернуть ваш скрейпер в облаке, задайте значение true для параметра headless. Большинство виртуальных машин не имеют пользовательского интерфейса, поэтому они могут запускать браузер только в режиме без графического интерфейса. Puppeteer также включает режим headful, но его следует использовать исключительно для тестирования.
      • ignoreHTTPSerrors – true позволяет вам посещать веб-сайты, доступ к которым осуществляется не через защищенный протокол HTTPS, и игнорировать любые ошибки HTTPS.

      Сохраните и закройте файл.

      Теперь создайте ваш второй файл .jsindex.js:

      Здесь вы подключаете файлы browser.js и pageController.js с помощью require. Затем вы вызовете функцию startBrowser() и передадите созданный экземпляр браузера в контроллер страницы, который будет управлять ее действиями. Добавьте следующий код:

      ./book-scraper/index.js

      const browserObject = require('./browser');
      const scraperController = require('./pageController');
      
      //Start the browser and create a browser instance
      let browserInstance = browserObject.startBrowser();
      
      // Pass the browser instance to the scraper controller
      scraperController(browserInstance)
      

      Сохраните и закройте файл.

      Создайте ваш третий файл .jspageController.js:

      pageController.js контролирует процесс скрейпинга. Он использует экземпляр браузера для управления файлом pageScraper.js, где выполняются все скрипты скрейпинга. В конечном итоге вы будете использовать его для указания категории, скрейпинг которой вы хотите выполнить. Однако сейчас вам нужно только убедиться, что вы можете открыть Chromium и перейти на веб-страницу:

      ./book-scraper/pageController.js

      const pageScraper = require('./pageScraper');
      async function scrapeAll(browserInstance){
          let browser;
          try{
              browser = await browserInstance;
              await pageScraper.scraper(browser); 
      
          }
          catch(err){
              console.log("Could not resolve the browser instance => ", err);
          }
      }
      
      module.exports = (browserInstance) => scrapeAll(browserInstance)
      

      Этот код экспортирует функцию, которая принимает экземпляр браузера и передает его в функцию scrapeAll(). Эта функция, в свою очередь, передает этот экземпляр в pageScraper.scraper() в качестве аргумента, который использует его при скрейпинге страниц.

      Сохраните и закройте файл.

      В заключение создайте ваш последний файл .jspageScraper.js:

      Здесь вы создаете литерал со свойством url и методом scraper(). url — это URL-адрес веб-страницы, скрейпинг которой вы хотите выполнить, а метод scraper() содержит код, который будет непосредственно выполнять скрейпинг, хотя на этом этапе он будет просто переходить по указанному URL-адресу. Добавьте следующий код:

      ./book-scraper/pageScraper.js

      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              await page.goto(this.url);
      
          }
      }
      
      module.exports = scraperObject;
      

      Puppeteer имеет метод newPage(), который создает новый экземпляр страницы в браузере, а эти экземпляры страниц могут выполнять несколько действий. В методе scraper() вы создали экземпляр страницы, а затем использовали метод page.goto() для перехода на домашнюю страницу books.toscrape.com.

      Сохраните и закройте файл.

      Теперь файловая структура вашей программы готова. Первый уровень дерева каталогов вашего проекта будет выглядеть следующим образом:

      Output

      . ├── browser.js ├── index.js ├── node_modules ├── package-lock.json ├── package.json ├── pageController.js └── pageScraper.js

      Теперь запустите команду npm run start и следите за выполнением вашего приложения для скрейпинга:

      Приложение автоматически загрузит экземпляр браузера Chromium, откроет новую страницу в браузере и перейдет на адрес books.toscrape.com.

      В этом шаге вы создали приложение Puppeteer, которое открывает Chromium и загружает домашнюю страницу шаблона книжного онлайн-магазина—books.toscrape.com. В следующем шаге вы будете выполнять скрейпинг данных для каждой книги на этой домашней странице.

      Шаг 3 — Скрейпинг данных с одной страницы

      Перед добавлением дополнительных функций в ваше приложение для скрейпинга, откройте предпочитаемый веб-браузер и вручную перейдите на домашнюю страницу с книгами для скрейпинга. Просмотрите сайт и получите представление о структуре данных.

      Изображение веб-сайта с книгами для скрейпинга

      Слева вы найдете раздел категорий, а справа располагаются книги. При нажатии на книгу браузер переходит по новому URL-адресу, который отображает соответствующую информацию об этой конкретной книге.

      В этом шаге вы будете воспроизводить данное поведение, но уже с помощью кода, т.е. вы автоматизируете процесс навигации по веб-сайту и получения данных.

      Во-первых, если вы просмотрите исходный код домашней страницы с помощью инструментов разработчика в браузере, то сможете заметить, что страница содержит данные каждой книги внутри тега section. Внутри тега section каждая книга находится внутри тега list (li), и именно здесь вы найдете ссылку на отдельную страницу книги, цену и информацию о наличии.

      Просмотр исходного кода books.toscrape с помощью инструментов для разработчика

      Вы будете выполнять скрейпинг URL-адресов книг, фильтровать книги, имеющиеся в наличии, переходить на отдельную страницу каждой книги и потом выполнять уже скрейпинг данных этой книги.

      Повторно откройте ваш файл pageScraper.js:

      Добавьте следующие выделенные строки. Вы поместите еще один блок await внутри блока await page.goto(this.url);:

      ./book-scraper/pageScraper.js

      
      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              // Navigate to the selected page
              await page.goto(this.url);
              // Wait for the required DOM to be rendered
              await page.waitForSelector('.page_inner');
              // Get the link to all the required books
              let urls = await page.$$eval('section ol > li', links => {
                  // Make sure the book to be scraped is in stock
                  links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
                  // Extract the links from the data
                  links = links.map(el => el.querySelector('h3 > a').href)
                  return links;
              });
              console.log(urls);
          }
      }
      
      module.exports = scraperObject;
      
      

      Внутри этого блока кода вы вызываете метод page.waitForSelector(). Он ожидает, когда блок, содержащий всю информацию о книге, будет преобразован в DOM, после чего вы вызываете метод page.$$eval(). Этот метод получает элемент URL-адреса с селектором section ol li (убедитесь, что методы page.$eval() и page.$$eval() возвращают только строку или число).

      Каждая книга имеет два статуса: In Stock (В наличии) или Out of stock (Распродано). Вам нужно выполнить скрейпинг книг со статусом In Stock. Поскольку page.$$eval() возвращает массив всех подходящих элементов, вы выполнили фильтрацию этого массива, чтобы гарантировать работу исключительно с книгами в наличии. Эта задача выполняется путем поиска и оценки класса .instock.availability. Затем вы вычленили свойство href в ссылках книг и вернули его с помощью метода.

      Сохраните и закройте файл.

      Повторно запустите ваше приложение:

      Приложение откроет браузер, перейдет на веб-страницу, а затем закроет его, когда задача будет выполнена. Теперь проверьте вашу консоль; она будет содержать все полученные URL-адреса:

      Output

      > [email protected] start /Users/sammy/book-scraper > node index.js Opening the browser...... Navigating to http://books.toscrape.com... [ 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html', 'http://books.toscrape.com/catalogue/soumission_998/index.html', 'http://books.toscrape.com/catalogue/sharp-objects_997/index.html', 'http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html', 'http://books.toscrape.com/catalogue/the-requiem-red_995/index.html', 'http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html', 'http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html', 'http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html', 'http://books.toscrape.com/catalogue/the-black-maria_991/index.html', 'http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html', 'http://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html', 'http://books.toscrape.com/catalogue/set-me-free_988/index.html', 'http://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html', 'http://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html', 'http://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html', 'http://books.toscrape.com/catalogue/olio_984/index.html', 'http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html', 'http://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html', 'http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html' ]

      Это отличное начало, но вам нужно выполнить скрейпинг всех подходящих данных для конкретной книги, а не только ее URL-адреса. Теперь вы будете использовать эти URL-адреса для открытия каждой страницы и скрейпинга названия, автора, цены, наличия, универсального товарного кода, описания и URL-адреса изображения.

      Повторно откройте pageScraper.js:

      Добавьте следующий код, который будет проходить по каждой полученной ссылке, открывать новый экземпляр страницы, а затем получать подходящие данные:

      ./book-scraper/pageScraper.js

      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              // Navigate to the selected page
              await page.goto(this.url);
              // Wait for the required DOM to be rendered
              await page.waitForSelector('.page_inner');
              // Get the link to all the required books
              let urls = await page.$$eval('section ol > li', links => {
                  // Make sure the book to be scraped is in stock
                  links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
                  // Extract the links from the data
                  links = links.map(el => el.querySelector('h3 > a').href)
                  return links;
              });
      
      
              // Loop through each of those links, open a new page instance and get the relevant data from them
              let pagePromise = (link) => new Promise(async(resolve, reject) => {
                  let dataObj = {};
                  let newPage = await browser.newPage();
                  await newPage.goto(link);
                  dataObj['bookTitle'] = await newPage.$eval('.product_main > h1', text => text.textContent);
                  dataObj['bookPrice'] = await newPage.$eval('.price_color', text => text.textContent);
                  dataObj['noAvailable'] = await newPage.$eval('.instock.availability', text => {
                      // Strip new line and tab spaces
                      text = text.textContent.replace(/(rnt|n|r|t)/gm, "");
                      // Get the number of stock available
                      let regexp = /^.*((.*)).*$/i;
                      let stockAvailable = regexp.exec(text)[1].split(' ')[0];
                      return stockAvailable;
                  });
                  dataObj['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src);
                  dataObj['bookDescription'] = await newPage.$eval('#product_description', div => div.nextSibling.nextSibling.textContent);
                  dataObj['upc'] = await newPage.$eval('.table.table-striped > tbody > tr > td', table => table.textContent);
                  resolve(dataObj);
                  await newPage.close();
              });
      
              for(link in urls){
                  let currentPageData = await pagePromise(urls);
                  // scrapedData.push(currentPageData);
                  console.log(currentPageData);
              }
      
          }
      }
      
      module.exports = scraperObject;
      

      Вы получили массив всех URL-адресов. Вам нужно пробежаться по этому массиву, открыть URL-адрес в новой странице, выполнить скрейпинг данных на этой странице, закрыть эту страницу и открыть новую страницу для следующего URL-адреса в массиве. Обратите внимание, что вы обернули этот код в промис. Это связано с тем, что вам нужно иметь возможность ожидать завершения выполнения каждого действия в вашем цикле. Поэтому каждый промис открывает новый URL-адрес и не будет выполнен, пока программа не выполнит скрейпинг всех данных для этого URL-адреса, и только после этого данный экземпляр страницы закроется.

      Предупреждение: обратите внимание, что вы реализовали ожидание выполнения промиса с помощью цикла for-in. Вы можете использовать любой цикл, но рекомендуется избегать итерации по вашим массивам с URL-адресами с помощью метода итерации по массивам, например, forEach, или любого другого метода, который использует функцию обратного вызова. Это связано с тем, что функция обратного вызова должна будет сначала пройти через очередь обратного вызова и цикл событий, в результате чего будет одновременно открываться несколько экземпляров страницы. Это будет накладывать заметно большую нагрузку на вашу память.

      Приглядитесь внимательней к вашей функции pagePromise. Ваше приложение для скрейпинга сначала создало новую страницу для каждого URL-адреса, а затем вы использовали функцию page.$eval() для настройки селекторов на получение подходящих данных, которые вы хотите собрать с новой страницы. Некоторые тексты содержат пробелы, символы табуляции, переносы строки и прочие специальные символы, которые вы удалили с помощью регулярного выражения. Затем вы добавили значение всех элементов данных, которые были получены во время скрейпинга страницы, в объект и зарезолвили этот объект.

      Сохраните и закройте файл.

      Запустите скрипт еще раз:

      Браузер открывает домашнюю страницу, затем переходит на страницу каждой книги и записывает данные скрейпинга для каждой из этих страниц. Следующий вывод будет отображен в консоли:

      Output

      Opening the browser...... Navigating to http://books.toscrape.com... { bookTitle: 'A Light in the Attic', bookPrice: '£51.77', noAvailable: '22', imageUrl: 'http://books.toscrape.com/media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg', bookDescription: "It's hard to imagine a world without A Light in the Attic. [...]', upc: 'a897fe39b1053632' } { bookTitle: 'Tipping the Velvet', bookPrice: '£53.74', noAvailable: '20', imageUrl: 'http://books.toscrape.com/media/cache/08/e9/08e94f3731d7d6b760dfbfbc02ca5c62.jpg', bookDescription: `"Erotic and absorbing...Written with starling power."--"The New York Times Book Review " Nan King, an oyster girl, is captivated by the music hall phenomenon Kitty Butler [...]`, upc: '90fa61229261140a' } { bookTitle: 'Soumission', bookPrice: '£50.10', noAvailable: '20', imageUrl: 'http://books.toscrape.com/media/cache/ee/cf/eecfe998905e455df12064dba399c075.jpg', bookDescription: 'Dans une France assez proche de la nôtre, [...]', upc: '6957f44c3847a760' } ...

      В этом шаге вы выполнили скрейпинг подходящих данных для каждой книги на домашней странице books.toscrape.com, но вы можете добавить гораздо больше функций. Например, если каждая страница с книгами имеет пагинацию, как вы получите книги со следующих страниц? Также в левой части веб-сайта указаны категории книг; что, если вам не нужны все книги, а только книги конкретного жанра? Теперь вы можете добавить эти функции.

      Шаг 4 — Скрейпинг данных с нескольких страниц

      Страницы на сайте books.toscrape.com с пагинацией имеют кнопку Next (Далее) под основным содержанием, а на страницах без пагинации этой кнопки нет.

      Вы будете использовать эту кнопку, чтобы определять, имеет ли страница пагинацию или нет. Поскольку данные на каждой странице имеют одну и ту же структуру и разметку, вам не придется писать скрейпер для каждой возможной страницы. Вместо этого вы будете использовать рекурсию.

      Сначала вам нужно немного изменить структуру вашего кода, чтобы выполнять навигацию по нескольким страницам рекурсивно.

      Повторно откройте pagescraper.js:

      Вы добавите новую функцию с именем scrapeCurrentPage() в ваш метод scraper(). Эта функция будет содержать весь код, который будет выполнять скрейпинг данных с отдельной страницы, а затем нажимать кнопку next, если она присутствует. Добавьте следующий выделенный код:

      ./book-scraper/pageScraper.js scraper()

      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              // Navigate to the selected page
              await page.goto(this.url);
              let scrapedData = [];
              // Wait for the required DOM to be rendered
              async function scrapeCurrentPage(){
                  await page.waitForSelector('.page_inner');
                  // Get the link to all the required books
                  let urls = await page.$$eval('section ol > li', links => {
                      // Make sure the book to be scraped is in stock
                      links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
                      // Extract the links from the data
                      links = links.map(el => el.querySelector('h3 > a').href)
                      return links;
                  });
                  // Loop through each of those links, open a new page instance and get the relevant data from them
                  let pagePromise = (link) => new Promise(async(resolve, reject) => {
                      let dataObj = {};
                      let newPage = await browser.newPage();
                      await newPage.goto(link);
                      dataObj['bookTitle'] = await newPage.$eval('.product_main > h1', text => text.textContent);
                      dataObj['bookPrice'] = await newPage.$eval('.price_color', text => text.textContent);
                      dataObj['noAvailable'] = await newPage.$eval('.instock.availability', text => {
                          // Strip new line and tab spaces
                          text = text.textContent.replace(/(rnt|n|r|t)/gm, "");
                          // Get the number of stock available
                          let regexp = /^.*((.*)).*$/i;
                          let stockAvailable = regexp.exec(text)[1].split(' ')[0];
                          return stockAvailable;
                      });
                      dataObj['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src);
                      dataObj['bookDescription'] = await newPage.$eval('#product_description', div => div.nextSibling.nextSibling.textContent);
                      dataObj['upc'] = await newPage.$eval('.table.table-striped > tbody > tr > td', table => table.textContent);
                      resolve(dataObj);
                      await newPage.close();
                  });
      
                  for(link in urls){
                      let currentPageData = await pagePromise(urls);
                      scrapedData.push(currentPageData);
                      // console.log(currentPageData);
                  }
                  // When all the data on this page is done, click the next button and start the scraping of the next page
                  // You are going to check if this button exist first, so you know if there really is a next page.
                  let nextButtonExist = false;
                  try{
                      const nextButton = await page.$eval('.next > a', a => a.textContent);
                      nextButtonExist = true;
                  }
                  catch(err){
                      nextButtonExist = false;
                  }
                  if(nextButtonExist){
                      await page.click('.next > a');   
                      return scrapeCurrentPage(); // Call this function recursively
                  }
                  await page.close();
                  return scrapedData;
              }
              let data = await scrapeCurrentPage();
              console.log(data);
              return data;
          }
      }
      
      module.exports = scraperObject;
      
      

      Первоначально вы задаете для переменной nextButtonExist значение false, а затем проверяете, присутствует ли кнопка. Если кнопка next существует, вы задаете для nextButtonExists значение true и переходите к нажатию кнопки next, после чего вызываете эту функцию рекурсивно.

      Если для nextButtonExists задано значение false, функция просто возвращает массив scrapedData.

      Сохраните и закройте файл.

      Запустите скрипт еще раз:

      На завершение работы скрипта может потребоваться время; ваше приложение теперь выполняет скрейпинг данных для более чем 800 книг. Вы можете закрыть браузер, либо нажать CTRL + C для завершения процесса.

      Вы успешно реализовали максимум возможностей вашего приложения для скрейпинга, но получили новую проблему в ходе этого процесса. Теперь проблема состоит не в слишком малом, а в слишком большом объеме данных. В следующем шаге вы выполните тонкую настройку приложения для фильтрации скрейпинга по категории книг.

      Шаг 5 — Скрейпинг данных по категории

      Чтобы выполнить скрейпинг данных по категориям, вам нужно будет изменить содержание файлов pageScraper.js и pageController.js.

      Откройте pageController.js в текстовом редакторе:

      nano pageController.js
      

      Вызовите скрейпер так, чтобы он выполнял скрейпинг исключительно книг о путешествиях. Добавьте следующий код:

      ./book-scraper/pageController.js

      const pageScraper = require('./pageScraper');
      async function scrapeAll(browserInstance){
          let browser;
          try{
              browser = await browserInstance;
              let scrapedData = {};
              // Call the scraper for different set of books to be scraped
              scrapedData['Travel'] = await pageScraper.scraper(browser, 'Travel');
              await browser.close();
              console.log(scrapedData)
          }
          catch(err){
              console.log("Could not resolve the browser instance => ", err);
          }
      }
      module.exports = (browserInstance) => scrapeAll(browserInstance)
      

      Теперь вы передаете два параметра в метод pageScraper.scraper(), где второй параметр — категория книг, которые вы хотите получить, в данном случае это Travel. Но ваш файл pageScraper.js еще не распознает этот параметр. Вам также нужно будет изменить этот файл.

      Сохраните и закройте файл.

      Откройте pageScraper.js:

      Добавьте следующий код, который будет добавлять параметр категории, переходить на страницу этой категории, а затем выполнять скрейпинг по страницам с пагинацией:

      ./book-scraper/pageScraper.js

      const scraperObject = {
          url: 'http://books.toscrape.com',
          async scraper(browser, category){
              let page = await browser.newPage();
              console.log(`Navigating to ${this.url}...`);
              // Navigate to the selected page
              await page.goto(this.url);
              // Select the category of book to be displayed
              let selectedCategory = await page.$$eval('.side_categories > ul > li > ul > li > a', (links, _category) => {
      
                  // Search for the element that has the matching text
                  links = links.map(a => a.textContent.replace(/(rnt|n|r|t|^s|s$|Bs|sB)/gm, "") === _category ? a : null);
                  let link = links.filter(tx => tx !== null)[0];
                  return link.href;
              }, category);
              // Navigate to the selected category
              await page.goto(selectedCategory);
              let scrapedData = [];
              // Wait for the required DOM to be rendered
              async function scrapeCurrentPage(){
                  await page.waitForSelector('.page_inner');
                  // Get the link to all the required books
                  let urls = await page.$$eval('section ol > li', links => {
                      // Make sure the book to be scraped is in stock
                      links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
                      // Extract the links from the data
                      links = links.map(el => el.querySelector('h3 > a').href)
                      return links;
                  });
                  // Loop through each of those links, open a new page instance and get the relevant data from them
                  let pagePromise = (link) => new Promise(async(resolve, reject) => {
                      let dataObj = {};
                      let newPage = await browser.newPage();
                      await newPage.goto(link);
                      dataObj['bookTitle'] = await newPage.$eval('.product_main > h1', text => text.textContent);
                      dataObj['bookPrice'] = await newPage.$eval('.price_color', text => text.textContent);
                      dataObj['noAvailable'] = await newPage.$eval('.instock.availability', text => {
                          // Strip new line and tab spaces
                          text = text.textContent.replace(/(rnt|n|r|t)/gm, "");
                          // Get the number of stock available
                          let regexp = /^.*((.*)).*$/i;
                          let stockAvailable = regexp.exec(text)[1].split(' ')[0];
                          return stockAvailable;
                      });
                      dataObj['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src);
                      dataObj['bookDescription'] = await newPage.$eval('#product_description', div => div.nextSibling.nextSibling.textContent);
                      dataObj['upc'] = await newPage.$eval('.table.table-striped > tbody > tr > td', table => table.textContent);
                      resolve(dataObj);
                      await newPage.close();
                  });
      
                  for(link in urls){
                      let currentPageData = await pagePromise(urls);
                      scrapedData.push(currentPageData);
                      // console.log(currentPageData);
                  }
                  // When all the data on this page is done, click the next button and start the scraping of the next page
                  // You are going to check if this button exist first, so you know if there really is a next page.
                  let nextButtonExist = false;
                  try{
                      const nextButton = await page.$eval('.next > a', a => a.textContent);
                      nextButtonExist = true;
                  }
                  catch(err){
                      nextButtonExist = false;
                  }
                  if(nextButtonExist){
                      await page.click('.next > a');   
                      return scrapeCurrentPage(); // Call this function recursively
                  }
                  await page.close();
                  return scrapedData;
              }
              let data = await scrapeCurrentPage();
              console.log(data);
              return data;
          }
      }
      
      module.exports = scraperObject;
      

      Этот блок кода использует категорию, которую вы передали, для получения URL-адреса страницы, где находятся книги этой категории.

      page.$$eval() может принимать аргументы с помощью передачи этого аргумента в качестве третьего параметра для метода $$$eval() и определения его как третьего параметра в обратном вызове следующим образом:

      example page.$$eval() function

      page.$$eval('selector', function(elem, args){
          // .......
      }, args)
      

      Вот что вы сделали в вашем коде: вы передали категорию книг, для которой вы хотите выполнить скрейпинг, выполнили маппинг по всем категориям, чтобы узнать, какая категория вам подходит, а затем вернули URL-адрес этой категории.

      Затем этот URL-адрес был использован для перехода на страницу, которая отображает категорию книг, для которой вы хотите выполнить скрейпинг, с помощью метода page.goto(selectedCategory).

      Сохраните и закройте файл.

      Запустите ваше приложение еще раз. Вы заметите, что приложение переходит к категории Travel, рекурсивно открывает книги данной категории одну за одной и записывает результаты:

      В этом шаге вы выполнили скрейпинг данных на нескольких страницах, а затем на нескольких страницах одной конкретной категории. В заключительном шаге вы внесете изменения в ваш скрипт для скрейпинга данных из нескольких категорий и последующего сохранения этих данных в имеющем строковый вид файле JSON.

      Шаг 6 — Выполнение скрейпинга данных из нескольких категорий и сохранение данных в виде файла JSON

      В этом заключительном шаге вы внесете изменения в скрипт, чтобы выполнять скрейпинг для нужного вам количества категорий, а затем измените порядок вывода результата. Вместо записи результатов вы будете сохранять их в структурированном файле data.json.

      Вы сможете быстро добавлять дополнительные категории для скрейпинга; для этого вам потребуется добавлять одну дополнительную строку для каждого отдельного жанра.

      Откройте pageController.js:

      Измените ваш код для включения дополнительных категорий. В примере ниже к существующей категории Travel добавляются категории HistoricalFiction и Mystery:

      ./book-scraper/pageController.js

      const pageScraper = require('./pageScraper');
      async function scrapeAll(browserInstance){
          let browser;
          try{
              browser = await browserInstance;
              let scrapedData = {};
              // Call the scraper for different set of books to be scraped
              scrapedData['Travel'] = await pageScraper.scraper(browser, 'Travel');
              scrapedData['HistoricalFiction'] = await pageScraper.scraper(browser, 'Historical Fiction');
              scrapedData['Mystery'] = await pageScraper.scraper(browser, 'Mystery');
              await browser.close();
              console.log(scrapedData)
          }
          catch(err){
              console.log("Could not resolve the browser instance => ", err);
          }
      }
      
      module.exports = (browserInstance) => scrapeAll(browserInstance)
      

      Сохраните и закройте файл.

      Запустите скрипт еще раз и посмотрите, как он выполняет скрейпинг данных для всех трех категорий:

      После получения всего необходимого функционала скрейпера в качестве заключительного шага вы добавите сохранение данных в более полезном формате. Теперь вы сможете сохранять их в файле JSON с помощью модуля fs в Node.js.

      Откройте pageController.js:

      Добавьте следующий выделенный код:

      ./book-scraper/pageController.js

      const pageScraper = require('./pageScraper');
      const fs = require('fs');
      async function scrapeAll(browserInstance){
          let browser;
          try{
              browser = await browserInstance;
              let scrapedData = {};
              // Call the scraper for different set of books to be scraped
              scrapedData['Travel'] = await pageScraper.scraper(browser, 'Travel');
              scrapedData['HistoricalFiction'] = await pageScraper.scraper(browser, 'Historical Fiction');
              scrapedData['Mystery'] = await pageScraper.scraper(browser, 'Mystery');
              await browser.close();
              fs.writeFile("data.json", JSON.stringify(scrapedData), 'utf8', function(err) {
                  if(err) {
                      return console.log(err);
                  }
                  console.log("The data has been scraped and saved successfully! View it at './data.json'");
              });
          }
          catch(err){
              console.log("Could not resolve the browser instance => ", err);
          }
      }
      
      module.exports = (browserInstance) => scrapeAll(browserInstance)
      

      В первую очередь вам потребуется добавить модуль fs из Node.js в файл pageController.js. Это гарантирует, что вы сможете сохранить ваши данные в виде файла JSON. Затем вы добавляете код, чтобы после завершения скрейпинга и закрытия браузера программа создавала новый файл с именем data.json. Обратите внимание, что data.json представляет собой строковый файл JSON. Следовательно, при чтении содержания data.json всегда необходимо парсить его в качестве JSON перед повторным использованием данных.

      Сохраните и закройте файл.

      Вы создали приложение для веб-скрейпинга, которое выполняет скрейпинг книг из нескольких категорий, а затем сохраняет полученные данные в файле JSON. По мере усложнения вашего приложения вам может потребоваться сохранять ваши данные в базе данных или работать с ними через API. То, как будут использованы эти данные, зависит от вас.

      Заключение

      В этом руководстве вы создали поискового робота, который рекурсивно скрейпит данные на нескольких страницах и затем сохраняет их в файле JSON. Коротко говоря, вы узнали новый способ автоматизации сбора данных с веб-сайтов.

      Puppeteer имеет множество функций, которые выходят за рамки данного руководства. Дополнительную информацию можно найти в статье Использование Puppeteer для удобного управления Chrome без графического интерфейса. Также вы можете ознакомиться с официальной документацией Puppeteer.



      Source link

      How To Build a Concurrent Web Scraper with Puppeteer, Node.js, Docker, and Kubernetes

      Introduction

      Web scraping, also known as web crawling, uses bots to extract, parse, and download content and data from websites.

      You can scrape data from a few dozen web pages using a single machine, but if you have to retrieve data from hundreds or even thousands of web pages, you might want to consider distributing the workload.

      In this tutorial you will use Puppeteer to scrape books.toscrape, a fictional bookstore that functions as a safe place for beginners to learn web scraping and for developers to validate their scraping technologies. At the time of writing this, there are 1000 books on books.toscrape and therefore 1000 web pages that you could scrape. However, in this tutorial, you will only scrape the first 400. To scrape all these web pages in a short amount of time, you will build and deploy a scalable app containing the Express web framework and the Puppeteer browser controller to a Kubernetes cluster. To interact with your scraper, you will then build an app containing axios, a promise-based HTTP client, and lowdb, a small JSON database for Node.js.

      When you complete this tutorial, you will have a scalable scraper capable of simultaneously extracting data from multiple pages. With the default settings and a three-node cluster, for instance, it will take less than 2 minutes to scrape 400 pages on books.toscrape. After scaling your cluster, it will take about 30 seconds.

      Warning: The ethics and legality of web scraping are very complex and continually evolving. They also differ based on your location, the data’s location, and the website in question. This tutorial scrapes a special website, books.toscrape.com, explicitly designed to test scraper applications. Scraping any other domain falls outside the scope of this tutorial.

      Prerequisites

      To follow this tutorial, you will need a machine with:

      Step 1 — Analyzing the Target Website

      Before writing any code, navigate to books.toscrape in a web browser. Examine how data is structured and why concurrent scraping is an optimal solution.

      books.toscrape homepage header

      Note that there are 1,000 books on this website, but each page only displays 20 books.

      Scroll to the bottom of the page.

      books.toscrape homepage footer

      The content on this website is paginated, and there are 50 total pages. Because each page shows 20 books and you only want to scrape the first 400 books, you will only retrieve the title, price, rating, and URL for every book displayed on the first 20 pages.

      The whole process should take less than 1 minute.

      Open your browser’s dev tools and inspect the first book on the page. You will see the following content:

      books.toscrape homepage with dev tools

      Every book is inside the <section> tag, and each book is listed under its own <li> tag. Inside each <li> tag there is an <article> tag with a class attribute equal to product_pod. This is the element that we want to scrape.

      After getting the metadata for every book on the first 20 pages and storing it, you will have a local database containing 400 books. However, since more detailed information about the book exists on its own page, you will need to navigate 400 additional pages using the URL inside each book’s metadata. You will then retrieve the missing book details that you want and add this data to your local database. The missing data that you are going to retrieve are the description, the UPC (Universal Book Code), the number of reviews, and the book’s availability. Going through 400 pages using a single machine can take more than 7 minutes, and this is why you will need Kubernetes to divide the work across multiple machines.

      Now click in the link for the first book on the homepage, which will open that book’s details page. Open your browser’s dev tools again and inspect the page.

      books.toscrape book page with dev tools

      The missing information that you want to extract is, again, inside an <article> tag with a class attribute equal to product_page.

      To interact with our scraper in the cluster, you will need to create a client application capable of sending HTTP requests to our Kubernetes cluster. You will first code the server side and then the client side of this project.

      In this section, you have reviewed what information your scraper will retrieve and why you need to deploy this scraper to a Kubernetes cluster. In the next section, you will create the directories for the client and server applications.

      Step 2 — Creating the Project Root Directory

      In this step, you will create your project’s directory structure. Then you will initialize a Node.js project for your client and server applications.

      Open a terminal window and create a new directory called concurrent-webscraper:

      • mkdir concurrent-webscraper

       

      Navigate into the directory:

      • cd ./concurrent-webscraper

       

      Now create three subdirectories named server, client, and k8s:

      Navigate into the server directory:

      Create a new Node.js project. Running npm’s init command will create a package.json file, which will help you manage your dependencies and metadata.

      Run the initialization command:

      To accept the default values, press ENTER to all the prompts; alternately, you can personalize your responses. You can read more about npm’s initialization settings in Step One of our tutorial, How To Use Node.js Modules with npm and package.json.

      Open the package.json file and edit it:

      You need to modify the main property, add some information to the scripts directive, and then create a dependencies directive.

      Replace the contents inside the file with the highlighted code:

      ./server/package.json

      {
        "name": "server",
        "version": "1.0.0",
        "description": "",
        "main": "server.js",
        "scripts": {
          "start": "node server.js"
        },
        "keywords": [],
        "author": "",
        "license": "ISC",
        "dependencies": {
        "body-parser": "^1.19.0",
        "express": "^4.17.1",
        "puppeteer": "^3.0.0"
        }
      }
      

      Here you changed the main and scripts properties, and you also edited the dependencies property. Because the server application will run inside a Docker container, you do not need to run the npm install command, which usually follows initialization and automatically adds each dependency to package.json.

      Save and close the file.

      Navigate to your client directory:

      Create another Node.js project:

      Follow the same procedure to accept the default settings or customize your responses.

      Open the package.json file and edit it:

      Replace the contents inside the file with the highlighted code:

      ./client/package.json

      {
        "name": "client",
        "version": "1.0.0",
        "description": "",
        "main": "main.js",
        "scripts": {
          "start": "node main.js"
        },
        "author": "",
        "license": "ISC"
      }
      

      Here you changed the main and scripts properties.

      This time, use npm to install the necessary dependencies:

      • npm install axios lowdb –save

       

      In this block of code, you have installed axios and lowdb. axios is a promise based HTTP client for the browser and Node.js. You will use this module to send asynchronous HTTP requests to REST endpoints in our scraper to interact with it; lowdb is a small JSON database for Node.js and the browser, which you will use to store your scraped data.

      In this step, you created a project directory and initialized a Node.js project for your application server that will contain the scraper; you then did the same for your client application that will interact with the application server. You also created a directory for your Kubernetes configuration files. In the next step, you will start building the application server.

      Step 3 — Building the First Scraper File

      In this step and Step 4, you are going to create the scraper on the server side. This application will consist of two files: puppeteerManager.js and server.js. The puppeteerManager.js file will create and manage browser sessions, and the server.js file will receive requests to scrape one or multiple web pages. In turn, these requests will call a method inside puppeteerManager.js that will scrape a given web page and return the scraped data. In this step, you will create the puppeteerManager.js file. In Step 4, you will create the server.js file.

      First, return to the server directory and create a file called puppeteerManager.js.

      Navigate to the server folder:

      Create and open the puppeteerManager.js file using your preferred text editor:

      Your puppeteerManager.js file will contain a class called PuppeteerManager, and this class will create and manage a Puppeteer browser instance. You will first create this class and then add a constructor to it.

      Add the following code to your puppeteerManager.js file:

      puppeteerManager.js

      class PuppeteerManager {
          constructor(args) {
              this.url = args.url
              this.existingCommands = args.commands
              this.nrOfPages = args.nrOfPages
              this.allBooks = [];
              this.booksDetails = {}
          }
      }
      module.exports = { PuppeteerManager }
      

      In this first block of code, you have created the PuppeteerManager class and added a constructor to it.
      The constructor expects to receive an object containing the following properties:

      • url: This property will hold a string, which will be the address of the page that you want to scrape.
      • commands: This property will hold an array, which provides instructions for the browser. For example, it will direct the browser to click a button or parse a specific DOM element. Each command has the following properties: description, locatorCss, and type. description tells you what the command does, locatorCss finds the appropriate element in the DOM, and type chooses the specific action.
      • nrOfPages: This property will hold an integer, which your application will use to determine how many times commands should repeat. books.toscrape.com, for instance, only shows 20 books per page, so to get all 400 books on all 20 pages, you will use this property to repeat the existing commands 20 times.

      In this code block, you also assigned the received object properties to the constructor variables url, existingCommands, and nrOfPages. You then created two additional variables: allBooks and booksDetails. You will use the variable allBooks to store the metadata for all retrieved books and the variable booksDetails to store the missing book details for a given, individual book.

      You are now ready to add a few methods to the PuppeteerManager class. This class will have the following methods: runPuppeteer(), executeCommand(), sleep(), getAllBooks(), and getBooksDetails(). Because these methods form the core of your scraper application, it is worth examining them one by one.

      Coding the runPuppeteer() Method

      The first method inside the PuppeteerManager class is runPuppeteer(). This will require the Puppeteer module and launch your browser instance.

      At the bottom of the PuppeteerManager class, add the following code:

      puppeteerManager.js

      . . .
          async runPuppeteer() {
              const puppeteer = require('puppeteer')
              let commands = []
              if (this.nrOfPages > 1) {
                  for (let i = 0; i < this.nrOfPages; i++) {
                      if (i < this.nrOfPages - 1) {
                          commands.push(...this.existingCommands)
                      } else {
                          commands.push(this.existingCommands[0])
                      }
                  }
              } else {
                  commands = this.existingCommands
              }
              console.log('commands length', commands.length)
          }
      

      In this block of code, you created the runPuppeteer() method. First, you required the puppeteer module and then created a variable that starts with an empty array called commands. Using conditional logic, you stated that if the number of pages to scrape is greater than one, the code should loop through the nrOfPages, and add the existingCommands for each page to the commands array. However, when it reaches the last page, it doesn’t add the very last command in the existingCommands array to the commands array because the last command clicks the next page button.

      The next step is to create a browser instance.

      At the bottom of the runPuppeteer() method that you just created, add the following code:

      puppeteerManager.js

      . . .
          async runPuppeteer() {
              . . .
      
              const browser = await puppeteer.launch({
                  headless: true,
                  args: [
                      "--no-sandbox",
                      "--disable-gpu",
                  ]
              });
              let page = await browser.newPage()
      
              . . .
          }
      

      In this block of code, you created a browser instance using the built-in puppeteer.launch() method. You are designating that the instance run in headless mode. This is the default option and necessary for this project because you are running the application on Kubernetes. The next two arguments are standard when creating a browser without a graphical user interface. Lastly, you created a new page object using Puppeteer’s browser.newPage() method. The .launch() method returns a Promise, which requires the await keyword.

      You are now ready to add some behavior to your new page object, including how it will navigate a URL.

      At the bottom of the runPuppeteer() method, add the following code:

      puppeteerManager.js

      . . .
          async runPuppeteer() {
              . . .
      
              await page.setRequestInterception(true);
              page.on('request', (request) => {
                  if (['image'].indexOf(request.resourceType()) !== -1) {
                      request.abort();
                  } else {
                      request.continue();
                  }
              });
      
              await page.on('console', msg => {
                  for (let i = 0; i < msg._args.length; ++i) {
                      msg._args[i].jsonValue().then(result => {
                          console.log(result);
                      })
                  }
              });
      
              await page.goto(this.url);
      
              . . .
          }
      

      In this block of code, the page object intercepts all requests using Puppeteer’s page.setRequestInterception() method, and if the request is to load an image, it prevents the image from loading, thus decreasing the time needed to load a web page. Then the page object intercepts any attempt to display a message in the browser context using Puppeteer’s page.on('console') event. The page then navigates to a given url using the page.goto() method.

      Now add some more behaviors to your page object that will control how it finds elements in the DOM and runs commands on them.

      At the bottom of the runPuppeteer() method add the following code:

      puppeteerManager.js

      . . .
          async runPuppeteer() {
              . . .
      
              let timeout = 6000
              let commandIndex = 0
              while (commandIndex < commands.length) {
                  try {
                      console.log(`command ${(commandIndex + 1)}/${commands.length}`)
                      let frames = page.frames()
                      await frames[0].waitForSelector(commands[commandIndex].locatorCss, { timeout: timeout })
                      await this.executeCommand(frames[0], commands[commandIndex])
                      await this.sleep(1000)
                  } catch (error) {
                      console.log(error)
                      break
                  }
                  commandIndex++
              }
              console.log('done')
              await browser.close()
          }
      

      In this block of code, you created two variables, timeout and commandIndex. The first variable will limit the amount of time that the code will wait for an element on a web page, and the second variable controls how you will loop through the commands array.

      Inside the while loop, the code goes through every command in the commands array. First, you are creating an array of all frames attached to the page using the page.frames() method. It searches for a DOM element in a frame object of a page using the frame.waitForSelector() method and the locatorCss property. If an element is found, it calls the executeCommand() method and passes the frame and the command object as parameters. After the executeCommand returns, it calls the sleep() method, which makes the code wait 1 second before executing the next command. Finally, when there are no more commands, the browser instance closes.

      This completes your runPuppeteer() method. At this point, your puppeteerManager.js file should look like this:

      puppeteerManager.js

      class PuppeteerManager {
          constructor(args) {
              this.url = args.url
              this.existingCommands = args.commands
              this.nrOfPages = args.nrOfPages
              this.allBooks = [];
              this.booksDetails = {}
          }
      
          async runPuppeteer() {
              const puppeteer = require('puppeteer')
              let commands = []
              if (this.nrOfPages > 1) {
                  for (let i = 0; i < this.nrOfPages; i++) {
                      if (i < this.nrOfPages - 1) {
                          commands.push(...this.existingCommands)
                      } else {
                          commands.push(this.existingCommands[0])
                      }
                  }
              } else {
                  commands = this.existingCommands
              }
              console.log('commands length', commands.length)
      
              const browser = await puppeteer.launch({
                  headless: true,
                  args: [
                      "--no-sandbox",
                      "--disable-gpu",
                  ]
              });
      
              let page = await browser.newPage()
              await page.setRequestInterception(true);
              page.on('request', (request) => {
                  if (['image'].indexOf(request.resourceType()) !== -1) {
                      request.abort();
                  } else {
                      request.continue();
                  }
              });
      
              await page.on('console', msg => {
                  for (let i = 0; i < msg._args.length; ++i) {
                      msg._args[i].jsonValue().then(result => {
                          console.log(result);
                      })
      
                  }
              });
      
              await page.goto(this.url);
      
              let timeout = 6000
              let commandIndex = 0
              while (commandIndex < commands.length) {
                  try {
      
                      console.log(`command ${(commandIndex + 1)}/${commands.length}`)
                      let frames = page.frames()
                      await frames[0].waitForSelector(commands[commandIndex].locatorCss, { timeout: timeout })
                      await this.executeCommand(frames[0], commands[commandIndex])
                      await this.sleep(1000)
                  } catch (error) {
                      console.log(error)
                      break
                  }
                  commandIndex++
              }
              console.log('done')
              await browser.close();
          }
      }
      

      Now you are ready to code the second method for puppeteerManager.js: executeCommand().

      Coding the executeCommand() Method

      After creating the runPuppeteer() method, it is now time to create the executeCommand() method. This method is responsible for deciding what actions Puppeteer should perform, like clicking a button or parsing one or multiple DOM elements.

      At the bottom of the PuppeteerManager class add the following code:

      puppeteerManager.js

      . . .
          async executeCommand(frame, command) {
              await console.log(command.type, command.locatorCss)
              switch (command.type) {
                  case "click":
                      break;
                  case "getItems":
                      break;
                  case "getItemDetails":
                      break;
              }
          }
      

      In this code block, you created the executeCommand() method. This method expects two arguments, a frame object that will contain page elements and a command object that will contain commands. This method consists of a switch statement with the following cases: click, getItems, and getItemDetails.

      Define the click case.

      Replace break; underneath case "click": with the following code:

      puppeteerManager.js

          async executeCommand(frame, command) {
              . . .
                  case "click":
                      try {
                          await frame.$eval(command.locatorCss, element => element.click());
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
              . . .        
          }
      

      Your code will trigger the click case when command.type equals click. This block of code is responsible for clicking the next button to move through the paginated list of books.

      Now program the next case statement.

      Replace break; underneath case "getItems": with the following code:

      puppeteerManager.js

          async executeCommand(frame, command) {
              . . .
                  case "getItems":
                      try {
                          let books = await frame.evaluate((command) => {
                              function wordToNumber(word) {
                                  let number = 0
                                  let words = ["zero","one","two","three","four","five"]
                                  for(let n=0;n<words.length;words++){
                                      if(word == words[n]){
                                          number = n
                                          break
                                      }
                                  }
                                  return number
                              }
      
                              try {
                                  let parsedItems = [];
                                  let items = document.querySelectorAll(command.locatorCss);
                                  items.forEach((item) => {
                                      let link = 'http://books.toscrape.com/catalogue/' + item.querySelector('div.image_container a').getAttribute('href').replace('catalogue/', '')<^>
                                      let starRating = item.querySelector('p.star-rating').getAttribute('class').replace('star-rating ', '').toLowerCase().trim()
                                      let title = item.querySelector('h3 a').getAttribute('title')
                                      let price = item.querySelector('p.price_color').innerText.replace('£', '').trim()
                                      let book = {
                                          title: title,
                                          price: parseInt(price),
                                          rating: wordToNumber(starRating),
                                          url: link
                                      }
                                      parsedItems.push(book)
                                  })
                                  return parsedItems;
                              } catch (error) {
                                  console.log(error)
                              }
                          }, command).then(result => {
                              this.allBooks.push.apply(this.allBooks, result)
                              console.log('allBooks length ', this.allBooks.length)
                          })
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
              . . .
          }
      

      The getItems case will trigger when command.type is equal to getItems. You are using the frame.evaluate() method to switch the browser context and then create a function called wordToNumber(). This function will convert the starRating of a book from a string to an integer. The code will then use the document.querySelectorAll() method to parse and match the DOM and retrieve the metadata of the books displayed in the given frame of a web page. Once the metadata is retrieved, the code will add it to the allBooks array.

      Now you can define the final case statement.

      Replace break; underneath case "getItemDetails" with the following code:

      puppeteerManager.js

          async executeCommand(frame, command) {
              . . .
                  case "getItemDetails":
                      try {
                          this.booksDetails = JSON.parse(JSON.stringify(await frame.evaluate((command) => {
                              try {
                                  let item = document.querySelector(command.locatorCss);
                                  let description = item.querySelector('.product_page > p:nth-child(3)').innerText.trim()
                                  let upc = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2)')
                                      .innerText.trim()
                                  let nrOfReviews = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(7) > td:nth-child(2)')
                                      .innerText.trim()
                                  let availability = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(6) > td:nth-child(2)')
                                      .innerText.replace('In stock (', '').replace(' available)', '')
                                  let details = {
                                      description: description,
                                      upc: upc,
                                      nrOfReviews: parseInt(nrOfReviews),
                                      availability: parseInt(availability)
                                  }
                                  return details;
                              } catch (error) {
                                  console.log(error)
                                  return error
                              }
      
                          }, command)))
                          console.log(this.booksDetails)
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
          }
      

      The getItemDetails case will trigger when command.type is equal to getItemDetails. You used the frame.evaluate() and .querySelector() methods again to switch the browser context and to parse the DOM. But this time, you retrieved the missing details for each book in a given frame of a web page. You then assigned these missing details to the booksDetails object.

      This completes your executeCommand() method. Your puppeteerManager.js file will now look like this:

      puppeteerManager.js

      class PuppeteerManager {
          constructor(args) {
              this.url = args.url
              this.existingCommands = args.commands
              this.nrOfPages = args.nrOfPages
              this.allBooks = [];
              this.booksDetails = {}
          }
      
          async runPuppeteer() {
              const puppeteer = require('puppeteer')
              let commands = []
              if (this.nrOfPages > 1) {
                  for (let i = 0; i < this.nrOfPages; i++) {
                      if (i < this.nrOfPages - 1) {
                          commands.push(...this.existingCommands)
                      } else {
                          commands.push(this.existingCommands[0])
                      }
                  }
              } else {
                  commands = this.existingCommands
              }
              console.log('commands length', commands.length)
      
              const browser = await puppeteer.launch({
                  headless: true,
                  args: [
                      "--no-sandbox",
                      "--disable-gpu",
                  ]
              });
      
              let page = await browser.newPage()
              await page.setRequestInterception(true);
              page.on('request', (request) => {
                  if (['image'].indexOf(request.resourceType()) !== -1) {
                      request.abort();
                  } else {
                      request.continue();
                  }
              });
      
              await page.on('console', msg => {
                  for (let i = 0; i < msg._args.length; ++i) {
                      msg._args[i].jsonValue().then(result => {
                          console.log(result);
                      })
      
                  }
              });
      
              await page.goto(this.url);
      
              let timeout = 6000
              let commandIndex = 0
              while (commandIndex < commands.length) {
                  try {
      
                      console.log(`command ${(commandIndex + 1)}/${commands.length}`)
                      let frames = page.frames()
                      await frames[0].waitForSelector(commands[commandIndex].locatorCss, { timeout: timeout })
                      await this.executeCommand(frames[0], commands[commandIndex])
                      await this.sleep(1000)
                  } catch (error) {
                      console.log(error)
                      break
                  }
                  commandIndex++
              }
              console.log('done')
              await browser.close();
          }
      
          async executeCommand(frame, command) {
              await console.log(command.type, command.locatorCss)
              switch (command.type) {
                  case "click":
                      try {
                          await frame.$eval(command.locatorCss, element => element.click());
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
                  case "getItems":
                      try {
                          let books = await frame.evaluate((command) => {
                              function wordToNumber(word) {
                                  let number = 0
                                  let words = ["zero","one","two","three","four","five"]
                                  for(let n=0;n<words.length;words++){
                                      if(word == words[n]){
                                          number = n
                                          break
                                      }
                                  }  
                                  return number
                              }
                              try {
                                  let parsedItems = [];
                                  let items = document.querySelectorAll(command.locatorCss);
      
                                  items.forEach((item) => {
                                      let link = 'http://books.toscrape.com/catalogue/' + item.querySelector('div.image_container a').getAttribute('href').replace('catalogue/', '')
                                      let starRating = item.querySelector('p.star-rating').getAttribute('class').replace('star-rating ', '').toLowerCase().trim()
                                      let title = item.querySelector('h3 a').getAttribute('title')
                                      let price = item.querySelector('p.price_color').innerText.replace('£', '').trim()
                                      let book = {
                                          title: title,
                                          price: parseInt(price),
                                          rating: wordToNumber(starRating),
                                          url: link
                                      }
                                      parsedItems.push(book)
                                  })
                                  return parsedItems;
                              } catch (error) {
                                  console.log(error)
                              }
                          }, command).then(result => {
                              this.allBooks.push.apply(this.allBooks, result)
                              console.log('allBooks length ', this.allBooks.length)
                          })
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
                  case "getItemDetails":
                      try {
                          this.booksDetails = JSON.parse(JSON.stringify(await frame.evaluate((command) => {
                              try {
                                  let item = document.querySelector(command.locatorCss);
                                  let description = item.querySelector('.product_page > p:nth-child(3)').innerText.trim()
                                  let upc = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2)')
                                      .innerText.trim()
                                  let nrOfReviews = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(7) > td:nth-child(2)')
                                      .innerText.trim()
                                  let availability = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(6) > td:nth-child(2)')
                                      .innerText.replace('In stock (', '').replace(' available)', '')
                                  let details = {
                                      description: description,
                                      upc: upc,
                                      nrOfReviews: parseInt(nrOfReviews),
                                      availability: parseInt(availability)
                                  }
                                  return details;
                              } catch (error) {
                                  console.log(error)
                                  return error
                              }
      
                          }, command))) 
                          console.log(this.booksDetails)
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
              }
          }
      }
      

      You are now ready to create the third method for your PuppeteerManager class: sleep().

      Coding the sleep() Method

      With the executeCommand() method created, your next step is to create the sleep() method. This method will make your code wait a specific amount of time before executing the next line of code. This is essential for reducing the crawl rate. Without this precaution, the scraper could, for example, click a button on page A and then search for an element on page B before page B even loads.

      At the bottom of the PuppeteerManager class add the following code:

      puppeteerManager.js

      . . .
          sleep(ms) {
              return new Promise(resolve => setTimeout(resolve, ms))
          }
      

      You are passing an integer to the sleep() method. This integer is the amount of time in milliseconds that the code should wait.

      Now code the final two methods inside the PuppeteerManager class: getAllBooks() and getBooksDetails().

      Coding the getAllBooks() and getBooksDetails() Methods

      After creating the sleep() method, create the getAllBooks() method. A function inside the server.js file will call this function. getAllBooks() is responsible for calling runPuppeteer(), getting the books displayed on a number of given pages, and then returning the retrieved books to the function that called it in the server.js file.

      At the bottom of the PuppeteerManager class add the following code:

      puppeteerManager.js

      . . .
          async getAllBooks() {
              await this.runPuppeteer()
              return this.allBooks
          }
      

      Note how this block uses another Promise.

      Now you can create the final method: getBooksDetails(). Like getAllBooks(), a function inside server.js will call this function. getBooksDetails() however, is responsible for retrieving the missing details for each book. It will also return these details to the function that called it in the server.js file.

      At the bottom of the PuppeteerManager class add the following code:

      puppeteerManager.js

      . . .
          async getBooksDetails() {
              await this.runPuppeteer()
              return this.booksDetails
          }
      

      You have now finished coding your puppeteerManager.js file.

      After adding the five methods described in this section, your completed file will look like this:

      puppeteerManager.js

      class PuppeteerManager {
          constructor(args) {
              this.url = args.url
              this.existingCommands = args.commands
              this.nrOfPages = args.nrOfPages
              this.allBooks = [];
              this.booksDetails = {}
          }
      
          async runPuppeteer() {
              const puppeteer = require('puppeteer')
              let commands = []
              if (this.nrOfPages > 1) {
                  for (let i = 0; i < this.nrOfPages; i++) {
                      if (i < this.nrOfPages - 1) {
                          commands.push(...this.existingCommands)
                      } else {
                          commands.push(this.existingCommands[0])
                      }
                  }
              } else {
                  commands = this.existingCommands
              }
              console.log('commands length', commands.length)
      
              const browser = await puppeteer.launch({
                  headless: true,
                  args: [
                      "--no-sandbox",
                      "--disable-gpu",
                  ]
              });
      
              let page = await browser.newPage()
              await page.setRequestInterception(true);
              page.on('request', (request) => {
                  if (['image'].indexOf(request.resourceType()) !== -1) {
                      request.abort();
                  } else {
                      request.continue();
                  }
              });
      
              await page.on('console', msg => {
                  for (let i = 0; i < msg._args.length; ++i) {
                      msg._args[i].jsonValue().then(result => {
                          console.log(result);
                      })
      
                  }
              });
      
              await page.goto(this.url);
      
              let timeout = 6000
              let commandIndex = 0
              while (commandIndex < commands.length) {
                  try {
      
                      console.log(`command ${(commandIndex + 1)}/${commands.length}`)
                      let frames = page.frames()
                      await frames[0].waitForSelector(commands[commandIndex].locatorCss, { timeout: timeout })
                      await this.executeCommand(frames[0], commands[commandIndex])
                      await this.sleep(1000)
                  } catch (error) {
                      console.log(error)
                      break
                  }
                  commandIndex++
              }
              console.log('done')
              await browser.close();
          }
      
          async executeCommand(frame, command) {
              await console.log(command.type, command.locatorCss)
              switch (command.type) {
                  case "click":
                      try {
                          await frame.$eval(command.locatorCss, element => element.click());
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
                  case "getItems":
                      try {
                          let books = await frame.evaluate((command) => {
                              function wordToNumber(word) {
                                  let number = 0
                                  let words = ["zero","one","two","three","four","five"]
                                  for(let n=0;n<words.length;words++){
                                      if(word == words[n]){
                                          number = n
                                          break
                                      }
                                  }  
                                  return number
                              }
      
                              try {
                                  let parsedItems = [];
                                  let items = document.querySelectorAll(command.locatorCss);
      
                                  items.forEach((item) => {
                                      let link = 'http://books.toscrape.com/catalogue/' + item.querySelector('div.image_container a').getAttribute('href').replace('catalogue/', '')
                                      let starRating = item.querySelector('p.star-rating').getAttribute('class').replace('star-rating ', '').toLowerCase().trim()
                                      let title = item.querySelector('h3 a').getAttribute('title')
                                      let price = item.querySelector('p.price_color').innerText.replace('£', '').trim()
                                      let book = {
                                          title: title,
                                          price: parseInt(price),
                                          rating: wordToNumber(starRating),
                                          url: link
                                      }
                                      parsedItems.push(book)
                                  })
                                  return parsedItems;
                              } catch (error) {
                                  console.log(error)
                              }
                          }, command).then(result => {
                              this.allBooks.push.apply(this.allBooks, result)
                              console.log('allBooks length ', this.allBooks.length)
                          })
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
                  case "getItemDetails":
                      try {
                          this.booksDetails = JSON.parse(JSON.stringify(await frame.evaluate((command) => {
                              try {
                                  let item = document.querySelector(command.locatorCss);
                                  let description = item.querySelector('.product_page > p:nth-child(3)').innerText.trim()
                                  let upc = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(2)')
                                      .innerText.trim()
                                  let nrOfReviews = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(7) > td:nth-child(2)')
                                      .innerText.trim()
                                  let availability = item.querySelector('.table > tbody:nth-child(1) > tr:nth-child(6) > td:nth-child(2)')
                                      .innerText.replace('In stock (', '').replace(' available)', '')
                                  let details = {
                                      description: description,
                                      upc: upc,
                                      nrOfReviews: parseInt(nrOfReviews),
                                      availability: parseInt(availability)
                                  }
                                  return details;
                              } catch (error) {
                                  console.log(error)
                                  return error
                              }
      
                          }, command))) 
                          console.log(this.booksDetails)
                          return true
                      } catch (error) {
                          console.log("error", error)
                          return false
                      }
              }
          }
      
          sleep(ms) {
              return new Promise(resolve => setTimeout(resolve, ms))
          }
      
          async getAllBooks() {
              await this.runPuppeteer()
              return this.allBooks
          }
      
          async getBooksDetails() {
              await this.runPuppeteer()
              return this.booksDetails
          }
      }
      
      module.exports = { PuppeteerManager }
      

      In this step you used the module Puppeteer to create the puppeteerManager.js file. This file forms the core of your scraper. In the next section you will create the server.js file.

      Step 4 — Building the Second Scraper File

      In this step, you will create the server.js file — the second half of your application server. This file will receive requests containing the information that will direct what data to scrape, and then return that data to the client.

      Create the server.js file and open it:

      Add the following code:

      server.js

      const express = require('express');
      const bodyParser = require('body-parser')
      const os = require('os');
      
      const PORT = 5000;
      const app = express();
      let timeout = 1500000
      
      app.use(bodyParser.urlencoded({ extended: true }))
      app.use(bodyParser.json())
      
      let browsers = 0
      let maxNumberOfBrowsers = 5
      

      In this code block, you required the modules express and body-parser. These modules are necessary to create an application server capable of handling HTTP requests. The express module will create an application server, and the body-parser module will parse incoming request bodies in a middleware before getting the contents of the body. You then required the os module, which will retrieve the name of the machine running your application. After that, you specified a port for the application and created the variables browsers and maxNumberOfBrowsers. These variables will help manage the number of browser instances that the server can create. In this case, the application is limited to creating five browser instances, which means that the scraper will be able to retrieve data from five pages simultaneously.

      Our web server will have the following routes: /, /api/books, and /api/booksDetails.

      At the bottom of your server.js file define the / route with the following code:

      server.js

      . . .
      
      app.get('/', (req, res) => {
        console.log(os.hostname())
        let response = {
          msg: 'hello world',
          hostname: os.hostname().toString()
        }
        res.send(response);
      });
      

      You will use the / route to check if your application server is running. A GET request sent to this route will return an object containing two properties: msg, which will only say “hello world,” and hostname, which will identify the machine where an instance of the application server is running.

      Now define the /api/books route.

      At the bottom of your server.js file, add the following code:

      server.js

      . . .
      
      app.post('/api/books', async (req, res) => {
        req.setTimeout(timeout);
        try {
          let data = req.body
          console.log(req.body.url)
          while (browsers == maxNumberOfBrowsers) {
            await sleep(1000)
          }
          await getBooksHandler(data).then(result => {
            let response = {
              msg: 'retrieved books ',
              hostname: os.hostname(),
              books: result
            }
            console.log('done')
            res.send(response)
          })
        } catch (error) {
          res.send({ error: error.toString() })
        }
      });
      

      The /api/books route will ask the scraper to retrieve the book-related metadata on a given web page. A POST request to this route will check if the number of browsers running equals the maxNumberOfBrowsers, and if it isn’t, it will call the method getBooksHandler(). This method will create a new instance of the PuppeteerManager class and retrieve the book’s metadata. Once it has retrieved the metadata, it returns in the response body to the client. The response object will contain a string, msg, that reads retrieved books, an array, books, that contains the metadata, and another string, hostname, that will return the name of the machine/container/pod where the application is running.

      We have one last route to define: /api/booksDetails.

      Add the following code to the bottom of your server.js file:

      server.js

      . . .
      
      app.post('/api/booksDetails', async (req, res) => {
        req.setTimeout(timeout);
        try {
          let data = req.body
          console.log(req.body.url)
          while (browsers == maxNumberOfBrowsers) {
            await sleep(1000)
          }
          await getBookDetailsHandler(data).then(result => {
            let response = {
              msg: 'retrieved book details',
              hostname: os.hostname(),
              url: req.body.url,
              booksDetails: result
            }
            console.log('done', response)
            res.send(response)
          })
        } catch (error) {
          res.send({ error: error.toString() })
        }
      });
      

      Sending a POST request to the /api/booksDetails route will ask the scraper to retrieve the missing information for a given book. The application server will check if the number of browsers running is equal to the maxNumberOfBrowsers. If it is, it will call the sleep() method and wait 1 second before checking again, and if it isn’t equal, it will call the method getBookDetailsHandler(). Like the getBooksHandler() method, this method will create a new instance of the PuppeteerManager class and retrieve the missing information.

      The program will then return the retrieved data in the response body to the client. The response object will contain a string, msg, saying retrieved book details, a string, hostname, that will return the name of the machine running the application, and another string, url, containing the project page’s URL. It will also contain an array, booksDetails, containing all the missing information for a book.

      Your web server will also have the following functions : getBooksHandler(), getBookDetailsHandler(), and sleep().

      Start with the getBooksHandler() function.

      At the bottom of your server.js file, add the following code:

      server.js

      . . .
      
      async function getBooksHandler(arg) {
        let pMng = require('./puppeteerManager')
        let puppeteerMng = new pMng.PuppeteerManager(arg)
        browsers += 1
        try {
          let books = await puppeteerMng.getAllBooks().then(result => {
            return result
          })
          browsers -= 1
          return books
        } catch (error) {
          browsers -= 1
          console.log(error)
        }
      }
      

      The getBooksHandler() function will create a new instance of the PuppeteerManager class. It will increase the number of browsers running by one, pass the object containing the necessary information to retrieve the books, and then call the getAllBooks() method. After the data is retrieved, it decreases the number of browsers running by one and then returns the newly retrieved data to the /api/books route.

      Now add the following code to define the getBookDetailsHandler() function:

      server.js

      . . .
      
      async function getBookDetailsHandler(arg) {
        let pMng = require('./puppeteerManager')
        let puppeteerMng = new pMng.PuppeteerManager(arg)
        browsers += 1
        try {
          let booksDetails = await puppeteerMng.getBooksDetails().then(result => {
            return result
          })
          browsers -= 1
          return booksDetails
        } catch (error) {
          browsers -= 1
          console.log(error)
        }
      }
      

      The getBookDetailsHandler() function will create a new instance of the PuppeteerManager class. It functions just like the getBooksHandler() function except it handles the missing metadata for each book and returns it to the /api/booksDetails route.

      At the bottom of your server.js file add the following code to define the sleep() function:

      server.js

        function sleep(ms) {
          console.log(' running maximum number of browsers')
          return new Promise(resolve => setTimeout(resolve, ms))
        }
      

      The sleep() function makes the code wait for a specific amount of time when the number of browsers is equal to the maxNumberOfBrowsers. We pass an integer to this function, and this integer represents the amount of time in milliseconds that the code should wait until it can check if browsers is equal to the maxNumberOfBrowsers.

      Your file is now complete.

      After creating all the necessary routes and functions, the server.js file will look like this:

      server.js

      const express = require('express');
      const bodyParser = require('body-parser')
      const os = require('os');
      
      const PORT = 5000;
      const app = express();
      let timeout = 1500000
      
      app.use(bodyParser.urlencoded({ extended: true }))
      app.use(bodyParser.json())
      
      let browsers = 0
      let maxNumberOfBrowsers = 5
      
      app.get('/', (req, res) => {
        console.log(os.hostname())
        let response = {
          msg: 'hello world',
          hostname: os.hostname().toString()
        }
        res.send(response);
      });
      
      app.post('/api/books', async (req, res) => {
        req.setTimeout(timeout);
        try {
          let data = req.body
          console.log(req.body.url)
          while (browsers == maxNumberOfBrowsers) {
            await sleep(1000)
          }
          await getBooksHandler(data).then(result => {
            let response = {
              msg: 'retrieved books ',
              hostname: os.hostname(),
              books: result
            }
            console.log('done')
            res.send(response)
          })
        } catch (error) {
          res.send({ error: error.toString() })
        }
      });
      
      
      app.post('/api/booksDetails', async (req, res) => {
        req.setTimeout(timeout);
        try {
          let data = req.body
          console.log(req.body.url)
          while (browsers == maxNumberOfBrowsers) {
            await sleep(1000)
          }
          await getBookDetailsHandler(data).then(result => {
            let response = {
              msg: 'retrieved book details',
              hostname: os.hostname(),
              url: req.body.url,
              booksDetails: result
            }
            console.log('done', response)
            res.send(response)
          })
        } catch (error) {
          res.send({ error: error.toString() })
        }
      });
      
      async function getBooksHandler(arg) {
        let pMng = require('./puppeteerManager')
        let puppeteerMng = new pMng.PuppeteerManager(arg)
        browsers += 1
        try {
          let books = await puppeteerMng.getAllBooks().then(result => {
            return result
          })
          browsers -= 1
          return books
        } catch (error) {
          browsers -= 1
          console.log(error)
        }
      }
      
      async function getBookDetailsHandler(arg) {
        let pMng = require('./puppeteerManager')
        let puppeteerMng = new pMng.PuppeteerManager(arg)
        browsers += 1
        try {
          let booksDetails = await puppeteerMng.getBooksDetails().then(result => {
            return result
          })
          browsers -= 1
          return booksDetails
        } catch (error) {
          browsers -= 1
          console.log(error)
        }
      }
      
      function sleep(ms) {
        console.log(' running maximum number of browsers')
        return new Promise(resolve => setTimeout(resolve, ms))
      }
      
      app.listen(PORT);
      console.log(`Running on port: ${PORT}`);
      

      In this step, you finished creating the application server. In the next step, you will create an image for the application server and then deploy it to your Kubernetes cluster.

      Step 5 — Building the Docker Image

      In this step, you will create a Docker image containing your scraper application. In Step 6 you will deploy that image to a Kubernetes cluster.

      To create a Docker image of your application, you will need to create a Dockerfile and then build the container.

      Make sure you are still in the ./server folder.

      Now create the Dockerfile and open it:

      Write the following code inside Dockerfile:

      Dockerfile

      FROM node:10
      
      RUN apt-get update
      
      RUN apt-get install -yyq ca-certificates
      
      RUN apt-get install -yyq libappindicator1 libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6
      
      RUN apt-get install -yyq gconf-service lsb-release wget xdg-utils
      
      RUN apt-get install -yyq fonts-liberation
      
      WORKDIR /usr/src/app
      
      COPY package*.json ./
      
      RUN npm install
      
      COPY . .
      
      EXPOSE 5000
      CMD [ "node", "server.js" ]
      

      Most of the code in this block is standard command line code for a Dockerfile. You built the image from a node:10 image. Next, you used the RUN command to install the necessary packages to run Puppeteer in a Docker container, and then you created the app directory. You copied your scraper’s package.json file to the app directory and installed the dependencies specified inside the package.json file. Lastly, you bundled the app source, exposed the app on port 5000, and selected server.js as the entry file.

      Now create a .dockerignore file and open it. This will keep sensitive and unnecessary files out of version control.

      Create the file using your preferred text editor:

      Add the following content to the file:

      ./server/.dockerignore

      node_modules
      npm-debug.log
      

      After creating the Dockerfile and the .dockerignore file, you can build the Docker image of the application and push it to a repository in your Docker Hub account. Before pushing the image, check that you are signed in to your Docker Hub account.

      Sign in to Docker Hub:

      • docker login –username=your_username –password=your_password

       

      Build the image:

      • docker build -t your_username/concurrent-scraper .

       

      Now it’s time to test the scraper. In this test, you will send a request to each route.

      First, start the app:

      • docker run -p 5000:5000 -d your_username/concurrent-scraper

       

      Now use curl to send a GET request to the / route:

      • curl http://localhost:5000/

       

      By sending a GET request to the / route, you should receive a response containing a msg saying hello world and a hostname. This hostname is the id of your Docker container. You should see an output similar to this, but with your machine’s unique ID:

      Output

      {“msg”:”hello world”,”hostname”:”0c52d53f97d3“}

      Now send a POST request to the /api/books route to get the metadata of all the books displayed on one web page:

      • curl –header “Content-Type: application/json” –request POST –data ‘{“url”: “http://books.toscrape.com/index.html” , “nrOfPages”:1 , “commands”:[{“description”: “get items metadata”, “locatorCss”: “.product_pod”,”type”: “getItems”},{“description”: “go to next page”,”locatorCss”: “.next > a:nth-child(1)”,”type”: “Click”}]}’ http://localhost:5000/api/books

       

      By sending a POST request to the /api/books route you will receive a response containing a msg saying retrieved books, a hostname similar to the one in the previous request, and a books array containing all 20 books displayed on the first page of the books.toscrape website. You should see an output like this, but with your machine’s unique ID:

      Output

      {“msg”:”retrieved books “,”hostname”:”0c52d53f97d3“,”books”:[{“title”:”A Light in the Attic”,”price”:null,”rating”:0,”url”:”http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html”},{“title”:”Tipping the Velvet”,”price”:null,”rating”:0,”url”:”http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html”}, [ . . . ] }]}

      Now send a POST request to the /api/booksDetails route to get the missing information for a random book:

      • curl –header “Content-Type: application/json” –request POST –data ‘{“url”: “http://books.toscrape.com/catalogue/slow-states-of-collapse-poems_960/index.html” , “nrOfPages”:1 , “commands”:[{“description”: “get item details”, “locatorCss”: “article.product_page”,”type”: “getItemDetails”}]}’ http://localhost:5000/api/booksDetails

       

      By sending a POST request to the /api/booksDetails route you will receive a response containing a msg saying retrieved book details, a booksDetails object containing the missing details of this book, a url containing the address of the product’s page, as well as a hostname like the one in the previous requests. You will see an output like this:

      Output

      {“msg”:”retrieved book details”,”hostname”:”0c52d53f97d3“,”url”:”http://books.toscrape.com/catalogue/slow-states-of-collapse-poems_960/index.html”,”booksDetails”:{“description”:”The eagerly anticipated debut from one of Canada’s most exciting new poets In her debut collection, Ashley-Elizabeth Best explores the cultivation of resilience during uncertain and often trying times […]”,”upc”:”b4fd5943413e089a”,”nrOfReviews”:0,”availability”:17}}

      If your curl commands don’t return the correct responses, make sure that the code in the files puppeteerManager.js and server.js match the final code blocks in the previous two steps. Also, make sure that the Docker container is running and that it didn’t crash. You can do that by trying to run the Docker image without the -d option (this option makes the Docker image run in the detached mode), then send an HTTP request to one of the routes.

      If you still encounter errors when trying to run the Docker image, try stopping all running containers and running the scraper image without the -d option.

      First stop all containers:

      • docker stop $(docker ps -a -q)

       

      Then run the Docker command without the -d flag:

      • docker run -p 5000:5000 your_username/concurrent-scraper

       

      If you don’t encounter any errors, clean the terminal window:

      Now that you have successfully tested the image, you can send it to your repository. Push the image to a repository in your Docker Hub account:

      • docker push your_username/concurrent-scraper:latest

       

      With your scraper application now available as an image on Docker Hub, you are ready to deploy to Kubernetes. This will be your next step.

      Step 6 — Deploying the Scraper to Kubernetes

      With your scraper image built and pushed to your repository, you are now ready for deployment.

      First, use kubectl to create a new namespace called concurrent-scraper-context:

      • kubectl create namespace concurrent-scraper-context

       

      Set concurrent-scraper-context as the default context:

      • kubectl config set-context –current –namespace=concurrent-scraper-context

       

      To create your application’s deployment, you will need to create a file called app-deployment.yaml, but first, you must navigate to the k8s directory inside your project. This is where you will store all your Kubernetes files.

      Go to the k8s directory inside your project:

      Create the app-deployment.yaml file and open it:

      Write the following code inside app-deployment.yaml. Make sure to replace your_DockerHub_username with your unique username:

      ./k8s/app-deployment.yaml

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: scraper
        labels:
          app: scraper
      spec:
        replicas: 5
        selector:
          matchLabels:
            app: scraper
        template:
          metadata:
            labels:
              app: scraper
          spec:
            containers:
            - name: concurrent-scraper
              image: your_DockerHub_username/concurrent-scraper
              ports:
              - containerPort: 5000
      

      Most of the code in the preceding block is standard for a Kubernetes deployment file. First, you set the name of your app deployment to scraper, then you set the number of pods to 5, and then you set the name of your container to concurrent-scraper. After that, you specified the image that you want to use to build your app as your_DockerHub_username/concurrent-scraper, but you will use your actual Docker Hub username. Lastly, you specified that you want your app to use port 5000.

      After creating the deployment file, you are ready to deploy the app to the cluster.

      Deploy the app:

      • kubectl apply -f app-deployment.yaml

       

      You can monitor the status of your deployment by running the following command:

      • kubectl get deployment -w

       

      After running the command, you will see an output like this:

      Output

      NAME READY UP-TO-DATE AVAILABLE AGE scraper 0/5 5 0 7s scraper 1/5 5 1 23s scraper 2/5 5 2 25s scraper 3/5 5 3 25s scraper 4/5 5 4 33s scraper 5/5 5 5 33s

      It will take a couple of seconds for all deployments to start running, but once they are, you will have five instances of your scraper running. Each instance can scrape five pages simultaneously, so you will be able to scrape 25 pages simultaneously, thus reducing the time needed to scrape all 400 pages.

      To access your app from outside the cluster, you will need to create a service. This service will be a load balancer, and it will require a file called load-balancer.yaml.

      Create the load-balancer.yaml file and open it:

      Write the following code inside load-balancer.yaml:

      load-balancer.yaml

      apiVersion: v1
      kind: Service
      metadata:
        name: load-balancer
        labels:
          app: scraper
      spec:
        type: LoadBalancer
        ports:
        - port: 80
          targetPort: 5000
          protocol: TCP
        selector:
          app: scraper
      

      Most of the code in the preceding block is standard for a service file. First, you set the name of your service to load-balancer. You specified the service type, and then you made the service accessible on port 80. Lastly, you specified that this service is for the app, scraper.

      Now that you have created your load-balancer.yaml file, deploy the service to the cluster.

      Deploy the service:

      • kubectl apply -f load-balancer.yaml

       

      Run the following command to monitor the status of your service:

      After running this command, you will see an output like this, but it will take a few seconds for the external IP to appear:

      Output

      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE load-balancer LoadBalancer 10.245.91.92 <pending> 80:30802/TCP 10s load-balancer LoadBalancer 10.245.91.92 161.35.252.69 80:30802/TCP 69s

      Your service’s EXTERNAL-IP and CLUSTER-IP will differ from the ones above. Make a note of your EXTERNAL-IP. You will use it in the next section.

      In this step, you deployed the scraper application to your Kubernetes cluster. In the next step, you will create a client application to interact with your newly deployed application.

      Step 7 — Creating the Client Application

      In this step, you will build your client application, which will require the following three files: main.js, lowdbHelper.js, and books.json. The main.js file is the main file of your client application. It sends requests to your application server and then saves the retrieved data using a method that you will create inside the lowdbHelper.js file. The lowdbHelper.js file saves data in a local file and retrieves the data in it. The books.json file is the local file where you will save all your scraped data.

      First return to your client directory:

      Because they are smaller than main.js, you will create the lowdbHelper.js and books.json files first.

      Create and open a file called lowdbHelper.js:

      Add the following code to the lowdbHelper.js file:

      lowdbHelper.js

      const lowdb = require('lowdb')
      const FileSync = require('lowdb/adapters/FileSync')
      const adapter = new FileSync('books.json')
      

      In this code block, you have required the module lowdb and then required the adapter FileSync, which you need to save and read data. You then direct the program to store data in a JSON file called books.json.

      Add the following code to the bottom of the lowdbHelper.js file:

      lowdbHelper.js

      . . .
      class LowDbHelper {
          constructor() {
              this.db = lowdb(adapter);
          }
      
          getData() {
              try {
                  let data = this.db.getState().books
                  return data
              } catch (error) {
                  console.log('error', error)
              }
          }
      
          saveData(arg) {
              try {
                  this.db.set('books', arg).write()
                  console.log('data saved successfully!!!')
              } catch (error) {
                  console.log('error', error)
              }
          }
      }
      
      module.exports = { LowDbHelper }
      

      Here you have created a class called LowDbHelper. This class contains the following two methods: getData() and saveData(). The first will retrieve the books saved inside the books.json file, and the second will save your books to the same file.

      Your completed lowdbHelper.js will look like this:

      lowdbHelper.js

      const lowdb = require('lowdb')
      const FileSync = require('lowdb/adapters/FileSync')
      const adapter = new FileSync('books.json')
      
      class LowDbHelper {
          constructor() {
              this.db = lowdb(adapter);
          }
      
          getData() {
              try {
                  let data = this.db.getState().books
                  return data
              } catch (error) {
                  console.log('error', error)
              }
          }
      
          saveData(arg) {
              try {
                  this.db.set('books', arg).write()
                  //console.log('data saved successfully!!!')
              } catch (error) {
                  console.log('error', error)
              }
          }
      
      }
      
      module.exports = { LowDbHelper }
      

      Now that you have created the lowdbHelper.js file, it’s time to create the books.json file.

      Create the books.json file and open it:

      Add the following code:

      books.json

      {
          "books": []
      }
      

      The books.json file consists of an object with a property called books. The initial value of this property is an empty array. Later, when you retrieve the books, this is where your program will save them.

      Now that you have created the lowdbHelper.js and the books.json files, you will create the main.js file.

      Create main.js and open it:

      Add the following code to main.js:

      main.js

      let axios = require('axios')
      let ldb = require('./lowdbHelper.js').LowDbHelper
      let ldbHelper = new ldb()
      let allBooks = ldbHelper.getData()
      
      let server = "http://your_load_balancer_external_ip_address"
      let podsWorkDone = []
      let booksDetails = []
      let errors = []
      

      In this chunk of code, you have required the lowdbHelper.js file and a module called axios. You will use axios to send HTTP requests to your scraper; the lowdbHelper.js file will save retrieved books, and the allBooks variable will store all books saved in the books.json file. Before retrieving any book, this variable will hold an empty array; the server variable will store the EXTERNAL-IP of the load balancer that you created in the previous section. Make sure to replace this with your unique IP. The podsWorkDone variable will track the number of pages that each instance of your scraper has handled. The booksDetails variable will store the details retrieved for individual books, and the errors variable will track any errors that may occur when trying to retrieve the books.

      Now we need to build some functions for each part of the scraper process.

      Add the next code block to the bottom of the main.js file:

      main.js

      . . .
      function main() {
        let execute = process.argv[2] ? process.argv[2] : 0
        execute = parseInt(execute)
        switch (execute) {
          case 0:
            getBooks()
            break;
          case 1:
            getBooksDetails()
            break;
        }
      }
      

      You are now creating a function called main(), which consists of a switch statement that will call either the getBooks() or getBooksDetails() function based on a passed input.

      Replace the break; beneath getBooks() with the following code:

      main.js

      . . .
      function getBooks() {
        console.log('getting books')
        let data = {
          url: 'http://books.toscrape.com/index.html',
          nrOfPages: 20,
          commands: [
            {
              description: 'get items metadata',
              locatorCss: '.product_pod',
              type: "getItems"
            },
            {
              description: 'go to next page',
              locatorCss: '.next > a:nth-child(1)',
              type: "Click"
            }
          ],
        }
        let begin = Date.now();
        axios.post(`${server}/api/books`, data).then(result => {
          let end = Date.now();
          let timeSpent = (end - begin) / 1000 + "secs";
          console.log(`took ${timeSpent} to retrieve ${result.data.books.length} books`)
          ldbHelper.saveData(result.data.books)
        })
      }
      

      Here you have created a function called getBooks(). This code assigns the object containing the necessary information to scrape all 20 pages to a variable called data. The first command in the commands array of this object retrieves all 20 books displayed on a page, and the second command clicks the next button on a page, thus making the browser navigate to the next page. This means that the first command will repeat 20 times, and the second 19 times. A POST request sent using axios to the /api/books route will send this object to your application server, and the scraper will then retrieve the basic metadata for every book displayed on the first 20 pages of the books.toscrape website. It then saves the retrieved data using the LowDbHelper class inside the lowdbHelper.js file.

      Now code the second function, which will handle the more specific book data on individual pages.

      Replace the break; beneath getBooksDetails() with the following code:

      main.js

      . . .
      
      function getBooksDetails() {
        let begin = Date.now()
        for (let j = 0; j < allBooks.length; j++) {
          let data = {
            url: allBooks[j].url,
            nrOfPages: 1,
            commands: [
              {
                description: 'get item details',
                locatorCss: 'article.product_page',
                type: "getItemDetails"
              }
            ]
          }
          sendRequest(data, function (result) {
            parseResult(result, begin)
          })
        }
      }
      

      The getBooksDetails() function will go through the allBooks array, which holds all the books, and for each book inside this array, and create an object that will contain the information needed to scrape a page. After creating this object, it will then pass it to the sendRequest() function. Then it will use the value that the sendRequest() function returns and pass this value to a function called parseResult().

      Add the following code to the bottom of the main.js file:

      main.js

      . . .
      
      async function sendRequest(payload, cb) {
        let book = payload
        try {
          await axios.post(`${server}/api/booksDetails`, book).then(response => {
            if (Object.keys(response.data).includes('error')) {
              let res = {
                url: book.url,
                error: response.data.error
              }
              cb(res)
            } else {
              cb(response.data)
            }
          })
        } catch (error) {
          console.log(error)
          let res = {
            url: book.url,
            error: error
          }
          cb({ res })
        }
      }
      

      Now you are creating a function called sendRequest(). You will use this function to send all 400 requests to your application server containing your scraper. The code assigns the object containing the necessary information to scrape a page to a variable called book. You then send this object in a POST request to the /api/booksDetails route on your application server. The response is sent back to the getBooksDetails() function.

      Now create the parseResult() function.

      Add the following code to the bottom of the main.js file:

      main.js

      . . .
      
      function parseResult(result, begin){
        try {
          let end = Date.now()
          let timeSpent = (end - begin) / 1000 + "secs ";
          if (!Object.keys(result).includes("error")) {
            let wasSuccessful = Object.keys(result.booksDetails).length > 0 ? true : false
            if (wasSuccessful) {
              let podID = result.hostname
              let podsIDs = podsWorkDone.length > 0 ? podsWorkDone.map(pod => { return Object.keys(pod)[0]}) : []
              if (!podsIDs.includes(podID)) {
                let podWork = {}
                podWork[podID] = 1
                podsWorkDone.push(podWork)
              } else {
                for (let pwd = 0; pwd < podsWorkDone.length; pwd++) {
                  if (Object.keys(podsWorkDone[pwd]).includes(podID)) {
                    podsWorkDone[pwd][podID] += 1
                    break
                  }
                }
              }
              booksDetails.push(result)
            } else {
              errors.push(result)
            }
          } else {
            errors.push(result)
          }
          console.log('podsWorkDone', podsWorkDone, ', retrieved ' + booksDetails.length + " books, ",
            "took " + timeSpent + ", ", "used " + podsWorkDone.length + " pods", " errors: " + errors.length)
          saveBookDetails()
        } catch (error) {
          console.log(error)
        }
      }
      

      parseResult() receives the result of the function sendRequest() containing missing book details. It then parses the result and retrieves the hostname of the pod that handled the request and assigns it to the podID variable. It checks if this podID is already part of the podsWorkDone array; if it isn’t, it will add the podId to the podsWorkDone array and set the number of work done to 1. But if it is, it will increase the number of work done by this pod by 1. The code will then add the result to the booksDetails array, output the overall progress of the getBooksDetails() function, and then call the saveBookDetails() function.

      Now add the following code to build the saveBookDetails() function:

      main.js

      . . .
      
      function saveBookDetails() {
        let books = ldbHelper.getData()
        for (let b = 0; b < books.length; b++) {
          for (let d = 0; d < booksDetails.length; d++) {
            let item = booksDetails[d]
            if (books[b].url === item.url) {
              books[b].booksDetails = item.booksDetails
              break
            }
          }
        }
        ldbHelper.saveData(books)
      }
      
      main()
      

      saveBookDetails() gets all the books stored in the books.json file using the LowDbHelper class and assigns it to a variable called books. It then loops through the books and booksDetails arrays to see if it finds elements in both arrays with the same url property. If it does, it will add the booksDetails property of the element in the booksDetails array and assign it to the element in the books array. Then it will overwrite the contents of the books.json file with the contents of the books array looped in this function. After creating the saveBookDetails() function, the code will call the main() function to make this file usable. Otherwise, executing this file wouldn’t produce the desired outcome.

      Your completed main.js file will look like this:

      main.js

      let axios = require('axios')
      let ldb = require('./lowdbHelper.js').LowDbHelper
      let ldbHelper = new ldb()
      let allBooks = ldbHelper.getData()
      
      let server = "http://your_load_balancer_external_ip_address"
      let podsWorkDone = []
      let booksDetails = []
      let errors = []
      
      function main() {
        let execute = process.argv[2] ? process.argv[2] : 0
        execute = parseInt(execute)
        switch (execute) {
          case 0:
            getBooks()
            break;
          case 1:
            getBooksDetails()
            break;
        }
      }
      
      function getBooks() {
        console.log('getting books')
        let data = {
          url: 'http://books.toscrape.com/index.html',
          nrOfPages: 20,
          commands: [
            {
              description: 'get items metadata',
              locatorCss: '.product_pod',
              type: "getItems"
            },
            {
              description: 'go to next page',
              locatorCss: '.next > a:nth-child(1)',
              type: "Click"
            }
          ],
        }
        let begin = Date.now();
        axios.post(`${server}/api/books`, data).then(result => {
          let end = Date.now();
          let timeSpent = (end - begin) / 1000 + "secs";
          console.log(`took ${timeSpent} to retrieve ${result.data.books.length} books`)
          ldbHelper.saveData(result.data.books)
        })
      }
      
      function getBooksDetails() {
        let begin = Date.now()
        for (let j = 0; j < allBooks.length; j++) {
          let data = {
            url: allBooks[j].url,
            nrOfPages: 1,
            commands: [
              {
                description: 'get item details',
                locatorCss: 'article.product_page',
                type: "getItemDetails"
              }
            ]
          }
          sendRequest(data, function (result) {
            parseResult(result, begin)
          })
        }
      }
      
      async function sendRequest(payload, cb) {
        let book = payload
        try {
          await axios.post(`${server}/api/booksDetails`, book).then(response => {
            if (Object.keys(response.data).includes('error')) {
              let res = {
                url: book.url,
                error: response.data.error
              }
              cb(res)
            } else {
              cb(response.data)
            }
          })
        } catch (error) {
          console.log(error)
          let res = {
            url: book.url,
            error: error
          }
          cb({ res })
        }
      }
      
      function parseResult(result, begin){
        try {
          let end = Date.now()
          let timeSpent = (end - begin) / 1000 + "secs ";
          if (!Object.keys(result).includes("error")) {
            let wasSuccessful = Object.keys(result.booksDetails).length > 0 ? true : false
            if (wasSuccessful) {
              let podID = result.hostname
              let podsIDs = podsWorkDone.length > 0 ? podsWorkDone.map(pod => { return Object.keys(pod)[0]}) : []
              if (!podsIDs.includes(podID)) {
                let podWork = {}
                podWork[podID] = 1
                podsWorkDone.push(podWork)
              } else {
                for (let pwd = 0; pwd < podsWorkDone.length; pwd++) {
                  if (Object.keys(podsWorkDone[pwd]).includes(podID)) {
                    podsWorkDone[pwd][podID] += 1
                    break
                  }
                }
              }
              booksDetails.push(result)
            } else {
              errors.push(result)
            }
          } else {
            errors.push(result)
          }
          console.log('podsWorkDone', podsWorkDone, ', retrieved ' + booksDetails.length + " books, ",
            "took " + timeSpent + ", ", "used " + podsWorkDone.length + " pods,", " errors: " + errors.length)
          saveBookDetails()
        } catch (error) {
          console.log(error)
        }
      }
      
      function saveBookDetails() {
        let books = ldbHelper.getData()
        for (let b = 0; b < books.length; b++) {
          for (let d = 0; d < booksDetails.length; d++) {
            let item = booksDetails[d]
            if (books[b].url === item.url) {
              books[b].booksDetails = item.booksDetails
              break
            }
          }
        }
        ldbHelper.saveData(books)
      }
      
      main()
      

      You have now created the client application and are ready to interact with the scraper in your Kubernetes cluster. In the next step, you will use this client application and the application server to scrape all 400 books.

      Step 8 — Scraping the Website

      Now that you have created the client application and the server-side scraper application it’s time to scrape the books.toscrape website. You will first retrieve the metadata for all 400 books. Then you will retrieve the missing details for every single book on its page and monitor how many requests each pod has handled in real-time .

      In the ./client directory, run the following command. This will retrieve the basic metadata for all 400 books and save it to your books.json file:

      You will receive the following output:

      Output

      getting books took 40.323secs to retrieve 400 books

      Retrieving the metadata for the books displayed on all 20 pages took 40.323 seconds, although this value may differ depending on your internet speed.

      Now you want to retrieve the missing details for every book stored in the books.json file while also monitoring the number of requests that each pod handles.

      Run npm start again to retrieve the details:

      You will receive an output like this but with different pod IDs:

      Output

      . . . podsWorkDone [ { ‘scraper-59cd578ff6-z8zdd‘: 69 }, { ‘scraper-59cd578ff6-528gv‘: 96 }, { ‘scraper-59cd578ff6-zjwfg‘: 94 }, { ‘scraper-59cd578ff6-nk6fr‘: 80 }, { ‘scraper-59cd578ff6-h2n8r‘: 61 } ] , retrieved 400 books, took 56.875secs , used 5 pods, errors: 0

      Retrieving the missing details for all 400 books using Kubernetes took less than 60 seconds. Each pod containing the scraper scraped at least 60 pages. This represents a massive performance increase over using one machine.

      Now double the number of pods in your Kubernetes cluster to accelerate the retrieval even more:

      • kubectl scale deployment scraper –replicas=10

       

      It will take a few moments before the pods are available, so wait at least 10 seconds before running the next command.

      Rerun npm start to get the missing details:

      You will receive an output similar to the following but with different pod IDs:

      Output

      . . . podsWorkDone [ { ‘scraper-59cd578ff6-z8zdd‘: 38 }, { ‘scraper-59cd578ff6-6jlvz‘: 47 }, { ‘scraper-59cd578ff6-g2mxk‘: 36 }, { ‘scraper-59cd578ff6-528gv‘: 41 }, { ‘scraper-59cd578ff6-bj687‘: 36 }, { ‘scraper-59cd578ff6-zjwfg‘: 47 }, { ‘scraper-59cd578ff6-nl6bk‘: 34 }, { ‘scraper-59cd578ff6-nk6fr‘: 33 }, { ‘scraper-59cd578ff6-h2n8r‘: 38 }, { ‘scraper-59cd578ff6-5bw2n‘: 50 } ] , retrieved 400 books, took 34.925secs , used 10 pods, errors: 0

      After doubling the number of pods, the time needed to scrape all 400 pages reduced almost by half. It took less than 35 seconds to retrieve all the missing details.

      In this section, you sent 400 requests to the application server deployed in your Kubernetes cluster and scraped 400 individual URLs in a short amount of time. You also increased the number of pods in your cluster to improve performance even more.

      Conclusion

      In this guide, you used Puppeteer, Docker, and Kubernetes to build a concurrent web scraper capable of rapidly scraping 400 web pages. To interact with the scraper, you built a Node.js app that uses axios to send multiple HTTP requests to the server containing the scraper.

      Puppeteer includes many additional features. If you want to learn more, check out Puppeteer’s official documentation. To learn more about Node.js, check out our tutorial series on how to code in Node.js.

      Source link