Web Scraping Using Node Js



Thanks to Node.js, JavaScript is a great language to u se for a web scraper: not only is Node fast, but you’ll likely end up using a lot of the same methods you’re used to from querying the. Node.js has a large library of packages that simplify different tasks. For web scraping we will use two packages called request and cheerio. The request package is used to download web pages, while cheerio generates a DOM tree and provides a subset of the jQuery function set to manipulate it. Scrapingdog is a web scraping API to scrape any website in just a single API call. It handles millions of proxies, browsers and CAPTCHAs so developers and even non-developers can focus on data collection. You can start with free 1000 API calls. I am looking at an example of scraping text data from a website and struggling to get all the text from a particular section specifically where that text box has a field called “Read More”. I have tried different css selectors (identified using Selector Gadget) with no success and the captured text is not all the text available. Apr 16, 2021 And in the python script (has scraping functions) take time to process the output as the csv file may have more number of rows (lets say csv file of size 200 X 1 and has 200 URLs) so to process 200 input and then return their corresponding output lead to me request timeout as the python execution requires some time.

There are multiple tools to scrap and retrieve content from websites in multiple languages. In Java you can use Jaunt, for C# you can use this approach, and in python there is a popular library called scrapy. Today, I’m going to show you a simple way to crawl a website using server-side Javascript (Node.js) and the help of Request and Cheerio.

Request is a library which main purpose is to create http requests to retrieve web content easily. It works on top of http library and simplifies calls and responses handling.

Cheerio is the equivalent to jQuery for Node.js. It implements the core functions of jQuery. Remember that in Node, unlike in client-side javascript, there isn’t a DOM. Using cheerio we will be able to create a DOM and manipulate it as same as we do in client-side javascript using jQuery.

So, the setting for web scraping is quite simple:

Web scraping using node js in python
  1. We will use request to retrieve html/xml documents from the web
  2. We will parse html content to create a DOM using cheerio
  3. Using jQuery-like functions we will be able to manipulate and extract content from the retrieved webpage.

Our example project will consist in two files, the node package.json and the app.js which contains the code to scrap the web:

[code lang=»js» title=»package.json»]{
«name»: «youtubeScrapingExample»,
«version»: «0.0.1»,
«scripts»: {
«start»: «node app.js»
},
«dependencies»: {
«cheerio»: «latest»,
«request»: «latest»
}
}
[/code]

[code lang=»js» title=»app.js»]// Import request and cheerio libraries
const request = require(‘request’);
const cheerio = require(‘cheerio’);

// Set an example youtube url
let videoUrl = ‘https://www.youtube.com/watch?v=7fYKMCCPh28’;

// HTTP GET of the youtube website using request
request.get(videoUrl, function(error, response, html){
// Use cheerio to parse and create the jQuery-like DOM based on the retrieved html string
let $ = cheerio.load(html);
// Find the element node which contains the title and retrieve it’s text
let title = $(‘span#eow-title’).text();
// Output the result
console.log(‘The title of the video %s is %s’, videoUrl, title);
// Output: The title of the video https://www.youtube.com/watch?v=7fYKMCCPh28 is The Earth: 4K Extended Edition
});[/code]

In this example we are scraping the title of a youtube video link. The video title is the content of the node span#eow-title as you can see in the image below.

If you open the debug console in a youtube video and you write the following code, you will receive the title of the current video:

[code lang=»js»]$(‘span#eow-title’).innerText[/code]

In this case we have to take in to account that the content from the website is publicly accessible using a GET http call. One of the most powerful capabilities of request is that you can define complex requests, using any of the methods of the http protocol. Also, you can provide your browser’s HTTP Archive (HAR) [in Google chrome, you can retrieve the HAR in the developer tool] if the website where you need to scrap from is too complex to access from in a simple http call.

It is a quite simple example, but I hope that this mini-tutorial will help you to scrap the entire web!

If you’ve ever visited a website and thought the information was useful but the data wasn’t available through an API, well I have some good new for you. You can scrape that website data using Node.js!

Web scraping refers to the collection of data on a website without relying on their API or any other service. If you can visit a website on your browser then you can visit that website through code.

All websites are built from HTML, CSS, and Javascript. If you open up the developer tools on a website, then you’ll see the HTML code of that website.

So to scrape the data from a website using web scraping methods, you’re getting HTML data from that website, and then extracting the content you want from the HTML.

This article will guide you through an introduction to web scraping using Javascript and Node.js. You’ll create a Node.js script that visits HackerNews and saves the post titles and links to a CSV file.

Important concepts

Before we get started, you should know about a few concepts relevant to web scraping. These concepts are DOM elements, query selectors, and the developer tools/inspector.

DOM Elements

DOM Elements are the building blocks of HTML code. They make up the content of a HTML website and can consist of elements such as headings, paragraphs, images, and many others. When scraping websites, you search for web content by searching for the DOM elements they’re defined within.

Query Selectors

Query selectors are methods available in the browser and Javascript that allow you to select DOM elements. After you select them, you can read the data or manipulate them such as changing the text or CSS properties. When scraping the web, you use query selectors to select the DOM elements you want to read from.

Developer Tools

Chrome, Firefox, and other browsers have tools built into their browser that allow developers to have an easier time working with websites. You can find the DOM elements of the content you want using the developer tools and then select it with code.

Different tools/libraries you can use for web scraping

There are many different tools and libraries you can use to scrape the web using Javascript and Node.js. Here is a list of some of the most popular tools.

Cheerio

Cheerio is a library that allows you to use jQuery like syntax on the server. Cheerio is often paired with a library like request or request-promise to read HTML code using jQuery on the server.

Nightmare.js

Nightmare.js is a high-level browser automation library that can be used to do some interaction on the website before you scrape the data. For example, you may want to enter a form and then submit it before you want to scrape the website. Nightmare.js allows you to do this with an easy to use API.

Puppeteer

Puppeteer is a Node.js library that can run headless Chrome to do automation tasks. Puppeteer can do things such as:

  • Generate screenshots and PDFs of pages.
  • Automate form submission, UI testing, keyboard input, etc.
  • Test Chrome Extensions.
  • And more.

Axios

Axios is a popular library for making requests over the web. This library is robust and has many features. To make a simple request to get a website’s HTML content is simple for this library. It’s often used in combination with a library like Cheerio for scraping the web.

Tutorial

In this tutorial, we’ll be scraping the front-page of HackerNews to get the post titles and links and save them to a CSV file.

Prerequesites

  • Node.js installed on your computer.
  • Basic understanding of Javascript and Node.js.

1. Project setup

To start, we’ll need to setup a Node.js project. In your terminal, change directories into an empty directory and type:

yarn init -y

Or

npm init -y

To initialize a new Node.js project. The -y flag skips all the questions that a new project asks you.

We’ll need to install two dependencies for this project: Cheerio and Axios.

In your terminal, type:

Node
yarn add cheerio axios

That will install the packages in your project.

Now let’s get something printing on the screen.

Create a new file called scraper.js in your project directory and add the following code to the file

-- CODE language-js --console.log('Hello world!');

Next, in your terminal run the command:

node scraper

And you should see the text Hello world! in your terminal.

2. See what DOM elements we need using the developer tools

Now that our project is set-up, we can visit HackerNews and inspect the code to see which DOM elements we need to target.

Visit HackerNews and right-click on the page and press “Inspect” to open the developer tools.

That’ll open up the developer tools which looks like:

Since we want the title and URL, we can search for their DOM elements by pressing Control + Shift + C to select an element. When you hover over an element on the website after pressing Control + Shift + C then the element will be highlighted and you can see information about it.

If you click the highlighted element then it will open up in the developer tools.

This anchor tag has all the data we need. It contains the title and the href of the link. It also has a class of storylink so what we need to do is select all the elements with a class of storylink in our code and then extract the data we want.

3. Use Cheerio and Axios to get HTML data from HackerNews

Now it’s time to start using Cheerio and Axios to scrape HackerNews.

Delete the hello world console log and add the packages to your script at the top of your file.

-- CODE language-js --const cheerio = require('cheerio');
const axios = require('axios');

Next, we want to call axios using their get method to make a request to the HackerNews website to get the HTML data.

Getting

That code looks like this:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
console.log(response.data);
});

If you run your script now, then you should see a large string of HTML code.

Here is where Cheerio comes into play.

We want to load this HTML code into a Cheerio variable and with that variable, we’ll be able to run jQuery like methods on the HTML code.

That code looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
});

The $ is the variable that contains the parsed HTML code ready for use.

Since we know that the .storylink class is where our data lies, we can find all of the elements that have a .storylink class using the $ variable. That looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
console.log($('.storylink'));
});

If you run your code now, you’ll see a large object that is a Cheerio object. Next, we will run methods on this Cheerio object to get the data we want.

4. Get the title and link using Cheerio

Since there are many DOM elements containing the class storylink, we want to loop over them and work with each individual one.

Cheerio makes this simple with an each method. This looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
console.log(i);
console.log(e);
}
});

i is the index of the array, and e is the element object.

What this does is loop over all the elements containing the storylink class and within the loop, we can work with each individual element.

Since we want the title and URL, we can access them using text and attr methods provided by Cheerio. That looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
let title = $(e).text();
let link = $(e).attr('href');
console.log(title);
console.log(link);
}
});

If you run your code now, you should see a large list of post titles and their URLs!

Next, we’ll save this data in a CSV file.

5. Save the title and link into a CSV file.

Creating CSV files in Node.js is easy. We just need to import a module called fs into our code and run some methods. fs is available with Node so we don’t have to install any new packages.

At the top of your code add the fs module and create a write stream.

-- CODE language-js --const fs = require('fs');
const writeStream = fs.createWriteStream('hackernews.csv');

What this does is it creates a file called hackernews.csv and prepares your code to write to it.

Next, we want to create some headers for the CSV file. This looks like:

-- CODE language-js --writeStream.write(`Title,Link n`);

What we’re doing here, is just writing a single linke with the string Title,Link n.

This prepares the CSV with headings.

What’s left is to write a line to the CSV file for every title and link. That looks like:
-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
let title = $(e).text();
let link = $(e).attr('href');
writeStream.write(`${title}, ${link} n`);
});
});

What we’re doing is writing a new line to the file that contains the title and link in its appropriate location and then adding a new line for the next line.

The string in use is called template literals and it’s an easy way to add variables to strings in nicer syntax.

If you run your code now, you should see a CSV file created in your directory with the title and link of all the posts from HackerNews.

Your final code should look like this:

Searching DuckDuckGo with Nightmare.js

In this tutorial, we'll be going over how to search DuckDuckGo with Nightmare.js and get the URLs of the first five results.

Nightmare.js is a browser automation library that uses Electron to mimic browser like behavior. Using Nightmare, you're able to automate actions like clicking, entering forms, going to another page, and everything you can do on a browser manually.

To do this, you use methods provided by Nightmare such as `goto`, `type`, `click`, `wait`, and many others that represent actions you would do with a mouse and keyboard.

Let's get started.

Prerequisites

- Node.js installed on your computer.
- Basic understanding of Javascript and Node.js.
- Basic understanding of the DOM.

1. Project setup

If you've initialized a Node project as outlined in the previous tutorial, you can simply create a new file in the same directory called `nightmare.js`.

If you haven't created a new Node project, follow Step 1 in the previous tutorial to see how to create a new Node.js project.

Next, we'll add the nightmare.js package. In your terminal, type:

yarn add nightmare

Next, add a console.log message in `nightmare.js` to get started.

Scraping

Your `nightmare.js` file should look like:

-- CODE language-js --console.log('Hello from nightmare!');

If you run `node nightmare` in your terminal, you should see:

Hello from nightmare!

2. See what DOM elements we need using the developer tools

Next, let's visit [DuckDuckGo.com](https://duckduckgo.com/) and inspect the website to see which DOM elements we need to target.

Visit DuckDuckGo and open up the developer tools by right-clicking on the form and selecting `Inspect`.

And from the developer tools, we can see that the ID of the input form is `search_form_input_homepage`. Now we know to target this ID in our code.

Next, we need to click the search button to complete the action of entering a search term and then searching for it.

Right-click the search icon on the right side of the search input and click `Inspect`.

From the developer tools, we can see that the ID of the search button is `search_button_homepage`. This is the next element we need to target in our Nightmare script.

3. Search for a term in DuckDuckGo using Nightmare.js

Now we have our elements and we can start our Nightmare script.

In your nightmare.js file, delete the console.log message and add the following code:

-- CODE language-js --const Nightmare = require('nightmare');
const nightmare = Nightmare({ show: true });
nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.then();

What we're doing here is first importing the Nightmare module, and then creating the nightmare object to work with.

The nightmare object takes in some options that you can see more of [here](https://github.com/segmentio/nightmare#nightmareoptions). The option we care about is `show: true` because this shows the electron instance and the actions being taken. You can hide this electron instance by setting `show` to `false`.

Web Scraping Using Node Js Tutorial

Next, we're telling the nightmare instance to take some actions. The actions are described using the methods `goto`, `type`, `click`, and `then`. They describe what we want nightmare to do.

First, we want it to go to the duckduckgo URL. Then, we want it to select the search form element and type 'web scraping'. Then, we want it to click the search button element. Then, we're calling `then` because this is what makes the instance run.

Web Scraping Using Node Js Using

If you run this script, you should see Nightmare create an electron instance, go to duckduckgo.com, and then search for web scraping.

4. Get the URLs of the search results

The next step in this action is to get the URLs of the search results.

As you saw in the last step, Nightmare allows us to go to another page after taking an action like searching in a form, and then we can scrape the next page.

If you go to the browser and right-click a link in the search results page of DuckDuckGo, you'll see the element we need to target.

Web Scraping Using Node Js In Python

The class of the URL result we want is `result__url js-result-extras-url`.

To get DOM element data in Nightmare, we want to write our code in their `evaluate` method and return the data we want.

Web Scraper With Node Js

Update your script to look like this:

-- CODE language-js --
nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.wait(3000)
.evaluate(() => {
const results = document.getElementsByClassName(
'result__url js-result-extras-url'
);
return results;
})
.end()
.then(console.log)
.catch((error) => {
console.error('Search failed:', error);
});

What we added here is a `wait`, `evaluate`, `end`, `catch`, and a console.log to the `then`.

The `wait` is so we wait a few seconds after searching so we don't scrape a page that didn't load.

Then `evaluate` is where we write our scraping code. Here, we're getting all the elements with a class of `result__url js-result-extras-url` and returning the results which will be used in the `then` call.

Then `end` is so the electron instance closes.

Then `then` is where we get the results that were returned from `evaluate` and we can work with it like any other Javascript code.

Then `catch` is where we catch errors and log them.

If you run this code, you should see an object logged.

-- CODE language-js --{
'0': { jQuery1102006895228087119576: 151 },
'1': { jQuery1102006895228087119576: 163 },
'2': { jQuery1102006895228087119576: 202 },
'3': { jQuery1102006895228087119576: 207 },
'4': { jQuery1102006895228087119576: 212 },
'5': { jQuery1102006895228087119576: 217 },
'6': { jQuery1102006895228087119576: 222 },
'7': { jQuery1102006895228087119576: 227 },
'8': { jQuery1102006895228087119576: 232 },
'9': { jQuery1102006895228087119576: 237 },
'10': { jQuery1102006895228087119576: 242 },
'11': { jQuery1102006895228087119576: 247 },
'12': { jQuery1102006895228087119576: 188 }
}

This is the object returned from the evaluate method. These are all the elements selected by `document.getElementsByClassName('result__url js-result-extras-url');`.

We don't want to use this object, we want the URLs of the first 5 results.

To get the URL or href of one of these objects, we simply have to select it using `[]` and calling the `href` attribute on it.

Update your code to look like this:

-- CODE language-js --nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.wait(3000)
.evaluate(() => {
const results = document.getElementsByClassName(
'result__url js-result-extras-url'
);
const urls = [];
urls.push(results[2].href);
urls.push(results[3].href);
urls.push(results[4].href);
urls.push(results[5].href);
urls.push(results[6].href);
return urls;
})
.end()
.then(console.log)
.catch((error) => {
console.error('Search failed:', error);
});

Since the first two elements are URLs of ads, we can skip them and go to elements 2-6.

What we're doing here is creating an array called `urls` and pushing 5 hrefs to them. We select an element in the array using `[]` and call the existing href attribute on it. Then we return the URLs to be used in the `then` method.

If you run your code now, you should see this log:

-- CODE language-js --[
'https://en.wikipedia.org/wiki/Web_scraping',
'https://www.guru99.com/web-scraping-tools.html',
'https://www.edureka.co/blog/web-scraping-with-python/',
'https://www.webharvy.com/articles/what-is-web-scraping.html',
'https://realpython.com/tutorials/web-scraping/',
];

And this is how you get the first five URLs of a search in DuckDuckGo using Nightmare.js.

Your final code should look like this:

# What we covered

- Introduction to web scraping with Node.js

- Important concepts for web scraping.

- Popular web scraping libraries in Node.js

- A tutorial about how to scrape the HackerNews frontpage and save data to a CSV file.

- A tutorial about how to get the search results on DuckDuckGo using Nightmare.js.

What we covered

  • Introduction to web scraping with Node.js
  • Important concepts for web scraping.
  • Popular web scraping libraries in Node.js
  • A tutorial about how to scrape the HackerNews frontpage and save data to a CSV file.