Web Scraping In Django



In this tutorial, we are going to learn about creating Django form and store data into the database. The form is a graphical entity on the website where the user can submit their information. Later, this information can be saved in the database and can be used to perform some other logical operation. Oct 13, 2020 Python Web Scraping Tutorials What Is Web Scraping? Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process. In this section, you will learn. About how to store scraped data in databases; how to process HTML documents and HTTP requests. Scrapy - A fast high-level screen scraping and web crawling framework. Django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface. Scrapy-Redis - Redis-based components for Scrapy.

We can build a news aggerator web app by scrapping the news websites and serving those scrapped news via Django on web or in any app.

In this article, i will explain step by step guide on how to implement everything. Let's start by understand what a news aggregator is and why should we build it.

Scraping
What is news aggregator ?

A news aggregator is a system that takes news from several resources and puts them all together. A good example of news aggregator are JioNews and Google News.

Why build a news aggregator ?

There are hundreds of news websites, they do cover news on serveral broad topics, out of which only a few of them are of our interest. A news aggregator can be a tool to save a lot of time and with some modifications and filteration we can fine tune it to show only news of our interest.

A news aggregator can be an useful tool to get information within short time.

Plan

We'll build our news aggeragator in 3 parts. These are following:

  1. We'll research on html source code of news sites and build a website scrapper for each
  2. Then, We'll setup our django server
  3. Finally, we'll integrate everything altogether

So, let's start with first step.

Building the website scrapper

Before we start building scrapper, let's get the required packages first. You can install them from command prompt by these commads.

This will install the required packages.

We are going to use timesofindia and hindustantimes as our news sources. We'll Get content from these two websites and integrate into our news aggregator.

Let's start by times of india... We'll take news from berief section of times of india. Here, we can see that news heading comes in h2 tag.

So we'll grab this tag. Here is how our scrapper will look like.

This we'll get all the news headings from times of india.

Now, let's move to Hindustan times. We'll scrap india section of their website. Here we can see that, news is coming in a div with headingfour class.

Let's write a scrapper for this div.

Now we have the news that, we want to display in our web app. We can start building our web app.

Building Django web app

To build web app with django, we need to install django on our system. You can install Django from following command.

After installation of django, we can start building our web app. I'll call my app HackersFriend News Aggregator, you can give name of your app as per your choice, it doesn't matter. We will create the project from this command.

After that your directory structure should look like this.

Once we have manage.py file. We'll create app, in which our web app will live. Django, has convetion of keeping everything in seperate app, Inside a project. A project can have multiple apps.

So move into the project folder and create the app. This is the command to create app. I am calling the app news. You can give name of your choice.

After that your directory should look like this.

Now, we'll add this news app to settings.py file in INSTALLED_APPS. So that, Django takes this app into consideration. Here is how your settings.py should look like after adding the news app:

Now, let's create a template for home page.

Go to news directory > create a directory with name templates > create a news directory inside templates directory and then create a index.html file inside this directory.

We'll use bootstrap 4, so include all the css links and js file links into page index.html. Also, we are going to pass two variables namely toi_news and ht_news from our views.py file to this template with news of times of india and hindustan times respectively and we'll loop through them and print the news. Here is how your index.html file should look like.

Scraping

Now, we can create views.py file.

Inside views.py file we will create news scrapper of both news sites.

Django web app

Here is how our views.py file looks.

Once, we are done with template and views creation, we can add this view to our urls.py file to server the view.

Move to HackersFriend_NewsAggregator diectory and open urls.py file and there you need to import news view and add this view to url.

Here is how urls.py looks after adding.

After that, we are done. Now you can run your web app from command window. Use this command to run the app.

after that, you can open 127.0.0.1:8000 and you should see the news aggregator app's homepage.

That's certainely not the most beautifule news app on the internet, but you get the idea how we can build a news aggregator.

You can add a lot of features on top of it. Like showing news on certain topic, aggregating from several websites etc.

Here is github repo for all the codes: https://github.com/hackers-friend/HackersFriend-NewsAggregator

For a recent MinnPost project, we wanted to scrape court dockets, so I figured I’d break out a python script in the wonderful ScraperWiki. One of my favorite features is that you can schedule a scraper to run automatically. One of my least favorite features is that the limit on automatic scrapers is once per day. We needed something to run every half hour.

Enter Django

It seems that every news hacker is using Django for something these days, and why not? It’s fast, flexible and a major headache to deploy (I’ll expand on this in a later post).

To build the scraper, I wrote a python script that used requests and lxml, invoked by a cron call to a Django command.

Here’s the site we want to scrape (and a great example of how “open” government really isn’t): Minneapolis court dockets

models.py

The models.py file is very simple, containing only the fields we want to scrape.

This should be self-explanatory if you’re at all familiar with Django; if not, I highly recommend the official tutorial.

The scraping script

Using requests and lxml, scraping in python is downright enjoyable. Look how easy it is to grab a site’s source and convert it into a useful object:

Boom.

Take a look at the source code of the court dockets site, and you’ll see how fun it is to scrape most government sites. The information we want to get is nested in four tables (!), all without ids or classes.

Luckily, one of these tables has an attribute that we can immediately jump to. Here’s the code I’m using to get the contents we want:

The text_content() function takes what’s inside an html element sans html tags, and strip() removes whitespace.

The magic - Django commands

Django commands are scripts that can do whatever you like, easily invoked through the command line:

This is great to keep everything inside a Django project, and the scripts are easily accessible. These files are stored inside your app -> management -> commands (my full path is minnpost/dockets/management/commands/scrapedockets.py). If you don’t have these folders already, create them, but don’t forget to add init.py files. I turned my scraping code into a command called scrapedockets.py - full code below.

Django commands require a class Command that extends BaseCommand and has a function handle(). This is called when the command is invoked.

I wrote an (admittedly) bad function to convert the times into seconds to store them in the database. I believe I went against a general rule, which is to store times in GMT, but I don’t competely understand how Django uses the timezone settings. Help?

Anyway, I end up with variables for each piece of information I want to store. I check if a Case already exists with the same information, and if it doesn’t, I create it and save it to the database. I used python’s slice operator to make sure the court and description aren’t too long (according to the database setup I created in models.py).

The magic, pt. 2 - Cron

To make this worthwhile, we need it to run on its own every half hour. Unix systems make this simple, with a daemon called Cron. If you’re using Ubuntu, here’s a nice guide (other distros will be very similar). Cron schedules scripts to be run at different intervals, and its uses are virtually limitless.

I created a script, scrapedockets.sh, which simply calls the Django command we just walked through.

Web Scraping In Django Python

Don’t forget to make it executable:

I used a crontab on the default user to call the scrapedockets.sh Django command every half hour. Edit your crontab using the command:

Web Scraping In Django Download

Each line is something you want cron to do. Here’s what mine looks like:

Cron will run the script scrapedockets.sh every 30 minutes (any minute value evenly divisible by 30) and log output to scrapedockets.log. I encourage you to look at a guide to see what the structure is.

If everything is set up, your Django database should start filling up with information. Build some views, and show the world what you’ve found.

Web Scraping In Django Unchained

If you know a better way, please share!

Web Scraping In Django Online

I’m far from an expert, so if you see something fishy here, leave a comment or tweet at @kevinschaul.