Regression Model using Web Scraping Algorithm

As a second project task that we were given in our Data Science Bootcamp, our mission was building a regression model using web scraping algorithm. For me at the time, this was a totally foreign idea, collecting data from websites using code, but it is one of the most logical and easily accessible data sources. After using a couple of different ways to scrape data from the web, I get used to how I will use the algorithm to get the data.

In this story, I will try to explain how I scraped the movies and the related data from the-numbers.com. After that I will shortly explain how I build my regression model in order to predict the total gross of high-budgeted movies.

Getting Started

The first question to ask before getting started with any python application is ‘Which libraries do I need?’

There are a few distinct libraries to consider for Web scraping, including:

  • Beautiful Soup
  • Requests
  • Selenium
  • Scrapy

In my example, I’ve used BeautifulSoup. Using pip,the Python package manager, you can install BeautifulSoup with the following:

Inspect the Web Page

You need to first inspect the web page in order to know which elements you need to target in your python code.

To gather data from the-numbers.com, you can inspect the page by right clicking on the element of interest and select inspect. This brings up the HTML code where we can see the element that each field is contained within.

Since the data is stored in a table, scraping with just a few lines of code would be straightforward. If you want to familiarize yourself with scraping websites, this is a good example and a good place to start, but bear in mind that it won’t always be so easy!

All 100 results are stored in <tr> elements in rows, and all of these are available on one page. This may not always be the case and you can need to adjust the number of results shown on a website as results stretch over several pages, or loop over all pages to collect all the details.

On the Budget Table web page, a table containing 100 results is displayed. When inspecting the page it is easy to see a pattern in the html. The results are contained in rows within the table:

Parsing the related page using BeautifulSoup

The first move is to import the libraries for your web scraper that you will be using. We have already discussed BeautifulSoup above, which enables us to deal with the html. The next library we are importing is urllib and requests, which allows and get the request from the webpage link. The other are libraries are standard numpy-pandas and re libraries.

The next step is to identify the url that is being scraped. This webpage presents all results on one tab, as discussed in the previous section, so the full url is provided here as in the address bar.

We then link to the webpage and can use BeautifulSoup to parse the html, storing the object in the ‘soup’ variable.

Searching for elements in the link

As all of the results are contained within a table, we can search the soup object for the table using the find method. We can then find each row within the table using the find_all method.

First, I’ve defined two functions that scrapes the actors-directors and genre-ratings.

Then I’ve created blank lists to append every feature for movies.

After creating for loop for each movie about their information, I got the final dataframe as:

In each page, there are 100 movies. In each movie page, they have their own actor, director,gross,rating and genre.

Since 100 movie will not be enough for my model, I did 10x this algorithm and finally, I have 1000 movies with their information in total.

Data Cleaning

In my data, there were many punctuation situations to be fixed. I had lots of <,>, <-> <.> and < >. Using the code below, I overcame of this situation.

Adding Data Feature

In my data, I thought that my features will not be enough to get good accuracy so I’ve added average actor gross and average director gross. As a final dataframe, I had the table below.

One-hot encoding

To find the possible relations with budget and gross, I’ve added rating and genre using one-hot encoding.

Using heatmap, I’ve found the all possible relations with each future.

In my final dataset, I’ve splitted data into test, training and validation.

When I investigated the Total Gross for all the movies, It looks many of the movies were between 0–$500.000.000. It was normal because, I scraped the first 1000 high-budgeted films all-time. You can see the distribution below.

In my model, I was going to predict the total gross of a movie so, I investigated the all correlations for total gross.

In the model, I’ve used Linear, Ridge, Lasso and Polynomial Regression and tried to find the best fitted model for my data.And found that, Ridge Regression was a little bit better.

After finding Ridge as my final model, I’ve also tried Cross Validation if it changes the final decision or not.

Finally, my score was increased but my model did not changed. Ridge was the best fitted model for my data.

Summary

This short example on web scraping and building a regression model has outlined:

  • Connecting to a webpage
  • Parsing html using BeautifulSoup
  • Looping through the soup object to find elements
  • Performing simple data cleaning
  • Linear,Ridge,Lasso,Polynomial Regression Model
  • Cross Validation

Thank you for reading!