Learn Web Scraping with Python from Scratch Full Course Transcripts

1. Web Scraping Course Overview

Welcome to the course web scraping with Python getting started.

So what is web scraping.

Well let's first see how things work without web scraping.

When you want to get data from the Web you open a web page on your browser.

You search for the data you need.

And finally you copy and paste text into another file on your machine.

Everything is good so far.

However when the web page structure is so complicated making it difficult to extract specific pieces

of data or when you need to open so many pages to extract data from each of them the manual process

can become boring and time wasting.

And that is when automated web scraping can make the process more efficient and effective in this course.

You will learn Python web scraping libraries you need for the course and how to install them how to

extract you are cells from one web page how to extract other tax data pieces from one web page how to

crawl multiple web pages and extract data from each of them how to handle navigation links and move

to next pages how to save your scraped data into a V file.

And finally a quick overview about other popular Web scraping frameworks.

So let's start.

2. Installing BeautifulSoup &
Requests

You start this Web scraping tutorial the first thing to do is to install the two libraries.

Beautiful Soup and requests.

If you use Linux or Mac open terminal if you use Windows open command prompt then to install the required

libraries type in the following commands in your terminal or command prompt and press enter pip install

beautiful soup for pip install requests in Linux or Mac.

You may need to start with Sudoku to avoid permissions errors to make sure that things work fine.

Open the new python file.

Write the following.

And run it.

From B.S. for import.

Beautiful Soup import requests if you don't get any errors then everything's OK.

3. URL Extraction

No installing the required libraries.

Let's learn how to extract you our L's from web pages.

So we have five variables.

Needless to say variable names can be anything else.

We care more about the code workflow.

First you have to specify the link of the web page you want to scrape modify the U.R.L. variable and

do not forget the quotes.

It is a string variable then you need to get the Web page creating a response object response equals

requests get you are out at this point.

If you print response to see its value the output will be like this response 200 200 means.

Great.

Your Internet connection works.

Your U.R.L. is correct and you are allowed to access this page.

It's just like you can see the Web page now in your browser next.

You need to extract the source code of the web page.

Data equals response text.

It's like you're using copy paste to get the source code of the page into memory but you are rather

using a variable you can print data to see what you have in it.

To make it easier to navigate the data structure of the web page you have to pass the source code to

beautiful soup to create a beautiful soup object for it.

Soup equals Beautiful Soup data H2O Mel parser.

If you print soup and data you might not notice a huge difference between the results of both variables.

However you need this step to allow beautiful soup to pass H2 email tags with the help of the h t AML

parser.

So now you can extract specific h t email tags such as the tags of links into a list so you can loop

on them later to extract all the A tags into a list.

Use the find all method tags equals soup.

Find all a finally you can extract the links from the attribute H ref in the A tags by looping on the

tags list.

In simple terms you can clean the links from their code for tag in tags print tag get a traffic extracting

links is something you will be doing all the time in web scraping and crawling tasks.

Why because you need to start by one page such as a book list and then open sub pages like the description

page of each book to scrape data from it now.

Here is the code of this lesson.

It extracts all the you ls from one web page.

Please try to run the code yourself.

You can download the code under the references section.

4. Web Scraping Craigslist - Titles

In this tutorial you will learn how to use beautiful soup for web scraping Craigslist.

So let's assume we want to scrape the titles of jobs available in Boston from craigslist.

In this video we will work on one page only.

We will simply search the Web site and get the U R L H TTP s colon backslash backslash Boston dot Craigslist

dot org backslash search backslash S.O.S.

If you have not already.

Please watch the two previous videos first in the current video.

The only new part is titles equals soup dot find all a class result.

Title 4 title in titles.

Print title dot text.

Now it will find all a tags whose class name is result.

Title How can you know the class name.

You can find it by opening the U.R.L. in your browser.

Moving the cursor on a job title right clicking and selecting inspect.

You can see now the H2 UML code like this.

You can try the same for job addresses.

You have to extract all span tags whose class name is result hood with this code.

Addresses equals soup dot find all span class result hood for address in addresses.

Print address dot text.

Try to run the code on your machine and let me know if you have any questions.

You can download the code under the references section.

5. Web Scraping Craigslist - Job
Details Wrapper

In the previous video you have extracted titles and addresses from a Craigslist job list page.

However you extracted those pieces of information separately which can cause mismatches.

If the address is missing from the job for example instead you should rather extract all the details

of a job at once.

As you can see here each job is in higher level paragraph tag whose class name is result info.

It is called a wrapper containing all the pieces of information about a job.

So what you should do is to extract all the p tags with the class result info and then loop on them

to extract data from the secondary tags.

Let's write it in code.

Jobs equals soup.

Don't find all PE class result info and then for job in jobs then extract each job detail title equals

job don't find a class result title dot text location equals job dot find span class result hood dot

text and so on as you see you have to use the Find not find all because you only need the first occurrence

of the tag and to extract the text from the a tag.

Use dot text at the end.

However if one tag is missing you will get an error message when you run the code.

None type object has no attribute text so you have to add an if statement you can update your code to

be location tag equals a job dot find span class result hood location equals location tag dot text info

location tag ls an A.

This means if a location tag is found.

Extract the text.

Otherwise just add the text in a as the location is between brackets you can use slicing from 1 to minus

1 to remove the brackets and also strip to remove the extra spaces or just slice from 2 to minus 1.

Extract the date in the same way but be careful that its tag is called Time

finally print all the job details you extracted in the way you prefer

let's run the code to see the output as you can see each job is printed with its details title location

date and link.

Now replay the video.

Write the code line by line yourself and run it on your machine.

Let me know if you have any questions.

You can download the code under the references section.

6. Web Scraping Craigslist - Job
Description Page

In this video you will learn how to open the page of each job and extract its full description.

We are still in the for loop.

We created in the previous video and the last thing we extracted was the job link with the code line

link equals job dad find a class result.

Title dot get a graph.

So do you remember what you have done to pass the main page.

Now you should do the same to pass each job description page first you connect to the page job response

equals requests.

Dot get link then you get the source code.

Job data equals job response dot text and pass it to beautiful soup to pass job soup equals Beautiful

Soup job data HDD email dot parser.

Now you extract the job description and it is in a section tag with the IDF posting body job description

equals job soup DOD find Section D posting body dot text here you use the find method not find all because

there is only one section tag and as usual you use the dot text to extract the text without tags you

can also extract some job attributes like compensation and employment type from the p tag whose class

name is Astra Group using this code line job attributes equals job soup DOD find PE class Astra Group

dot text however sometimes this tag is not found so you have to add a conditional statement job attributes

tag equals job soup dot find PE class Astra Group job attributes equals job attributes tag dot text

if job attributes tag ls N A.

So the first lines specify the tag and then the second line says extract the text from the tag if the

tag is found otherwise just add the string in a note that we're still inside the for loop.

Finally print the job details including job description and job attributes.

As you can see when you run the code there is now more details about the job and its description.

Try to run the code on your machine and let me know if you have questions.

You can download the code under the references section.

7. Crawling & Scraping Next Pages

In this video you will learn how to move to the next page and extract its data as you can see here.

There are next and previous links and when you click next.

This opens a new page including more job listings.

So let's see how to extract the jobs from the next pages.

You can add a wild true statement just before response equals request.

Get U.R.L..

So the initial U.R.L. variable will remain the same but later will change it to be the next page you

are out in the previous video we had a for loop to iterate jobs on the same page.

Make sure you are outside the for loop and then extract the next page tag which is in this case and

a tag whose title attribute value is next page.

You are all tag equals Snoop Dog.

Find a title next page then you should have an if statement to exclude the case when you are on the

last page and hence there is no next page U.R.L..

Note that there is the a tag but it is empty.

Note also that the trip here includes only a relative U.R.L. that is to say without the domain name

if you are L tag dot get h ref you are L equals H TTP -- colon backslash backslash Boston dot Craigslist

dot org plus you RL tag dot get a trip else break so the code says if the a tag H ref is not empty then

the U RL variable gets the absolute U R L by concatenate ring the domain name to the value of H ref

otherwise exit the while loop using else break you may also want to count the number of jobs if you

like so you can see the final code on the screen.

Let's run it to see the output before moving to the next video.

Make sure you run the code on your machine and let me know if you have questions.

You can download the code under the references section.

8. Saving Output to CSV File

In this video you will learn how to save your output to a C RSVP file.

For this we will use the library pandas if you prefer.

You can use the C S V module instead.

But in this tutorial we will use pandas which will be very useful for you to learn about.

So if you have not already installed pandas you can install it by typing in this command in your terminal

or S.M. D pip install pandas then in your code.

Add this line.

Import pandas as PD.

Now before the while loop create an empty dictionary say in P.O. jobs to save your data.

You can also have job no equals zero which you will use later to count jobs if you want.

Then inside the for loop add job no plus equals one which will give you a new number for each job.

So now you can update the dictionary MPO jobs as a quick reminder.

This is how you update a dictionary

so back to our Craigslist code let's update the dictionary in P.O. jobs and we will add this at the

end of the for loop.

You need a dictionary key for each job which can be the current job.

No.

Or the current job title or whatever you prefer.

And of course you need a value for each key and in this case it must be a list of job details.

Yes it must be a list and you will see why in a moment.

So the list value will include the job details such as title location date link job attributes and job

description.

By the end of the while loop your NPR jobs dictionary will include all the jobs in the form of keys

and values and you can print the dictionary NPR jobs if you want to check it.

Now it is time to use the panda's library you have imported outside both the for loop and the well loop.

Add this code line.

NPR 0 jobs def equals PD dot data frame dot from dict in P.O. jobs Orient equals index columns equals

job title location date link job attributes job description this line converts the MPO jobs dictionary

into a data frame.

What is a data frame.

Data frame is another Python data structure which is specific to pandas.

You can see how it looks like by printing the first 10 rows using the head method using data frames

can make handling data much easier especially if more processing is needed.

So use PD dot data frame dot from dict and specify the name of the dictionary you want to convert to

a panda data frame.

Here you have Orient equals index which means each key and value forms a row.

Also you have the columns attribute which is just the table header that is to say the name of each column.

Let's now convert the panda's data frame into a V file using the method to see S V and specifying the

file name as you prefer.

So here.

If everything works fine after running your code you should find a CSB file created in the same location

as your python file.

And this in jobs dot CSA V will include all the jobs you extracted and their details.

So that's it for this tutorial.

I hope you enjoyed it.

If you have questions please feel free to send them to the dedicated Q and A section.

9. Scrapy vs. Other Python Web
Scraping Libraries

Hey there.

So today we are going to learn about BP.

What's great is overall it's great versus other patron days scraping tools.

Why you should use it and when it makes sense to use some other tools.

Pros and cons of scraping.

And that would be it.

So let's begin scrappy oral.

Is that patrolling framework written in Python one of its main advantages is that it's built on top

of twisted asynchronous networking framework which in other words means that it's a really efficient

and b scrape it is a synchronous framework.

So to illustrate why this is a great feature I'll use for those of you that don't know what a synchronous

sweeping framework means.

I'll use some layman example.

So imagine you had to call hundred different people by phone numbers.

Well normally you would do it by sitting down and then dialing the first number and then patiently waiting

for their response on the other end in asynchronous world.

You can pretty much dialing first 20 or 50 phone numbers and then only process those calls once the

other person on the other end picks up the phone hopefully.

Now it makes sense scrappy is supported under or uses Python two point seven and Python three point

three.

So you can pretty much do it depending on your level of Python.

You are pretty much good to go.

So Python two point six.

Important thing to note.

Support was dropped starting at scraping the zero point twenty.

So just bear that in mind and Python 3 support was added in scrape be one point one scrape.

In some ways it's similar to Django.

So those of you that use or have used previously Django will definitely benefit.

Now let's talk more about other item based scraping tools and these.

Bear in mind that these are all specialized libraries with very focused functionality and they don't

clean or they not or they are not really a complete web scraping solution like scraping is the first

two or the euro lib two and then requests are modules for reading or opening the pages.

So actually to be modules the other two are a beautiful soup.

And then Alex AML.

These are for AK.

The fun part of this creating jobs or really for extracting data points from those pages that load with

your link to and then requests.

But let's get back to the euro lib too and you're limp to the biggest advantage is that it's included

in the python standard library.

So it's better included.

And as long as you have items installed you're good to go in past.

It was more popular.

But since then on to replace it and that tool believe it or an artist called requests the DA deducts

or documentations are superb for requests.

I think it's even the most popular model for Python period.

And if you haven't already did once again the docs are just amazing so if you can't already just give

it a read and requests.

Unfortunately it doesn't come pre installed with Python so you will have to install it.

I personally use it for a quick and dirty scraping jobs and bolts.

You allowed to and requests are supported with Python 2 and 3 the next two is called Beautiful Soup.

And once again it's used for extracting data points from the pages that are loaded.

Okay.

And it's the beautiful so it's quite robust and it handles nicely.

Mail form and markup.

So in other words if you have a page that is not getting validated as a proper aged email but you know

for a fact that it's a page that it's H email specifically page then you should give it a try scraping

data from with beautiful soup.

So actually the main thing from the expression type soup which is used to describe are really invalid

market.

Beautiful Soup creates a path tree that can be used to extract data from each name.

The official ducks are comprehensive.

Easy to read and with lots of examples so they are really just like with the requests.

They are really beginner friendly and just like the other tools for scraping Beautiful Soup also comes

with Python two and bites and.

And now let's talk for let's see for Alex email now what Alex email is it just similar to the beautiful

soup so once again it sandals or it's used for scraping data it's most feature rich python library for

processing bolt ex male and a female.

It's also really fast and memory efficient.

Fun fact is that script selectors are built all relics smell and for example beautiful soup also support

supports it as a parser just like with the requests I personally use Alex email in pair with request

of course for again previously mentioned.

Quick and dirty jobs.

Bear in mind that official documentation is not a beginner friendly to be honest.

And so if you haven't already used similar tool in the past used examples from blogs or other sites

it will probably make a bit more sense than the official way of reading the less tool for scraping and

is called selenium so to prefix this selenium is first and foremost a tool for writing automated tests

for their applications.

It's used for verbs scraping mainly because it's beginner friendly and if a site uses JavaScript So

if site is heavy on javascript which more and more sites are selenium is a good option because once

again it's easy to extract the data.

If you're a beginning beginner or if the javascript interactions are very complex if we show a bunch

of forget and post requests I use it sometimes solely or in pair of its Ruby and most of the time when

I'm using it with 3P I kind of try to iterate over once again javascript yellow pages and then use creepy

selectors to grab the h the email that selenium produces currently supported patent versions for selenium

are two point seven and three points and then plus overall selenium support is really extensive and

it revises bindings for languages such as Java C Sha Ruby Python of course and then javascript selenium

official docs are once again great and easy to grasp and you can probably give it a read even if you

are a complete beginner and in two hours you will pretty much figure it all out.

Bear in mind that from my testing for example scraping thousand pages from Wikipedia was 20 times faster

believe it or not in scraping them in selenium.

Also on top of that it consumed a lot less memory and CPO usage was a lot lower than script with scraping.

Then with these selenium.

So back to the three main probes and when using 3P.

Of course it's first and foremost a synchronous.

But if you were building something robust and to make it as efficient as possible with lots of flexibility

and bunch of functions then you should definitely use it.

One case example when using some other like above mentioned tools kind of makes sense is if you had

a project where you need to load on page or something like that of your favorite.

Let's call it let's say a restaurant and check if they are having your favorite dish on the menu and

then for this type of case you should not use creepy because it would be to be honest maybe overkill

some of the drawbacks of scraping is that since it's really a full fledged framework it's not that beginner

friendly and learning curve is a little steeper than some other tools.

Also installing straightly is a tricky process especially with Windows.

But bear in mind that you have a lot of resources online for this and this pretty much means that you

have.

I'm not even kidding.

Probably a thousand blog posts about installing scraping on your specific operating system and that's

it for this video.

So thanks for watching and I'll see you in the very next video where I will discuss installing scraping

10. Scrapy Framework Tutorial:
Scraping Craigslist

In this Scrapy tutorial, you will learn how to write a Craigslist crawler to scrape Craigslist‘s "Architecture & Engineering" jobs in New York and store the data to a CSV file.

https://python.gotrained.com/scrapy-tutorial-web-scraping-craigslist/

11. Bonus: Advanced Web Scraping
Courses

Use the coupon code WEB-SCRAPING-101 to get any of our courses.

Learn Web Scraping with Python from Scratch Full Course Transcripts

Web Scraping Courses

No comments:

Search This Blog

Blog Archive

About Me