Python Part 15 - Scraping websites Office

Python Part 15 - Scraping websites Office So welcome to this whitetail tutorial on scraping websites using python here's what you'll learn during the tutorial so we'll begin by looking at some problems you may encounter it might seem a gleaming way to start but you have to be realistic about your will and what you will and won't be able to achieve and we'll also look at different scraping tools in this section we're then going to look at our example html for the case study we're going to do during the tutorial and then we're going to look at the document object model how an html page is constructed into different sections.

We'll look at html tags and attributes so tags are things like img image p for paragraph div for div section and then we're going to look at element ids and class names and how cascading style sheets use them to impose formatting we'll then go on to look at getting html from a website using the request module within python and then we're going to look at getting html from a file for which you don't need the request module and that will allow us to do our case study.

And finally we'll then look at actually scraping a website using the beautiful super module with this wonderful name we'll then look at elements of beautiful soup so we'll look at chaining elements together we'll look at how you can get navigable strings which is basically readable text from a website we'll look at how you can navigate the document object model by its relative so things like parents and children and siblings and so on and then we'll look at how you can find elements by their either their tag name or their id or their class or whatever.

It may be and finally we'll look at doing the same things using jquery style or css style selectors so there's lots to cover at the top right of your screen about now a link should appear and you can click on that to see the many files and exercises to do with this tutorial if it doesn't appear you can click on the same link on the youtube page for this tutorial but that's enough of me i'm going to vanish now and sven will take you through the rest of the tutorial.

So let's get started so before we look at how you can scrape a website in python let's look at some problems that you're going to have to encounter it may seem a bit gloomy starting like this but you have to be realistic in life so i can think of seven believe it or not firstly you might have legal and ethical reasons why you shouldn't scrape a website you need to be able to understand the html which won't always be easy.

You need to be able to cope with pages or websites which are written very much with javascript in mind so you can't necessarily actually see the underlying html you need to get around password protection so that you can scrape a web page and likewise you need to get around captures which are designed to filter out robots like you possibly you need to be able to make sure that you're looking at the correct version of a web page so if you and i both go to a web page who's to say we're even looking at the same thing and finally you need to decide which web.

Python Part 15 - Scraping websites

Scraper you're going to use so let's look at each of those in turn beginning with legal and ethical issues so i've taken here more or less at random the monopoly game and the mcdonald's website at the time of speaking and you could if you like scrape their website and if you could get to the underlying code you could publish it on your own apart from the fact i don't think you'll be able to get to the underlying code there's legal and ethical problems with this legal because i think mcdonald's would.

Send you a cease and desist letter within a couple of hours unethical because you're stealing someone else's property we get lots of people uh browsing our website and taking our videos and it always hurts we try to follow it up where possible please don't do it it's just not a nice thing to do so make sure you're only taking publicly available information so the second issue is are you going to understand the html this is the premier league website i took it more or less again at random and what you see on screen definitely.

Doesn't easily translate into the html so that's going to be a big thing to get over the third thing you're going to have to get around is client script this is a big problem there's a bbc home page today and there's what it looks like in html there's absolutely no way i could scrape it because everything is hidden away with calls to javascript functions and things like that so it will be almost impossible to browse this website unless i use a specialist tool like selenium more on.

That in a second so fourth problem you may have is password protected sites i've chosen the american express site but i could have chosen anything really you're not going to be able to get at the underlying html unless you get through the security barrier it can be done but it's another problem a fifth problem you may have is captures captures designed to filter out robots like yourself possibly so there's a couple of images there i'm sure everybody watching this video has.

Seen before on different websites um so that's another pro obstacle in your way a sixth problem is which version are you seeing when i go to a web page am i looking the same as you here's four reasons why that might not be the case we might be using different platforms what you see on your mobile may be very different to what i see on my laptop i might have turned javascript off i'll see something very different then we might be using or the website might be using something called a b testing.

People like amazon do this all the time they serve up different versions of the website and gauge from their reaction of their browsers which one to keep so it's quite possible you can log on to web page twice in a row and see different things and fourthly the cookies i've got on my webpage on my computer rather will determine what i see so again we may not necessarily be seeing the same thing as each other so given all that which web scraper should you choose well the three main choices uh which.

    Seem to come up beautiful soup which is the one i'm going to cover in this tutorial scrapey which seems another

    Good choice and selenium which is i think more difficult to learn but also more powerful from everything i all the research i've done i think beautiful soup scene's a very good choice it's nice and easy it's very well documented and very powerful but if you're going to get round some of the problems i've described in this like robots captures password protection and so on it may be that you want to get.

    Your web scraper to behave more like a human being and for that selenium is probably the best choice but what i would probably recommend doing is starting with beautiful soup mastering the basics of scraping a website and then maybe moving on to something more powerful so in order to be able to make sure that everyone watching this tutorial experiences exactly the same thing rather than relying on a public website i've created a file called windham.htm it's one of the files attached to this.

    Tutorial and if you go to that you might be able to see that it lists out films sorry books rather by one of my favorite authors john wyndham it references a file called wiser logo.png which is an image so it'd be nice to be able to right-click on this and browse it but unfortunately you can't do that in the default version of visual studio code so what i'm going to do is just go to this icon in the activity bar at the bottom and we're going to install a browser so if i type in it's not a browser the ability to go to a browser.

    By typing the web browser in the search bar you can see the opening browser extension is the first one which comes up with 4.5 million hits that's got to be reliable surely so if i install that it takes only a second or two on my machine at the moment possibly because it's not the first time i've done it and then if i go back to my list of files and if i go to my windom file i can right click anywhere i like on that and choose to open it in my either my default browser or any other one and what that will do is show what it looks like you can see i've already had.

    A go doing this before and what it does is list out six of six books by john wyndham and that's what we're going to use for our scraping examples so if you want to be able to scrape a website or page you're going to need to understand the document object model in html so let's have a look at that if you right click on a page like the one i've just told you about the windham.html file and if you choose to open it in the browser this is what uh html renders as.

    In your browser what that means is the browser takes the underlying hypertext markup language that's what html stands for and interprets it and presented on screen like this if you want to see the underlying source code for any web page you can right click and choose to view the page source and you can usually press ctrl u to do the same thing and what that will do is open up another file listing out the html you can do this for any web page any web page that you look at will begin with normally within html instruction.

    And then following that is the header section so the header section is a bit at the top which tells you about how the page is going to work but it doesn't actually contain anything you'll ever be able to see on screen so when you're scraping a web page you can usually ignore this following that is the body section and the body section is a heart and soul of the web page and it goes on from for the pretty much the rest of the page so that's the bit we'll actually be scraping.

    And right at the end is the an instruction which says that's the end of all the html so that's one way of looking at it but if we close that down and go back to our web page you can also in most browsers press the f12 key i'm going to do that now and what that will do is open up something called developer tools and if you click on it the elements tag here then you can normally see how the document model object model is structured.

    That's it in edge i just want to show you that it does work in chrome 2 for example if i do the same thing in chrome and press the f12 key i've got pretty much the same schematic here so let's close that down and go back to edge so what you can do in this is see the main sections the header section the body section and then all the div sections you can see if i look at any div section which corresponds to a block in the html file that might have another div section within it or paragraph for example and if i go down to this div section you.

    Can see this contains a table and the table contains a table body and the body contains table rows and each table row

    Contains either table headers or table data and they may contain links as well so there's this hierarchical structure of how the tags are linked together and it's tags and attributes i want to look at in the next part of this tutorial so as well as understanding the document object model you need an appreciation of tags and attributes to be able to scrape.

    A web page so let's look at those in turn our example contains at the top a picture of a wise owl logo and john wyndham's name and that's generated by this html so let's look at that in more detail if you look at the beginning there's a div tag and a div tag means there's a block of text i guess it stands for division maybe whenever you have a tag which begins something it has to have an end point and the end point is exactly the same with the exception you have a forward slash in front of the tag name.

    So to give an another example there's an a tag signifying the beginning of a hyperlink that's where the image is clickable and there's the end day tag there so that's what tags are but you'll notice each tag also has attributes so let's have a look at those so taking this time the example of the image tag now the image tag violates the rules i've just explained to you because it doesn't have a closing part and some tags are like that unfortunately they're called self-closing i think this image tag has four attributes style.

    Src alt and width and they're listed at the bottom there so the styles saying how it appears the source is saying where the picture is coming from the alt is saying what will appear in a tool tip when you let your mouse linger over it and the width is saying how wide the image will appear and if you look at all the other tags in that html they've all got different attributes so having looked at that let's now have a look at classes and ids in html to.

    Complete the picture so the final piece of our html jigsaw is looking at tags classes and elements or ids to see how things are styled with something called css or cascading style sheets so what i've done is i've taken a copy of this file to windom temp because i'm going to make changes to it which i don't really want to keep and i'm going to browse this what i want to do is explain why three bits of formatting appear.

    The first one is why the words john wyndham appear in a larger font the second one is why this box appears with a thin line around it and the third one is why each of these links appears in blue and it turns out the reason for each of those is different so i'm going going back to the html let's start with a font of the title so the title is here let's just highlight it saying h1 john wyndham so.

    Each one is a tag which means it's the most important heading you can have h1 up to 89 i think so why is that appearing in a different font there's no formatting applied to that tag but if you go up to the top of the page you can see there's a section on styles now this is called cascading styles and it's normally contained in a separate file called a cascading style sheet that's what the css stands for and it contains a set of instructions which we're going to have a look at so the first set of instruction it.

    Contains is that the h anything where the h1 tag appears will be formatted with 20 pixel font size and that's why the font's appearing bigger so just in case you don't believe me let's make a bit of a change to it let's add a text decoration property to underline so that should automatically underline text and what i've done is just save that that's why the icon there has disappeared and if i go back to my browser and refresh my browser by pressing f5 you can see the words john.

    Wyndham are now underlined so the idea behind css is to separate the presentation of the date on your page from how it how it's formatted and it's near universal good practice to do it like this now you may be picky notice that i haven't always followed these rules for example i've got an image there where i've actually put some styling within it and i've got a div tag where i've styled that too but by and large that's bad practice and on the web page you should separate out your content and your formatting.

    So let's look at the second example of how this is done what we're going to do is have a look at the box around this list of books so this is coming from this instruction let's just highlight it again that is the box so why's got a board around it there's nothing within that div tag saying that should be the case well again the answer is in our styles at the top if i go back up to the top you can see that there is this.

    Instruction with a hash in front of it what that hash means is if the browser finds an element which has an id attribute of table dash box then it should format it with a border solid border and make it 600 pixels wide and again i think you probably do believe me by now but i'm still going to illustrate it let's change that to 5 pixels this is not going to look very.

    Nice if i then go back to the web page and refresh it you can see i get a much thicker border so that is coming if we go back down just to remind you from the id attribute there now it's possible i could apply the id attribute to more than one tag firstly that would be terrible html practice because the whole idea behind an id is it should be unique on the page and secondly browsers may treat it differently most browsers will pick out the first element with the id tag and.

    Ignore the rest but you can't be sure so the whole idea of an id tag is you should be unique on the page so that's classes and ids or elements uh sorry tags or ids or elements the last bit of the picture is classes you'll notice that every single link appears in blue and the reason for that is every single link has got this attribute assigned to it class equals link so you use class when you don't just want a single.

    Tag to appear in a certain way you want anything belonging to that group and as you probably expect by now if we go up to the top and have a look at our styles we'll find the corresponding style for that in fact there it is so what this is saying is any tag which has got a link class set for it will automatically appear in blue so um i could if i liked change this let's change this to red and when i refresh the page and go back to my browser and press f5 to refresh it.

    You can see all the color of my links changes so classes are a great way to set general changes for all the elements of a particular type and those are your three building blocks tags elements and classes and the entire worldwide web i would say slightly exaggerating is built upon this principle of having a separate style sheet normally which you don't normally get to see containing instructions for how to format your elements on your page and then the elements themselves which either.

    Invoke an id or which invoke a class or which are formatted just by virtue of being of a particular tag so that's how html works so you can either get html from a website address or from a file so in this part of the tutorial we'll look at getting it from a website address or url or uniform resource locator as they're.

    Actually called and the one we're going to use is called pythonscraping.com which i think is available to anybody to use i think that's the idea behind it so to do this if we go into visual studio code open up the terminal window and then in this let's firstly find out if we have the request module installed which is a module you can use to go to website and get the underlying html so i'm going to type pep space list to list my modules and you can see request isn't there so the first thing i need to do is.

    Install it so i'll type pip space install space requests and when i press return it should install that module so that's good news i can now go to file i've created one called the request module dot and in this i should be able now to import my request module so that's good because it means i can go to a website so let's do exactly that let's go to a website and to do that uh you can create a.

    Variable to hold the response it's normally called either r or response i'm going to go for the longer more descriptive response i can put in the name my module and use the get method and then in quotation marks i can put in the website address i want to go to i think i've got it listed there that should do it now the next thing i need to do is test the status returned so to do that i can say if the response and there's a status code property and if you let your mouse linger over it you'll see it's an integer.

    So what i can do is firstly i'll test for what's called a 404 error and that means that the file was the web page wasn't actually found and if that's the case i'll just say not found i could say well how about if it was found and everything went perfectly and to do that i could test the 200 code the holy grail of going to a web page it means everything worked perfectly and in that case what i'm going to do is create a variable called html returned.

    And set that to hold the text returned so that will be the entire contents of the file and then just to prove this has worked i'll just print it out as well all other possibilities lots of other possible status codes exist i'll just cover by saying a message status code and then i'll put the name of the status code in a placeholder and form a function and substitute it substitute in the.

    Status code received so if i run that that should give me success i hope and after a short time while it's going to the website that's the html returned from it that's all good if i change this slightly now or completely in fact to a web page which doesn't exist so what i'm going to do is choose this one it should be python videos not cobra that can't be right so if i run this i should get a 404 error it's gone to this.

    Segment of the if statement and if i return back to what i had before but put maybe a silent q in the middle that domain doesn't actually exist and this time i'll get an error when i run it so i need to build error trapping around this to check it's actually working so that's how you can get a text from a website what we'll now do is look at how you can get text from a file and then we'll go on to actually do some scraping.

    So what i want to do now is to look at how you can get html from a file like this one wyndham.htm and this is actually much easier so i've created a program called html from file.py and we're going to get html from our file to do that i can just use a standard open statement so i can say with open put a cheeky little r in there paste in the contents of the clipboard which gives me the location of my file and what i'll do is call that windom.

    File incidentally it seems that you can't use the request module to browse to a file on your hard disk you can only use it to go to website url so what i can now do is store the response and to do that i'll create a variable called html text and i'll set that equal to windham file and then i'll use the read method to return the entire contents of the file so now all i need to do is check that works so i can print out the html text.

    And if i try running this program you'll see it should give me the contents of my file subject to one rather strange character at the beginning it's all worked perfectly and now finally i think we're ready to do some scraping so we're finally ready to install beautiful soup i've included a file called usefulwebsite.txt with this tutorial and if you click on the link in that to go to the beautiful soup help page you can see the documentation on it it's beautifully written and it's used as a lovely simple example so along with this tutorial i really.

    Think that's the only thing you'll need to get started i recommend it thoroughly so i need to firstly find out whether i've got beautiful soup installed so i'll go to a terminal window and list out my modules pip space list beautiful soup isn't there i didn't think it would be so let's add it so i'll do pip install and beautiful soup except you need to be very careful that will install an old version you need to do beautiful c4 to get the latest version so now if i run that code it will.

    Install beautiful soup and also something called soup sieve 2 which is i presume is a dependent module that's great i can now use it so what i can do is um import that module so i've created a file called basic scraping dot py and within this what i can do is say from the module which is actually called bs4 not beautiful soup i'm going to import the thing i need which is beautiful soup.

    I googled what beautiful stone soup is and as far as i can see it seems to be an older version but it's very difficult to find any information on it i found so now i've got beautiful soup available to me i can use it to make sense of my html text i've got for my file so to do this what i'll do is create a variable called soup that seems to be a very common name to call it i'll take my beautiful soup module and apply it or beautiful soup function.

    DISCLAIMER: In this description contains affiliate links, which means that if you click on one of the product links, I'll receive a small commission. This helps support the channel and allows us to continue to make videos like this. All Content Responsibility lies with the Channel Producer. For Download, see The Author's channel. The content of this Post was transcribed from the Channel: https://www.youtube.com/watch?v=ZuHhwY8XA7M
Previous Post Next Post