Home » bitcoin tapper » How to scrape websites with Python and BeautifulSoup

How to scrape websites with Python and BeautifulSoup

How to scrape websites with Python and BeautifulSoup

There is more information on the Internet than any human can absorb te a lifetime. What you need is not access to that information, but a scalable way to collect, organize, and analyze it.

You need web scraping.

Web scraping automatically extracts gegevens and presents it ter a format you can lightly make sense of. Ter this tutorial, we’ll concentrate on its applications ter the financial market, but web scraping can be used te a broad multiplicity of situations.

If you’re an avid investor, getting closing prices every day can be a agony, especially when the information you need is found across several webpages. We’ll make gegevens extraction lighter by building a web scraper to retrieve stock indices automatically from the Internet.

Getting Embarked

Wij are going to use Python spil our scraping language, together with a elementary and powerful library, BeautifulSoup.

  • For Mac users, Python is pre-installed ter OS X. Open up Terminal and type python –version . You should see your python version is Two.7.x.
  • For Windows users, please install Python through the official webstek.

Next wij need to get the BeautifulSoup library using pip , a package management implement for Python.

Te the terminal, type:

Note: If you fail to execute the above directive line, attempt adding sudo ter vooraanzicht of each line.

The Basics

Before wij embark leaping into the code, let’s understand the basics of HTML and some rules of scraping.

If you already understand HTML tags, feel free to skip this part.

This is the basic syntax of an HTML webpagina. Every <,tag>, serves a block inwards the webpagina:

1. <,!DOCTYPE html>, : HTML documents vereiste begin with a type declaration.

Two. The HTML document is contained inbetween <,html>, and <,/html>, .

Trio. The meta and script declaration of the HTML document is inbetween <,head>, and <,/head>, .

Four. The visible part of the HTML document is inbetween <,bod>, and <,/assets>, tags.

Five. Title headings are defined with the <,h1>, through <,h6>, tags.

6. Paragraphs are defined with the <,p>, tag.

Other useful tags include <,a>, for hyperlinks, <,table>, for tables, <,tr>, for table rows, and <,td>, for table columns.

Also, HTML tags sometimes come with id or class attributes. The id attribute specifies a unique id for an HTML tag and the value vereiste be unique within the HTML document. The class attribute is used to define equal styles for HTML tags with the same class. Wij can make use of thesis ids and classes to help us locate the gegevens wij want.

For more information on HTML tags, id and class, please refer to W3Schools Tutorials.

  1. You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of gegevens. Usually, the gegevens you scrape should not be used for commercial purposes.
  2. Do not request gegevens from the webstek too aggressively with your program (also known spil spamming), spil this may pauze the webstek. Make sure your program behaves te a reasonable manner (i.e. acts like a human). One request for one webpagina vanaf 2nd is good practice.
  3. The layout of a webstek may switch from time to time, so make sure to revisit the webpagina and rewrite your code spil needed

Studying the Pagina

Let’s take one pagina from the Bloomberg Quote webstek spil an example.

Spil someone following the stock market, wij would like to get the index name (S&,P 500) and its price from this pagina. Very first, right-click and open your browser’s inspector to inspect the webpagina.

Attempt hovering your cursor on the price and you should be able to see a blue opbergruimte surrounding it. If you click it, the related HTML will be selected te the browser console.

From the result, wij can see that the price is inwards a few levels of HTML tags, which is <,div class=",basic-quote",>, > <,div class=",price-container up",>, > <,div class=",price",>, .

Similarly, if you hover and click the name “S&,P 500 Index”, it is inwards <,div class=",basic-quote",>, and <,h1 class=",name",>, .

Now wij know the unique location of our gegevens with the help of class tags.

Leap into the Code

Now that wij know where our gegevens is, wij can embark coding our web scraper. Open your text editor now!

Very first, wij need to invoer all the libraries that wij are going to use.

Next, announce a variable for the url of the pagina.

Then, make use of the Python urllib2 to get the HTML pagina of the url proclaimed.

Eventually, parse the pagina into BeautifulSoup format so wij can use BeautifulSoup to work on it.

Now wij have a variable, soup , containing the HTML of the pagina. Here’s where wij can begin coding the part that extracts the gegevens.

Reminisce the unique layers of our gegevens? BeautifulSoup can help us get into thesis layers and samenvatting the content with find() . Ter this case, since the HTML class name is unique on this pagina, wij can simply query <,div class=",name",>, .

After wij have the tag, wij can get the gegevens by getting its text .

Similarly, wij can get the price too.

When you run the program, you should be able to see that it prints out the current price of the S&,P 500 Index.

Uitvoer to Excel CSV

Now that wij have the gegevens, it is time to save it. The Excel Comma Separated Format is a nice choice. It can be opened te Excel so you can see the gegevens and process it lightly.

But very first, wij have to invoer the Python csv module and the datetime module to get the record date. Insert thesis lines to your code te the invoer section.

At the bottom of your code, add the code for writing gegevens to a csv opstopping.

Now if you run your program, you should able to uitvoer an index.csv opstopping, which you can then open with Excel, where you should see a line of gegevens.

So if you run this program everyday, you will be able to lightly get the S&,P 500 Index price without rummaging through the webstek!

Going Further (Advanced uses)

So scraping one index is not enough for you, right? Wij can attempt to samenvatting numerous indices at the same time.

Very first, modify the quote_page into an array of URLs.

Then wij switch the gegevens extraction code into a for loop, which will process the URLs one by one and store all the gegevens into a variable gegevens te tuples.

Also, modify the saving section to save gegevens row by row.

Rerun the program and you should be able to samenvatting two indices at the same time!

Advanced Scraping Technologies

BeautifulSoup is plain and superb for small-scale web scraping. But if you are interested te scraping gegevens at a larger scale, you should consider using thesis other alternatives:

  1. Scrapy, a powerful python scraping framework
  2. Attempt to integrate your code with some public APIs. The efficiency of gegevens retrieval is much higher than scraping webpages. For example, take a look at Facebook Graph API, which can help you get hidden gegevens which is not shown on Facebook webpages.
  3. Consider using a database backend like MySQL to store your gegevens when it gets too large.

Adopt the DRY Method

DRY stands for “Don’t Repeat Yourself”, attempt to automate your everyday tasks like this person. Some other joy projects to consider might be keeping track of your Facebook friends’ active time (with their consent of course), or grabbing a list of topics ter a forum and attempting out natural language processing (which is a hot topic for Artificial Intelligence right now)!

If you have any questions, please feel free to leave a comment below.

This article wasgoed originally published on Altitude Labs’ blog and wasgoed written by our software engineer, Leonard Mok. Altitude Labs is a software agency that specializes ter personalized, mobile-first React apps.

Related movie: RPG Maker VX Ace Spel : Final Fantasy 1 Remake


Leave a Reply

Your email address will not be published. Required fields are marked *