Question: What is source code?
Answer: Source code is the original text version of a web page written in programming languages. So, it contains all the information of a web page. You can view the source code of any web page by right-clicking and selecting "View Page Source" in a browser.
Why do you need to scrape from source code?
When the data you need is shown in the form of non-text contents, like star rating, you may not be able to extract the rating directly using "Extract text of the element" as the number value is not directly visible on the page (only the stars); however, you can still capture this valuable piece of information from the source code-HTML . In other situations, the data you require may be scrambled with other messy data as it gets extracted directly as text; in this case, you can try scrape the data from HTML.
Octoparse supports extracting data from source code directly. In this tutorial, we will show you how to extract from inner HTML and outer HTML.
1) Extract data from inner HTML
HTML is the standard markup language for creating web pages. When we extract the inner HTML of an element on the page, we will get the HTML markup contained within the element. So, for the information shown in the form of a picture or icon, we can capture its inner HTML first, then further extract the target data from the extracted code by using data reformat tools.
Take the star-rating of a restaurant on Yelp.com as an example.
- Click the "star-rating"
- Select "Extract inner HTML of the selected element"
Switch to the Workflow Mode by toggling the Workflow switch . The extracted inner HTML had been added to "Data field",
<img class="offscreen" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png " alt="4.0 star rating" height="303" width="84">
Notice the number value of the star rating (4.0) is included the code extracted though it was not directly available on the web page. Now we have the code, we can further pinpoint "4.0" from it via reformatting the data with Regular Expression (learn more about reformatting HTML in Part 3).
2) Extract data from outer HTML
Outer HTML is an element property that includes the opening and the closing tags as well as the content. So, capturing the outer HTML can technically provide more information than inner HTML. If the information needed cannot be found in the inner HTML, it is still possible to locate it in the outer HTML.
The steps to extract outer HTML is similar to that of inner HTML:
- Click the data needed
- Select "Extract outer HTML of the selected element" from "Action Tips"
The outer HTML of the star rating is as follow:
<div style="background-color: rgb(229, 245, 233); outline: 1px solid rgb(0, 162, 59);" class="i-stars i-stars--large-4-rating-very-large" title="4.0 star rating">
<img class="offscreen" src="https://s3-media2.fl.yelpcdn.com/assets/srv0/yelp_design_web/9b34e39ccbeb/assets/img/stars/stars.png
" alt="4.0 star rating" height="303" width="84"> </div>
As you can see, the inner HTML (highlighted in blue) is part of the outer HTML. Once extracted, the target data (4.0) can be further captured using Regular Expression tool in similar way (Skip to the step).
1. How to extract the full HTML of a web page?
Extracting the full HTML enables you to have all the information on a web page.
Now you've captured the full HTML of the page!
2. Why is there no "Extract inner HTML ..." or "Extract outer HTML..." available on "Action Tips"?
The options provided on "Action Tips" vary according to the data you select.
Try to expand the selection by clicking on the expansion icon at the bottom of "Action Tips".
3) Reformat data with RegEx tools
Data reformat tools are very helpful to process the data extracted, to clean the data. There are 8 built-in data reformat tools in Octoparse. For the purpose of this tutorial, we'll cover two HTML related reformat tools.
To access the data reformat tools,
- Select the data field to reformat
- Click on to customize the field
- Click "Refine extracted data"
- Click "Add step"
1. HTML Transcoding
Once you have the inner/outer HTML code extracted, you can convert the HTML tags into plain text using "HTML transcoding". For example, transcode ">" into ">" and " " into a space.
- Select "HTML transcoding"
- Click "Evaluate" and confirm the output
- Click "OK" to save the settings
2. Match with Regular Expression
- Select "Match with Regular Expression"
- Click "Try RegEx Tool"
- Enter the match criteria: start with " alt=" ", end with "star rating"
- Click "generate", then "Match", you will see the number value of star rating (4.0) is matched.
- Click "Apply"
- Click "OK" to save the settings
If you are interested in learning the other data reformat tools, see this tutorial .
Artículo en español: Extraer datos del código fuente
También puede leer artículos de web scraping en el sitio web oficial.