In this tutorial, we are going to show you how to scrape information from Yahoo Finance.
To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the Symbol and Name from cryptocurrency chart the with Octoparse.
This tutorial will also cover:
- Deal with AJAX for pagination
- Paginate correctly by modifying the loop mode and XPath in Octoparse
Here are the main steps in this tutorial: [Download task file here ]
- "Go To Web Page" - to open the targeted web page
- Create a pagination loop - to scrape all the results from multiple pages
- Create a "Loop Item" - to loop extract element on each row
- Extract data - to select the data for extraction
- Start extraction - to run the task and get data
1. "Go To Web Page" - to open the targeted web page
- Click "+ Task" to start a new task with Advanced Mode
Advanced Mode is a highly flexible and powerful web scraping mode. For people who want to scrape from websites with complex structures, like Walmart.com, we strongly recommend Advanced Mode to start your data extraction project.
- Paste the URL into the "Extraction URL" box and click "Save URL" to move on
2. Create a pagination loop - to scrape all the results from multiple pages
- Click the "Next" button to create a pagination
- Click "Loop click next page" on the "Action Tips" panel
- Unclick "Retry when page remains unchanged"
- Set "AJAX Timeout" as 10s and "Wait before Execution" as 3s
Octoparse will skip loading if it finds the target element regardless of execution waiting. By doing so can accelerate the entire extraction process.
- Click "OK" to save the step
- Copy the XPath expression "//button[not(@disabled)]//*[@data-reactid][text()='Next']" and paste it into the box for "Wait until the element is found" and Single element, and set the execution waiting time according to your local network condition (30 seconds as an example)
We need to modify XPath in order to locate the pagination button correctly. We also paste this XPath expression so that Octoparse will not execute the pagination step until this element has been found on the web page.
- Click "OK" to save
If you want to learn more about AJAX and XPath, here are some related tutorials you might need:
3. Create a "Loop Item" - to loop extract each element on each row.
- Click "Go To Web Page" to go back to the first page
When extracting data throughout multiple pages, you should always begin your task building on the first page.
- Click the name of the first item in the table
- Click "Expand" icon on the "Action Tips" panel
Octoparse will automatically select the item. The selected item will be highlighted in green while other items with the same structure will be highlighted in red.
The data present in the form of Table. Thus we want to extract by rows rather by columns. Expand the area by clicking this icon
- Click "Select All sub-element" and then Click "Select All" to create a loop list
Octoparse will detect all the sub-element with similar structures.
Since we just want to get Name and Symbol, we need to cross out all unwanted fields.
- Copy the XPath expression automatically generated "//TR[contains(@class,'simpTblRow Bgc($extraLightBlue):h BdB Bdbc($finLightGrayAlt) Bdbc($tableBorderBlue):h H(32px)')]" and paste it into the box for "Wait until the element is found" and set the execution waiting time according to your local network condition (30 seconds as an example).
We copy and paste this XPath expression so that Octoparse will not execute the extraction step until these elements have been found on the web page.
4. Extract data - to select the data for extraction
After you click "Extract data in the loop", Octoparse will extract all selected elements in the same row.
- Edit the name by selecting the name from the pre-defined list names or create on your own
5. Save and start extraction - to run the task and get data
- Click “Start Extraction” on the upper left side
- Select “Local Extraction” to run the task on your computer, or select “Cloud Extraction” to run the task in the Cloud (for premium users only)
Here is the sample output: