In this tutorial, we are going to show you how to scrape company information from bbb.org. To follow through, you may want to use this URL in the tutorial:
We will scrape data such as the title, phone, and address from the search results list with Octoparse.
Here are the main steps in this tutorial [Download demo task file from here ]:
- "Go To Web Page" - to open the targeted web page
- "Enter Text" – to enter a keyword to be searched through
- Create a pagination loop - to scrape multiple listing pages
- Create a "Loop Item" - scrape all the items on each page
- Extract data- to scrape certain elements on each page
- Save and start extraction - to run the task and get data
1. "Go To Web Page" - to open the targeted web page
- Click "+ Task" to start a new task with Advanced Mode
- Paste the URL into the "Website" box
- Click "Save URL" to move on
2. “Enter Text” – to enter keywords to be searched through
- Click "Search box" on the left
- Click "Enter text" on the "Action Tips" panel
- Enter the keyword to be searched through
- Click "OK"
- Repeat the steps above to the "Search box" on the right
- Click the "Search" button
- Click "Click button" on the "Action Tips" panel
- Scroll down and click the "Next" button on the web page
- Click "Loop click next page" on "Action Tips" panel
- Uncheck "Retry when page remains unchanged"
- Set "AJAX Load" as 5s
- Click "Save" to move on
AJAX timeout can often be used as web page timeout for Click Action. For example, when you have a page that takes forever to finish loading, long after the data you need gets loaded, you can conveniently use AJAX timeout to tell Octoparse to move on to the next action when the set time is reached.
If you want to learn more about AJAX, here are some related links:
4. Create a "Loop Item" - scrape all the items on each page
- Click on any product titles on the page
- Click "Select all" on the "Action Tips" panel
- Click "Extract link text"
There is no data in the field, so there must be something wrong with the XPath of "Loop item" and data field.
- Click “Loop item”
- Change the XPath into "//div[@class='MuiPaper-root MuiPaper-elevation1 MuiCard-root styles__ResultItem-sc-7wrkzl-0 fbHYdT MuiPaper-rounded']"
- Click "OK" to save
Modifying XPath in Octoparse is really important when the XPath auto-generated cannot locate items precisely. Here are some related tutorials you might need：
- Click the data field you need
- Click "Extract text from selected elements" on the "Action Tips" panel to extract the title of the detail page
- Repeat the above steps until you select all you need
- Rename the data field as you need
6. Save and start extraction - to run the task and get data
- Click “Start Extraction” on the upper left side
- Select “Local Extraction” to run the task on your computer, or select “Cloud Extraction” to run the task in the Cloud (for premium users only)
Here is the sample output.