In the previous lesson, we've learned how to capture simple text from a webpage. So now, let's move on to a more advanced scraping technique: capturing a list of items.
Scraping list is quick and easy with Octoparse. Based on its advanced algorithm, Octoparse is capable of auto-detecting items of a list, which makes scraping list a lot more straight-forward. Now let’s see how it is done with an example.
We'll use the Octoparse's blog page https://www.octoparse.com/blog for this example.
1. Build a list
First, create a new task for the target webpage: https://www.octoparse.com/blog. The webpage will be loaded in Octoparse. Let's inspect the page structure first.
There is a list of post blocks on this particular webpage, with each post block containing information like post title, posted date, abstract, and a tag. Suppose we want to capture the title, posted date, tag as well as the abstract from each post block.
To help Octopares identify the list of post blocks, select any two items (post blocks) from the list.
- Click the first post block (make sure you select the whole block)
- Click the second post block (any block down the list will do).
Scroll down the page and notice how Octoparse has selected all the post blocks as well as all the sub-elements (title, posted data, tag, abstracted).
1. In order for the list to be built correctly, it is critical to keep the two selection identical in structure, i.e. the highlighted content should be of the same "look".
You can always expand the selection area by clicking on the tags (e.g. DIV, A, LI, etc.) on the bottom of "Action Tips".
2. If certain items on the list are still missing after the first two clicks, keep clicking on more items from the same list until all items desired are selected and highlighted in green.
2. Select the sub-elements
After we have successfully built the list of all post blocks, the next step is to extract the specific sub-elements from each individual post block. There are two ways to approach this.
2.1 Capture all the sub-elements found
Octoparse automatically selects all the sub-elements found within each post block. To have all the sub-elements extracted, simply follow the instructions provided on Action Tips and confirm the selection by clicking on "Select all sub-elements".
Then, go on to select all the elements in the list. Click "Select all".
Capture all the selected data by clicking on "Extract data in the loop".
2.2 Capture only the sub-elements you want
Instead of capturing all the sub-elements found, you can manually select the sub-elements you want by handpicking those that are needed.
Once finish building the list, do not click "Select all sub-elements" this time but capture the selected text by clicking "Extract text of the selected element". This way, Octoparse will extract the whole post block at once.
Switch to the workflow mode and note a loop item has been added to the workflow mode automatically.
At the same time, all the text data within the post block is being extracted into a single field.
Next, go on to select the specific sub-elements you need from the selected post block (one that's highlighted in red). Follow the instructions provided on "Action Tips" and select "Extract text of the selected element" when done.
Octoparse will automatically fetch the same sub-elements from every post block in the list.
3. Fetch the data
That's it! We were able to get all the data we need from the webpage within a few clicks. Run the task to get the data now.
Artículo en español: Lección 4: Obtener datos - Capturar una lista de elementos
También puede leer artículos de web scraping en el sitio web oficial.