I used the new version and auto-detect to capture data from this URL https://www.indeed.com.pe/Empleos-en-Lima


1 comment

  • Scarlett

    For your future reference, in order for the scraper to paginate properly, it is critical to make sure that the Pagination Loop is positioned correctly. Octoparse executes steps of the workflow in a top-down, inside-out manner. Therefore, any scraping action that needs to be performed prior to paginating to the second page needs to be positioned within the Pagination Loop.  

    In this case, the task is built to first load the webpage, loop through the list to extract data on the first page, turn the page, then repeat the same scraping on the second page, so on and so forth. As the scraping steps need to be performed before flipping the page, so the List Loop should be built within the Pagination Loop. 

    For cases like scraping from pages with a load more button, we'll need to rearrange the workflow, as it introduced in this tutorial: Dealing with Infinitive Scrolling/Load More If the extraction loop has been built inside of the pagination loop, drag it out manually since we would want to finish the first loop before executing the second.

    Here we need to revise the XPath of the pagination loop into:​
    //span[@class="np"][contains(text(),'Siguiente')] So it can loop through all the multiple pages.
    I also revised the XPath of the loop item and data fields, you can click in the workflow to check the details, but I'm not sure what is the "keywords" here, so I wasn't able to revise it. If you need help with that, please let me know where does this data field locate on the webpage. To learn more about XPath, this tutorial will be very helpful, please check: What is XPath and how to use it in Octoparse

    Comment actions Permalink

Please sign in to leave a comment.