Hi, greetings. I wrote a crawler which has two sets of pagination but when I run, it keeps extracting duplicates of the first page, I have attached the tasks here, please what am I doing wrong.

Comments

1 comment

  • Fergus

    1. Task PressRush_URL is used for collecting URLs of the detail pages.
    We cannot scrape the URLs directly from the page as the website does not code the URLs in the HTML.

    The URLs needs to be combined from the two fields, page URL+ ID:
    https://www.pressrush.com/search/?q=personal%20information&author=5588297&sort=recent


    After we collect the URLs, you can export the data, then use Excel to combine the two fields.

    Another modification in the PressRush_URL task is that I directly use the search result page URL:
    https://www.pressrush.com/search/?q=personal%20information
    Using enter search keyword and click search button can only work for one keyword.
    You can also use Excel to gerenate the search result page URLs since there is the parameter "q=XXXX" in the URL.


    2. Task PressRush_Detail is used to collect the detailed page info.
    You need to input the URLs scraped from the first task to the loop 



    3. Both tasks have login steps. Please note to enter the password into the tasks as the password has been deleted.

    0
    Comment actions Permalink

Please sign in to leave a comment.