You are browsing a tutorial guide for the latest Octoparse version. If you are running an older version of Octoparse, we strongly recommend you upgrade because it is faster, easier and more robust! Download and upgrade here if you haven't already done so!

If you scrape a list of URLs, you may want to get the original input URL as a field along with your target data so you can match them to see if there are any URLs that haven't been scrapped.

However, chances are the URLs might change after opening (e.g, some URL parameters might change) or be redirected to another totally different URL. Now the new feature of adding an Original input URL in Octoparse 8.5 perfectly resolves this dilemma! Let's see how to use this function.

What's the original URL Octoparse adds as a field?

For this function, Octoparse adds the original URL you input to Octoparse to start the task.

  • Single URL. If you start the task with one single URL, you will get the URL that you put in the Go to Web Page action

mceclip5.png
  • URL lists in the loop item. If you are extracting data from a URL list, you will get the URL list you input in the Loop URLs by using the Original Input URL

mceclip2.png

How to add the original URL?

Let's take this link as an example: https://www.yachtall.com/en/fwd/go-to-builder?id=75&js=1

Open this link in your browser and you will notice that the URL is redirected to another one: https://en.azimutyachts.com/

mceclip4.png

STEP 1. Input your URL(s) in Octoparse to start a task

start_a_task.jpg

STEP 2. Go to the Data Preview section and select Original input URL from Add Custom Field

mceclip0.png

You will see a field named Original_URL created as a field and the value of it is https://www.yachtall.com/en/fwd/go-to-builder?id=75&js=1 not https://en.azimutyachts.com/

mceclip2.png

Tip: You can also scrape the URL after redirecting, which means to get https://en.azimutyachts.com/ instead of https://www.yachtall.com/en/fwd/go-to-builder?id=75&js=1. Please check the tutorial Scrape page-level data (metadata, page URL, page title, source code)

Did this answer your question?