Consistency between extracted data and my List of Urls?
Hi there, this is a frequent issue I have.
Most of the time I have a list of URLs in an Excel sheet that I want to extract data from.
I import this list to Octoparse, initiate a task to extract from this list of URLs, etc, data is parsed ok.
The problem comes when I extract the data. What I would expect is to copy-paste the extracted data and put it side by side with the column containing the initial list of URLs and each row containing the data pertaining to the correct URL.
Alas, what I end up with is a list of data that is not sorted in the same way as my list of URLs and I need to manually manipulate and sort it so I can match them.
Is there a way I can extract data in sequential order following the order of my list of URLs?
Thank you
-
Hi konasim,
Thank you for reaching out.
You can set up to extract the current page URL:
Scrape page-level data (meta data, page URL, page title, source code)
And if you want to get the data extracted to be in the same order from how it is listed on the page, with local extraction, the data extracted will be in the same order as the URL list, with cloud extraction, the task will be broken down into multiple subtasks which can be run on multiple servers simultaneously, thus it's most likely the order of URLs will not be the same, but if we slow down the scraping process in the cloud, there's still a chance we can get it in the same order, and we might need to do several testing runs to check that beforehand.
Best regards,
Please sign in to leave a comment.
Comments
1 comment