Go to Web Page
FollowWhen you have a target website to scrape, you need to have a specific URL to start with first. Go to Web Page in Octoparse can simulate the action to open a specific URL. If possible, it's better that we have a direct web page URL that has data you want to extract rather than a general website domain link.
Now, you have a specific website URL or a list of web page URLs. Let's get started!
- Open Single Web Page
- Open Multiple URLs in the Loop
- Settings on "Go To Web Page"
- Web Page Not Loading
1. Open Single Web Page
If you have a single web page URL to open with (an example search result page URL from eBay: https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xcharger.TRS0&_nkw=charger&_sacat=0), there are 3 entries for you to enter the web page URL.
- Home Page of Octoparse
You can directly enter the URL into the search bar and then press "Start" to start.
- Side Navigation Menu
You can find the "+ New" button on the side navigation menu. Click it and then select the 1st option - Advanced Mode.
Then, the "New Task" set up page will be opened. You can enter the URL manually in the Website box and then press "Save" to start.
A Go to Web Page action will be generated automatically in the workflow.
- Task Workflow
During the task set up, if you want to add a new page URL to the workflow, you can find the "+" icon shown up in the workflow between each step when you move your mouse around the workflow.
After you click the "+" icon, there will be several options in the drop-down menu. Click the load more button to find the option of "Open Page" and click it.
A new step of "Go to Web Page" will be generated then. Double-click the step to open the action settings so that you can enter the URL of the new page. Click "OK" to save the settings.
2. Open Multiple URLs in the Loop
If you have multiple web page URLs sharing similar web structure, then there is no need to build a task one by one, you can input URLs at once.
The ways to open multiple URLs are actually the same as how to open a single web page.
- Home Page of Octoparse
If you have a list of URLs, you can also directly copy (e.g. from an Excel file) and paste them in the search bar. Valid URLs will be detected, and then you can press the "Start" button to get started.
A "Loop URLs" box which includes all URLs you input will be generated. You can double-click the "Loop URLs" box to check or modify URLs in the loop item.
- Side Navigation Menu and 4 URLs Input Ways
You can find the "+ New" button on the side navigation menu. Click it and then select the 1st option - Advanced Mode.
Then, the "New Task" set up page will be opened. There are 4 ways to input URLs. You can choose based on your situation. Check details in this tutorial: Batch URL input.
If you want to enter URLs manually, remember to enter one URL per line or you can directly copy from an Excel sheet with a list of URLs.
- Task Workflow
If you want to add a list of URLs in the workflow, you can click the "+" icon to add some steps necessary.
First, you need to add a "Loop" item from the drop-down menu. Then, a Loop Item is added. Double-click it to input URLs you want to enter.
Under the "Loop Item", select the loop mode as List of URLs and click the to input URLs.
After saving the settings, a "Loop Item" with "Go to Web Page" will be generated.
3. Settings on "Go to Web Page"
When you double-click "Go to Web Page" on the workflow, you can adjust settings to this step based on the website condition and your Internet connection.
- General settings and "Before page render"
You can adjust the "Timeout" if the web page takes quite a long to load. You can also change the web page URL on the URL bar.
"Load URLs in the loop" should be ticked only when you need to open URLs from the loop list.
Under the section "Before page render", you can set a wait time and cookie to this step. "Wait before action" can be set to control the intervals of opening each URL. As for the cookie setting, it is frequently used when the web page requires log-in to access.
- "After loading page"
The most frequently used one is the scroll down setting. You can choose it if the page requires scroll-down to load the content.
First, you need to choose Scroll way, "to the bottom of the page" or "for one screen".
Then, set up "Repeats" (how many scroll-down times you want to have) and "Wait time" (interval time between each scroll to allow the new data loaded after scrolling).
- "Retry"
You can set "Retry" settings to re-load the page again if the current page does or doesn't contain text/the element you want.
4. Web Page Not Loading
Sometimes a web page can't load well in Octoparse's built-in browser. You only receive a blank page.
In this case, you can click the setting icon on the left top corner to modify task settings.
You can go to "Browser Ver." under the "Run Settings" to switch to another browser.
There are many options under the "Browser Ver.". Choose one and click "Save" to go back to the previous page.
Then, click the icon of "Reload webpage" to refresh the page and see if the web page can be loaded well.
If you have questions, you are welcome to submit a request here. Our support team will get back to you later.
Artículo en español: Ir a la página web
También puedes leer artículos de web scraping en sitio web oficial
Author: Vanny
Editor: Yina