In this video, I am going to show you how to scrape youtube videos information from Youtube, including the titles, views, numbers of likes, dislikes, and YouTubers name. Before we get started, Open https://www.youtube.com/ on a web browser, type in the keyword that you need in the search bar. In this case, we are looking for videos about “web scraping”. After the page finishes loading, copy this (Show the URL) URL. This is the URL we will use in the demonstration.
note: To learn more about AJAX click https://youtu.be/MuOC1yCKai0
To learn more about XPath click https://youtu.be/kZwD6szlvas
Step One: Enter the URLs of the websites you would like to scrape
- Build a new task by clicking “Advanced Mode”, and enter the URL we just copied
- And click “Save URL” on the left corner. This will bring you to the hotel listing page with Octopuses’ built-in browser.
Step Two: Create a pagination loop
- After the web pages complete loading from Octoparse built-in browser. You can see the interface is divided into three parts: The workflow box is on the left, the setting area on the right, and the interactive view of the websites on the bottom.
- Let’s switch off the work-flow and take a closer look at the webpage structure. Youtube applies infinite scrolling which means the website paging with automatic display of the content from the next page by adding to the current page. This is a typical Ajax technique. We will get into this later. So let’s turn the work-flow back on to continue.
- Since Youtube applies an infinite scroll, we don’t need to create pagination. Instead, we tell the Octoparse keeps scroll down until the desired amount of content loaded up.
- To do this, go to the settings area. Check the Scroll Down option and set the scroll times. Depends on how many YouTube videos you would like to scrape, you may change the Scroll times accordingly. Let’s say there are around 25 videos before any new content is loaded, and there are 25 new videos added each time you scroll down to the bottom. In order to get 100 pieces, you need to scroll 3 times in this case.
- Now you need to set the interval time depends on the website’s loading speed. Youtube is fairly fast, so I set as 2 seconds considering the videos may affect its loading speed.
- Click the “Ok” to save the steps.
Step Three: Create a “Loop item” To click through each detail page, get detail information. we need to create a “Loop item”
- To create a loop item, select the element. In this case the video name from the listing. Click the first title, you will notice other similar elements that have been found and highlighted in red. We want to create a loop item with all listings. Follow the guide, choose “Select All”.
- Now, all results are highlighted in green, which means they have been successfully selected.
- Then Click “Loop click the selected link” to create a “Loop Item.” Octoparse will click through each video for detailed information.
- We still need to go back to the setting area and adjust the setting.
- As Youtube applies Ajax, select the “Ajax Load” option and set the wait time accordingly.
- Uncheck “Auto-Retry”, “Ajax” and “Auto retry” are mutually exclusive, make sure you set this correctly.
- Click “Save” to Save the steps.
- Now we need to come back and check if things have been set as expected. If the web page in the built-in browser shows the corresponding reaction as we click through each step, it means the setting is correct. As you may notice, when I click “Loop Item”, there are only 18 listings whereas there are actually more than 18 listings. This is due to an incorrect Xpath. Therefore, we need to fix the XPath and tell Octoparse to locate the element in the webpage so as to avoid incomplete extraction. To do this, go to the settings area. I have already prepared the correct Xpath for the purpose of the demonstration. Copy this Xpath and paste the expression onto the variable List.
- I have also attached the tutorial of how to write an XPath down below. Since not all webpages are well written with the exact same structure. The robot will skip the element if scraper can’t locate it. You can ignore this step if the webpage is well organized.
- Click “Save” to save the step.
Step Four, Data extraction.
- To extract the data, Click the element, for example, in this case, the video title and select “extract the text of the selected element” from Action Tip.
- You can preview the extraction from the data fields and edit.
- Repeat the above steps and get the needed data extracted.
- Click “Save” to save the steps.
- Now go back to the settings area. Set “wait before execution” as 2 seconds. This is because Octoparse scrapes too fast, even faster than the actual webpages. Octoparse finishes scraping the current page and is ready to scrape the next page, but the webpage is still loading for the rest of the information. The lagging causes Octoparse to scrape the whole page again until the current web page is ready to paginate.
Step Five: Run the task and get data
- After finishing setting up the rules, we can run the task by clicking “start extraction”
- Then the Select “Local extraction” to run the task. You can switch the view to check. The scraping status on the websites and the data have been extracted from the table.