The updated tutorial for the latest version 8.1 is available here. Go to have a check now!
In this tutorial, we will show you how to customize data conglomeration in Octoparse to merge different rows of data into one single row.
Let's suppose you need to extract posts from a blog. In some cases, you might not be able to select the entire post to extract. But you want the whole post in one single row instead of separated paragraphs in different data fields like below:
So in this case, to merge different rows into one row of data, we suggest you use the conglomerate feature in Octoparse while configuring extraction.
Here we use the blog content from https://philipyancey.com/a-view-from-abroad as an example to show you how to use the conglomerate feature to merge data extracted.
1) Select the desired data to extract
1. Select one paragraph on the page and click "Select all" to create "Loop Item" to extract each paragraph of the post.
2.Select "Extract text of the selected elements"
2) Customize data conglomeration to combine the data extracted
1. Click on the "Extract Data" action and then the data field to customize
2. Click on to customize data field
3. Select "Customize data conglomeration"
4. Select "Conglomerate data captured for the same data field into a single row."
Now, the paragraphs captured in "Text" field would be merged into one single row when executing.
Let's run the task and export the result to excel for a better view.
You can see that paragraphs captured in "Text" filed are now combined into a single row as one big chunk.
1. Data conglomeration is especially useful in extracting articles from the web.
You can extract the article as one whole chunk with no other elements like blank lines, comments, and images.
2. When the data are conglomerated as one chunk, you can use Data reformat tools to add a prefix or suffix, such as "|" and "\", to make each item to be better viewed.
Artículo en español: Datos del conglomerado extraídos
También puede leer artículos de web scraping en el sitio web oficial.