Extract all links from a list of local html files.


1 comment

  • tatgift

    Hi Steve

    Also new to Octoparse, but have seen there are a few help guides that should assist - 1 deals with creating a loop of pages to crawl from a list of URLs, the second is extracting URLs from text strings using Regular Expressions.

    Take a look at these and see if it helps.

    1. https://www.octoparse.com/tutorial-7/extract-data-from-a-list-of-urls   
    Note: not sure if its as simple as creating a file path list from the directory storing your HTML files and copying and pasting into the start page dialogue but worth a go. I assume you will need to define the field to load the complete HTML as a single field - select any element from the page and keep expanding the tags until it select the lot

    2. https://helpcenter.octoparse.com/hc/en-us/articles/900004068246-Regular-expression-tool-Version-8-
    Note: once you have the complete HTML loaded as a signle field can select the clean text function and match with regeg using the regex builder setting start values to http or www or what ever string is appropriate and similarly for the end value to .com .uk .jpg or what ever is required. Generate the regex. Apply. Then evaluate setting the 'match all'. Should extract all URLs that match the string.

    May not be the best approach but how I would tackle it with my recently acquired knowledge.

    Btw: these forums dont seem to be terribly active but hopefully you get some other replies if suggested approach doesnt work.


    Comment actions Permalink

Please sign in to leave a comment.