Automation Action – Web Spider | ThinkAutomation
Automation Action: Web Spider
Crawls a web site and returns a list of all URLs found.
Crawls (spider’s) a URL and returns a list of URLs found. The list can either be returned as a text with one URL per line or as CSV or Json containing each URL, Title, Description, Keywords and Last Modified Date.
The Web Spider Action only crawls the specified URL. It does not crawl outbound links.
Specify the URL to spider.
Specify any Avoid Patterns (separated by semi colons). Adds wildcard patterns to prevent spidering matching URLs. For example, if “*/assets/*” is added, then any URL containing “/assets/” is not spidered. The “*” character matches zero or more of any character.
Optionally specified a date (or %variable% containing a date) in the Only Modified Since entry. If a date is specified then only URL’s with a Last-Modified header date greater than this date will be returned.
Set the Maximum URLs that you want to spider for the site.
Enable the Chop Querystrings to remove the ?query portion from any URLs. This can be done to avoid auto-generated content.
The Web Spider Action will check any robots.txt file. It will not download pages denied by robots.txt
The Return As option can be set to:
URLs one per line
For example:
https://www.testsite.com/ https://www.testsite.com/page2.htm
CSV Containing URL, Title, Description, Keywords, Modified Date
For example:
URL,Title,Description,Keywords,LastModDate https://www.testsite.com/,Title1,Test Description 1,"keyword1,keyword2",2025-02-26 15:12:12 https://www.testsite.com/page2.htm,Title 2,Test Description 2,"keyword1,keyword2",22025-02-26 15:12:12
JSON Array Containing, URL, Title, Description, Keywords, Modified Date
For example:
[ { "URL": "https://www.testsite.com/", "Title": "Title 1", "Description": "Test Description 1", "Keywords": "keyword1,keyword2", "LastModDate": "2025-02-26T15:12:12" }, { "URL": "https://www.testsite.com/page2", "Title": "Title 2", "Description": "Test Description 2", "Keywords": "keyword1,keyword2", "LastModDate": "2025-02-26T15:12:12" } ]
Select the variable to receive the results from the Assign To list.
You can also assign a list of outbound links found across all URLs spidered. Select the variable to receive outbound links from the Assign Outbound Links to list. Outbound links are returned as a text string with one link per line.
This Action is useful when you need to load content for an entire site – for example: If loading a site to add to a Knowledge Store or Vector Database for use with AI. You could first spider a site and then use the For..Each.. Line In action to loop through the site adding each page content to a Knowledge Store/Vector Database Collection, using the page title as the article titles.
For example:
Note: This action may take several minutes for large sites.