Automation Action – Web Spider | ThinkAutomation

Automation Action: Web Spider

Crawls a web site and returns a list of all URLs found.

Crawls (spider’s) a URL and returns a list of URLs found. The list can either be returned as a text with one URL per line or as CSV or Json containing each URL, Title, Description, Keywords and Last Modified Date.

The Web Spider Action only crawls the specified URL. It does not crawl outbound links.

Specify the URL to spider.

Specify any Avoid Patterns (separated by semi colons). Adds wildcard patterns to prevent spidering matching URLs. For example, if “*/assets/*” is added, then any URL containing “/assets/” is not spidered. The “*” character matches zero or more of any character.

Optionally specified a date (or %variable% containing a date) in the Only Modified Since entry. If a date is specified then only URL’s with a Last-Modified header date greater than this date will be returned.

Set the Maximum URLs that you want to spider for the site.

Enable the Chop Querystrings to remove the ?query portion from any URLs. This can be done to avoid auto-generated content.

The Web Spider Action will check any robots.txt file. It will not download pages denied by robots.txt

The Return As option can be set to:

URLs one per line

For example:

https://www.testsite.com/ https://www.testsite.com/page2.htm

CSV Containing URL, Title, Description, Keywords, Modified Date

For example:

URL,Title,Description,Keywords,LastModDate https://www.testsite.com/,Title1,Test Description 1,"keyword1,keyword2",2025-02-26 15:12:12 https://www.testsite.com/page2.htm,Title 2,Test Description 2,"keyword1,keyword2",22025-02-26 15:12:12

JSON Array Containing, URL, Title, Description, Keywords, Modified Date

For example:

 [ { "URL": "https://www.testsite.com/", "Title": "Title 1", "Description": "Test Description 1", "Keywords": "keyword1,keyword2", "LastModDate": "2025-02-26T15:12:12" }, { "URL": "https://www.testsite.com/page2", "Title": "Title 2", "Description": "Test Description 2", "Keywords": "keyword1,keyword2", "LastModDate": "2025-02-26T15:12:12" } ]

Select the variable to receive the results from the Assign To list.

You can also assign a list of outbound links found across all URLs spidered. Select the variable to receive outbound links from the Assign Outbound Links to list. Outbound links are returned as a text string with one link per line.

This Action is useful when you need to load content for an entire site – for example: If loading a site to add to a Knowledge Store or Vector Database for use with AI. You could first spider a site and then use the For..Each.. Line In action to loop through the site adding each page content to a Knowledge Store/Vector Database Collection, using the page title as the article titles.

For example:

Spider Site For AI Use Automation
// Add site to vector database
SpiderURL=”https://www.optimagpt.com”
Markdown=
Title=
Content=
PageURL=
URLList=
LastDate=
// Get the last date we spidered
LastDate=Embedded Value StoreGetIn”SpiderDates”For Key%SpiderURL%
// Get a list of page changes since the last run
URLList=Web SpiderURL%SpiderURL%Only Modified Since%LastDate%
For EachLine In%URLList%[Assign To: PageURL]
Content=HTTP GetFrom%PageURL%[Assign Title To: Title]
If%Content%Is Not BlankThen
// Convert the page content to Markdown
Markdown=Text OperationConvert: HTML To Markdown%Content%Drop Tags”header,nav,footer,form”(Suppress Links)(Suppress Images)
// Add/update the Markdown in the vector database.
Embedded Vector DatabaseUpdateIn”OptimaGPT”Key%Title%=%Markdown%
End If
Next Loop
Embedded Value StoreSetIn”SpiderDates”Key%SpiderURL%=%DateTimeUtc%
Return%URLList%
Note: This action may take several minutes for large sites.
This is one action from over 180 actions included with ThinkAutomation. The ThinkAutomation business process automation (BPA) solution is designed to automate on-premises and cloud-based business processes that are triggered from incoming messages. Automate messages received by email, database updates, webhooks, web forms, web chat, SMS messages, Twitter, Teams messages, documents, local files and other messages sources. Create any number of workflow automations using the drag-and-drop low-code designer. Simple fixed pricing, with unlimited message processing reduces overall costs compared to hosted automation solutions.

You can also extend ThinkAutomation by creating your own custom automation actions using the built-in designer and C#/VB.net code editor.

Download Free 30 Day Trial

Back To Automation Actions List

ThinkAutomation Home