Automation Action – Convert Document To Text | ThinkAutomation

Automation Action: Convert Document To Text

Convert PDF, Word, Open Document, Excel, Richtext or HTML documents or attachments to plain text or extract PDF form data.

The Convert Document To Text automation action enables you to parse and extract text data from Word, PDF, Open Document, Excel, RichText, Markdown and HTML attachments or local document files. Documents are converted to plain text which is then assigned to a variable, which can then be used further in your automation workflow. This action can also extract PDF form data.

Select a Document To Convert – this can be any local file or a %variable% replacement. You can specify multiple documents if required, separated by commas (any file paths that contain commas must be enclosed in quotes).

Enable the Include Incoming Attachments option to convert attached documents matching the Matching Mask. Enter *.* to convert all supported attachments.

Select the variable to receive the plain text from the Assign To list.

The document(s) will be converted to plain text. Excel files will be converted to CSV text. Markdown documents will be converted to HTML first and then the HTML converted to plain text.

If multiple files are converted within the same action then the extracted text from each file will be appended to the returned text.

PDF Extract Form Data

Enable the PDF Extract Form Data option to extract only form data from PDF files. If enabled then form data only will be extracted in the following format:

{form field Name}: {value} {form field name}: {value} ...

Enable the Return PDF Form Data As Json option to return the form data as Json text.

PDF Use OCR For Image Only Documents

When converting PDF documents, if the PDF document only contains images (for example, a scanned document). Then OCR will be used to extract the text. This option will only be used if the PDF document contains images only and no actual text. Requires Tesseract to be installed.

PDF Text Extract Mode

When converting PDF documents to text you have a number of options:

  • Keep Positioning Method 1 : Some positioning will be retained.
  • Keep Positioning Method 2: Same as above but using a different extraction method. This may provide a more accurate plaintext representation of the PDF document in some cases.
  • Keep Reading Order Method 1 : The text will be extracted in reading order – with no positioning indentation.
  • Keep Reading Order Method 2 : Same as above but using a different extraction method.
  • Extract To CSV : Extracts each text element to CSV text with columns: Page, Bounds (left,top,right,bottom), Text, Font, Size, Weight, RGB. The CSV will contain a row for each text element.

Enable the Remove Repeated Blank Lines option if you need repeated blank lines removed from the text. This option is useful in cases where there is differing amounts of blank space in the PDF document which your extraction rules do not need.

You can then use the text in other actions – or use Extract Field actions to parse & extract data from the text.

To test the text extraction select or enter a document and click the Test button. The results will be displayed. Click the Copy button to copy the extracted text to the clipboard. You can then paste this into the Extract Field Helper Message if you need to extract data from the text.

This is one action from over 180 actions included with ThinkAutomation. The ThinkAutomation business process automation (BPA) solution is designed to automate on-premises and cloud-based business processes that are triggered from incoming messages. Automate messages received by email, database updates, webhooks, web forms, web chat, SMS messages, Twitter, Teams messages, documents, local files and other messages sources. Create any number of workflow automations using the drag-and-drop low-code designer. Simple fixed pricing, with unlimited message processing reduces overall costs compared to hosted automation solutions.

You can also extend ThinkAutomation by creating your own custom automation actions using the built-in designer and C#/VB.net code editor.

Download Free 30 Day Trial

Back To Automation Actions List

ThinkAutomation Home