Altair SmartWorks Analytics

 

Importing Data

Data can be imported into a Workflow Canvas by using the Import Text and Import Database nodes.

Up to 400 columns can easily be imported into SmartWorks Analytics. Moreover, text files including up to 1,000 columns can be imported into the application provided that the Columns to Include option in the Import Text node is deselected and the specific columns desired are reselected prior to saving and running the node.

Prerequisites

  • An Execution Profile with an active session linked to the workflow

  • A database added as an internal connection to the Execution Profile linked to the Workflow (if importing data from a database)

Importing Data Using the Import Text Node

Steps

  1. Open the Nodes tab in the Palette and then expand the Connect section.

  2. Drag and drop the Import Text node from the Palette to the Workflow Canvas.

  3. Configure the node by opening the Node Viewer. You can double-click on the node or use the Open option provided in the node menu.

  4.  

  5. Provide the following details.

  6. Property

    Description

    Output Properties

    Table name. The table name is autogenerated but can be edited.

    Import Settings

    • Connection Profile – Use the dropdown list provided to select a connection profile (S3, Library)

    • File Location - Enter the path of the CSV file. Alternatively, click the Browse button to launch the File Browser. Use this browser to select the CSV file from some location/folder based on your chosen Connection Profile and then click Open.

    •  

      NOTES:

      • For Apache Spark - You can enter the folder location only (e.g., Test/Folder1) or specify both a folder and file location (e.g., Test/Folder1/ImportFile.csv). Files can be read simply by entering the folder location if the folder is created by the Apache Spark engine (e.g., you export a file to a folder via the Apache Spark engine).

      • For Pandas - You must enter both the folder and file location.

    • Sampling Type  – Use the drop-down list provided to select how records should be imported to your workflow.

      • First N Rows - Returns every row in the database table from first row to row N

      • First N% of Rows - Returns the specified N percentage of rows starting from the first row in the database table

      • Random N Rows - Returns a random sample of N rows. A random sample method is applied to independently determine which rows will be included in the sample. Results will be an approximation of N. Note that, in Apache Spark, the exact N number of rows is not displayed in the Object node. Instead, rows approximating N (i.e., +n/-n) are displayed.

      • Random N% of Rows - Returns a random sample of the specified N percentage of rows. A random sample method is applied to independently determine the N% of rows to be included in the sample. Results will be an approximation of N. Note that, in Apache Spark, the exact N% of rows is not displayed in the Object node. Instead, rows approximating N (i.e., +n/-n) are displayed.

      • Skip First N Rows - Returns every row in the database table starting from row N

      • All Rows - Returns all rows in the database table

    • Sampling Size - Provide a number (N) to specify how many records should be imported to your workflow (depending on the Sampling Type selected). By default, the numbers of rows imported is 10,000.

    • Escape Character – Specify an escape character if necessary

    • Delimiter – Use the dropdown list provided to select a column delimiter; by default, the comma (,) is selected

    • Text Qualifying Character – Use the dropdown list provided to select a text qualifying character; by default, the quotation mark (“) is selected  

    • Text Encoding (for Python) – Use the dropdown list provided to select a text encoding type; by default, UTF_8 is selected. The following encoding types are supported:

      • ASCII

      • UTF-8

      • UTF-16LE

      • UTF-16BE

      • LATIN

      More information on encoding types can be found here.

    • Column Name in First Row – Tick the box provided if the first row of the file contains the column names; this setting is checked by default

    • Contains Multiline (for Apache Spark) - Tick the box provided if any of the columns in the data source contains multiline data

    Columns to Include

    Select columns to include in the table from the input data. All columns of the text file, along with the corresponding data types, are displayed in the grid. Users can select/deselect which columns should be brought into the Workflow Canvas. By default, all columns are selected.

     

     

    Sample raw data and the data preview section are displayed. The preview table displays according to the specifications selected above.

     

  7. To check the code that will be executed for your specified Import Text configuration, click on the Code tab of the Import Text Node Viewer after saving the configuration. You can also use the tab that displays to refine the code further.

  8.  

  9. To complete the Import Text Node configuration, press Save. To cancel your changes and return to the Workflow Canvas, press Discard or simply close the Import Text Viewer. To execute the Import operation, click the Run button.

Importing Data Using the Import Database Node

Steps

  1. Open the Nodes tab in the Palette and then expand the Connect section.

  2. Drag and drop the Import From Database node from the Palette to the Workflow Canvas.

  3. Open the Node Viewer. You can double-click on the node or use the Open option provided in the node menu.

  4. In the Configuration tab, specify the following details to configure the Import From Database node.

  5.  

    Property

    Description

    Output properties

    Specify a name for the output table.

    Import settings

    • Connection Profile - Select a connection profile from the list of existing database profiles.

    • Table Type - Select Table if you wish to simply import the database table in a straightforward manner or SQL if you wish to apply a SQL query to join and transform tables before importing data into the workflow.

    • Table Name - Applicable only if Table is selected as the Table Type. Click on the Browse button to launch a table browser that you can use to locate the database table you wish to add to your workflow.

    •  

      You can select a specific schema from the Select Schema list or search for a specific table name to narrow your search.

    • SQL Query - Applicable only if SQL is selected as the Table Type. Enter the corresponding SQL query into the text box provided.

    • Number of Rows - Applicable only if Table is selected as the Table Type. Specify the number of rows to be imported from the database. If this field is left blank, all rows are imported from the data source. By default, the numbers of rows imported is 1,000.

    Columns to include

    Select columns to be included from the input data source for importing data. The column grid displays all columns of the input data source with data type and the preview table displays records based on the selected columns.

     

     

    A snapshot of your data displays on the right-hand side of the viewer.

     

  6. Click Save to save the configuration settings.

  7. To check the code that will be executed for your specified Import from Database configuration, save your specifications and then click on the Code tab of the Node Viewer. You can also use the tab that displays to refine the code further.

  8.  

  9. Complete the Import From Database configuration by pressing Save. To cancel your changes and return to the Workflow Canvas, press Discard or simply close the Node Viewer. To execute the import operation, click the Run button.

Working with Column Formats

The following table describes how different date/time formats are recognized when importing text files into SmartWorks Analytics using the corresponding engine.

Date/Time Format

Column Type

Pandas

Apache Spark

MM/DD/YYYY

object

string

DD/MM/YYYY

object

string

YYYY/MM/DD

object

string

MM-DD-YYYY

object

string

DD-MM-YYYY

object

string

YYYY-MM-DD

object

timestamp

MM/DD/YYYY HH:MM:SS

object

string

DD/MM/YYYY HH:MM:SS

object

string

YYYY/MM/DD HH:MM:SS

object

string

YYYY-MM-DD HH:MM:SS

object

timestamp

 

The following table describes how different numeric formats are recognized when importing text files into SmartWorks Analytics using the corresponding engine.

Numeric Format

Column Type

Pandas

Apache Spark

General

int64

integer

Financial

float64

double

Currency

object

string

Percent

object

string

Scientific notation

float64

double

 

You can change the data type of a column via Column Changes > Change Type.