1 of 15

datastage-standalone-workshop

Introduction

Information Server Workshop

Welcome to our workshop! In this workshop we'll be using IBM Information Server to design data processing jobs with DataStage and create rules to analyze data.

Get acquainted with Information Server
Run a few DataStage jobs
Learn the ins and outs of data rules
Have fun!

About Information Server

IBM InfoSphere Information Server is a market-leading data integration platform which includes a family of products that enable you to understand, cleanse, monitor, transform and deliver data, and to collaborate to bridge the gap between business and IT. InfoSphere Information Server provides massively parallel processing (MPP) capabilities to deliver a highly scalable and flexible integration platform that handles all data volumes, big and small.

Credits

Getting Started

Lab: Access the environment

This section will explain how to access your environment.

Credentials

1. Turn on your Skytap Environment

The lab instructors will provide everyone with a set of URLs at the beginning of the lab.

Accessing the URL will open a Skytap environment that is password protected.

Typing in the password provided by the lab instructor will give you access to 4 virtual machines. Turn the VMs on by clicking the "Play" button.

Once the VMs are turned on, click on the tile to access the virtual machine.

Use the toolbar at the top to make the window fullscreen. The "Ctrl-Alt-Del" button can be used to log into Windows environments.

NOTE: The Skytap environment will automatically turn off the VMs if they are not used within two hours. You can simply start them again to resume your work.

Lab: DataStage: Read and write to file (CSV)

DataStage Flow Designer enables users to create, edit, load, and run DataStage jobs which can be used to perform integration of data from various sources in order to glean meaningful and valuable information.

The purpose of this lab is to familiarize ourselves with the DataStage Flow Designer and the Operations Console.

In this lab, you will learn the following:

How to create a job in DataStage.
How to read tabular data from a file into DataStage.
How to perform transformations (such as filtering out records and removing duplicate values) on a table.
How to write tabular data from DataStage into a file.
How to run jobs.
How to view logs for jobs.
How to schedule jobs.

This lab is comprised of the following steps:

Create a Transformation project
Create the table definition
Create the job
Compile and run the job
Use Operations Console to view previous logs
Scheduling the job
View output

Before you start

Before we start the lab, let's switch to the iis-client VM and launch Firefox.

Click on Classic Launchpad in the Bookmarks tab. The first time you try this out, you might see a certificate error. To get past it, click on Advanced... and then click Accept the Risk and Continue.

Click on DataStage Flow Designer.

This brings up the DataStage Flow Designer. Click OK.

1. Create a Transformation project

On the IBM DataStage Flow Designer, click on the Projects tab and click + Create. In the modal that opens up, type in a name for the project and click Create.

The project takes a few minutes to be created and once ready, it will be visible on the Projects tab.

Click on the tile for your newly created project. In the modal that opens up, verify that the name of your project is provided as the Project Name and click OK to switch the project.

2. Create the table definition

We will be using the raviga-products.csv file which contains tabular data having 2 columns:

On the DataStage Flow Designer, click on the Table Definitions tab and then click + Create. In the modal that opens up, provide the definition of the table contained in the input file. Go to the General tab and provide the Data source type as File, the Data source name as raviga and the Table/File name as ravigaproducts. If you wish to, you can also provide optional Short description and Long description.

Next, go to the Columns tab and click + Add. A new entry is added in the table. Edit the entry and update the Column name to ProductID. Mark the column as Key. Update the SQL type to Varchar and the Length to 9.
Click on + Add to add another entry in the table. Edit this entry and update the Column name to Zip, SQL type to Integer, Extended to "" (blank) and Length to 6. Mark the column as Nullable. Click OK. In the modal that pops up, click Save.

3. Create the job

Click on the Jobs tab and then click + Create. Click Parallel Job.

A new tab with the name Job_1* opens up where you can now start designing the parallel job.

Before designing the job, take a few minutes to look at the various buttons and menus available on the screen.

The Show/Hide Palette button can be used to show or hide the palette on the left of the screen which contains the connectors and stages that can be used while designing the job.
The Save, Compile and Run icons are used to save, compile and run the job respectively.
The Compare icon is used to compare this job with another job in the Project.
The View menu has the following options:
1. View OSH code - which is available once the job has been successfully compiled.
2. View log - which is available once the job has been run.
3. Properties - based on what is selected in the job canvas before clicking on Properties, either the properties of the job or of a connector/stage/link are displayed.
The Schedule icon is used to set a schedule to run the job.
The Settings menu has a number of options such as
1. Apply horizontal layout - which arranges all the connectors and stages in the canvas in a horizontal manner with data flowing from left to right.
2. The ability to view/hide annotations, arrows, link names, and the type of the connector/stage.
3. Smart palette - which applies smart logic based on usage patterns to reorder the items available in the palette. If disabled, the items in the palette are displayed in an alphabetical order.
4. Smart stage suggestions - applies smart logic based on usage patterns to suggest the next stage that you might want to add to the job.

Drag a File connector to the canvas. The Table Definition Asset Browser opens up. Select raviga\ravigaproducts and click Next.

Ensure that both the columns ProductID and Zip are selected and click Add to Job.

The File connector should now be visible on the canvas. Drag and drop a Filter stage on the canvas. Provide the output of the File connector as the input to the Filter stage. To do this, click on the little blue dot on the right side of the File connector and drag the mouse pointer to the Filter stage.

NOTE: For another method to connect the File connector to the Filter stage, click on the File connector to select it and then drag and drop the Filter stage. The Filter stage will automatically be connected to the File connector.

Double click on the File connector to open the Connector's Properties page. Provide the File name as /IBMdemos/raviga-products.csv which is where the source file is located. Further down, provide the File format as Comma-separated value (CSV).

Scroll further down and update the Null value to "". Click OK.

Double click the Filter stage to open the Stage's Properties page. Provide the value for Where Clause as Zip IS NOT NULL. This will filter out all the records in the table where the value of Zip is NULL. Click OK.

We will use the Peek stage to look at the intermediate state of the table. Drag and drop a Peek stage to the canvas and provide the output of the Filter stage as the input to the Peek stage.

Next, we will add a Remove Duplicates stage and provide the output of the Peek stage as the input to the Remove Duplicates stage.

Double click on the Remove Duplicates stage to open the Stage's Properties page. Under KEYS THAT DEFINE DUPLICATES, select Zip. Set the value of Duplicate to retain as First. This will ensure that if there are any duplicate Zip values, the job will only retain the first record out of the records where this Zip value occurs. Click OK.

Drag and drop a File connector to the canvas. In the modal that opens up, check the Add connector as target checkbox and click Add to Job.

Provide the output of the Remove Duplicates stage as the input to the File connector. Double click the File connector to open the stage page, and provide the File name as output.csv. Specify Comma-separated value (CSV) as the File format. Click OK.

4. Compile and run the job

Click the Save icon to save the job. If you wish to, you can provide a different name for the job in the modal that pops up. Click Save. Once the job is saved, click on the Compile icon to compile it. If compilation is successful, you should see a green check mark and the message Compiled successfully displayed on the screen. The OSH Code will also be available for viewing.

If the compilation was not successful, you should see a red error icon and the message Compiler failed with ### errors. where ### represents the number of errors. Clicking on View errors brings up a modal that lists all the errors that were found. The stage/connector/link that has the error is also highlighted in the canvas and you can hover over it to see the error.

Click the Run icon to run the job. In the modal that opens up, click Run.

Once the run completes, you should see the status of the run on the screen. The status will also indicate if there were any warnings or errors. Click on View log to get more details about the run.

Everything that happens as part of the job run is available in the log. This includes warnings, errors and informational messages such as the results of the Peek stage. Scroll through the log to reach the Peek results. By default, the Peek stage only logs the first 10 records of the partition (the data was split into two partitions in this case). You can see how in both Peek logs there are no records where Zip value is NULL.

5. Use Operations Console to view previous logs

At the bottom of the log, click on Open Operations Console. Alternatively, you can open the Classic Launchpad in Firefox and click on Operations Console.

The Operations console can be used to view logs for previous runs of the job. Switch to the Projects tab. In the left-hand navigation pane, expand Lab1 (the transformation project), followed by Jobs. Click on Job_1 (the job that was just saved/compiled/run). The Job Runs table at the bottom of the screen will be updated to reflect all the times that the job was run. Click on any job run in the list to select it and click View Details.

Go to the Log Messages tab and check the Full Messages checkmark to see the complete log for the job run. Clicking on an entry in the job log will display the log message details at the bottom.

6. Scheduling the job

DataStage Flow Designer provides the ability to run jobs as per a schedule.

Back in the DataStage Flow Designer where we were designing the job, click on the Schedule icon. In the modal that opens up, provide the frequency for running the job - whether you want to run it every Day, Week or Month. Next to it, provide the time of the day when the job should run. Click Schedule.

Alternatively you can go up one level to the Jobs tab, and hover over your job's tile to see the kebab icon (⋮). Click on the icon and select Schedule to open up the scheduling modal where you can set up a schedule for your job.

NOTE: You can only schedule your job once it has been successfully compiled.

7. View output

The output file will be saved on the server. Switch to the server VM by clicking the first icon on the Environment VMs panel and selecting iis-server. Login as the root user with the password inf0Xerver.

CD to the location where you had stored the file. If you provided a path starting at "/", then it will be stored at that location in the server. Since we had only provided output.csv as the file path in the File connector, the file will be available in the Transformation project's folder, i.e.,

/opt/IBM/InformationServer/Server/Projects/<project-name>/

CONGRATULATIONS!! You have completed this lab!

Lab: DataStage: Read from a database (Db2 on IBM Cloud)

The purpose of this lab is to design a DataStage job to satisfy the following problem statement:

As a data engineer, you have been asked by the Line of Business that you support, to produce a new data file that contains all employees whose total compensation is less than $50,000. The file must also contain the Department Name that the employee works in, and the mean average salary of all employees in that department who earn less than 50,000. In addition, the file must be sorted in descending order, based on the mean average salary amount. Finally, the application that will consume this file, expects the full name of the employee to be in one field, formatted as first, middle initial, last).

In this lab, you will learn:

How to create a job in DataStage.
How to load data from Db2 Warehouse into DataStage.
How to perform transformations such as modifying tables, joining tables, aggregating table data and sorting table data.
How to write tabular data from DataStage into a file.
How to run jobs.
How to view logs for jobs.

This lab is comprised of the following steps:

Before you start

Before we start the lab, let's switch to the iis-client VM and launch Firefox.

Click on DataStage Flow Designer.

This brings up the DataStage Flow Designer. Click OK.

1. Create a Transformation project

On the IBM DataStage Flow Designer, click on the Projects tab and click + Create. In the modal that opens up, type in a name for the project and click Create.

The project takes a few minutes to be created and once ready, it will be visible on the Projects tab.

Click on the tile for your newly created project. In the modal that opens up, verify that the name of your project is provided as the Project Name and click OK to switch the project.

2. Add database connection

The input tables - EMP (containing employee data) and DEPT (containing department data) - are already loaded in Db2 Warehouse. Let's add a Db2 warehouse instance as a Connection in DataStage.

Click on the Connections tab and then click + Create to add a new connection.

Provide the following connection details and click OK. Click Save on the new modal that pops up.

A tile for the new connection will now be displayed in the Connections tab.

3. Create the job

Click on the Jobs tab and then click + Create. Click Parallel Job.

A new tab with the name Job_1* opens up where you can now start designing the parallel job.

The first step is to load the input tables DEPT and EMP into DataStage.

The DEPT table contains the following columns:

The EMP table contains the following columns:

First, drag a Connection connector to the canvas. In the modal that opens up, select the DB2WH connection that was created earlier and click Next.

On the next screen, select the BLUADMIN schema and click Next.

On the next screen, select the DEPT table and click Next.

On the next screen, click Add to Job.

Drag another Connection connector to the canvas and repeat the steps given above but this time, select the EMP table instead. Once you complete the steps, you should see the two Connection connectors on the canvas.

Use a Transformer stage to perform the following two modifications:

Update the output of the EMP table by replacing any NULL MIDINIT values with " ". This is needed for a future step where we will combine the FIRSTNME, MIDINIT and LASTNAME columns to create the FULLNAME of the employee.
Currently, the EMP table uses the WORKDEPT column to identify the department number whereas the DEPT table uses the DEPTNO column. Modify the output of the EMP table by changing the name of the WORKDEPT column to DEPTNO. This is needed for a future step where we will Join the two tables.
Drag and drop a Transformer stage next to the Connection connector for the EMP table. Provide the output of the EMP table Connection connector as the input to the Transformer stage. For this, click on the little blue dot on the right side of the Connection connector and drag the mouse pointer to the Transformer stage.

NOTE: For another method to connect the Connection connector to the Transformation stage, click on the Connection connector to select it, then drag and drop the Transformation stage. The Transformation stage will automatically be connected to the Connection connector.

Drag and drop a Join stage to the canvas and provide the output of the Transformer stage as the input to this Join stage.

Double click on the Transformer stage to open up the stage page. In the Properties tab, click + Add. A new entry is added in the table below. Use this to define a "Stage variable" (a local variable that is available only within this stage) named MIDINITIAL which will represent the middle initial of the employee with any NULL values replaced with the empty string. Double click on {derivation} to open up the Derivation Builder.

Begin building the derivation rule for MIDINITIAL by finding the NullToValue function in the table. Clicking on the entry in the table will insert it in the "Derivation" at the top. You can also use the search bar to look for the function. Replace the %input_column_input_column% with the <Link>.MIDINIT input variable which can also be found in the table and the %value% with " ". <Link> represents the identifier of the input link for the Transformer. Click OK to go back to the Stage page.

Update the Name of the stage variable to MIDINITIAL, the SQL type to Char, the Precision to 1 and the Scale to 0.

Now go to the Outputs tab and in the table find the entry for the MIDINIT column. Double click on the derivation value for this entry (it will open up the Derivation builder) and change the derivation value to use the newly created MIDINITIAL stage variable instead. Next, look for the entry for the WORKDEPT column. Double click on the WORKDEPT value under the Column name column and replace the text with DEPTNO. Click OK.

Both the tables now have a column called DEPTNO which can be used to join the tables. Provide the output of the DEPT table Connection connector as the second input to the Join stage. Double clicking the Join stage brings up the stage page where you can verify that the DEPTNO is being used as the JOIN KEY and the Join Type is Inner.

Next, add a Transformer stage to the canvas and provide the output of the Join stage as the input to the Transformer stage.

Double click the Transformer stage to open the stage page. We will add 2 stage variables to this Transformer stage. In the Properties tab, click + Add. A new entry is added in the table below. Use this to define a stage variable named TOTALCOMP which will represent the total compensation (the sum of bonus, commission and salary) of the employee. Double click on {derivation} to open up the Derivation Builder.

Build the derivation rule for TOTALCOMP by finding the BONUS, COMM and SALARY input columns in the table. Clicking on the entries in the table will insert them in the "Derivation" at the top. Type in + signs between them in order to complete out the derivation. You can also use the search bar to look for the fields. When inserted in the "Derivation", the column names will be prepended with the identifier for the input link. Click OK to go back to the Stage page.

Update the Name of the stage variable to TOTALCOMP, the SQL type to Decimal, the Precision to 9 and the Scale to 2.

Repeat the process above to add another Stage variable FULLNAME which will represent the complete name of the employee. Provide the Derivation as CompactWhiteSpace(<Link>.FIRSTNME:" ":<Link>.MIDINIT:" ":<Link>.LASTNAME), the Name of the stage variable as FULLNAME, the SQL type as Varchar and the Precision as 36 and the Scale as 0. Click OK to save the changes and return to the canvas.

NOTE: <Link> needs to be replaced with the identifier of the input link. CompactWhiteSpace is a function that will compact any continuous white spaces into a single white space. : is the operator used for concatenation.

Next, add a Join stage and an Aggregator stage to the canvas. Connect the Transformer stage to both these stages such that the output of the Transformer stage is provided as the input to both these stages.

Since the output links have now been added, we can provide the 2 stage variables TOTALCOMP and FULLNAME as outputs of the Transformer stage and once that is done, these values will be available as inputs in the subsequent stages.

Double click on the Transformer stage to open the stage page. Go to the Outputs tab. Click on +Add twice to add 2 new entries in the table. Update one entry with the Derivation value as TOTALCOMP, the Column name as TOTALCOMP, the SQL type as Decimal, the Precision as 9, the Scale as 2 and set Nullable to true. Update the second entry with the Derivation value as FULLNAME, the Column name as FULLNAME, the SQL type as Varchar, the Precision as 36 and the Scale as 0.
We also need to add a Constraint here, which will ensure that only the records with TOTALCOMP more than 50000 are sent in the output. Click on the empty space under Constraint to open up the Derivation Builder. Specify the derivation as TOTALCOMP>50000.
Switch to the second output link by clicking on the Link_<number> under Output name and repeat the above steps to add the 2 stage variables to the output and to add the constraint. Click OK to save the changes and return to the canvas.

Provide the output of the Aggregator stage as the input to the Join stage. Double click on the Aggregator stage to open the stage page. Select DEPTNAME as the grouping key.

Scroll down to the Aggregations and select the Aggregation Type as Calculation, Column as TOTALCOMP, Calculation Type as Mean Value and click Add Calculation +.

Go to the Outputs tab and verify that you can see 2 output columns - MEAN_TOTALCOMP and DEPTNAME. Click OK to save the changes and return to the canvas.

Drag and drop a Sort stage on the canvas and provide the output of the Join stage as the input of the Sort stage.

Double click on the Join stage to open the stage page. On the Properties tab, verify that the Join Key is DEPTNAME and the Join Type is Inner.

Go to the Outputs tab and verify that you can see FULLNAME and MEAN_TOTALCOMP in the output column list.

Double click on the Sort stage to open up the stage page. Specify the Sorting Key as MEAN_TOTALCOMP and the Sort Order as Descending. Click OK.

Drag and drop a File connector to the canvas. In the modal that opens up, check the Add connector as target checkbox and click Add to Job.

Provide the output of the Sort stage as the input to the File connector. Double click the File connector to open the stage page, and provide the name of the output File as output.csv. Specify Comma-separated value (CSV) as the File format. Enable First row is header and provide Null value as "". Click OK.

4. Compile and run the job

Click the Save icon to save the job. If you wish to, you can provide a different name for the job in the modal that pops up. Click Save. Once the job is saved, click on the Compile icon to compile it. If compilation is successful, you should see a green check mark and the message Compiled successfully displayed on the screen.

Click the Run icon to run the job. In the modal that opens up, click Run.

5. View output

The output file will be saved on the server. Switch to the server VM by clicking the first icon on the Environment VMs panel and selecting iis-server. Login as the root user with the password inf0Xerver.

CD to the location where you had stored the file. If you provided a path starting at "/", then it will be stored at that location in the server. Since we had only provided output.csv as the file path in the File connector, the file will be available in the Transformation project's folder, i.e.,

CONGRATULATIONS!! You have completed this lab!

Lab: DataStage: Read and analyze data from multiple tables

Information Server enables users to better understand their data. It uses pre-built and custom rules that will apply meaning and quality measurements, which are available for users of the data and interested parties.

This lab will give you hands-on experience using Information Server's datastage flow designer and rule capabilities. In this lab, you will learn the following:

How to join two data sets using DataStage Flow Designer to create a single data set
How to create a data rule that runs on columns from each data set
Apply the rules to the joined data set
View data that does not apply to the rule

This lab is comprised of 2 sections. In the first section, we will see how to associate records within 2 data sets to create a single combined data set using DataStage Flow Designer. In the second section, we will apply a data rule on the combined data set and analyze the data that does not apply to the rule.

Note: The merged data set is already available in DB2WH. If you wish to, you may skip section 1 and directly move on to section 2 where you can import this merged data set.

This lab consists of the following steps:

Section 1: Join the data sets

Before you start

Before we start the lab, let's switch to the iis-client VM and launch Firefox.

Click on DataStage Flow Designer.

This brings up the DataStage Flow Designer. Click OK.

1. Create a Transformation project

On the IBM DataStage Flow Designer, click on the Projects tab and click + Create. In the modal that opens up, type in a name for the project and click Create.

The project takes a few minutes to be created and once ready, it will be visible on the Projects tab.

Click on the tile for your newly created project. In the modal that opens up, verify that the name of your project is provided as the Project Name and click OK to switch the project.

2. Add database connection

The input tables - EMP (containing employee data) and DEPT (containing department data) - are already loaded in Db2 Warehouse. Let's add a Db2 warehouse instance as a Connection in DataStage.

Click on the Connections tab and then click + Create to add a new connection.

Provide the following connection details and click OK. Click Save on the new modal that pops up.

A tile for the new connection will now be displayed in the Connections tab.

3. Create the job

Click on the Jobs tab and then click + Create. Click Parallel Job.

A new tab with the name Job_1* opens up where you can now start designing the parallel job.

The first step is to load the input tables DEPT and EMP into DataStage. The WORKDEPT column of the EMP table is the same as the DEPTNO column of the DEPT table.

First, drag a Connection connector to the canvas. In the modal that opens up, select the DB2WH connection that was created earlier and click Next.

On the next screen, select the BLUADMIN schema and click Next.

On the next screen, select the DEPT table and click Next.

On the next screen, click Add to Job.

Drag another Connection connector to the canvas and repeat the steps given above but this time, select the EMP table instead. Once you complete the steps, you should see the two Connection connectors on the canvas.

The EMP table uses the WORKDEPT column to identify the department number whereas the DEPT table uses the DEPTNO column. Use a Transformer stage to modify the output of the EMP table by changing the name of the WORKDEPT column to DEPTNO. This is needed for a future step where we will Join the two tables.

Drag and drop a Transformer stage next to the Connection connector for the EMP table. Provide the output of the EMP table Connection connector as the input to the Transformer stage. For this, click on the little blue dot on the right side of the Connection connector and drag the mouse pointer to the Transformer stage.

Drag and drop a Join stage to the canvas and provide the output of the Transformer stage as the input to this Join stage.

Double click on the Transformer stage to open up the stage page. Go to the Outputs tab and in the table find the entry for the WORKDEPT column. Double click on the WORKDEPT value under the Column name column and replace the text with DEPTNO. Click OK.

Both the tables now have a column called DEPTNO which can be used to join the tables. Provide the output of the DEPT table Connection connector as the second input to the Join stage. Double clicking the Join stage brings up the stage page where you can verify that the DEPTNO is being used as the JOIN KEY and the Join Type is Inner. Click OK.

Drag and drop a Connection connector to the canvas. In the modal that pops up in the screen, check the box for Add selected connection as target and click Add to Job. Provide the output of the Join stage as the input to this connector.

Double click on the Connection connector to open up the Properties page on the right. Verify that the URL, username and password are already populated.

Scroll down and under the Usage section, provide the Table name as <user>_DEPTEMP where \ is your name, and update the Table action to Replace. Click OK.

4. Compile and run the job

Click the Save icon to save the job. If you wish to, you can provide a different name for the job in the modal that pops up. Click Save. Once the job is saved, click on the Compile icon to compile it. If compilation is successful, you should see a green check mark and the message Compiled successfully displayed on the screen.

Click the Run icon to run the job. In the modal that opens up, click Run.

5. View output

After a successful run, the results will be stored within the DB2WH connection in the BLUADMIN schema. Because we specified the Table action as Replace in the Connection connector that represents the output, each subsequent run of the job will delete all existing records in the table and replace them with the new output.

To view the results of the job, double click on the Connection connector that represents the output. This will open up the Properties page on the right. Click on View Data to bring up a modal that shows the contents of the <user>_DEPTEMP table in which the output was stored.

This marks the end of Section 1 of this lab.

Section 2: Create and run data rule

1. Import and view the data

Switch to the iis-client VM and launch Firefox.

Click on the Launchpad bookmark. When the Information Server launchpad shows up click on the Information Governance Catalog New tile.

The overview page will appear.

Click on the Connections tab and click + Create connections.

On the next screen, provide the following details and click Test connection to test the connection. Once successful, click Save connection.

Now locate the tile for the connection that was just added, click on the kebab icon (⋮) on the tile and click Discover.

Click on Browse.

Expand db2. Expand BLUADMIN and select the <user>_DEPTEMP table where \ is your name. This is the table that was created in section 1 above. If you skipped section 1, then you can select the SANDHYA_DEPTEMP table. Click OK.

Under Discovery options, select Analyze data quality. This will automatically check the boxes for Analyze columns and Assign terms.

Scroll to the bottom and select the Host as IIS-SERVER.IBM.DEMO and the Workspace as UGDefaultWorkspace. Click Discover.

The discovery process will take some time. Once the assets are discovered, the analysis process will begin. Once that completes, you will be able to see what percentage of records were successfully analyzed.

Now let us go and have a look at the data. Go to Quality tab and click on the tile for UGDefaultWorkspace.

The workspace overview will load. Take a few moments to browse the graphics on the page and click on Data sets link to view the data in this exercise.

Before we create new rules, let's look at the data set that will be used in this lab. Click on the tile for <user>_DEPTEMP.

Click on the Columns tab to view findings from the analyzer. It found many things when the data was imported, like maximum and minimum values, distinct values, format, and uniqueness.

We're now ready to create our first data rule!

2. Create a data rule

From the UGDefaultWorkspace workspace, click on the Data rules tab.

Expand Published Rules > 08 Validity and Completeness > Valid Value Combination. Find the rule for IfFieldaIsXThenFieldbGtQty. Click on the ... overflow menu on the right and select Manage in workspace.

The rule should now be available under All. We will now edit the rule. If you wish to rename the rule, you will first need to Copy the rule and you will be provided with the option to rename the rule. Click on the ... overflow menu of the rule and select Edit.

Switch to the Rule logic tab and update the rule to say IF DEPTNAME = 'OPERATIONS' THEN SALARY > 36000.

Next, switch to the Rule testing tab. Here you need to bind the variables in the rule logic to specific columns in the data source. Select the salary variable in the left table and select the SALARY column under Available data sources. It should be under IIS-SERVER.IBM.DEMO > db2 > BLUADMIN > <user>_DEPTEMP. Click on + Bind. The value <user>_DEPTEMP.SALARY will now be shown under Implemented bindings for the salary variable.
Uncheck the salary variable, check the deptname variable and bind it with the DEPTNAME column under Available data sources. As in case of the SALARY column, it should be under IIS-SERVER.IBM.DEMO > db2 > BLUADMIN > <user>_DEPTEMP. Click on + Bind. The value <user>_DEPTEMP.DEPTNAME will now be shown under Implemented bindings for the salary variable.

Scroll down and click on Test to test out the rule. You will see a message at the top of the screen that says the test results can be viewed once ready. When the message disappears, go to the Rule test results tab to view the test results.

You can see that of the 42 total rows, 39 met the rule, and 3 did not. Click on Did not meet rule conditions to view the 3 rows. You can see that these rows have DEPTNAME = OPERATIONS but have SALARY < 36000, and therefore they did not match the rule conditions. Click Save to save the rule.

When you are brought back to the Data rules tab, you'll notice that the new rule has an error. We need to publish the rule. To do so navigate to the right to show the ⋯ menu. Choose the Publish option from the menu.

In the modal that pops up, click Publish to confirm the publish action.

3. Re-analyze and view results

Go back to the UGDefaultWorkspace workspace and click on the Data sets link.

Click on the <user>_DEPTEMP data set.

We can now apply the newly created rule by switching to the Rules (0) tab and clicking the + Add rule button.

Choose the IfFieldaisXThenFieldbGtQty rule under All and click Next.

As before, we need to bind the rule variables to specific columns in our data set. Select the salary variable in the left table and select the SALARY column under Available data sources. Click on the + Bind button. Once bound, select the deptname variable and bind it with the DEPTNAME column under Available data sources. Once both the variables are bound with the right columns, click Next.

This time, we don't need to test the rule. Simply click Save.

You should now see the rule in the data set view. Click on the Analyze button to restart the analysis with the new rule.

In the modal that pops up, click Analyze to confirm analysis.

The analysis will take a few minutes. You may need to refresh your browser a few times. You will see the state go from Running to Successful when analysis is complete.

Once analysis has completed successfully, go to the Data quality tab. You'll see that the new rule has six findings - three against DEPTNAME and three against SALARY. Click on either DEPTNAME or SALARY to view the exceptions.

Scrolling to the right you'll see that the three entries shown have DEPTNAME = OPERATIONS but have SALARY less than 36000, which goes against our rule.

CONGRATULATIONS!! You have completed this lab!

Lab: DataStage: Read JSON from a REST call

You can use the REST protocols to incorporate IBM DataStage Flow Designer features in your custom application. You use the REST protocol to perform actions such as compiling jobs, running jobs, and getting job status.

The following REST calls are supported:

XML
JSON

In this lab we'll learn how to call a JSON endpoint and pipe those results to a file using DataStage.

This lab consists of the following steps:

Before you start: Launching DataStage Flow Designer

Before we start the lab, let's switch to the iis-client VM and launch Firefox.

Launch the desktop client by going to the start menu and searching for DataStage Designer.

1. Create job layout

Start a new Parallel Job project and create a job that looks like the image below. Remember to wire the elements together. It should have:

1 x Hierarchical Data
1 x Peek
1 X Sequential File

Double click on the Hierarchical Data node and click on Edit assembly.

2. Customize with Assembly Editor

The "Assembly Editor" will launch. This allows for fine tuning the properties of certain jobs.

Click on Palette and add a REST step and a JSON Parser step.

From the REST step, in the General tab, specify the following URL:

From the REST step, in the Response tab, specify the content type to be test/plain:

From the REST step, in the Mappings tab, click on the Map Automatically button:

Before proceeding to the JSON Parser step click on the Libraries button on the top left and choose the New Library button.

Click on Import New Resource and browse to the newly downloaded file.

DataStage will automatically use this file as a JSON schema

From the JSON Parser step, in the Configuration tab, choose the String set option and use the drop down menu. There is only one result.

From the JSON Parser step, in the Documentation Root tab, specify the documentation root to be the new library from the previous step, likely called employees.

From the Output step, in the Mappings tab, click on the Map Automatically button. Remove the image profile and any others you do not want to use.

Click OK to exit out of the Assembly Editor.

3. Compile and run the job

Before running the job we need to modify the Sequential File by double clicking on its icon on the designer. Specify an output file, say output.csv.

Also update the columns by adding the id, employee_name, employee_salary, and employee_age fields.

Finally, compile and run the job.

4. View output

The output file will be saved on the server. Switch to the server VM by clicking the first icon on the Environment VMs panel and selecting iis-server. Login as the root user with the password inf0Xerver.

Change your direcotory using cd to the location where you had stored the file. If you provided a path starting at "/", then it will be stored at that location in the server. Since we had only provided output.csv as the file path in the File connector, the file will be available in the Transformation project's folder, i.e.,

Finally, output your results using the cat command.

CONGRATULATIONS!! You have completed this lab!

Lab: DataStage: Import COBOL copybook and read EBCIDIC data

IBM DataStage Flow Designer allows you to read data from a mainframe. More specifically, you can specify inputs in your DataStage job to be in EBCIDIC format and to import COBOL copybooks as table definitions.

This lab consists of the following steps:

About the data

The example binary data should be downloaded to the server. Switch to the server and run:

Before you start: Launching DataStage Flow Designer

Before we start the lab, let's switch to the iis-client VM and launch Firefox.

Launch the desktop client by going to the start menu and searching for DataStage Designer.

1. Create job layout

Start a new Parallel Job project and create a job that looks like the image below. Remember to wire the elements together. It should have:

1 x Complex Flat File
1 x Peek
1 X Sequential File

2. Add COBOL copybook as a table definition

In the toolbar click on Import > Table Definitions > COBOL File Definitions.

Specify the downloaded copybook file and click Import.

You have just imported your copybook definitions!

3. Customize the job

The first step is to double click on the Complex Flat File node, go to the File options tab, and specify the example binary file as the input. Critically, we must specify Fixed block as the record type.

Go to the Record options tab and choose the Binary data format and EBCDIC as the character set.

Go to the Records tab and click the Load button, this will give us the option to specify a copybook.

Choose to use on the copybook that was imported in the previous step.

Select the >> icon to use all fields from the copybook.

The Records tab should now show the various column names from the copybook.

Double clicking on the Peek node allows us to map output from the Complex Flat File. Click on the Output section and choose the Columns tab.

Enter the following new columns:

ID
SHORT_NAME
CLIENTID
COMPANY_ID_NUM
COMPANY_ID_STR
REGISTRATION_NUM
NUMBER_OF_ACCTS
ACCOUNT_NUMBER
ACCOUNT_TYPE_X

Still in the Output section now click on the Mapping tab and choose to Auto-Match.

Double clicking on the Sequential File node brings up a single option. To specify a filename, choose mainframe.csv for example.

4. Compile, run, view output

Compile and run the job using the usual icons from the toolbar.

After running the job you can view the output from the Designer tool by clicking on the Sequential File node and clicking the View Data button. Click OK on the next dialog.

You'll be able to see one row of data with an ID, SHORT_NAME and a few other fields.

The output file is also written to the server. Switch to the server VM by clicking the first icon on the Environment VMs panel and selecting iis-server. Login as the root user with the password inf0Xerver.

Change your direcotory using cd to the location where you had stored the file.

Finally, output your results using the cat command.

CONGRATULATIONS!! You have completed this lab!

Resources

Survey

Instructors

Lab: DataStage: Read and write to file (CSV)

The purpose of this lab is to familiarize ourselves with the DataStage Flow Designer and the Operations Console.

In this lab, you will learn the following:

How to create a job in DataStage.
How to read tabular data from a file into DataStage.
How to perform transformations (such as filtering out records and removing duplicate values) on a table.
How to write tabular data from DataStage into a file.
How to run jobs.
How to view logs for jobs.
How to schedule jobs.

This lab is comprised of the following steps:

Create a Transformation project
Create the table definition
Create the job
Compile and run the job
Use Operations Console to view previous logs
Scheduling the job
View output

Before you start

Before we start the lab, let's switch to the iis-client VM and launch Firefox.

Click on DataStage Flow Designer.

This brings up the DataStage Flow Designer. Click OK.

1. Create a Transformation project

On the IBM DataStage Flow Designer, click on the Projects tab and click + Create. In the modal that opens up, type in a name for the project and click Create.

The project takes a few minutes to be created and once ready, it will be visible on the Projects tab.

Click on the tile for your newly created project. In the modal that opens up, verify that the name of your project is provided as the Project Name and click OK to switch the project.

2. Create the table definition

We will be using the raviga-products.csv file which contains tabular data having 2 columns:

On the DataStage Flow Designer, click on the Table Definitions tab and then click + Create. In the modal that opens up, provide the definition of the table contained in the input file. Go to the General tab and provide the Data source type as File, the Data source name as raviga and the Table/File name as ravigaproducts. If you wish to, you can also provide optional Short description and Long description.

Next, go to the Columns tab and click + Add. A new entry is added in the table. Edit the entry and update the Column name to ProductID. Mark the column as Key. Update the SQL type to Varchar and the Length to 9.
Click on + Add to add another entry in the table. Edit this entry and update the Column name to Zip, SQL type to Integer, Extended to "" (blank) and Length to 6. Mark the column as Nullable. Click OK. In the modal that pops up, click Save.

3. Create the job

Click on the Jobs tab and then click + Create. Click Parallel Job.

A new tab with the name Job_1* opens up where you can now start designing the parallel job.

Before designing the job, take a few minutes to look at the various buttons and menus available on the screen.

The Show/Hide Palette button can be used to show or hide the palette on the left of the screen which contains the connectors and stages that can be used while designing the job.
The Save, Compile and Run icons are used to save, compile and run the job respectively.
The Compare icon is used to compare this job with another job in the Project.
The View menu has the following options:
1. View OSH code - which is available once the job has been successfully compiled.
2. View log - which is available once the job has been run.
3. Properties - based on what is selected in the job canvas before clicking on Properties, either the properties of the job or of a connector/stage/link are displayed.
The Schedule icon is used to set a schedule to run the job.
The Settings menu has a number of options such as
1. Apply horizontal layout - which arranges all the connectors and stages in the canvas in a horizontal manner with data flowing from left to right.
2. The ability to view/hide annotations, arrows, link names, and the type of the connector/stage.
3. Smart palette - which applies smart logic based on usage patterns to reorder the items available in the palette. If disabled, the items in the palette are displayed in an alphabetical order.
4. Smart stage suggestions - applies smart logic based on usage patterns to suggest the next stage that you might want to add to the job.

Drag a File connector to the canvas. The Table Definition Asset Browser opens up. Select raviga\ravigaproducts and click Next.

Ensure that both the columns ProductID and Zip are selected and click Add to Job.

The File connector should now be visible on the canvas. Drag and drop a Filter stage on the canvas. Provide the output of the File connector as the input to the Filter stage. To do this, click on the little blue dot on the right side of the File connector and drag the mouse pointer to the Filter stage.

Double click on the File connector to open the Connector's Properties page. Provide the File name as /IBMdemos/raviga-products.csv which is where the source file is located. Further down, provide the File format as Comma-separated value (CSV).

Scroll further down and update the Null value to "". Click OK.

Double click the Filter stage to open the Stage's Properties page. Provide the value for Where Clause as Zip IS NOT NULL. This will filter out all the records in the table where the value of Zip is NULL. Click OK.

We will use the Peek stage to look at the intermediate state of the table. Drag and drop a Peek stage to the canvas and provide the output of the Filter stage as the input to the Peek stage.

Next, we will add a Remove Duplicates stage and provide the output of the Peek stage as the input to the Remove Duplicates stage.

Double click on the Remove Duplicates stage to open the Stage's Properties page. Under KEYS THAT DEFINE DUPLICATES, select Zip. Set the value of Duplicate to retain as First. This will ensure that if there are any duplicate Zip values, the job will only retain the first record out of the records where this Zip value occurs. Click OK.

Drag and drop a File connector to the canvas. In the modal that opens up, check the Add connector as target checkbox and click Add to Job.

Provide the output of the Remove Duplicates stage as the input to the File connector. Double click the File connector to open the stage page, and provide the File name as output.csv. Specify Comma-separated value (CSV) as the File format. Click OK.

4. Compile and run the job

Click the Save icon to save the job. If you wish to, you can provide a different name for the job in the modal that pops up. Click Save. Once the job is saved, click on the Compile icon to compile it. If compilation is successful, you should see a green check mark and the message Compiled successfully displayed on the screen. The OSH Code will also be available for viewing.

Click the Run icon to run the job. In the modal that opens up, click Run.

Once the run completes, you should see the status of the run on the screen. The status will also indicate if there were any warnings or errors. Click on View log to get more details about the run.

Everything that happens as part of the job run is available in the log. This includes warnings, errors and informational messages such as the results of the Peek stage. Scroll through the log to reach the Peek results. By default, the Peek stage only logs the first 10 records of the partition (the data was split into two partitions in this case). You can see how in both Peek logs there are no records where Zip value is NULL.

5. Use Operations Console to view previous logs

At the bottom of the log, click on Open Operations Console. Alternatively, you can open the Classic Launchpad in Firefox and click on Operations Console.

The Operations console can be used to view logs for previous runs of the job. Switch to the Projects tab. In the left-hand navigation pane, expand Lab1 (the transformation project), followed by Jobs. Click on Job_1 (the job that was just saved/compiled/run). The Job Runs table at the bottom of the screen will be updated to reflect all the times that the job was run. Click on any job run in the list to select it and click View Details.

Go to the Log Messages tab and check the Full Messages checkmark to see the complete log for the job run. Clicking on an entry in the job log will display the log message details at the bottom.

6. Scheduling the job

DataStage Flow Designer provides the ability to run jobs as per a schedule.

Back in the DataStage Flow Designer where we were designing the job, click on the Schedule icon. In the modal that opens up, provide the frequency for running the job - whether you want to run it every Day, Week or Month. Next to it, provide the time of the day when the job should run. Click Schedule.

NOTE: You can only schedule your job once it has been successfully compiled.

7. View output

The output file will be saved on the server. Switch to the server VM by clicking the first icon on the Environment VMs panel and selecting iis-server. Login as the root user with the password inf0Xerver.

CD to the location where you had stored the file. If you provided a path starting at "/", then it will be stored at that location in the server. Since we had only provided output.csv as the file path in the File connector, the file will be available in the Transformation project's folder, i.e.,

/opt/IBM/InformationServer/Server/Projects/<project-name>/

CONGRATULATIONS!! You have completed this lab!

Lab: DataStage: Read from a database (Db2 on IBM Cloud)

The purpose of this lab is to design a DataStage job to satisfy the following problem statement:

As a data engineer, you have been asked by the Line of Business that you support, to produce a new data file that contains all employees whose total compensation is less than $50,000. The file must also contain the Department Name that the employee works in, and the mean average salary of all employees in that department who earn less than 50,000. In addition, the file must be sorted in descending order, based on the mean average salary amount. Finally, the application that will consume this file, expects the full name of the employee to be in one field, formatted as first, middle initial, last).

NOTE: You can use the Peek stage to check intermediate results in the job as demonstrated in .

In this lab, you will learn:

How to create a job in DataStage.
How to load data from Db2 Warehouse into DataStage.
How to perform transformations such as modifying tables, joining tables, aggregating table data and sorting table data.
How to write tabular data from DataStage into a file.
How to run jobs.
How to view logs for jobs.

NOTE: You can use the Peek stage to check intermediate results in the job as demonstrated in .

This lab is comprised of the following steps:

Before you start

Before we start the lab, let's switch to the iis-client VM and launch Firefox.

Click on DataStage Flow Designer.

This brings up the DataStage Flow Designer. Click OK.

1. Create a Transformation project

On the IBM DataStage Flow Designer, click on the Projects tab and click + Create. In the modal that opens up, type in a name for the project and click Create.

The project takes a few minutes to be created and once ready, it will be visible on the Projects tab.

Click on the tile for your newly created project. In the modal that opens up, verify that the name of your project is provided as the Project Name and click OK to switch the project.

2. Add database connection

The input tables - EMP (containing employee data) and DEPT (containing department data) - are already loaded in Db2 Warehouse. Let's add a Db2 warehouse instance as a Connection in DataStage.

Click on the Connections tab and then click + Create to add a new connection.

Provide the following connection details and click OK. Click Save on the new modal that pops up.

A tile for the new connection will now be displayed in the Connections tab.

3. Create the job

Click on the Jobs tab and then click + Create. Click Parallel Job.

A new tab with the name Job_1* opens up where you can now start designing the parallel job.

The first step is to load the input tables DEPT and EMP into DataStage.

The DEPT table contains the following columns:

The EMP table contains the following columns:

First, drag a Connection connector to the canvas. In the modal that opens up, select the DB2WH connection that was created earlier and click Next.

On the next screen, select the BLUADMIN schema and click Next.

On the next screen, select the DEPT table and click Next.

On the next screen, click Add to Job.

Drag another Connection connector to the canvas and repeat the steps given above but this time, select the EMP table instead. Once you complete the steps, you should see the two Connection connectors on the canvas.

Use a Transformer stage to perform the following two modifications:

Update the output of the EMP table by replacing any NULL MIDINIT values with " ". This is needed for a future step where we will combine the FIRSTNME, MIDINIT and LASTNAME columns to create the FULLNAME of the employee.
Currently, the EMP table uses the WORKDEPT column to identify the department number whereas the DEPT table uses the DEPTNO column. Modify the output of the EMP table by changing the name of the WORKDEPT column to DEPTNO. This is needed for a future step where we will Join the two tables.
Drag and drop a Transformer stage next to the Connection connector for the EMP table. Provide the output of the EMP table Connection connector as the input to the Transformer stage. For this, click on the little blue dot on the right side of the Connection connector and drag the mouse pointer to the Transformer stage.

Drag and drop a Join stage to the canvas and provide the output of the Transformer stage as the input to this Join stage.

Double click on the Transformer stage to open up the stage page. In the Properties tab, click + Add. A new entry is added in the table below. Use this to define a "Stage variable" (a local variable that is available only within this stage) named MIDINITIAL which will represent the middle initial of the employee with any NULL values replaced with the empty string. Double click on {derivation} to open up the Derivation Builder.

Begin building the derivation rule for MIDINITIAL by finding the NullToValue function in the table. Clicking on the entry in the table will insert it in the "Derivation" at the top. You can also use the search bar to look for the function. Replace the %input_column_input_column% with the <Link>.MIDINIT input variable which can also be found in the table and the %value% with " ". <Link> represents the identifier of the input link for the Transformer. Click OK to go back to the Stage page.

Update the Name of the stage variable to MIDINITIAL, the SQL type to Char, the Precision to 1 and the Scale to 0.

Now go to the Outputs tab and in the table find the entry for the MIDINIT column. Double click on the derivation value for this entry (it will open up the Derivation builder) and change the derivation value to use the newly created MIDINITIAL stage variable instead. Next, look for the entry for the WORKDEPT column. Double click on the WORKDEPT value under the Column name column and replace the text with DEPTNO. Click OK.

Both the tables now have a column called DEPTNO which can be used to join the tables. Provide the output of the DEPT table Connection connector as the second input to the Join stage. Double clicking the Join stage brings up the stage page where you can verify that the DEPTNO is being used as the JOIN KEY and the Join Type is Inner.

Next, add a Transformer stage to the canvas and provide the output of the Join stage as the input to the Transformer stage.

Double click the Transformer stage to open the stage page. We will add 2 stage variables to this Transformer stage. In the Properties tab, click + Add. A new entry is added in the table below. Use this to define a stage variable named TOTALCOMP which will represent the total compensation (the sum of bonus, commission and salary) of the employee. Double click on {derivation} to open up the Derivation Builder.

Build the derivation rule for TOTALCOMP by finding the BONUS, COMM and SALARY input columns in the table. Clicking on the entries in the table will insert them in the "Derivation" at the top. Type in + signs between them in order to complete out the derivation. You can also use the search bar to look for the fields. When inserted in the "Derivation", the column names will be prepended with the identifier for the input link. Click OK to go back to the Stage page.

Update the Name of the stage variable to TOTALCOMP, the SQL type to Decimal, the Precision to 9 and the Scale to 2.

Repeat the process above to add another Stage variable FULLNAME which will represent the complete name of the employee. Provide the Derivation as CompactWhiteSpace(<Link>.FIRSTNME:" ":<Link>.MIDINIT:" ":<Link>.LASTNAME), the Name of the stage variable as FULLNAME, the SQL type as Varchar and the Precision as 36 and the Scale as 0. Click OK to save the changes and return to the canvas.

Next, add a Join stage and an Aggregator stage to the canvas. Connect the Transformer stage to both these stages such that the output of the Transformer stage is provided as the input to both these stages.

Double click on the Transformer stage to open the stage page. Go to the Outputs tab. Click on +Add twice to add 2 new entries in the table. Update one entry with the Derivation value as TOTALCOMP, the Column name as TOTALCOMP, the SQL type as Decimal, the Precision as 9, the Scale as 2 and set Nullable to true. Update the second entry with the Derivation value as FULLNAME, the Column name as FULLNAME, the SQL type as Varchar, the Precision as 36 and the Scale as 0.
We also need to add a Constraint here, which will ensure that only the records with TOTALCOMP more than 50000 are sent in the output. Click on the empty space under Constraint to open up the Derivation Builder. Specify the derivation as TOTALCOMP>50000.
Switch to the second output link by clicking on the Link_<number> under Output name and repeat the above steps to add the 2 stage variables to the output and to add the constraint. Click OK to save the changes and return to the canvas.

Provide the output of the Aggregator stage as the input to the Join stage. Double click on the Aggregator stage to open the stage page. Select DEPTNAME as the grouping key.

Scroll down to the Aggregations and select the Aggregation Type as Calculation, Column as TOTALCOMP, Calculation Type as Mean Value and click Add Calculation +.

Go to the Outputs tab and verify that you can see 2 output columns - MEAN_TOTALCOMP and DEPTNAME. Click OK to save the changes and return to the canvas.

Drag and drop a Sort stage on the canvas and provide the output of the Join stage as the input of the Sort stage.

Double click on the Join stage to open the stage page. On the Properties tab, verify that the Join Key is DEPTNAME and the Join Type is Inner.

Go to the Outputs tab and verify that you can see FULLNAME and MEAN_TOTALCOMP in the output column list.

Double click on the Sort stage to open up the stage page. Specify the Sorting Key as MEAN_TOTALCOMP and the Sort Order as Descending. Click OK.

Drag and drop a File connector to the canvas. In the modal that opens up, check the Add connector as target checkbox and click Add to Job.

Provide the output of the Sort stage as the input to the File connector. Double click the File connector to open the stage page, and provide the name of the output File as output.csv. Specify Comma-separated value (CSV) as the File format. Enable First row is header and provide Null value as "". Click OK.

4. Compile and run the job

Click the Save icon to save the job. If you wish to, you can provide a different name for the job in the modal that pops up. Click Save. Once the job is saved, click on the Compile icon to compile it. If compilation is successful, you should see a green check mark and the message Compiled successfully displayed on the screen.

Click the Run icon to run the job. In the modal that opens up, click Run.

5. View output

The output file will be saved on the server. Switch to the server VM by clicking the first icon on the Environment VMs panel and selecting iis-server. Login as the root user with the password inf0Xerver.

CD to the location where you had stored the file. If you provided a path starting at "/", then it will be stored at that location in the server. Since we had only provided output.csv as the file path in the File connector, the file will be available in the Transformation project's folder, i.e.,

CONGRATULATIONS!! You have completed this lab!

Lab: DataStage: Read and analyze data from multiple tables

This lab will give you hands-on experience using Information Server's datastage flow designer and rule capabilities. In this lab, you will learn the following:

How to join two data sets using DataStage Flow Designer to create a single data set
How to create a data rule that runs on columns from each data set
Apply the rules to the joined data set
View data that does not apply to the rule

Note: The merged data set is already available in DB2WH. If you wish to, you may skip section 1 and directly move on to section 2 where you can import this merged data set.

This lab consists of the following steps:

Section 1: Join the data sets

Before you start

Before we start the lab, let's switch to the iis-client VM and launch Firefox.

Click on DataStage Flow Designer.

This brings up the DataStage Flow Designer. Click OK.

1. Create a Transformation project

On the IBM DataStage Flow Designer, click on the Projects tab and click + Create. In the modal that opens up, type in a name for the project and click Create.

The project takes a few minutes to be created and once ready, it will be visible on the Projects tab.

Click on the tile for your newly created project. In the modal that opens up, verify that the name of your project is provided as the Project Name and click OK to switch the project.

2. Add database connection

The input tables - EMP (containing employee data) and DEPT (containing department data) - are already loaded in Db2 Warehouse. Let's add a Db2 warehouse instance as a Connection in DataStage.

Click on the Connections tab and then click + Create to add a new connection.

Provide the following connection details and click OK. Click Save on the new modal that pops up.

A tile for the new connection will now be displayed in the Connections tab.

3. Create the job

Click on the Jobs tab and then click + Create. Click Parallel Job.

A new tab with the name Job_1* opens up where you can now start designing the parallel job.

The first step is to load the input tables DEPT and EMP into DataStage. The WORKDEPT column of the EMP table is the same as the DEPTNO column of the DEPT table.

First, drag a Connection connector to the canvas. In the modal that opens up, select the DB2WH connection that was created earlier and click Next.

On the next screen, select the BLUADMIN schema and click Next.

On the next screen, select the DEPT table and click Next.

On the next screen, click Add to Job.

Drag another Connection connector to the canvas and repeat the steps given above but this time, select the EMP table instead. Once you complete the steps, you should see the two Connection connectors on the canvas.

Drag and drop a Transformer stage next to the Connection connector for the EMP table. Provide the output of the EMP table Connection connector as the input to the Transformer stage. For this, click on the little blue dot on the right side of the Connection connector and drag the mouse pointer to the Transformer stage.

Drag and drop a Join stage to the canvas and provide the output of the Transformer stage as the input to this Join stage.

Double click on the Transformer stage to open up the stage page. Go to the Outputs tab and in the table find the entry for the WORKDEPT column. Double click on the WORKDEPT value under the Column name column and replace the text with DEPTNO. Click OK.

Both the tables now have a column called DEPTNO which can be used to join the tables. Provide the output of the DEPT table Connection connector as the second input to the Join stage. Double clicking the Join stage brings up the stage page where you can verify that the DEPTNO is being used as the JOIN KEY and the Join Type is Inner. Click OK.

Drag and drop a Connection connector to the canvas. In the modal that pops up in the screen, check the box for Add selected connection as target and click Add to Job. Provide the output of the Join stage as the input to this connector.

Double click on the Connection connector to open up the Properties page on the right. Verify that the URL, username and password are already populated.

Scroll down and under the Usage section, provide the Table name as <user>_DEPTEMP where \ is your name, and update the Table action to Replace. Click OK.

4. Compile and run the job

Click the Save icon to save the job. If you wish to, you can provide a different name for the job in the modal that pops up. Click Save. Once the job is saved, click on the Compile icon to compile it. If compilation is successful, you should see a green check mark and the message Compiled successfully displayed on the screen.

Click the Run icon to run the job. In the modal that opens up, click Run.

5. View output

To view the results of the job, double click on the Connection connector that represents the output. This will open up the Properties page on the right. Click on View Data to bring up a modal that shows the contents of the <user>_DEPTEMP table in which the output was stored.

This marks the end of Section 1 of this lab.

Section 2: Create and run data rule

1. Import and view the data

Switch to the iis-client VM and launch Firefox.

Click on the Launchpad bookmark. When the Information Server launchpad shows up click on the Information Governance Catalog New tile.

The overview page will appear.

Click on the Connections tab and click + Create connections.

On the next screen, provide the following details and click Test connection to test the connection. Once successful, click Save connection.

Now locate the tile for the connection that was just added, click on the kebab icon (⋮) on the tile and click Discover.

Click on Browse.

Expand db2. Expand BLUADMIN and select the <user>_DEPTEMP table where \ is your name. This is the table that was created in section 1 above. If you skipped section 1, then you can select the SANDHYA_DEPTEMP table. Click OK.

Under Discovery options, select Analyze data quality. This will automatically check the boxes for Analyze columns and Assign terms.

Scroll to the bottom and select the Host as IIS-SERVER.IBM.DEMO and the Workspace as UGDefaultWorkspace. Click Discover.

The discovery process will take some time. Once the assets are discovered, the analysis process will begin. Once that completes, you will be able to see what percentage of records were successfully analyzed.

Now let us go and have a look at the data. Go to Quality tab and click on the tile for UGDefaultWorkspace.

The workspace overview will load. Take a few moments to browse the graphics on the page and click on Data sets link to view the data in this exercise.

Before we create new rules, let's look at the data set that will be used in this lab. Click on the tile for <user>_DEPTEMP.

Click on the Columns tab to view findings from the analyzer. It found many things when the data was imported, like maximum and minimum values, distinct values, format, and uniqueness.

We're now ready to create our first data rule!

2. Create a data rule

From the UGDefaultWorkspace workspace, click on the Data rules tab.

Expand Published Rules > 08 Validity and Completeness > Valid Value Combination. Find the rule for IfFieldaIsXThenFieldbGtQty. Click on the ... overflow menu on the right and select Manage in workspace.

The rule should now be available under All. We will now edit the rule. If you wish to rename the rule, you will first need to Copy the rule and you will be provided with the option to rename the rule. Click on the ... overflow menu of the rule and select Edit.

Switch to the Rule logic tab and update the rule to say IF DEPTNAME = 'OPERATIONS' THEN SALARY > 36000.

Next, switch to the Rule testing tab. Here you need to bind the variables in the rule logic to specific columns in the data source. Select the salary variable in the left table and select the SALARY column under Available data sources. It should be under IIS-SERVER.IBM.DEMO > db2 > BLUADMIN > <user>_DEPTEMP. Click on + Bind. The value <user>_DEPTEMP.SALARY will now be shown under Implemented bindings for the salary variable.
Uncheck the salary variable, check the deptname variable and bind it with the DEPTNAME column under Available data sources. As in case of the SALARY column, it should be under IIS-SERVER.IBM.DEMO > db2 > BLUADMIN > <user>_DEPTEMP. Click on + Bind. The value <user>_DEPTEMP.DEPTNAME will now be shown under Implemented bindings for the salary variable.

Scroll down and click on Test to test out the rule. You will see a message at the top of the screen that says the test results can be viewed once ready. When the message disappears, go to the Rule test results tab to view the test results.

You can see that of the 42 total rows, 39 met the rule, and 3 did not. Click on Did not meet rule conditions to view the 3 rows. You can see that these rows have DEPTNAME = OPERATIONS but have SALARY < 36000, and therefore they did not match the rule conditions. Click Save to save the rule.

When you are brought back to the Data rules tab, you'll notice that the new rule has an error. We need to publish the rule. To do so navigate to the right to show the ⋯ menu. Choose the Publish option from the menu.

In the modal that pops up, click Publish to confirm the publish action.

3. Re-analyze and view results

Go back to the UGDefaultWorkspace workspace and click on the Data sets link.

Click on the <user>_DEPTEMP data set.

We can now apply the newly created rule by switching to the Rules (0) tab and clicking the + Add rule button.

Choose the IfFieldaisXThenFieldbGtQty rule under All and click Next.

As before, we need to bind the rule variables to specific columns in our data set. Select the salary variable in the left table and select the SALARY column under Available data sources. Click on the + Bind button. Once bound, select the deptname variable and bind it with the DEPTNAME column under Available data sources. Once both the variables are bound with the right columns, click Next.

This time, we don't need to test the rule. Simply click Save.

You should now see the rule in the data set view. Click on the Analyze button to restart the analysis with the new rule.

In the modal that pops up, click Analyze to confirm analysis.

The analysis will take a few minutes. You may need to refresh your browser a few times. You will see the state go from Running to Successful when analysis is complete.

Once analysis has completed successfully, go to the Data quality tab. You'll see that the new rule has six findings - three against DEPTNAME and three against SALARY. Click on either DEPTNAME or SALARY to view the exceptions.

Scrolling to the right you'll see that the three entries shown have DEPTNAME = OPERATIONS but have SALARY less than 36000, which goes against our rule.

CONGRATULATIONS!! You have completed this lab!