# Data Visualization with Data Refinery

Let's take a quick detour to the *Data Refinery* tool. Data Refinery can quickly filter and mutate data, create quick visualizations, and do other data cleansing tasks from an easy to use user interface.

This section is broken up into the following steps:

1. [Load the data](/cloudpakfordata-credit-risk-workshop/credit-risk-workshop/data-visualization-and-refinery.md#1-load-data)
2. [Refine the data](/cloudpakfordata-credit-risk-workshop/credit-risk-workshop/data-visualization-and-refinery.md#2-refine-data)
3. [Profile the data](/cloudpakfordata-credit-risk-workshop/credit-risk-workshop/data-visualization-and-refinery.md#3-profile-data)
4. [Visualize the data](/cloudpakfordata-credit-risk-workshop/credit-risk-workshop/data-visualization-and-refinery.md#4-visualize-data)

> *Note: The lab instructions below assume you have a project already and have data you will refine. If not, follow the instructions in the pre-work and import data to project sections to create a project and assign data to your project.*

## 1. Load Data

* Go the (☰) navigation menu and under the *Projects* section click on *`All Projects`*.

![(☰) Menu -> Projects](/files/-MSeC04BTKhxPF3_Mq-v)

* Click the project name you created in the pre-work section.
* From the `Project` home, under the `Assets` tab, ensure the `Data assets` section is expanded or click on the arrow to toggle it and open up the list of data assets.
* Click the merged data data asset *`XXXAPPLICANTFINANCIALPERSONALLOANDATA`* (the name of the file may vary, `XXX` may be your initials or the initials of the person who granted you data access) to open it. If this is the first time you are opening the data asset, you will be asked to unlock the connection with your personal credentials. Click the `Use your Cloud Pak for Data credentials to authenticate to the data source` checkbox and then click the *`Connect`* button.

![Authenticate connection](/files/-MWe_-ZWTt-kGNP3Co7e)

* Once the preview of the data asset opens, click on the *`Refine`* button on the top right of the table.

![Launch the action button](/files/-MWe_-ZY7dts03O4cMNb)

* Data Refinery will launch and open to the `Data` tab. It will also display the information panel with details of the data refinery flow and where the output of the flow will be placed. Go ahead and click the `X` to the right of the `Information` panel to close it.

![Data Refinery view](/files/-MSeC0JusohC1bc-erPW)

## 2. Refine Data

We'll start out in the `Data` tab where we wrangle, shape and refine our data. As you refine your data, IBM Data Refinery keeps track of the steps in your data flow. You can modify them and even select a step to return to a particular moment in your data’s transformation.

### Create Transformation Flow

* With Data Refinery, we can transform our data by directly entering operations in [R syntax](https://cran.r-project.org/manuals.html) or interactively by selecting operations from the menu. For example, start typing `filter` on the Command line and observe that the list of operations displayed will get updated. Click on the filter operation.

![Command line filter](/files/-MSeC0JvvOqHLMfcOmLp)

* A `filter` operation syntax will be displayed in the Command line. Clicking on the operation name within the Command line will give hints on the syntax and how to use the command. For instance, to filter for customers who have paid credits up to date, build the expression shown below. To inact the filter, you would `Apply` the expression.

```r
filter(`CreditHistory` == 'credits_paid_to_date')
```

* We can remove this custom filter by clicking on the trash icon on the `Custom code` step of our data workflow.

![Clear custom filter](/files/-MWe_-ZaaSdJOAx7j7LH)

* We will use the UI to explore and transform the data. Click the `+Operation` button.

![Choose Operation button](/files/-MSeC0JwQ7YrBlFVuJMr)

* Let's use the `Filter` operation to check some values. Click on `Filter` in the left panel.

![Filter Operation](/files/-MSeC0JxC8PU5Bs8nZsP)

* We want to make sure that there are no empty values in the `StreetAddress` column. Select the `StreetAddress` column from the `Column` drop down list, select *`Is empty`* from the `Operator` drop down list, and then click the `Apply` button.

![Filter is empty](/files/-MSeC0JysRoQYm7xGVRK)

> *Note: If there are records where the selected column is empty, they will be displayed after clicking the apply button. If there are no records for this filter, it means that the rows being sampled do not have any empty values for the selected column.*

* Now, click on the counter-clockwise "back" arrow to remove the filter. Alternately, we can also remove the filter by clicking the trash icon for the Filter step in the `Steps` panel on the right.

![Click back arrow](/files/-MSeC0JzGbS1NQ6plBAF)

* We can remove these records with empty values. Click the `+Operation` again and this time select the `Remove empty rows` operation. Select the `StreetAddress` column, then click the `Next` button and finally the `Apply` button.

![Remove empty rows](/files/-MSeC0K-0YyBOxcQiKbd)

* Let's say we've decide that there are columns that we don't want to leave in our dataset ( maybe because they might not be usefule features in our Machine Learning model, or because we don't want to make those data attributes accessible to others, or any other reason). We'll remove the `FirstName`, `LastName`, `Email`, `StreetAddress`, `City`, `State`, `PostalCode` columns.
* For each columnn to be removed: Click the `+Operation` button, then select the `Remove` operation. Click the `Change column selection` option.

![Remove Column](/files/-MSeC0K00x3GTMGXxbBu)

* In the `Select column` drop down, choose one of the columns to remove (i.e `FirstName`). Click the `Next` button and then the `Apply` button. The columns will be removed. Repeat for each of the above columns.
* At this point, you have a data transformation flow with 8 steps. As we saw in the last section, we keep track of each of the steps and we can even undo (or redo) an action using the circular arrows. To see the steps in the data flow that you have performed, click the `Steps` button. The operations that you have performed on the data will be shown.

![Flow](/files/-MSeC0K1JtbYKLHFqAFK)

* You can modify these steps in real time and save for future use.

### Schedule Jobs

Data Refinery allows you to run jobs at scheduled times, and save the output. In this way, you can regularly refine new data as it is updated.

* Click on the "jobs" icon and then `Save and create job` option from the menu.

![Click jobs icon](/files/-MSeC0K2G_eqt2QjWXTZ)

* Give the job a name and optional description, then click the *`Next`* button.

![jobs name](/files/-MWe_-Zmsae2mj-CAQ2s)

* The job will configure a default input and output data asset, as well as the runtime environment. Click the *`Next`* button.

![jobs name](/files/-MWe_-ZqU06ZTsdVmUBW)

* We can set the job to run on a schedule. For now, leave the schedule off and click the *`Next`* button.

![jobs name](/files/-MWe_-ZreSYjJ_xCUgzB)

* Click the *`Create and Run`* button to save and run this job.

![Create and Run Refinery job](/files/-MSeC0K3V5xKDurDkz3h)

* This refinery flow will be saved to your project in the `Data Refinery flows` section of the project overview page. From that section you could revisit the flow to edit the steps or even see any execution jobs you have run. For now, we will move on to exploring our data.

## 3. Profile Data

* Back on the top level of the data refinery view, click on the `Profile` tab to bring up a view of several statistics and histograms for the attributes in your data.

![Data Refinery Profile tab](/files/-MSeC0K7-8c9mmHPp6QW)

* Once the data profile loads, you can get insight into the data from the views and statistics:
  * The median age of the applicants is 36, with the bulk under 49.
  * About as many people had credits\_paid\_to\_date as prior\_payments\_delayed. Few had no\_credits.
  * The median was 3 years for duration at current residence. Range was 1-6 years.

## 4. Visualize Data

Let's do some visual exploration of our data using charts and graphs. Note that this is an exploratory phase and we're looking for insights in out data. We can accomplish this in Data Refinery interactively without coding.

* Choose the `Visualizations` tab to bring up the page where you can select columns that you want to visualize. Select `LoanAmount` from the "Columns to visualize" drop down list as the first column and click `Add another column` to add another column. Next add `LoanDuration` and click the *`Visualize data`* button. The system will pick a suggested plot for you based on your data and show more suggested plot types at the top.

![Select columns](/files/-MMJHcrMT__QnoL2sgu-)

* Remember that we are most interested in knowing how these features impact a loan being at the risk. So, let's add the `Risk` as a color on top of our current scatter plot. That should help us visually see if there's something of interest here. From the left panel, click the `Color Map` drop down and select `Risk`. Also, to see the full data, drag the right side of the data selector at the bottom all the way to the right, in order to show all the data inside your plot.

![Amount v Duration Scatter](/files/-MMJHcrN2JLCB0cRRALa)

* We notice that there are more blue (risk) on this plot towards the top right, than there is on the bottom left. This is a good start as it shows that there is probably a relationship between the riskiness of a loan and its duration and amount. It appears that the higher the amount and duration, the riskier the loan. Interesting, let's dig in further in how the loan duration could play into the riskiness of a loan.

> *Note: The colors used in your visualization may be different. Be sure to look at chart legend for clarification*

* Let's plot a histogram of the `LoanDuration` to see if we can notice anything. First, select `Histogram` from the `Chart Type`.
* On the left, select `LoanDuration` for the 'X-axis', select `Risk` in the 'Split By' section, check the `Stacked` option, uncheck the `Show kde curve` toggle, uncheck the `Show distribution curve` toggle. You should see a chart that looks like the following image.

![Visualize loan duration](/files/-MWe_-_-R0Wg79H-JJCj)

* It looks like the longer the duration the larger the blue bar (risky loan count) become and the smaller the dark blue bars (non risky loan count) become. That indicate loans with longer duration are in general more likely to be risky. However, we need more information.
* We next explore if there is some insight in terms of the riskiness of a loan based on its duration when broken down by the loan purpose. To do so, let's create a Heat Map plot.
* At the top of the page, in the `Chart Type` section, open the arrows on the right, select `Heat Map`.

![Switch to heat map](/files/-MMJHcrP1UZq_wj0YA-n)

* Next, select `Risk` in the column section and `LoanPurpose` for the `Row` section. Additionally, to see the effects of the loan duration, select `Mean` in the summary section, and select `LoanDuration` in the `Value` section.

![Loan purpose heat map](/files/-MMJHcrQCY_gRmm2_55x)

* You can now see that the least risky loans are those taken out for purchasing a new car and they are on average 10 years long. To the left of that cell we see that loans taken out for the same purpose that average around 15 years for term length seem to be more risky. So one could conclude the longer the loan term is, the more likely it will be risky. In contrast, we can see that both risky and non-risky loans for the other category seem to have the same average term length, so one could conclude that there's little, if any, relationship between loan length and its riskiness for the loans of type other.
* In general, for each row, the bigger the color difference between the right and left column, the more likely that loan duration plays a role for the riskiness of the loan category.
* Now let's look into customizing our plot. Under the Actions panel, notice that you can perform tasks such as `Start over`, `Download chart details`, `Download chart image`, or set `Global visualization preferences` *(Note: Hover over the icons to see the names)*. Click on the drop down arrow next to `Action`. Then click on the `Global visualization preferences` option from the menu.

![Visualize theme preferences](/files/-MWe_-_4xjf9JozY__Wx)

* We see that we can do things in the `Global visualization preferences` for `Titles`, `Tools`, `Theme`, and `Notifications`. Click on the `Theme` tab and update the color scheme to `Dark`. Then click the `Apply` button, now the colors for all of our charts will reflect this. Play around with various Themes and find one that you like.

![Visualize set theme and choose preferences](/files/-MSeC0K8V9cahF4LAAY5)

## Conclusion

We've seen a some of the capabilities of the Data Refinery. We saw how we can transform data using R code, as well as using various operations on the columns such as changing the data type, removing empty rows, or deleting the column altogether. We next saw that all the steps in our Data Flow are recorded, so we can remove steps, repeat them, or edit an individual step. We were able to quickly profile the data, to see histograms and statistics for each column. And finally we created more in-depth Visualizations, creating a scatter plot, histogram, and heatmap to explore the relationship between the riskiness of a loan and its duration, and purpose.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ibm-developer.gitbook.io/cloudpakfordata-credit-risk-workshop/credit-risk-workshop/data-visualization-and-refinery.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
