# Data Visualization with Data Refinery

Let's take a quick detour to the *Data Refinery* tool. Data Refinery can quickly filter and mutate data, create quick visualizations, and do other data cleansing tasks from an easy to use user interface.

This section is broken up into the following steps:

1. [Load the *BILLING* data table into data refinery](#1-load-the-billing-data-table-into-data-refinery)
2. [Refine your data](#2-refine-your-data)
3. [Use Data Flow steps to keep track of your work](#3-use-data-flow-steps-to-keep-track-of-your-work)
4. [Profile the data](#4-profile-the-data)
5. [Visualize with charts and graphs](#5-visualize-with-charts-and-graphs)

## 1. Load the *BILLING* data table into data refinery

From the *Project* home, under the *Assets* tab, click on the *Data assets* arrow to toggle it and open up the list of data assets. Click the box next to *USERxxxx.BILLING* (where `USERxxxx` is your username or the username of the person who granted you data access) to check it, and click the 3 dots to the right, and then *Refine* :

![Launch the BILLING table](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fbe06241240ec34269da14cdd98b890c778cf025f.png?generation=1595273050956312\&alt=media)

Data Refinery should launch and open the data like the image below:

![Data Refinery view of the BILLING table](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fc44a160da7d2fc724d07d515ac9116e912407c3c.png?generation=1595273056848596\&alt=media)

Click the `X` by the *Details* button to close it.

## 2. Refine your data

We'll start out in the *Data* tab.

### Transform your sample data set by entering R code in the command line or selecting operations from the menu

For example, type *filter* on the Command line and observe that autocomplete will give hints on the syntax and how to use the command:

![Command line filter](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fbb100495ece68091e0866e62eb9d7a98d029006f.png?generation=1595273055408960\&alt=media)

When you have completed a command, click Apply to apply the operation to your data set.

Click the `Operation +` button:

![Choose Operation button](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fadbf8805a1a12eb16ae0580e022ead3322b8365c.png?generation=1595273042227113\&alt=media)

We want to make sure that there are no empty values, and there may be some for the *TotalCharges* column, so let's fix that. Click on `Filter` and choose the *TotalCharges* column from the drop down, then the Operator *Is empty*, then `Apply`:

![Filter is empty](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fa49a15ceb1344e68f1efa18b7a07119df7d46da0.png?generation=1595273041656901\&alt=media)

We can see that there is only 1 row with an empty value for *TotalCharges*:

![Filter is empty results](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fc6a21fb6c23bff9bf875c6a203d79d2bf87aa887.png?generation=1595273052274785\&alt=media)

It should be safe to just drop these rows from the data set, so let's do that.

First, remove the filter that you just added. You can delete it from the "Steps" section of clicking the undo arrow on top of the page.

Next, choose the Operation *Remove empty rows*, select the *TotalCharges* column, click `Next` and then click `Apply`:

![Remove empty rows](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fedfaea073dd472f761d096bc8be463f86ad31aa0.png?generation=1595273052707194\&alt=media)

Finally, we can remove the *CustomerID* column, since that won't be useful for training a machine learning model in the next exercise. Choose the *Remove* operator, then choose `Change column selection`. Under `Select column` pick *customerID*, click `Next` and then click `Apply`:

![Remove CustomerID column](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2F4c436fe4213cd22f2152bff2410e02849930072c.png?generation=1595273014145189\&alt=media)

### 3. Use Data Flow steps to keep track of your work

What if you need to show a non-technical person the steps you took? What if we do something we don't want?

Within Data Refinery, we keep track of the steps and we can even undo (or redo) an action using the circular arrows:

![Undo recent action](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fe592f5cea2adc27bb37d078243ae4cf1feac7782.png?generation=1595273053626337\&alt=media)

As you refine your data, IBM Data Refinery keeps track of the steps in your data flow. You can modify them and even select a step to return to a particular moment in your data’s transformation.

To see the steps in the data flow that you have performed, click the *Steps* button. The operations that you have performed on the data will be shown:

![Data Flow steps](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2F04d697c4a2049608babf3be2f7c24b92c4769037.png?generation=1595273043514735\&alt=media)

You can modify these steps in real time and save for future use.

### 4. Profile the data

Clicking on the *Profile* tab will bring up a quick view of several histograms about the data.

![Data Refinery Profile tab](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fac5c452a301d80e02d040175ebf22b5a0b624c28.png?generation=1595273049078989\&alt=media)

You can get insight into the data from the histograms:

* Twice as many customers are month-to-month as either 2-year or 1-year contract.
* More choose paperless billing, but around 40% still prefer a paper bill mailed out to them.
* You can see the distribution of *MonthlyCharges* and *TotalCharges*.
* From the Churn column, you can see that a significant number of customers will cancel their service.

### 5. Visualize with charts and graphs

Choose the *Visualizations* tab to bring up an option to choose which columns to visualize. Under *Columns to Visualize* choose *TotalCharges* and click `Visualize data`:

![Visualize TotalCharges column](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2F26bb744361850e5a8ab68b20862e1aa649849510.png?generation=1595273060680690\&alt=media)

We first see the data in a histogram by default. You can choose other chart types. We'll pick `Scatter plot` next by clicking on it:

![Visualize TotalCharges histogram](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2F22a26cccef2e052931e814278c045b853a5521bf.png?generation=1595273037450681\&alt=media)

In the scatter plot, choose *TotalCharges* for the x-axis, *MonthlyCharges* for the y-axis, and *Churn* for the *Color map*. Drag the bottom *TotalCharges* filter to show all the data:

![set x- and y- axes and Color map](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2F06dda65ef07ba243faa3596abfbe4d5f0a2280b8.png?generation=1595273059716753\&alt=media)

Scroll down and give the scatter plot a title and sub-title if you wish. Under the `Actions` panel, notice that you can perform tasks such as *Start over*, *Download chart details*, *Download chart image*, or set *Global visualization preferences* (*Note: Hover over the icons to see the names*). Click on the "gear" icon in the `Actions` panel

![Visualize set titles and choose preferences](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fd4f17f9d87820bbe44a71e75c6dde929dedc1ff1.png?generation=1595273003707931\&alt=media)

We see that we can do things in the *Global visualization preferences* for *Titles*, *Tools*, *Theme*, and *Notification*. Click on the `Theme` tab and update the color scheme to *Vivid*. Then click the `Apply` button :

![Visualize set vivid](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2F3aeb86893432b1cffbc7d138470ccd28c5c01122.png?generation=1595273034932901\&alt=media)

Now the colors for all of our charts will reflect this:

![Visualize show vivid](https://2515897395-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M0YTFb_WdaBDnhlXs-q%2Fsync%2Fc0f25e3c29b0017d7e36af138363a98619c12573.png?generation=1595272970505147\&alt=media)

### Conclusion

We've seen a small sampling of the power of Data Refinery on IBM Cloud Pak for Data. We saw how we can transform data using R code, at the command line, or using various Operations on the columns such as filtering the data, removing empty rows, or deleting a column altogether. We next saw that all the steps in our Data Flow are recorded, so we can remove steps, repeat them, or edit an individual step. We were able to quickly profile the data, to see histograms and statistics for each column. And finally we created more in-depth visualizations, creating a scatter plot mapping TotalCharges vs. MonthlyCharges, with the Churn results highlighted in color.
