Data cleansing

It is recommended to get familiar with Transformation scripts basics before proceeding with this tutorial.

Goal of this use cases is to clean the values in imported data file.

Merging values

You can download data for this use case here.

Use contains function to transform values into the same value for all the rows. First you should define all the wrong appearances and then create condition for all of them. This function will update all the columns which contain selected string.

if(contains(value(2), 'wasser')) {
 return "Water"
}
if(contains(value(2), 'Warter')) {
 return "Water"
}
if(contains(value(2), 'water')) {
 return "Water"
}
return value(2)

This function will be used for all the values which contain selected string. If you want to update only specific values, you can also use conditions, like following:

if (value(1) == 'UK') {
return "United Kingdom"
}
return value(1)

Removing unwanted values

Sometimes a column should contain only specified values but thanks to some error in source system it also contains some unwanted values. In this example there is a column Gender which you should contain only values "Male" and "Female". Everything else should be replaced by "Unkown". You can use this script to get rid of them.

if (value() == "Male" | value()== "Female") {
return value()
}
else return "Unknown"

Similar way it is also possible to apply transformation scripts on indicators. For example you can get rid of values outside of range. In this example, we want to display all values greater than 1 as 0:

if (value()>"1") {
  return 0}
else return value()

String cleaning (normalization)

In some cases the strings contain unwanted characters, such as leading spaces, dashes etc. You can remove them by using various combination of function for character replacing and stripping, such as:

basetext = replace(stripStart(strip(strip(value(8)),'-'),'"'), ' &amp;', '&')
return basetext

Returning certain number of words

Sometimes the user wants to use only a certain maximum number of words. This can be achieved by combination of functions countMatches (to find number of words in original string), ordinalIndexOf (to find position of space after last required word) and left (to return required part of string).

basetext = replace(stripStart(strip(strip(value(8)),'-'),'"'), ' &amp;', '&')
numOfSpaces = countMatches(basetext, ' ')
if (numOfSpaces>=3) { 
	subs= ordinalIndexOf(basetext, " ", 3) 
	return left(basetext, subs)
} else {
	return basetext}

Adding leading zeros to match certain length

This script can be used in cases when some identification number needs to have specific length but in source data the length varies. In this example we want to have ID with nine digits.

code=value(1)
zeroString='000000000'
leftPadding = 9-code.length()
return substring(zeroString,0, leftPadding) + code

Returning longer part of code

Let's say that source data contains a column with alphanumeric code but the column also contains additional some additional informaion, for example a check digit. The check digit is separated from the code by dash but order of these two parts is arbitrary. It can looks something like this:

Code
55DF4FG-Z
DFGFDG-T
5-VFDGDF
D4WV3-S
5-D96V5D

In some cases the user would like to work only with the code itself. To do that, he needs to get the longer part:

code=''
xArray=split(value(1), '-')
if (xArray[0].length()>xArray[1].length())
{code= xArray[0]}
else code=xArray[1]
  
return code

Page tree