This week we were given a function called tukey_multiple() that contained a deliberate bug within in. The goal of this week was to debug this function using R’s debug function.
Which, can I say, learning that R has a line by line debugger has been slightly infuriating, but also very cool. For context: the first coding language I ever learned was Python, using the Thony IDE. One of my favorite functions in the Thony IDE was their line by line debugger. When I was learning coding, specifically when we started getting into function creation and recursion, the line by line debugger was my best friend and my greatest asset in coding. Since my coding education has evolved, I had not yet found a single IDE with a similar debugger as Thony….
Until now, and I am so mad it took me until now to realize it.
That being said, this week’s debugging was a little touch and go for a little bit, as I struggled to figure out exactly what was going on with this function and what it was trying to accomplish. The first bug I found was in this line:
{
outliers[,j] <- outliers[,j] && tukey.outlier(x[,j])
}
The original function used &&, which only compares the first element of two vectors, so I changed it to the vectorized & operator to correctly compare all elements in the column and identify outliers row by row.
The next bug was also in this line of code, and consisted of the tukey.outlier() call. Through my research, I found inconclusive results over if this was an actual function (maybe from an older version of R) or a custom function that was not defined properly. Regardless. rather than fully remove it (and risk the whole function falling apart) I decided to write a quick dummy function that would do something similar to what I think this call was trying to achieve (identify outliers)
tukey.outlier <- function(x) {
# Pretend outlier if value > mean + 2*sd (Dummy Function for Logic testing)
upper <- mean(x) + 2 * sd(x)
lower <- mean(x) - 2 * sd(x)
return(x < lower | x > upper)
}
This dummy function returns a boolean vector that indicates whether each value in the input is an outlier — specifically, it flags values that fall more than two standard deviations above or below the mean. While this dummy function does not follow the proper outlier identification process that Tukey used (IQR), it is good enough for testing the logic behind the actual function we are debugging, tukey_multiple()
The last bug I found before I was able to run this function with no errors was the parameter allowances. This function only wanted to receive numeric columns, but wouldn’t accept single columns. To solve this issue, I added a checkpoint that ensured that main part of the function was only accepting numeric columns, and redirected the function if it was given a character or factor column.
for (j in 1:ncol(x)) {
if (is.numeric(x[[j]])) {
outliers[, j] <- outliers[, j] & tukey.outlier(x[[j]])
} else {
outliers[, j] <- FALSE
}
}
Overall — now this function looks at the columns in a row, and if all the columns in this row are outliers, this function flags the rows entirely as an outlier.
For the dataset I was testing this on, I had a mix of numeric and factor columns, so all of my rows returned FALSE as outliers, so at the moment I do not see a major use for this function, so my next step would be to adapt this further to better handle factor columns or other data types, rather than defaulting to FALSE, or look at the function to try and judge how many of the columns returned TRUE outlier reports, or something else to make this a bit more useful in the real world.
However, currently, the function does run without errors, so I am going to stop here.
As always, link to my Github: here
Leave a comment