Title: | Preprocessor for Data Modeling |
---|---|
Description: | Includes binning categorical variables into lesser number of categories based on t-test, converting categorical variables into continuous features using the mean of the response variable for the respective categories, understanding the relationship between the response variable and predictor variables using data transformations. |
Authors: | Navin Loganathan [aut], Mohan Manivannan [aut], Santhosh Sasanapuri [aut, cre], LatentView Analytics [ctb] |
Maintainer: | Santhosh Sasanapuri <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1 |
Built: | 2025-02-13 04:12:30 UTC |
Source: | https://github.com/cran/corkscrew |
Includes binning categorical variables into lesser number of categories based on t-test, converting categorical variables into continuous features using the mean of the response variable for the respective categories, understanding the relationship between the response variable and predictor variables using data transformations.
Package: | corkscrew |
Type: | Package |
Version: | 1.1 |
Date: | 2015-10-30 |
Depends: | R (>= 3.0.1), ggplot2, gplots, RColorBrewer, igraph, stats, grDevices |
License: | GPL (version 2 or newer) |
Navin Loganathan, Mohan Manivannan, Santhosh Sasanapuri, LatentView Analytics
# using transformation data(airquality) transformation(names(airquality)[2:4],"Ozone",airquality) # using ctoc data(ChickWeight) # Converting the "Chick" variable into factor from ord.factor for demonstration purposes. ChickWeight$Chick <- as.factor(as.numeric(ChickWeight$Chick)) # Returns a dataframe with two added columns for "Chick" and "Diet" head(ctoc(y = "weight", x = c("Chick","Diet"), data = ChickWeight, min.obs = 12)) # using tbin train = as.data.frame(cbind(runif(1000, 10, 1000),sample(1:40, 1000,TRUE))) colnames(train) = c("response","state") train$state = as.factor(train$state) train.output = tbin(dv = "response",idv = c("state"),train,25,TRUE)
# using transformation data(airquality) transformation(names(airquality)[2:4],"Ozone",airquality) # using ctoc data(ChickWeight) # Converting the "Chick" variable into factor from ord.factor for demonstration purposes. ChickWeight$Chick <- as.factor(as.numeric(ChickWeight$Chick)) # Returns a dataframe with two added columns for "Chick" and "Diet" head(ctoc(y = "weight", x = c("Chick","Diet"), data = ChickWeight, min.obs = 12)) # using tbin train = as.data.frame(cbind(runif(1000, 10, 1000),sample(1:40, 1000,TRUE))) colnames(train) = c("response","state") train$state = as.factor(train$state) train.output = tbin(dv = "response",idv = c("state"),train,25,TRUE)
Extrapolating the categorical to continuous conversion that is calculated from one dataframe to another dataframe.
apply.ctoc(y, x, data, newdata, min.obs)
apply.ctoc(y, x, data, newdata, min.obs)
y |
Response variable (categorical or continuous). |
x |
Predictor variables in the dataframe which are categorical and need to be converted into continuous. |
data |
Name of the dataframe from which the values of the categories have to be calculated. |
newdata |
Name of the dataframe to which the values of the categories have to be applied. |
min.obs |
The minimum number of observations within a category in a categorical variable to get converted into a continuous feature. All the categories which have observations less than the min.obs will form a different category. |
This function is only for categorical variables. The min.obs refers to the minimum number of observations in the "data".
Returns a dataframe with converted features without replacing the original ones.
Santhosh Sasanapuri
ctoc
, tbin
, apply.tbin
.
data(ChickWeight) set.seed(2) sample_ex <- sample(nrow(ChickWeight), size = 289, replace = FALSE, prob = NULL) train <- ChickWeight[sample_ex,] test <- ChickWeight[-sample_ex,colnames(ChickWeight) != "weight"] # Returns the test dataframe with an added column "Diet_cont" by extrapolating it from train head(apply.ctoc(y = "weight", "Diet", data = train, newdata = test, min.obs = 60))
data(ChickWeight) set.seed(2) sample_ex <- sample(nrow(ChickWeight), size = 289, replace = FALSE, prob = NULL) train <- ChickWeight[sample_ex,] test <- ChickWeight[-sample_ex,colnames(ChickWeight) != "weight"] # Returns the test dataframe with an added column "Diet_cont" by extrapolating it from train head(apply.ctoc(y = "weight", "Diet", data = train, newdata = test, min.obs = 60))
Extrapolates the binning of categorical variables to the new datasets.
apply.tbin(idv, train.output, test)
apply.tbin(idv, train.output, test)
idv |
Predictor variables in the dataframe which are categorical and need to be binned. |
train.output |
The output object of the tbin function. |
test |
A new data set on which binning has to be extrapolated. |
This function performs binning on the new dataset based on the output object from the tbin function.
Returns a dataframe which contains the extrapolated variables of the output object from tbin function appended to the new dataset.
New level error is thrown if the new dataset contains new levels other than what is present in the old dataset.
Mohan Manivannan
tbin
, ctoc
, apply.ctoc
.
train = as.data.frame(cbind(runif(1000, 10, 1000),sample(1:40, 1000,TRUE))) colnames(train) = c("response","state") train$state = as.factor(train$state) train.output = tbin(dv = "response",idv = c("state"),train,25,TRUE) # extrapolating the tbin function to a new dataset using apply.tbin test = as.data.frame(sample(1:40, 100,TRUE)) colnames(test) = c("state") test$state = as.factor(test$state) test.output = apply.tbin(idv = c("state"), train.output, test)
train = as.data.frame(cbind(runif(1000, 10, 1000),sample(1:40, 1000,TRUE))) colnames(train) = c("response","state") train$state = as.factor(train$state) train.output = tbin(dv = "response",idv = c("state"),train,25,TRUE) # extrapolating the tbin function to a new dataset using apply.tbin test = as.data.frame(sample(1:40, 100,TRUE)) colnames(test) = c("state") test$state = as.factor(test$state) test.output = apply.tbin(idv = c("state"), train.output, test)
Converting categorical variables into continuous features using the mean of the response variable for the respective categories without using the index record.
ctoc(y, x, data, min.obs)
ctoc(y, x, data, min.obs)
y |
Response variable (categorical or continuous). |
x |
Predictor variables in the dataframe which are categorical and need to be converted into continuous. |
data |
Name of the dataframe. |
min.obs |
The minimum number of observations within a category in a categorical variable to get converted into a continuous feature. All the categories which have observations less than the min.obs will form a different category. |
This function is only for categorical variables.
Returns a dataframe with converted features without replacing the original ones.
Santhosh Sasanapuri
data(ChickWeight) # Converting the "Chick" variable into factor from ord.factor for demonstration purposes. ChickWeight$Chick <- as.factor(as.numeric(ChickWeight$Chick)) # Returns a dataframe with two added columns for "Chick" and "Diet" head(ctoc(y = "weight", x = c("Chick","Diet"), data = ChickWeight, min.obs = 12))
data(ChickWeight) # Converting the "Chick" variable into factor from ord.factor for demonstration purposes. ChickWeight$Chick <- as.factor(as.numeric(ChickWeight$Chick)) # Returns a dataframe with two added columns for "Chick" and "Diet" head(ctoc(y = "weight", x = c("Chick","Diet"), data = ChickWeight, min.obs = 12))
Bins the different levels of a categorical variable based on the similarity of the response.
tbin(dv, idv, train, min.obs, plot.bin = c(TRUE, FALSE), method = c(1, 2, 3))
tbin(dv, idv, train, min.obs, plot.bin = c(TRUE, FALSE), method = c(1, 2, 3))
dv |
The response variable and it must be continuous. |
idv |
Predictor variables in the dataframe which are categorical and need to be binned. |
train |
Name of the dataframe. |
min.obs |
All the levels with the count of observations less than min.obs are binned into one category, by default the min.obs is set at 30. |
plot.bin |
If TRUE, then a heat map comparing the binned levels and original levels is plotted, by default the plot.bin is set as FALSE. |
method |
Three types of community detection is used for binning the levels, the methods are 1.Fastgreedy, 2.Walktrap, 3.edge.betweenness choose the method that suits the needs, by default the method is set at 1. |
The levels of a categorical variable are compared with each other and those levels which are having same response levels are binned together as a category.Before comparing the levels, the levels which has less than the minimum observation cut-off are binned to form a small category. Every pair of levels are compared using either a parametric or non-parametric test depending on the normality of the response data. A pairwise comparison matrix is created for each level of a categorical variables which is further processed to form a graph. Then the levels which should be combined together are identified using a community detection algorithm, a community is a collection of levels which have statistically same response level. The newly created variables in the new data set are created by extrapolating the values from the original data set(training).
The tbin output contains the newly created binned categorical variables appended to the original data(training) set. The names of the new variables are sufficed with "_cat" to their corresponding original variables.
The minimum observation cut-off is set to 30, so that the statistical significance of the parametric and non-parametric tests are applicable.
Mohan Manivannan
train = as.data.frame(cbind(runif(1000, 10, 1000),sample(1:40, 1000,TRUE))) colnames(train) = c("response","state") train$state = as.factor(train$state) train.output = tbin(dv = "response",idv = c("state"),train,25,TRUE)
train = as.data.frame(cbind(runif(1000, 10, 1000),sample(1:40, 1000,TRUE))) colnames(train) = c("response","state") train$state = as.factor(train$state) train.output = tbin(dv = "response",idv = c("state"),train,25,TRUE)
Transformation is used to study the relationship between the two variables. The relationships that are studied include linear, power, log and arctangent.
transformation(x, y, data)
transformation(x, y, data)
x |
Predictor variables in the dataframe that are to be transformed. |
y |
Response variable in the dataframe. |
data |
Name of the dataframe. |
Applies only when both the response variable and the predictor variables are continuous.
Returns a list
[[1]] |
Summary of the predictor variables' best transformation, correlation and influence studied against the y variable |
[[2]] |
Summary of the predictor variables' correlation and influence for all the transformations studied against the y variable |
Navin Loganathan
data(airquality) transformation(names(airquality)[2:4],"Ozone",airquality)
data(airquality) transformation(names(airquality)[2:4],"Ozone",airquality)