Chi-squared Test of Independence in R- Programming
Chi-squared Test of Independence
It is a non-parametric test to determine if there is a significant relationship between two categorical variables. The frequency of one variable is compared with frequency of second variable .
The assumptions of chi-square test are :
1. The data in the cells should be frequencies or counts of cases.
2. The levels of the variables are mutually exclusive .
3. Each subject may contribute data to one and only one cell .
4. The groups must be independent . There is no interdependency between groups while comparing the groups.
5. The variables should be categorical or we can change data in categorical form .
6. The sample data are displayed in a contingency table , the expected frequency count for each cell of the table is at least 5.
Expected frequencies :
The expected frequency is calculated for each cell in a contingency table. The expected frequency is calculated as :
E = nr X nc /n
E – represents the cell expected value,
nr – represents the total number of sample observations for row for level r
nc – represents the total number of sample observations for column for level c
n – represents the total sample size
Test statistic :
The test statistic of the chi-square test is –
χ2 = ∑ (O-E)2/ E
O – Observed value
E – Expected value
χ2 – the chi-square value
∑ – Calculate summation of all values in cell
Null hypothesis : Assumes that there is no association between the two variables .
Alternative hypothesis : Assumes that there is an association between the two variables.
If p-value > 0.05 , then null hypothesis is true. If p-value is less than 0.05 then alternative hypothesis is true.
Degrees of freedom :
The number of degrees of freedom can be defined as the minimum number of independent coordinates that can specify the position of the system completely.
The degrees of freedom , df = (Number of rows -1) X (Number of columns – 1)
We will use housetasks data set from STHDA .
We import dataset using online link.
file_path <- “http://www.sthda.com/sthda/RDoc/data/housetasks.txt”
We import dataset by using read.delim() function.
housetasks <- read.delim(file_path, row.names = 1)
We are installing “gplots” library for visualization.
We load “gplots” library using following code:
We want to create a table format to store the dataset . To convert dataset into a table , we used as.matrix() function to convert in matrix form and then convert matrix into a table format by using as.table() function on it.
dt <- as.table(as.matrix(housetasks))
We transform dt table to represent rows values corresponds to values in table .
We are using baloonplot to plot data in a dot form. In this plot , dot is bigger if the value of the variable is larger. We used label = FALSE to not show the values of the elements on the plot. We used show.margins = FALSE to not print the total sum of rows and columns in the plot.
balloonplot(t(dt), main =”housetasks”, xlab =””, ylab=””, label = FALSE, show.margins = FALSE)
We are installing “graphics” library for advanced visualization. We load “graphics” library as:
We are using mosaicplot to plot the work associated with Husband and Wife .
The argument shade is used to color the graph
The argument las=2 produces vertical labels.
mosaicplot(dt, shade = TRUE, las=2, main = “housetasks”)
From this plot , we can see that housetasks Laundry, Main_meal , Dinner and breakfast(blue color) are mainly done by the wife .
The chi-square test can be done as :
chisq <- chisq.test(housetasks)
Here , X-squared = 1944.5 means chi-square value is 1944.5 and the degrees of freedom is 36. The p-value is less than 2.2e-16
We can see observed frequency by using following code :
We can see expected frequency by using following code :
The Pearson residuals can be used to check the model fit at each observation for generalized linear models. The Pearson residual for a cell in a two-way table is :
r = O – E / √ E
We can calculate residuals by following code :
The chi-square statistic is the sum of the contributions from each of the individual cells.
If an individual contribution is high, it is either because the expected value is low or the difference between the observed and the expected is reasonably high. If the independent variable has more than two values, you might like to consider whether the distinction between a specific value and all the others would be significant.
We can see chi-square vale as :
We can also find contribution of each combination of pairs in chi-square test.
It is the ratio of squared residual value and chi-square value.
contrib <- 100*chisq$residuals^2/chisq$statistic