biostats.fisher_exact_test#

biostats.fisher_exact_test(data, variable_1, variable_2, kind='count')[source]#

Test whether there is an association between two categorical variables.

Parameters:
datapandas.DataFrame

The input data. Must contain at least two categorical columns.

variable_1str

The first categorical variable. Maximum 10 groups.

variable_2str

The second categorical variable. Switching the two variables will not change the result of Fisher exact test. Maximum 10 groups.

kindstr

The way to summarize the contingency table.

  • “count” : Count the frequencies of occurance.

  • “vertical” : Calculate proportions vertically, so that the sum of each column equals 1.

  • “horizontal” : Calculate proportions horizontally, so that the sum of each row equals 1.

  • “overall” : Calculate overall proportions, so that the sum of the whole table equals 1.

Returns:
summarypandas.DataFrame

The contingency table of the two categorical variables.

resultpandas.DataFrame

The p-value of the test.

See also

chi_square_test

The normal approximation version of Fisher exact test.

binomial_test

Test the difference between the observed and expected proportion of a variable.

Notes

Warning

Fisher exact test calculates the exact p-value by iterating through all the possible distributions, so it may consume lots of time when the size of data is huge. For larger data, chi_square_test() is recommended.

Examples

>>> import biostats as bs
>>> data = bs.dataset("fisher_exact_test.csv")
>>> data
    Frequency     Result
0     Monthly  Undamaged
1     Monthly    Damaged
2     Monthly    Damaged
3     Monthly    Damaged
4     Monthly  Undamaged
..        ...        ...
95    Monthly  Undamaged
96     Weekly  Undamaged
97    Monthly    Damaged
98  Quarterly  Undamaged
99    Monthly  Undamaged

We want to test whether there is an association between Frequency and Result.

>>> summary, result = bs.fisher_exact_test(data=data, variable_1="Frequency", variable_2="Result", kind="horizontal")
>>> summary
           Damaged  Undamaged
Daily         0.04       0.96
Monthly       0.56       0.44
Quarterly     0.44       0.56
Weekly        0.20       0.80

The proportions of Damaged in different Frequency are given.

>>> result
        p-value     
Model  0.000123  ***

The p-value < 0.001, so there is a significant association between Frequency and Result. That is, the proportions of Damaged are different between the four Frequency.