biostats.factor_analysis#

biostats.factor_analysis(data, x, factors, analyze=None)[source]#

Find the underlying factors of a set of variables.

Parameters:
datapandas.DataFrame

The input data. Must contain at least two numeric columns.

xlist

The list of numeric variables to be analyzed.

factorsint

The number of factors.

analyzedict

The data to be analyzed. Optional.

Returns:
summarypandas.DataFrame

The uniqueness of each variable.

resultpandas.DataFrame

The loadings of each variable, sum of squared loadings, proportion of variance, and cumulative proportion of variance of each factor.

analysispandas.DataFrame

The factor scores of the data to be analyzed.

See also

principal_component_analysis

Find the linear combination of a set of variables to manifest the variation of data.

linear_discriminant_analysis

Find the linear combination of a set of variables to distinguish between groups.

Examples

>>> import biostats as bs
>>> data = bs.dataset("factor_analysis.csv")
>>> data
     Oil  Density  Crispy  Fracture  Hardness
0   16.5     2955      10        23        97
1   17.7     2660      14         9       139
2   16.2     2870      12        17       143
3   16.7     2920      10        31        95
4   16.3     2975      11        26       143
5   19.1     2790      13        16       189
6   18.4     2750      13        17       114
7   17.5     2770      10        26        63
8   15.7     2955      11        23       123
9   16.4     2945      11        24       132
10  18.0     2830      12        15       121
11  17.4     2835      12        18       172
12  18.4     2860      14        11       170
13  13.9     2965      12        19       169
14  15.8     2930       9        26        65
15  16.4     2770      15        16       183
16  18.9     2650      14        20       114
17  17.3     2890      12        17       142
18  16.7     2695      13        13       111
19  19.1     2755      14        10       140
20  13.7     3000      10        27       177
21  14.7     2980      10        20       133
22  18.1     2780      13        14       150
23  17.2     2705       8        27       113
24  18.7     2825      13        20       166
25  18.1     2875      12        15       150
26  16.6     2945      10        25       100
27  17.1     2920      10        25       123
28  17.4     2845      13        19       129
29  19.4     2645      12        18        68
30  15.9     3080      10        23       106
31  17.1     2825      10        28       131
32  15.5     3125       7        33        92
33  17.7     2780      13        22       141
34  15.9     2900      12        21       192
35  21.2     2570      14        13       105
36  19.5     2635      13        22       101
37  20.5     2725      14        16       145
38  17.0     2865      11        22       100
39  16.7     2975      10        26       105
40  16.8     2980      10        24       144
41  16.8     2870      12        20       123
42  16.3     2920      11        22       136
43  16.2     3100       8        27       140
44  18.1     2910      12        21       120
45  16.6     2865      11        25       120
46  16.4     2995      12        20       165
47  15.1     2925      10        29       118
48  21.1     2700      13        16       116
49  16.3     2845      10        26        75

We want to find the underlying factors of the five variables.

>>> summary, result, analysis = bs.factor_analysis(data=data, x=["Oil", "Density", "Crispy", "Fracture", "Hardness"], factors=2, 
...     analyze={"Oil":17.2, "Density":2830, "Crispy":12, "Fracture":19, "Hardness":121})
>>> summary
                 Oil   Density   Crispy  Fracture  Hardness
Uniqueness  0.322983  0.169086  0.04781  0.251765  0.398991

The uniqueness of each variable (proportion of variability that cannot be explained by the factors) are given.

>>> result
                 Factor 1  Factor 2
Oil             -0.822497 -0.022736
Density          0.911124  0.027689
Crispy          -0.747793  0.626893
Fracture         0.653877 -0.566286
Hardness         0.095274  0.769371
                    NaN       NaN
SS Loadings      2.502476  1.306891
Proportion Var.  0.500495  0.261378
Cumulative Var.  0.500495  0.761873

The loadings (contribution of each original variable to the factor), SS Loadings (sum of squared loadings), Proportion Var (proportion of variance explained by each factor), and Cumulative Var (cumulative proportion of variance) are calculated.

>>> analysis
          Factor 1  Factor 2
Analysis -0.251185  0.090308

The factor scores of the data to be analyzed are calculated.