biostats.linear_discriminant_analysis#

biostats.linear_discriminant_analysis(data, x, y, predict=None)[source]#

Find the linear combination of a set of variables to distinguish between groups.

Parameters:

datapandas.DataFrame: The input data. Must contain at least one numeric column.
xlist: The list of numeric variables to be analyzed.
ystr: The categorical variable that specifies the groups to be distinguished. Maximum 20 groups.
predictdict: The data to be predicted. Optional.

Returns:

summarypandas.DataFrame: The mean values of each variable in each group.
resultpandas.DataFrame: The coefficients and intercepts of the linear combinations, as well as the proportions of separation achieved by each dimension.
predictionpandas.DataFrame: The probabilities and results of the data to be predicted.

See also

factor_analysis: Find the underlying factors of a set of variables.
principal_component_analysis: Find the linear combination of a set of variables to manifest the variation of data.

Examples

>>> import biostats as bs
>>> data = bs.dataset("linear_discriminant_analysis.csv")
>>> data
    sepal_length sepal_width petal_length petal_width    species
0            5.1         3.5          1.4         0.2     setosa
1            4.9         3.0          1.4         0.2     setosa
2            4.7         3.2          1.3         0.2     setosa
3            4.6         3.1          1.5         0.2     setosa
4            5.0         3.6          1.4         0.2     setosa
..           ...         ...          ...         ...        ...
145          6.7         3.0          5.2         2.3  virginica
146          6.3         2.5          5.0         1.9  virginica
147          6.5         3.0          5.2         2.0  virginica
148          6.2         3.4          5.4         2.3  virginica
149          5.9         3.0          5.1         1.8  virginica

We want to find the linear combination of the four variables to distinguish between the three species.

>>> summary, result, prediction = bs.linear_discriminant_analysis(data=data, x=["sepal_length", "sepal_width", "petal_length" ,"petal_width"], y="species", 
...     predict={"sepal_length": 5.7, "sepal_width": 2.7, "petal_length": 4.0 ,"petal_width":1.4})
>>> summary
            sepal_length  sepal_width  petal_length  petal_width
setosa             5.006        3.428         1.462        0.246
versicolor         5.936        2.770         4.260        1.326
virginica          6.588        2.974         5.552        2.026

The mean values of each variable in each group are calculated.

>>> result
             sepal_length  sepal_width  petal_length  petal_width  Intercept  Proportion
Dimension 1      0.829378     1.534473     -2.201212    -2.810460   2.105106    0.991213
Dimension 2      0.024102     2.164521     -0.931921     2.839188  -6.661473    0.008787

The coefficients and intercepts to form the new dimensions are given. The proportions of separation achieved by each dimension are also given.

>>> prediction
               P(setosa)  P(versicolor)  P(virginica)      Result
Prediction  7.206674e-20       0.999792      0.000208  versicolor

The data is predicted to belong to versicolor.