biostats.principal_component_analysis#

biostats.principal_component_analysis(data, x, transform=None)[source]#

Find the linear combination of a set of variables to manifest the variation of data.

Parameters:

datapandas.DataFrame: The input data. Must contain at least one numeric column.
xlist: The list of numeric variables to be analyzed.
transformdict: The data to be transformed. Optional.

Returns:

summarypandas.DataFrame: The counts, mean values, standard deviations, and variances of each variable.
resultpandas.DataFrame: The coefficients and intercepts of the linear combinations, as well as the proportions of variation explained by each dimension.
transformationpandas.DataFrame: The new coordinates of the data to be transformed.

See also

factor_analysis: Find the underlying factors of a set of variables.
linear_discriminant_analysis: Find the linear combination of a set of variables to distinguish between groups.

Examples

>>> import biostats as bs
>>> data = bs.dataset("principal_component_analysis.csv")
>>> data
    Murder  Assault  UrbanPop  Rape
   13.2      236        58  21.2
   10.0      263        48  44.5
    8.1      294        80  31.0
    8.8      190        50  19.5
    9.0      276        91  40.6
    7.9      204        78  38.7
    3.3      110        77  11.1
    5.9      238        72  15.8
   15.4      335        80  31.9
   17.4      211        60  25.8
   5.3       46        83  20.2
   2.6      120        54  14.2
  10.4      249        83  24.0
   7.2      113        65  21.0
   2.2       56        57  11.3
   6.0      115        66  18.0
   9.7      109        52  16.3
  15.4      249        66  22.2
   2.1       83        51   7.8
  11.3      300        67  27.8
   4.4      149        85  16.3
  12.1      255        74  35.1
   2.7       72        66  14.9
  16.1      259        44  17.1
   9.0      178        70  28.2
   6.0      109        53  16.4
   4.3      102        62  16.5
  12.2      252        81  46.0
   2.1       57        56   9.5
   7.4      159        89  18.8
  11.4      285        70  32.1
  11.1      254        86  26.1
  13.0      337        45  16.1
   0.8       45        44   7.3
   7.3      120        75  21.4
   6.6      151        68  20.0
   4.9      159        67  29.3
   6.3      106        72  14.9
   3.4      174        87   8.3
  14.4      279        48  22.5
   3.8       86        45  12.8
  13.2      188        59  26.9
  12.7      201        80  25.5
   3.2      120        80  22.9
   2.2       48        32  11.2
   8.5      156        63  20.7
   4.0      145        73  26.2
   5.7       81        39   9.3
   2.6       53        66  10.8
   6.8      161        60  15.6

We want to find the linear combination of the four variables to manifest the variation of data.

>>> summary, result, transformation = bs.principal_component_analysis(data=data, x=["Murder", "Assault", "UrbanPop", "Rape"], 
...     transform={"Murder":10.2, "Assault":211, "UrbanPop":67, "Rape":32.3})
>>> summary
          Count     Mean  Std. Deviation     Variance
Murder       50    7.788        4.355510    18.970465
Assault      50  170.760       83.337661  6945.165714
UrbanPop     50   65.540       14.474763   209.518776
Rape         50   21.232        9.366385    87.729159

Basic descriptive statistics of the four variables are calculated.

>>> result
               Murder   Assault  UrbanPop      Rape   Intercept  Proportion
Dimension 1  0.041704  0.995221  0.046336  0.075156 -174.901326    0.965534
Dimension 2  0.044822  0.058760 -0.976857 -0.200718   57.901952    0.027817
Dimension 3  0.079891 -0.067570 -0.200546  0.974081    3.378144    0.005800
Dimension 4  0.994922 -0.038938  0.058169 -0.072325   -3.376148    0.000849

The coefficients and intercepts to form the new dimensions are given. The proportions of variation explained by each dimension are also given.

>>> transformation
                Dimension 1  Dimension 2  Dimension 3  Dimension 4
Transformation    41.047766    -1.175146     7.962017     0.117308

The new coordinates of the data to be transformed.