Package smile.feature.imputation
Interface SVDImputer
public interface SVDImputer
Missing value imputation with singular value decomposition. Given SVD
A = U Σ VT, we use the most significant eigenvectors of
VT to linearly estimate missing values. Although it has been
shown that several significant eigenvectors are sufficient to describe
the data with small errors, the exact fraction of eigenvectors best for
estimation needs to be determined empirically. Once k most significant
eigenvectors from VT are selected, we estimate a missing value j
in row i by first regressing this row against the k eigenvectors and then use
the coefficients of the regression to reconstruct j from a linear combination
of the k eigenvectors. The j th value of row i and the j th values of the k
eigenvectors are not used in determining these regression coefficients.
It should be noted that SVD can only be performed on complete matrices;
therefore we originally fill all missing values by other methods in
matrix A, obtaining A'. We then utilize an expectation maximization method to
arrive at the final estimate, as follows. Each missing value in A is estimated
using the above algorithm, and then the procedure is repeated on the newly
obtained matrix, until the total change in the matrix falls below the
empirically determined threshold (say 0.01).
-
Method Summary
Modifier and TypeMethodDescriptionstatic double[][]
impute
(double[][] data, int k, int maxIter) Impute missing values in the dataset.
-
Method Details
-
impute
static double[][] impute(double[][] data, int k, int maxIter) Impute missing values in the dataset.- Parameters:
data
- a data set with missing values (represented as Double.NaN).k
- the number of eigenvectors used for imputation.maxIter
- the maximum number of iterations.- Returns:
- the imputed data.
- Throws:
IllegalArgumentException
- when the whole row or column is missing.
-