DPMeans
DP-Means clustering.
The DP-Means is an extension of the K-Means algorithm inspired by the Dirichlet Process Mixture model. This allows the number of clusters to be learned from the data, instead of being set beforehand.
Parameters
- n_clustersint, default=8
The initial number of clusters to form as well as the number of centroids to generate.
- init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’
Method for initialization. Same as KMeans initialization.
- max_iterint, default=300
Maximum number of iterations of the DP-means algorithm for a single run.
- tolfloat, default=1e-4
Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
- verboseint, default=0
Verbosity mode.
- random_stateint, RandomState instance or None, default=None
Determines random number generation for centroid initialization.
- copy_xbool, default=True
When pre-computing distances it is more numerically accurate to center the data first.
- deltafloat, default=1.0
The parameter controls the balance between the number of clusters and the data fitting term. Higher values of delta would generate fewer clusters, lower values would generate more clusters.
- max_clustersint or None, default=None
The maximum number of clusters that can be formed. Useful for controlling runtime in the case where it’s suspected that delta is set too low.
Attributes
- cluster_centers_ndarray of shape (n_clusters, n_features)
Coordinates of cluster centers.
- labels_ndarray of shape (n_samples,)
Labels of each point
- inertia_float
Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
- n_iter_int
Number of iterations run.
- n_features_in_int
Number of features seen during fit.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
See Also
KMeans : The base algorithm for DP-Means. Fixed number of clusters.
Notes
The DP-Means algorithm extends K-Means by treating the number of clusters as a variable to be learned. A new cluster is formed whenever a data point is “far enough” from all existing clusters. “Far enough” is determined by the delta parameter, which effectively controls the number of clusters formed.
Examples
>>> from pdc_dp_means import DPMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> dpmeans = DPMeans(n_clusters=2, delta=1.0, random_state=0).fit(X)
>>> dpmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> dpmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> dpmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])
API
- class pdc_dp_means.DPMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, delta=1.0, max_clusters=None)
Bases:
KMeans
DP-Means clustering.
The DP-Means is an extension of the K-Means algorithm inspired by the Dirichlet Process Mixture model. This allows the number of clusters to be learned from the data, instead of being set beforehand.
Parameters
- n_clustersint, default=8
The initial number of clusters to form as well as the number of centroids to generate.
- init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’
Method for initialization. Same as KMeans initialization.
- max_iterint, default=300
Maximum number of iterations of the DP-means algorithm for a single run.
- tolfloat, default=1e-4
Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
- verboseint, default=0
Verbosity mode.
- random_stateint, RandomState instance or None, default=None
Determines random number generation for centroid initialization.
- copy_xbool, default=True
When pre-computing distances it is more numerically accurate to center the data first.
- deltafloat, default=1.0
The parameter controls the balance between the number of clusters and the data fitting term. Higher values of delta would generate fewer clusters, lower values would generate more clusters.
- max_clustersint or None, default=None
The maximum number of clusters that can be formed. Useful for controlling runtime in the case where it’s suspected that delta is set too low.
Attributes
- cluster_centers_ndarray of shape (n_clusters, n_features)
Coordinates of cluster centers.
- labels_ndarray of shape (n_samples,)
Labels of each point
- inertia_float
Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
- n_iter_int
Number of iterations run.
- n_features_in_int
Number of features seen during fit.
- feature_names_in_ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
See Also
KMeans : The base algorithm for DP-Means. Fixed number of clusters.
Notes
The DP-Means algorithm extends K-Means by treating the number of clusters as a variable to be learned. A new cluster is formed whenever a data point is “far enough” from all existing clusters. “Far enough” is determined by the delta parameter, which effectively controls the number of clusters formed.
Examples
>>> from pdc_dp_means import DPMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> dpmeans = DPMeans(n_clusters=2, delta=1.0, random_state=0).fit(X) >>> dpmeans.labels_ array([1, 1, 1, 0, 0, 0], dtype=int32) >>> dpmeans.predict([[0, 0], [12, 3]]) array([1, 0], dtype=int32) >>> dpmeans.cluster_centers_ array([[10., 2.], [ 1., 2.]])
- fit(X, y=None, sample_weight=None)
Compute dp-means clustering.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
New in version 0.20.
Returns
- selfobject
Fitted estimator.