DPMeans

DP-Means clustering.

The DP-Means is an extension of the K-Means algorithm inspired by the Dirichlet Process Mixture model. This allows the number of clusters to be learned from the data, instead of being set beforehand.

Parameters

n_clustersint, default=8: The initial number of clusters to form as well as the number of centroids to generate.
init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’: Method for initialization. Same as KMeans initialization.
max_iterint, default=300: Maximum number of iterations of the DP-means algorithm for a single run.
tolfloat, default=1e-4: Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verboseint, default=0: Verbosity mode.
random_stateint, RandomState instance or None, default=None: Determines random number generation for centroid initialization.
copy_xbool, default=True: When pre-computing distances it is more numerically accurate to center the data first.
deltafloat, default=1.0: The parameter controls the balance between the number of clusters and the data fitting term. Higher values of delta would generate fewer clusters, lower values would generate more clusters.
max_clustersint or None, default=None: The maximum number of clusters that can be formed. Useful for controlling runtime in the case where it’s suspected that delta is set too low.

Attributes

cluster_centers_ndarray of shape (n_clusters, n_features): Coordinates of cluster centers.
labels_ndarray of shape (n_samples,): Labels of each point
inertia_float: Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
n_iter_int: Number of iterations run.
n_features_in_int: Number of features seen during fit.
feature_names_in_ndarray of shape (n_features_in_,): Names of features seen during fit. Defined only when X has feature names that are all strings.

Notes

The DP-Means algorithm extends K-Means by treating the number of clusters as a variable to be learned. A new cluster is formed whenever a data point is “far enough” from all existing clusters. “Far enough” is determined by the delta parameter, which effectively controls the number of clusters formed.

Examples

>>> from pdc_dp_means import DPMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> dpmeans = DPMeans(n_clusters=2, delta=1.0, random_state=0).fit(X)
>>> dpmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> dpmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> dpmeans.cluster_centers_
array([[10.,  2.],
    [ 1.,  2.]])

API

class pdc_dp_means.DPMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, delta=1.0, max_clusters=None)

Bases: KMeans

DP-Means clustering.

The DP-Means is an extension of the K-Means algorithm inspired by the Dirichlet Process Mixture model. This allows the number of clusters to be learned from the data, instead of being set beforehand.

Parameters

n_clustersint, default=8: The initial number of clusters to form as well as the number of centroids to generate.
init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’: Method for initialization. Same as KMeans initialization.
max_iterint, default=300: Maximum number of iterations of the DP-means algorithm for a single run.
tolfloat, default=1e-4: Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
verboseint, default=0: Verbosity mode.
random_stateint, RandomState instance or None, default=None: Determines random number generation for centroid initialization.
copy_xbool, default=True: When pre-computing distances it is more numerically accurate to center the data first.
deltafloat, default=1.0: The parameter controls the balance between the number of clusters and the data fitting term. Higher values of delta would generate fewer clusters, lower values would generate more clusters.
max_clustersint or None, default=None: The maximum number of clusters that can be formed. Useful for controlling runtime in the case where it’s suspected that delta is set too low.

Attributes

cluster_centers_ndarray of shape (n_clusters, n_features): Coordinates of cluster centers.
labels_ndarray of shape (n_samples,): Labels of each point
inertia_float: Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
n_iter_int: Number of iterations run.
n_features_in_int: Number of features seen during fit.
feature_names_in_ndarray of shape (n_features_in_,): Names of features seen during fit. Defined only when X has feature names that are all strings.

Notes

The DP-Means algorithm extends K-Means by treating the number of clusters as a variable to be learned. A new cluster is formed whenever a data point is “far enough” from all existing clusters. “Far enough” is determined by the delta parameter, which effectively controls the number of clusters formed.

Examples

>>> from pdc_dp_means import DPMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> dpmeans = DPMeans(n_clusters=2, delta=1.0, random_state=0).fit(X)
>>> dpmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> dpmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> dpmeans.cluster_centers_
array([[10.,  2.],
    [ 1.,  2.]])

fit(X, y=None, sample_weight=None)

Compute dp-means clustering.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features): Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.
yIgnored: Not used, present here for API consistency by convention.
sample_weightarray-like of shape (n_samples,), default=None: The weights for each observation in X. If None, all observations are assigned equal weight.

New in version 0.20.

Returns

selfobject: Fitted estimator.

DPMeans

Parameters

Attributes

See Also

Notes

Examples

API

Parameters

Attributes

See Also

Notes

Examples

Parameters

Returns