DPMeans

DP-Means clustering.

The DP-Means is an extension of the K-Means algorithm inspired by the Dirichlet Process Mixture model. This allows the number of clusters to be learned from the data, instead of being set beforehand.

Parameters

n_clustersint, default=8

The initial number of clusters to form as well as the number of centroids to generate.

init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’

Method for initialization. Same as KMeans initialization.

max_iterint, default=300

Maximum number of iterations of the DP-means algorithm for a single run.

tolfloat, default=1e-4

Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

verboseint, default=0

Verbosity mode.

random_stateint, RandomState instance or None, default=None

Determines random number generation for centroid initialization.

copy_xbool, default=True

When pre-computing distances it is more numerically accurate to center the data first.

deltafloat, default=1.0

The parameter controls the balance between the number of clusters and the data fitting term. Higher values of delta would generate fewer clusters, lower values would generate more clusters.

max_clustersint or None, default=None

The maximum number of clusters that can be formed. Useful for controlling runtime in the case where it’s suspected that delta is set too low.

Attributes

cluster_centers_ndarray of shape (n_clusters, n_features)

Coordinates of cluster centers.

labels_ndarray of shape (n_samples,)

Labels of each point

inertia_float

Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.

n_iter_int

Number of iterations run.

n_features_in_int

Number of features seen during fit.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

See Also

KMeans : The base algorithm for DP-Means. Fixed number of clusters.

Notes

The DP-Means algorithm extends K-Means by treating the number of clusters as a variable to be learned. A new cluster is formed whenever a data point is “far enough” from all existing clusters. “Far enough” is determined by the delta parameter, which effectively controls the number of clusters formed.

Examples

>>> from pdc_dp_means import DPMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> dpmeans = DPMeans(n_clusters=2, delta=1.0, random_state=0).fit(X)
>>> dpmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> dpmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> dpmeans.cluster_centers_
array([[10.,  2.],
    [ 1.,  2.]])

API

class pdc_dp_means.DPMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, delta=1.0, max_clusters=None)

Bases: KMeans

DP-Means clustering.

The DP-Means is an extension of the K-Means algorithm inspired by the Dirichlet Process Mixture model. This allows the number of clusters to be learned from the data, instead of being set beforehand.

Parameters

n_clustersint, default=8

The initial number of clusters to form as well as the number of centroids to generate.

init{‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’

Method for initialization. Same as KMeans initialization.

max_iterint, default=300

Maximum number of iterations of the DP-means algorithm for a single run.

tolfloat, default=1e-4

Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

verboseint, default=0

Verbosity mode.

random_stateint, RandomState instance or None, default=None

Determines random number generation for centroid initialization.

copy_xbool, default=True

When pre-computing distances it is more numerically accurate to center the data first.

deltafloat, default=1.0

The parameter controls the balance between the number of clusters and the data fitting term. Higher values of delta would generate fewer clusters, lower values would generate more clusters.

max_clustersint or None, default=None

The maximum number of clusters that can be formed. Useful for controlling runtime in the case where it’s suspected that delta is set too low.

Attributes

cluster_centers_ndarray of shape (n_clusters, n_features)

Coordinates of cluster centers.

labels_ndarray of shape (n_samples,)

Labels of each point

inertia_float

Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.

n_iter_int

Number of iterations run.

n_features_in_int

Number of features seen during fit.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Defined only when X has feature names that are all strings.

See Also

KMeans : The base algorithm for DP-Means. Fixed number of clusters.

Notes

The DP-Means algorithm extends K-Means by treating the number of clusters as a variable to be learned. A new cluster is formed whenever a data point is “far enough” from all existing clusters. “Far enough” is determined by the delta parameter, which effectively controls the number of clusters formed.

Examples

>>> from pdc_dp_means import DPMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> dpmeans = DPMeans(n_clusters=2, delta=1.0, random_state=0).fit(X)
>>> dpmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> dpmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> dpmeans.cluster_centers_
array([[10.,  2.],
    [ 1.,  2.]])
fit(X, y=None, sample_weight=None)

Compute dp-means clustering.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it’s not in CSR format.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

New in version 0.20.

Returns

selfobject

Fitted estimator.