摘要

We describe a density-based hierarchical group finding algorithm capable of identifying structures and substructures of any shape and density in multidimensional data sets where each dimension can be a numeric attribute with arbitrary measurement scale. This has applications in a wide variety of fields from finding structures in galaxy redshift surveys, to identifying halos and subhalos in N-body simulations and group finding in Local Group chemodynamical data sets. In general, clustering schemes require an a priori definition of a metric (a non-negative function that gives the distance between two points in a space) and the quality of clustering depends upon this choice. The general practice is to use a constant global metric which is optimal only if the clusters in the data are self-similar. For complex data configurations even the most finely tuned constant global metric turns out to be suboptimal. Moreover, the correct choice of metric also becomes increasingly important as the number of dimensions increase. To address these problems, we present an entropy-based binary space partitioning algorithm which uses a locally adaptive metric for each data point. The metric is employed to calculate the density at each point and a list of its nearest neighbors, and this information is then used to form a hierarchy of groups. Finally, the ratio of maximum to minimum density of points in a group is used to estimate the significance of the groups. Setting a threshold on this significance can effectively screen out groups arising due to Poisson noise and helps organize the groups into meaningful clusters. For a data set of N points, the algorithm requires only O(N) space and O(N(logN)(3)) time which makes it ideally suitable for analyzing large data sets. As an example, we apply the algorithm to identify structures in a simulated stellar halo using the full six-dimensional phase space coordinates.

  • 出版日期2009-9