基于Mulvus进行 HDBSCAN 聚类

将数据通过深度学习模型转化为Embeddings，可以获得原始数据的有意义表征。HDBSCAN（基于密度的有噪声应用空间分层聚类）通过分析数据点的密度进行分组。它可以发现不同形状和大小聚类特别有用。HDBSCAN （Hierarchical Density-Based Spatial Clustering of Applications with Noise）是一种聚类算法，它依赖于计算嵌入空间中数据点之间的距离。

## 准备数据
```
pip install "pymilvus[model]"
pip install hdbscan
pip install plotly
pip install umap-learn
```

## 提取Embeddings

```python
from openai import OpenAI
openai_client = OpenAI(api_key="b5f67c0f6b67", base_url="http://14.103.139.62:8000/v1")

def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="/data/llm/Qwen/Qwen3-Embeding-0.6B")
        .data[0]
        .embedding
    )
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])

```

1024
    [-0.0101318359375, -0.0299072265625, -0.0115966796875, -0.060546875, -0.0023651123046875, 0.016845703125, -0.03955078125, 0.0311279296875, -0.024169921875, 0.0]

```python
import pandas as pd
from dotenv import load_dotenv
from pymilvus.model.hybrid import BGEM3EmbeddingFunction
from pymilvus import FieldSchema, Collection, connections, CollectionSchema, DataType

df = pd.read_csv("news_data_dedup.csv")

docs = [
    f"{title}\n{description}" for title, description in zip(df.title, df.description)
]

for item in docs[:10]:
    print(item+"\n")

```

/home/sichen/anaconda3/envs/jupyterlab/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
      from .autonotebook import tqdm as notebook_tqdm

Harvey Weinstein's 2020 rape conviction overturned
    Victims group describes the New York appeal court's decision to retry Hollywood mogul as "profoundly unjust".
    
    Police and activists clash on Atlanta campus amid Gaza protests
    Meanwhile, hundreds of students march in Washington DC, and congresswoman Ilhan Omar joins protesters at a New York campus.
    
    Haiti PM resigns as transitional council sworn in
    The council will try to restore order and form a new government in the nation gripped by gang violence.
    
    Europe risks dying and faces big decisions - Macron
    The French president delivers a stark warning for Europe to act fast to survive in a changing world.
    
    Prosecutors ask for halt to case against Spain PM's wife
    Pedro Sánchez is deciding whether to resign after a case against his wife by an anti-corruption group.
    
    WATCH: Would you pay a tourist fee to enter Venice?
    From Thursday visitors making a trip to the famous city at peak times will be charged a trial entrance fee.
    
    Supreme Court divided on whether Trump has immunity
    The justices discussed immunity, coups, pardons, Operation Mongoose - and the future of democracy.
    
    More than 150 killed as heavy rains pound Tanzania
    The prime minister warns that El Niño-triggered heavy rains are likely to continue into May.
    
    Playboy model's interview 'aggravated' Trump, court hears
    A tabloid publisher testifies the then-president didn't understand why he let Karen McDougal speak to media.
    
    Paris's Moulin Rouge loses windmill sails overnight
    The cause of the sails' collapse from the roof of the world famous cabaret club is not yet clear.

```python
from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(docs, desc="Creating embeddings")):
    #data.append({"id": i, "embedding": emb_text(line), "text": line})
    data.append(emb_text(line))

#print(data[:1])
```

Creating embeddings: 100%|██████████| 870/870 [00:18<00:00, 46.00it/s]

```python
connections.connect(uri="milvus_hdbscan.db")
```

```python

fields = [
    FieldSchema(
        name="id", dtype=DataType.INT64, is_primary=True, auto_id=True
    ),  # Primary ID field
    FieldSchema(
        name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024
    ),  # Float vector field (embedding)
    FieldSchema(
        name="text", dtype=DataType.VARCHAR, max_length=65535
    ),  # Float vector field (embedding)
]

schema = CollectionSchema(fields=fields, description="Embedding collection")

collection = Collection(name="news_data", schema=schema)

for doc, embedding in zip(docs, data):
    collection.insert({"text": doc, "embedding": embedding})
    print(doc)

index_params = {"index_type": "FLAT", "metric_type": "L2", "params": {}}

collection.create_index(field_name="embedding", index_params=index_params)

collection.flush()

```

##  HDBSCAN 构建距离矩阵

HDBSCAN 需要计算点与点之间的距离来进行聚类，这可能需要大量计算。由于远处的点对聚类分配的影响较小，我们可以通过计算前 k 个近邻来提高效率。在本例中，我们使用的是 FLAT 索引，但对于大规模数据集，Milvus 支持更高级的索引方法来加速搜索过程。 首先，我们需要获取一个迭代器来迭代之前创建的 Milvus Collections。

```python
import hdbscan
import numpy as np
import pandas as pd
import plotly.express as px
from umap import UMAP
from pymilvus import Collection

collection = Collection(name="news_data")
collection.load()

iterator = collection.query_iterator(
    batch_size=10, expr="id > 0", output_fields=["id", "embedding"]
)

search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10},
}  # L2 is Euclidean distance

ids = []
dist = {}

embeddings = []

```

将遍历 Milvus Collections 中的所有嵌入。对于每个 Embeddings，我们将搜索其在同一 Collections 中的前 k 个邻居，获取它们的 id 和距离。然后，我们还需要创建一个字典，将原始 ID 映射到距离矩阵中的连续索引。完成后，我们需要创建一个初始化所有元素为无穷大的距离矩阵，并填充我们搜索到的元素。这样，远距离点之间的距离将被忽略。最后，我们使用 HDBSCAN 库，利用我们创建的距离矩阵对点进行聚类。我们需要将度量值设置为 "预计算"，以表明数据是距离矩阵而非原始嵌入。

```python
while True:
    batch = iterator.next()
    batch_ids = [data["id"] for data in batch]
    # 所有id集合
    ids.extend(batch_ids)

# 嵌入集合
    query_vectors = [data["embedding"] for data in batch]
    embeddings.extend(query_vectors)

results = collection.search(
        data=query_vectors,
        limit=50,
        anns_field="embedding",
        param=search_params,
        output_fields=["id"],
    )
    for i, batch_id in enumerate(batch_ids):
        dist[batch_id] = []
        for result in results[i]:
            dist[batch_id].append((result.id, result.distance))

if len(batch) == 0:
        break

ids2index = {}

for id in dist:
    ids2index[id] = len(ids2index)

dist_metric = np.full((len(ids), len(ids)), np.inf, dtype=np.float64)

for id in dist:
    for result in dist[id]:
        dist_metric[ids2index[id]][ids2index[result[0]]] = result[1]

h = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=3, metric="precomputed")
hdb = h.fit(dist_metric)
print(hdb)
```

HDBSCAN(metric='precomputed', min_cluster_size=3, min_samples=3)

/home/sichen/anaconda3/envs/jupyterlab/lib/python3.12/site-packages/hdbscan/hdbscan_.py:142: UserWarning:
    
    The minimum spanning tree contains edge weights with value infinity. Potentially, you are missing too many distances in the initial distance matrix for the given neighborhood size.

## UMAP 进行聚类可视化
我们已经使用 HDBSCAN 对数据进行了聚类，并获得了每个数据点的标签。不过，利用一些可视化技术，我们可以获得聚类的全貌，以便进行直观分析。现在，我们将使用 UMAP 对聚类进行可视化。UMAP 是一种用于降维的高效方法，它在保留高维数据结构的同时，将其投影到低维空间，以便进行可视化或进一步分析。有了它，我们就能在二维或三维空间中可视化原始高维数据，并清楚地看到聚类。 在这里，我们再次遍历数据点，获取原始数据的 ID 和文本，然后使用 ploty 将数据点与这些元信息绘制成图，并用不同的颜色代表不同的聚类。

```python
import plotly.io as pio

pio.renderers.default = "notebook"

umap = UMAP(n_components=2, random_state=42, n_neighbors=80, min_dist=0.1)

df_umap = (
    pd.DataFrame(umap.fit_transform(np.array(embeddings)), columns=["x", "y"])
    .assign(cluster=lambda df: hdb.labels_.astype(str))
    .query('cluster != "-1"')
    .sort_values(by="cluster")
)
iterator = collection.query_iterator(
    batch_size=10, expr="id > 0", output_fields=["id", "text"]
)

ids = []
texts = []

while True:
    batch = iterator.next()
    if len(batch) == 0:
        break
    batch_ids = [data["id"] for data in batch]
    batch_texts = [data["text"] for data in batch]
    ids.extend(batch_ids)
    texts.extend(batch_texts)

show_texts = [texts[i] for i in df_umap.index]

df_umap["hover_text"] = show_texts
fig = px.scatter(
    df_umap, x="x", y="y", color="cluster", hover_data={"hover_text": True}
)
fig.show()

```
![newplot.png](https://oss.zxse.cn/2025/12/2374503304.png)

## 准备数据
```
pip install "pymilvus[model]"
pip install hdbscan
pip install plotly
pip install umap-learn
```

## 提取Embeddings

```python
from openai import OpenAI
openai_client = OpenAI(api_key="b5f67c0f6b67", base_url="http://14.103.139.62:8000/v1")

```

1024
    [-0.0101318359375, -0.0299072265625, -0.0115966796875, -0.060546875, -0.0023651123046875, 0.016845703125, -0.03955078125, 0.0311279296875, -0.024169921875, 0.0]

df = pd.read_csv("news_data_dedup.csv")

docs = [
    f"{title}\n{description}" for title, description in zip(df.title, df.description)
]

for item in docs[:10]:
    print(item+"\n")

```

```python
from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(docs, desc="Creating embeddings")):
    #data.append({"id": i, "embedding": emb_text(line), "text": line})
    data.append(emb_text(line))

#print(data[:1])
```

Creating embeddings: 100%|██████████| 870/870 [00:18<00:00, 46.00it/s]

```python
connections.connect(uri="milvus_hdbscan.db")
```

```python

schema = CollectionSchema(fields=fields, description="Embedding collection")

collection = Collection(name="news_data", schema=schema)

for doc, embedding in zip(docs, data):
    collection.insert({"text": doc, "embedding": embedding})
    print(doc)

index_params = {"index_type": "FLAT", "metric_type": "L2", "params": {}}

collection.create_index(field_name="embedding", index_params=index_params)

collection.flush()

```

##  HDBSCAN 构建距离矩阵

```python
import hdbscan
import numpy as np
import pandas as pd
import plotly.express as px
from umap import UMAP
from pymilvus import Collection

collection = Collection(name="news_data")
collection.load()

iterator = collection.query_iterator(
    batch_size=10, expr="id > 0", output_fields=["id", "embedding"]
)

search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10},
}  # L2 is Euclidean distance

ids = []
dist = {}

embeddings = []

```

```python
while True:
    batch = iterator.next()
    batch_ids = [data["id"] for data in batch]
    # 所有id集合
    ids.extend(batch_ids)

# 嵌入集合
    query_vectors = [data["embedding"] for data in batch]
    embeddings.extend(query_vectors)

if len(batch) == 0:
        break

ids2index = {}

for id in dist:
    ids2index[id] = len(ids2index)

dist_metric = np.full((len(ids), len(ids)), np.inf, dtype=np.float64)

for id in dist:
    for result in dist[id]:
        dist_metric[ids2index[id]][ids2index[result[0]]] = result[1]

h = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=3, metric="precomputed")
hdb = h.fit(dist_metric)
print(hdb)
```

HDBSCAN(metric='precomputed', min_cluster_size=3, min_samples=3)

```python
import plotly.io as pio

pio.renderers.default = "notebook"

umap = UMAP(n_components=2, random_state=42, n_neighbors=80, min_dist=0.1)

ids = []
texts = []

show_texts = [texts[i] for i in df_umap.index]

df_umap["hover_text"] = show_texts
fig = px.scatter(
    df_umap, x="x", y="y", color="cluster", hover_data={"hover_text": True}
)
fig.show()

```
![newplot.png](https://oss.zxse.cn/2025/12/2374503304.png)

最后修改：2025 年 12 月 09 日

如果觉得我的文章对你有用，请随意赞赏

基于Mulvus进行 HDBSCAN 聚类

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

Agent 设计模式：LLM 驱动下的多元思维范式与框架哲学

基于Embedding的搜索

基于Mulvus进行 HDBSCAN 聚类

让网页拥有App体验？PWA将网页变为桌面应用的保姆级教程

探讨自我价值与外界

JVM管理常用工具

基于Embedding的搜索

通用订单状态机

Graph RAG with Milvus

从 0 到 1 在 CPU 上训练可推理可演示的小参数中文GPT

基于Mulvus进行 HDBSCAN 聚类

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

基于Mulvus进行 HDBSCAN 聚类

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款