LLM Reasoning 相关度: 7/10

Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering

Yunhui Liu, Pengyu Qiu, Yu Xing, Yongchao Liu, Peng Du, Chuntao Hong, Jiajun Zheng, Tao Zheng, Tieke He
arXiv: 2602.08519v1 发布: 2026-02-09 更新: 2026-02-09

AI 摘要

提出了PyAGC,一个全面的属性图聚类基准,旨在弥合学术研究与工业应用之间的差距。

主要贡献

  • 构建了大规模、低同质性的属性图聚类基准PyAGC。
  • 统一了属性图聚类方法,提出了模块化的Encode-Cluster-Optimize框架。
  • 提供了内存高效的mini-batch属性图聚类算法实现。

方法论

提出了一个模块化的Encode-Cluster-Optimize框架,并提供了多种现有算法的mini-batch实现,在PyAGC基准上进行评估。

原文摘要

Attributed Graph Clustering (AGC) is a fundamental unsupervised task that integrates structural topology and node attributes to uncover latent patterns in graph-structured data. Despite its significance in industrial applications such as fraud detection and user segmentation, a significant chasm persists between academic research and real-world deployment. Current evaluation protocols suffer from the small-scale, high-homophily citation datasets, non-scalable full-batch training paradigms, and a reliance on supervised metrics that fail to reflect performance in label-scarce environments. To bridge these gaps, we present PyAGC, a comprehensive, production-ready benchmark and library designed to stress-test AGC methods across diverse scales and structural properties. We unify existing methodologies into a modular Encode-Cluster-Optimize framework and, for the first time, provide memory-efficient, mini-batch implementations for a wide array of state-of-the-art AGC algorithms. Our benchmark curates 12 diverse datasets, ranging from 2.7K to 111M nodes, specifically incorporating industrial graphs with complex tabular features and low homophily. Furthermore, we advocate for a holistic evaluation protocol that mandates unsupervised structural metrics and efficiency profiling alongside traditional supervised metrics. Battle-tested in high-stakes industrial workflows at Ant Group, this benchmark offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment. The code and resources are publicly available via GitHub (https://github.com/Cloudy1225/PyAGC), PyPI (https://pypi.org/project/pyagc), and Documentation (https://pyagc.readthedocs.io).

标签

图聚类 属性图 基准测试 机器学习 工业应用

arXiv 分类

cs.LG