04-按年龄和经度对州长进行聚类
6.3 按年龄和经度对州长进行聚类
美国每一个州都有一位州长。2017年6月,这些州长的年龄从42岁到79岁。如果从东到西以经度来考量每个州,也许可以找到经度相近且州长年龄相仿的州聚类簇。图6-2是全部50位州长的散点图。x轴是州的经度,y轴是州长的年龄。
图6-2中是否包含明显的聚类簇?此图的坐标轴没有归一化。图中的数据仍是原始数据。如果聚类簇总是那么明显,就不需要用到聚类算法了。
下面试着用k均值聚类算法运行一下上述数据集。首先,单个数据点需要有一种表现形式。具体代码如代码清单6-12所示。
代码清单6-12 governors.py
from __future__ import annotations
from typing import List
from data_point import DataPoint
from kmeans import Kmeans
class Governor(DataPoint):
def __init__(self, longitude: float, age: float, state: str) -> None:
super().__init__([longitude, age])
self.longitude = longitude
self.age = age
self.state = state
def __repr__(self) -> str:
return f"{self.state}: (longitude: {self.longitude}, age: {self.age})"
Governor
带有两个已命名并存储的维度: longitude
和 age
。除为实现美观打印而重写了 __repr__()
之外, Governor
没有对其超类 DataPoint
的处理机制做出其他改动。手工录入以下数据很不合理,因此还是请查看本书附带的源代码库吧。具体代码如代码清单6-13所示。
代码清单6-13 governors.py(续)
if __name__ == "__main__":
governors: List[Governor] = [Governor(-86.79113, 72, "Alabama"), Governor(-152.
404419, 66, "Alaska"), Governor(-111.431221, 53, "Arizona"), Governor(-92.373123,
66, "Arkansas"), Governor(-119.681564, 79, "California"), Governor(-105.311104,
65, "Colorado"), Governor(-72.755371, 61, "Connecticut"), Governor(-75.507141,
61, "Delaware"), Governor(-81.686783, 64, "Florida"), Governor(-83.643074, 74,
"Georgia"), Governor(-157.498337, 60, "Hawaii"), Governor(-114.478828, 75, "Idaho"),
Governor(-88.986137, 60, "Illinois"), Governor(-86.258278, 49, "Indiana"), Governor
(-93.210526, 57, "Iowa"), Governor(-96.726486, 60, "Kansas"), Governor(-84.670067,
50, "Kentucky"), Governor(-91.867805, 50, "Louisiana"), Governor(-69.381927, 68,
"Maine"), Governor(-76.802101, 61, "Maryland"), Governor(-71.530106, 60,
"Massachusetts"), Governor(-84.536095, 58, "Michigan"), Governor(-93.900192,
70, "Minnesota"), Governor(-89.678696, 62, "Mississippi"), Governor(-92.288368,
43, "Missouri"), Governor(-110.454353, 51, "Montana"), Governor(-98.268082, 52,
"Nebraska"), Governor(-117.055374, 53, "Nevada"), Governor(-71.563896, 42,
"New Hampshire"), Governor(-74.521011, 54, "New Jersey"), Governor(-106.248482,
57, "New Mexico"), Governor(-74.948051, 59, "New York"), Governor(-79.806419,
60, "North Carolina"), Governor(-99.784012, 60, "North Dakota"), Governor(-82.764915,
65, "Ohio"), Governor(-96.928917, 62, "Oklahoma"), Governor(-122.070938, 56,
"Oregon"), Governor(-77.209755, 68, "Pennsylvania"), Governor(-71.51178, 46,
"Rhode Island"), Governor(-80.945007, 70, "South Carolina"), Governor(-99.438828,
64, "South Dakota"), Governor(-86.692345, 58, "Tennessee"), Governor(-97.563461,
59, "Texas"), Governor(-111.862434, 70, "Utah"), Governor(-72.710686, 58, "Vermont"),
Governor(-78.169968, 60, "Virginia"),Governor(-121.490494, 66, "Washington"),
Governor(-80.954453, 66, "West Virginia"),Governor(-89.616508, 49, "Wisconsin"),
Governor(-107.30249, 55, "Wyoming")]
将 k
设为 2
,运行k均值聚类算法。具体代码如代码清单6-14所示。
代码清单6-14 governors.py(续)
kmeans: KMeans[Governor] = KMeans(2, governors)
gov_clusters: List[KMeans.Cluster] = kmeans.run()
for index, cluster in enumerate(gov_clusters):
print(f"Cluster {index}: {cluster.points}\n")
因为是以随机形心开始运行的,所以每次运行 KMeans
都可能返回不同的聚类簇。这里需要进行一些人工分析才能确定聚类簇是否真正相关。以下是确实存在有意义聚类簇的情况下的运行结果:
Converged after 5 iterations
Cluster 0: [Alabama: (longitude: -86.79113, age: 72), Arizona: (longitude: -111.431221,
age: 53), Arkansas: (longitude: -92.373123, age: 66), Colorado: (longitude:
-105.311104, age: 65), Connecticut: (longitude: -72.755371, age: 61), Delaware:
(longitude: -75.507141, age: 61), Florida: (longitude: -81.686783, age: 64),
Georgia: (longitude: -83.643074, age: 74), Illinois: (longitude: -88.986137,
age: 60), Indiana: (longitude: -86.258278, age: 49), Iowa: (longitude: -93.210526,
age: 57), Kansas: (longitude: -96.726486, age: 60), Kentucky: (longitude:
-84.670067, age: 50), Louisiana: (longitude: -91.867805, age: 50), Maine:
(longitude: -69.381927, age: 68), Maryland: (longitude: -76.802101, age: 61),
Massachusetts: (longitude: -71.530106, age: 60), Michigan: (longitude: -84.536095,
age: 58), Minnesota: (longitude: -93.900192, age: 70), Mississippi: (longitude:
-89.678696, age: 62), Missouri: (longitude: -92.288368, age: 43), Montana:
(longitude: -110.454353, age: 51), Nebraska: (longitude: -98.268082, age: 52),
Nevada: (longitude: -117.055374, age: 53), New Hampshire: (longitude: -71.563896,
age: 42), New Jersey: (longitude: -74.521011, age: 54), New Mexico: (longitude:
-106.248482, age: 57), New York: (longitude: -74.948051, age: 59), North Carolina:
(longitude: -79.806419, age: 60), North Dakota: (longitude: -99.784012, age:
60), Ohio: (longitude: -82.764915, age: 65), Oklahoma: (longitude: -96.928917,
age: 62), Pennsylvania: (longitude: -77.209755, age: 68), Rhode Island:(longitude:
-71.51178, age: 46), South Carolina: (longitude: -80.945007, age: 70), South Dakota:
(longitude: -99.438828, age: 64), Tennessee: (longitude: -86.692345, age: 58),
Texas: (longitude: -97.563461, age:59), Vermont: (longitude: -72.710686, age:
58), Virginia: (longitude: -78.169968, age: 60), West Virginia: (longitude:
-80.954453, age: 66), Wisconsin: (longitude: -89.616508, age: 49), Wyoming:
(longitude: 107.30249, age: 55)]
Cluster 1: [Alaska: (longitude: -152.404419, age: 66), California: (longitude:
-119.681564, age: 79), Hawaii: (longitude: -157.498337, age: 60), Idaho: (longitude:
-114.478828, age: 75), Oregon: (longitude: -122.070938, age: 56), Utah: (longitude:
-111.862434, age: 70), Washington: (longitude: -121.490494, age: 66)]
聚类簇1代表最西部的各州,在地理上均彼此相邻(如果将Alaska和Hawaii视作与太平洋沿岸各州相邻)。这些州的州长年龄相对较大,于是就形成了一个有意义的聚类簇。难道太平洋沿岸的人们都喜欢年长的州长吗?除相关之外,无法从这些聚类簇中得出任何其他结论。图6-3演示了这一结果。方块表示聚类簇1,圆点表示聚类簇0。
提示 如果形心是随机初始化的,则每次k均值聚类的结果会有所不同,这一点怎么强调都不为过。对于任何数据集,请确保多次运行k均值聚类算法。