当前位置:嗨网首页>书籍在线阅读

04-按年龄和经度对州长进行聚类

  
选择背景色: 黄橙 洋红 淡粉 水蓝 草绿 白色 选择字体: 宋体 黑体 微软雅黑 楷体 选择字体大小: 恢复默认

6.3 按年龄和经度对州长进行聚类

美国每一个州都有一位州长。2017年6月,这些州长的年龄从42岁到79岁。如果从东到西以经度来考量每个州,也许可以找到经度相近且州长年龄相仿的州聚类簇。图6-2是全部50位州长的散点图。x轴是州的经度,y轴是州长的年龄。

37.png

图6-2 按州的经度和州长年龄绘制的2017年6月州长散点图

图6-2中是否包含明显的聚类簇?此图的坐标轴没有归一化。图中的数据仍是原始数据。如果聚类簇总是那么明显,就不需要用到聚类算法了。

下面试着用k均值聚类算法运行一下上述数据集。首先,单个数据点需要有一种表现形式。具体代码如代码清单6-12所示。

代码清单6-12 governors.py

from __future__ import annotations
from typing import List
from data_point import DataPoint
from kmeans import Kmeans
class Governor(DataPoint):
    def __init__(self, longitude: float, age: float, state: str) -> None:
        super().__init__([longitude, age])
        self.longitude = longitude
        self.age = age
        self.state = state
    def __repr__(self) -> str:
        return f"{self.state}: (longitude: {self.longitude}, age: {self.age})"

Governor 带有两个已命名并存储的维度: longitudeage 。除为实现美观打印而重写了 __repr__() 之外, Governor 没有对其超类 DataPoint 的处理机制做出其他改动。手工录入以下数据很不合理,因此还是请查看本书附带的源代码库吧。具体代码如代码清单6-13所示。

代码清单6-13 governors.py(续)

if __name__ == "__main__":
   governors: List[Governor] = [Governor(-86.79113, 72, "Alabama"), Governor(-152.
    404419, 66, "Alaska"), Governor(-111.431221, 53, "Arizona"), Governor(-92.373123,
    66, "Arkansas"), Governor(-119.681564, 79, "California"), Governor(-105.311104, 
    65, "Colorado"), Governor(-72.755371, 61, "Connecticut"), Governor(-75.507141,
    61, "Delaware"), Governor(-81.686783, 64, "Florida"), Governor(-83.643074, 74,
    "Georgia"), Governor(-157.498337, 60, "Hawaii"), Governor(-114.478828, 75, "Idaho"), 
    Governor(-88.986137, 60, "Illinois"), Governor(-86.258278, 49, "Indiana"), Governor
    (-93.210526, 57, "Iowa"), Governor(-96.726486, 60, "Kansas"), Governor(-84.670067, 
    50, "Kentucky"), Governor(-91.867805, 50, "Louisiana"), Governor(-69.381927, 68, 
    "Maine"), Governor(-76.802101, 61, "Maryland"), Governor(-71.530106, 60,
    "Massachusetts"), Governor(-84.536095, 58, "Michigan"), Governor(-93.900192, 
    70, "Minnesota"), Governor(-89.678696, 62, "Mississippi"), Governor(-92.288368, 
    43, "Missouri"), Governor(-110.454353, 51, "Montana"), Governor(-98.268082, 52, 
    "Nebraska"), Governor(-117.055374, 53, "Nevada"), Governor(-71.563896, 42, 
    "New Hampshire"), Governor(-74.521011, 54, "New Jersey"), Governor(-106.248482, 
    57, "New Mexico"), Governor(-74.948051, 59, "New York"), Governor(-79.806419, 
    60, "North Carolina"), Governor(-99.784012, 60, "North Dakota"), Governor(-82.764915,
    65, "Ohio"), Governor(-96.928917, 62, "Oklahoma"), Governor(-122.070938, 56, 
    "Oregon"), Governor(-77.209755, 68, "Pennsylvania"), Governor(-71.51178, 46, 
    "Rhode Island"), Governor(-80.945007, 70, "South Carolina"), Governor(-99.438828,
    64, "South Dakota"), Governor(-86.692345, 58, "Tennessee"), Governor(-97.563461, 
    59, "Texas"), Governor(-111.862434, 70, "Utah"), Governor(-72.710686, 58, "Vermont"),
    Governor(-78.169968, 60, "Virginia"),Governor(-121.490494, 66, "Washington"), 
    Governor(-80.954453, 66, "West Virginia"),Governor(-89.616508, 49, "Wisconsin"),
    Governor(-107.30249, 55, "Wyoming")]

k 设为 2 ,运行k均值聚类算法。具体代码如代码清单6-14所示。

代码清单6-14 governors.py(续)

kmeans: KMeans[Governor] = KMeans(2, governors)
gov_clusters: List[KMeans.Cluster] = kmeans.run()
for index, cluster in enumerate(gov_clusters):
    print(f"Cluster {index}: {cluster.points}\n")

因为是以随机形心开始运行的,所以每次运行 KMeans 都可能返回不同的聚类簇。这里需要进行一些人工分析才能确定聚类簇是否真正相关。以下是确实存在有意义聚类簇的情况下的运行结果:

Converged after 5 iterations
Cluster 0: [Alabama: (longitude: -86.79113, age: 72), Arizona: (longitude: -111.431221,
     age: 53), Arkansas: (longitude: -92.373123, age: 66), Colorado: (longitude: 
     -105.311104, age: 65), Connecticut: (longitude: -72.755371, age: 61), Delaware: 
     (longitude: -75.507141, age: 61), Florida: (longitude: -81.686783, age: 64), 
     Georgia: (longitude: -83.643074, age: 74), Illinois: (longitude: -88.986137, 
     age: 60), Indiana: (longitude: -86.258278, age: 49), Iowa: (longitude: -93.210526,
     age: 57), Kansas: (longitude: -96.726486, age: 60), Kentucky: (longitude: 
     -84.670067, age: 50), Louisiana: (longitude: -91.867805, age: 50), Maine: 
     (longitude: -69.381927, age: 68), Maryland: (longitude: -76.802101, age: 61),
     Massachusetts: (longitude: -71.530106, age: 60), Michigan: (longitude: -84.536095,
     age: 58), Minnesota: (longitude: -93.900192, age: 70), Mississippi: (longitude:
     -89.678696, age: 62), Missouri: (longitude: -92.288368, age: 43), Montana: 
     (longitude: -110.454353, age: 51), Nebraska: (longitude: -98.268082, age: 52),
     Nevada: (longitude: -117.055374, age: 53), New Hampshire: (longitude: -71.563896, 
     age: 42), New Jersey: (longitude: -74.521011, age: 54), New Mexico: (longitude: 
     -106.248482, age: 57), New York: (longitude: -74.948051, age: 59), North Carolina:
     (longitude: -79.806419, age: 60), North Dakota: (longitude: -99.784012, age:
     60), Ohio: (longitude: -82.764915, age: 65), Oklahoma: (longitude: -96.928917, 
     age: 62), Pennsylvania: (longitude: -77.209755, age: 68), Rhode Island:(longitude:
     -71.51178, age: 46), South Carolina: (longitude: -80.945007, age: 70), South Dakota:
     (longitude: -99.438828, age: 64), Tennessee: (longitude: -86.692345, age: 58), 
     Texas: (longitude: -97.563461, age:59), Vermont: (longitude: -72.710686, age:
     58), Virginia: (longitude: -78.169968, age: 60), West Virginia: (longitude: 
     -80.954453, age: 66), Wisconsin: (longitude: -89.616508, age: 49), Wyoming: 
     (longitude: 107.30249, age: 55)]
Cluster 1: [Alaska: (longitude: -152.404419, age: 66), California: (longitude: 
     -119.681564, age: 79), Hawaii: (longitude: -157.498337, age: 60), Idaho: (longitude: 
     -114.478828, age: 75), Oregon: (longitude: -122.070938, age: 56), Utah: (longitude:
     -111.862434, age: 70), Washington: (longitude: -121.490494, age: 66)]

聚类簇1代表最西部的各州,在地理上均彼此相邻(如果将Alaska和Hawaii视作与太平洋沿岸各州相邻)。这些州的州长年龄相对较大,于是就形成了一个有意义的聚类簇。难道太平洋沿岸的人们都喜欢年长的州长吗?除相关之外,无法从这些聚类簇中得出任何其他结论。图6-3演示了这一结果。方块表示聚类簇1,圆点表示聚类簇0。

38.png

图6-3 聚类簇0的数据点由圆点标识,聚类簇1的数据点由方块标识

提示 如果形心是随机初始化的,则每次k均值聚类的结果会有所不同,这一点怎么强调都不为过。对于任何数据集,请确保多次运行k均值聚类算法。