算法系列:CDN背后的数学

F或者是整体的交付部分流方程，cdn使用细化内容缓存和内容复制; 优化的网络路径，包括入口, 出口, 中间数据传输和战略性服务器放置是核心 (such as the origin server) or at the edge (often referred to as caching content at a point 存在). Underlying the key CDN components are a number of fundamental algorithms used to balance strategic core and edge architecture demands.

这篇文章, 这是我们新算法系列中的第一个, dives into the math behind the magic of 流媒体 media delivery to highlight significant mathematical concepts—and even a few equations—that power the infrastructure to deliver live and on-demand streams on a global scale.

如果你读了我的2019年11月/ 12月思想流专栏, you'll remember that this article series came about from a discussion at IBC in Amsterdam among me, my 和几个博士中的一个.D. 流媒体解决方案架构师. 在那次谈话中，我们三个人中有两个拥有数学学位——米歇尔·福尔和尤里·雷兹尼克, 研究主管和brightcove的一名研究员开始深入研究媒体播放器交叉点的数学运算性能和多编解码器清单文件.

的 conversation led to an initial idea to cover four key areas—delivery performance (CDN), 播放器性能(OTT或OVP比特率和呈现阶梯优化), 实时事件伸缩(包括身份验证和其他潜在瓶颈), 和数字版权管理, was fleshed out in a subsequent interview I did with Reznik at 流媒体 West 2019 in Los Angeles.

Along the way, I was introduced to people in the industry with whom I've had either no or limited interaction over my past 2 decades in the space, but whose contributions to those four areas are essential to the road map of not just how we got to where we are today, 也为未来的媒体传播服务.

CDN的数学

CDN数学到底是什么? 最常见的内容交付数学, 至少从付费用户的角度来看是这样, 计费算法. 无论如何，这些都不是新的, 在基于电信的数据网络领域(想想拨号), ISDN, 甚至是固定电话长途服务).

Beyond these and other customer-centric algorithms like 95/5 (aka 95th percentile), 所有cdn衡量和优化的一个关键领域是缓存服务器利用率, 包括防止单个服务器容量过载的方法, 通常被称为“淹没”服务器.

适当缓存的服务器利用率(又名一致哈希)

这里面有很多数学运算现代发布商, but one of the fundamental algorithms for web acceleration and 流媒体 can be found in a design patent filed way back in 1997. U.S. 专利没有. 8458259的题目是分配请求的方法和设备在众多百家乐软件中，它本身， a continuation of several prior patents dating back to a March 13, 1998, patent application for 什么变成了U.S. 专利没有. 6430618 in 2002. 的原始的专利授予麻省理工学院的，是基于研究 presented by members of its Laboratory for 计算机科学 at the 29th Annual ACM Symposium on 的ory of Computing (STOC97) in May 1997. 他们的论文是题目是“一致哈希和随机树”: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web" and can be 在第654-663页研讨会论文集.

该论文的几位作者——大卫·卡尔格尔, 埃里克·雷曼, 汤姆•雷顿, 马修Le-vine, 丹尼尔•列文, 和Rina panigrahy 现在在内容交付圈很有名. 以莉顿为例，她在两部电影中都保留了角色麻省理工学院计算机科学实验室(现称为计算机科学 & 人工智能实验室)和麻省理工学院数学系, 第二年与已故的Lewin共同创立了Akamai.

那么这个“一致哈希”的想法是什么呢这篇论文的作者发展并提出了? 专利中的金钱报价说明了这一点:

造成延迟的两个原因是网络部分的通信负荷过重以及大量加载a的请求特定服务器. 当网络的一部分变得过于拥挤时那部分的交通，通讯网络变得不可靠和缓慢. 类似的, 当太多的请求指向单个服务器时, 服务器过载, 或“淹没.'

To address both the network congestion and the swamping of the original server, 通常称为源服务器, 该专利提出了缓存服务器的概念:

缓存可以减少网络流量缓存副本在网络中很接近拓扑感，给请求者因为用于检索信息的网络链接和百家乐软件更少. 缓存还可以缓解过载的服务器因为有些请求会 normally be routed to the original site can be served by a cache server, and so 减少了对原始站点的请求数量.

但是需要多少缓存服务器在多少个位置? 更重要的是, 从CDN设计的角度来看，还有其他的这些问题甚至可能导致交通堵塞如果有很多缓存服务器? 事实证明，答案是投掷在这个问题上缓存服务器不是一个有效的解决方案. 这就是一致哈希法要解决的问题.

To understand the basis for consistent hashing, we first have to understand hashing. 要做到这一点，我们还需要了解模数学.

数学中的模块化

求模是一个数学运算 what whole number (any natural numbers plus 0) remains after division takes place. 如果你还记得高中数学的话, the test of whether any two numbers will have a remainder is called synthetic division or Horner's method.

还是不记得怎么做? 好吧,这是模是如何工作的. 例如, 7 modulo 2 (often written in shorthand as 7 mod 2) has an answer that is either zero or some natural number (any positive number above zero). 而我们通常会把7/2写成3.5当用小数点来表示除数的一半, 对于模数学, 答案是1(本质上是2的3次方，剩下1).

因为取模的结果等于整数，在某些情况下模量的结果将是无余数. 例如，如果公式是6 mod 2，答案就是0.

模对哈希的重要性在于 the remainders help determine which server a given piece of data might be assigned to and retrieved from. 稍后会详细介绍.

解决问题

哈希最简单的定义是切或者除以，但就我们的目的而言，哈希是a function that maps one piece of data—typically describing some kind of object, 通常是任意大小的另一段数据, 通常是整数." 稍微放一点不同的方式, 根据Wolfram MathWorld网站, "A hash function (H) projects a value from a set with many (or even an infinite 集合中的一个值的成员数固定数量(更少)的成员.换句话说, 它是一种通过集合来表示无限数量的值的方法, 更有限的数量值. 实际上, 为内容, this also allows variable-length content to be represented by fixed-length representations.

我们用社保号作为a 将可变数据散列到固定长度数字的形式. 如果去掉破折号, 社会安全号码是唯一的, 固定长度-值的9位自然数(大于零的正整数). 忽略的初始限制社会安全编号方案(3位数字) Area Number, 2-digit Group Number, 4-digit Serial Number) and assuming that the Social 安全号码从100开始,000,000才能准确地填满所有九个槽, 可能的固定9位值的总数将是899,999,999. 而号码是固定长度的，名称是固定长度的附在每个社会安全号码上的是一段可变长度的内容. 可能是玛丽珍. 布莱姬，富兰克林·德拉诺·罗斯福，甚至约翰·菲茨杰拉德·肯尼迪.

在我们的 example, the fact that the name itself uses both variable-length and multiple-字符类型(数据库术语为“varchar”) 表示全局名称搜索需要计算能力比 entirety of 899,999,999 integer permutations in our Social Security number example.

的最后一个好处哈希是积极的影响更有效的搜索和存储现代关系数据库. 大多数数据库使用一个键结构——在任何结构中都是唯一的值 single field in a database table—as the primary key on which to not just search content but also to join content from multiple tables (the relationship) into a single query of content across multiple database tables.

What we've described in our hashing example—a combination of a fixed-length value (the Social Security number) and a variable-length 价值(任何给定的名称)——最基本的是，称为键值存储的传统数据库结构, 其中键(哈希)与值(内容字符串)相关联, 例如任何给定的名称). 在哈希键唯一的情况下, 它可以在任何给定的数据库表中充当主键.

数据库在索引内容方面也表现得非常好, 哪个本质上是一组键值存储的路线图. 该指数有效地逆向工程存储在数据库中的内容数组, 使用正在搜索的索引, 而不是整个数据库. 对于一组哈希表的作用是 index and significantly reduces search times for the string of content associated with a particular hash.

哈希的问题

散列的主要潜在问题是什么叫做碰撞, 这意味着两个变长值共享相同的定义形参(例如.g.有两个叫约翰的人小雷金纳德·史密斯.两人都出生在美国的同一个地区.S. in the same year) that may result in the same fixed-length value representing both (in our case the Social Security number).

一种基本的哈希方法 generates a hashing collision due to the fact that two strings of content with the same parameters share the same hash key value, 可以导致无意的淹没. 实际上，它也可能导致一台服务器被占用内容太多，而集群中的其他服务器持有的内容太少.

此外，哈希冲突可能会增加浏览器接收内容的严重延迟. In the worst-case scenario, content cannot be served at all or the wrong content 可以从错误的缓存服务器发送到在错误的时间使用错误的媒体播放器. 在我们的以社保为例 like the Social Security Administration sending a check in the mail to the wrong John 小雷金纳德·史密斯.然后，他们就可以在法律上自由地兑现它.

One way to address potential collision errors in a CDN architecture is to add more servers, 复制相同的哈希表和键值存储组合到多个服务器. 如果内容本身有限制，这种方法就有效 collision likelihood; however, 如果一组服务器中的一台服务器出现故障, 哈希表就过时了, since some of the content of the failed server would need to be remapped to all the other servers. 的 end result would be a massive hit on the origin server whenever any single caching server, 即使是集群中的一个, 失败.

下一个页面