Hadoop at Yahoo! Sets New Gray Sort Record[zz]

去年曾经整理了所有Terasort的记录,最快的是Yahoo 2009年做的TB排序,62秒钟(1460节点,主频2.5Ghz,8G,1Gbe,4HD)。今年7月份Yahoo就搞了一个100TB的测试,72分钟(2100节点,主频2.3Ghz,双路6核,64G,10Gbe,12HD)。如果是线性的,则1TB的数据扩大到100TB,需要103分钟,当然,排序算法并非线性的。今年Yahoo达到这么高的性能,虽然和Hadoop优化有关系,但是和采用高性能服务器的关系更大。
Hadoop at Yahoo! Sets New Gray Sort Record – The Yellow Elephant is Getting Faster

By Thomas Graves – Wed, Jul 3, 2013 5:27 AM EDT

hadoop-elephantWe are proud to announce we used Apache Hadoop to set a new Gray sort record for the Jim Gray’s Sort benchmark. We nearly doubled the rate of the previous Gray sort entry by sorting at a rate of 1.42 Terabytes per minute. The previous record was 0.725 Terabytes per minute.

Jim Gray’s sort benchmark consists of a set of many related benchmarks, each with their own rules. All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value. The Gray sort is to measure the sort rate achieved while sorting at least 100 terabytes of data. The Minute sort is the amount of data that can be sorted in less than a minute. There are two different benchmark categories. The Daytona category requires the sort code to be general purpose sort. The Indy category needs to only sort 100-byte records with 10-byte keys. We used Hadoop Terasort with slightly different configurations in both categories.

There were some new rules this year. The biggest rule change is that your sort is required to sort both skewed and non-skewed data for the Daytona benchmark category. The skewed data has to be sorted in no more than twice the elapsed time of the non-skewed data. Other changes include: the input and output data must persist in the case of a single node failure and none of the data can be compressed (input, intermediate, or output)


Our only official entry was to Gray sort, but we also unofficially (we didn’t get the submission in by the deadline – learned that lesson!) broke the previous Minute Sort record. The full report can be found on the sort benchmark page under the Gray sort results.


Software, Hardware, and Operating System

The version of Hadoop used was Hadoop 0.23.7. Hadoop 0.23.7 is an early branch of the Hadoop 2.X line that Yahoo! has used to stabilize YARN. It is available for download at hadoop.apache.org.

The hardware and operating system details are:

  • Approximately 2100 nodes for GraySort and 2200 nodes for MinuteSort
  • System: Dell R720xd, 2 x Xeon E5-2630 2.30GHz, 62.3GB / 64GB 1333MHz DDR3, 12 x 3TB SATA
  • Processors: 2 x Xeon E5-2630 2.30GHz, 7.2GT QPI (HT enabled, 12 cores, 24 threads) – Sandy Bridge-EP C2, 64-bit, 6-core, 32nm, L3: 15MB
  • OS: RHEL Server 6.3, Linux 2.6.32-279.19.1.el6.YAHOO.20130104.x86_64 x86_64, 64-bit
  • Network: eth0 (bnx2x): 10Gb/s <full-duplex>
  • 40 nodes/rack 160Gbps rack to spine. 2.5:1 subscription.
  • Oracle JDK 1.7 (u17) – 64 bit


Thomas Graves is a software developer at Yahoo! and a Hadoop PMC member at the Apache Software Foundation.

Nathan Roberts is a Hadoop architect at Yahoo!.

Balaji Narayanan and Rajiv Chittajallu are on the Grid Operations team at Yahoo!.


  1. Terasort 的记录
  2. Hadoop at Yahoo! Sets New Gray Sort Record – The Yellow Elephant is Getting Faster




电子邮件地址不会被公开。 必填项已用*标注