What is Toluu?
Toluu is a free service for sharing the feeds you read and discovering new ones.
Get Invite

Hadoop and Distributed Computing at Yahoo!

News and information about Hadoop and related distributed computing work going on at Yahoo!


Pig - The Road to an Efficient High-level language for HadoopOctober 28 2008
Pig started as a research project within Yahoo! in the summer of 2006. The original prototype quickly became very popular with users. It was clear that a higher level language than raw map-reduce was needed to quickly rollout prototypes as well as to build production quality applications. Early adopters within Yahoo! have reported substantial increases in productivity when they migrated from raw map-reduce to Pig. In the summer of 2007 a team was put together to make the project into a product. Working within an open source community was perceived as one of the important early goals of the project. Pig has been part of the open source community for over a year, joining Apache Incubator in September of 2007. During this time Pig has developed a community of users and developers, and added two new committers. It also gained wide popularity within Yahoo! with 30% of all Hadoop jobs using Pig - which amounts to thousands per day! A lot of great technical work went into the project which helped with the adoption and popularity of the system. The early work included the addition of streaming operator, parameter substitution, error handling, and some performance improvements like using binary comparators and combiner. More recently the entire system, from the parser down, has been rebuilt making the code much cleaner, extensible, and efficient. A types system was also added further improving performance and allowing for early error dete
Hadoop User Group MeetingOctober 16 2008
In response to a number of requests from folks outside the Bay Area to have us record and post the Hadoop User Group presentations, here are the talks from the October meeting which was held this week at the Yahoo! Mission College campus. We had Jun Rao from IBM Almaden Research talk about “Exploiting database join techniques for analytics with Hadoop”. This was followed by an update on Jaql by Kevin Beyer from IBM, who informed us that Jaql is now available as Open Source. The last talk was a lively discussion with Sriram Rao from Quantcast about his “Experiences moving a Petabyte Data Center”. Bay Area Hadoop User Group meetings are usually held on the third Wednesday of each month at Yahoo! Mission College in Santa Clara. Ajay Anand Yahoo! Grid Computing
Hadoop Camp at ApacheConOctober 9 2008
Following up on the interest in the Hadoop Summit which we held a few months ago, we got together with the ApacheCon folks to arrange a Hadoop Camp at their conference this year. Hadoop Camp will be held on November 6th and 7th in New Orleans as part of ApacheCon this year. Along the lines of the summit, we have speakers from some of the leading companies developing on and using Hadoop, including Facebook, Amazon, IBM, Hewlett-Packard, Sun, Powerset, and Yahoo! in what is possibly the largest gathering of Hadoop committers, developers and users outside of the Bay Area. In addition to the Camp, there is a Hadoop tutorial on Monday, November 3rd, and we are also looking into coordinating a Hadoop “hack” contest that would run through the week at ApacheCon. We are looking forward to a strong turnout! Ajay Anand Yahoo!
Scaling Hadoop to 4000 nodes at Yahoo!September 30 2008
We recently ran Hadoop on what we believe is the single largest Hadoop installation, ever: • 4000 nodes • 2 quad core Xeons @ 2.5ghz per node • 4x1TB SATA disks per node • 8G RAM per node • 1 gigabit ethernet on each node • 40 nodes per rack • 4 gigabit ethernet uplinks from each rack to the core (unfortunately a misconfiguration, we usually do 8 uplinks) • Red Hat Enterprise Linux AS release 4 (Nahant Update 5) • Sun Java JDK 1.6.0_05-b13 • So that's well over 30,000 cores with nearly 16PB of raw disk! The exercise was primarily an effort to see how Hadoop works at this scale and gauge areas for improvements as we continue to push the envelope. We ran Hadoop trunk (post Hadoop 0.18.0) for these experiments. Scaling has been a constant theme for Hadoop: we, at Yahoo!, ran a modestly sized Hadoop cluster of 20 nodes in early 2006; currently Yahoo! has several clusters around the 2000 node mark. HDFS The scaling issues have always been the main focus in designing any HDFS feature. Despite these efforts, attempts to scale the cluster up in the past sometimes resulted in some unpredictable effects. One of the most memorable examples was the cascading crash described in HADOOP-572, when failure of just a handful of data-nodes made the whole cluster completely dysfunctional in a matter of minutes. This time the testing went smoothly and we observed quite decent file system performance. We did
Hadoop 0.18 HighlightsSeptember 25 2008
Apache Hadoop 0.18 was released on 8/22. This is the largest Hadoop release to date in terms of the number of patches committed (266). It also has the largest percentage of patches (20%) from contributors outside of Yahoo!. This is a great indicator of both the growth of the Hadoop community and their increasing involvement in the projects progress. The size of the release resulted in a very large number of blocking bugs in the code base. Unfortunately, this created a big delay between the feature freeze on 6/4 and the final release. Hadoop 0.18 has many improvements in the areas of performance, scalability and reliability in addition to new features. Some of the performance improvements contributed to Hadoop’s first place in the terabyte sort benchmark. Hadoop 0.18 runs the grid mix benchmark in ~45% of the time taken by Hadoop 0.15. Lots of cool new stuff in this release, some of which is briefly described below.

HDFS


• Namespace auto-recovery The HDFS Namenode can store the filesystem image and journal in multiple locations. Upon startup it automatically consults all configured locations of its state and reads the most up to date image and journal. If all of the Namenodes copies of data are unavailable state can be (mostly) recovered from the secondary Namenode using the ‘¬-import