{"id":32,"date":"2011-12-31T01:30:55","date_gmt":"2011-12-31T01:30:55","guid":{"rendered":"http:\/\/oren.lederman.name\/?p=32"},"modified":"2012-01-07T09:00:23","modified_gmt":"2012-01-07T09:00:23","slug":"analytic-functions-in-hadoop-hive-part-2","status":"publish","type":"post","link":"http:\/\/oren.lederman.name\/?p=32","title":{"rendered":"Analytic functions in Hadoop Hive part 2"},"content":{"rendered":"<p>&nbsp;<\/p>\n<p>On this post I will explain how to write \u00a0a row_number() function that works with partitions. I assume that you have read Part 1 and loaded the sample data set. We will be using a small sample of original data set:<\/p>\n<p>[crayon lang=&#8221;sql&#8221; url=&#8221;http:\/\/oren.lederman.name\/wp-content\/uploads\/2011\/12\/post2_query0.txt&#8221;][\/crayon]<\/p>\n<p>For the\u00a0purpose\u00a0of this post, we will implement a UDF imitates the\u00a0behavior\u00a0of the following SQL statement:<\/p>\n<p>[crayon lang=&#8221;sql&#8221; url=&#8221;http:\/\/oren.lederman.name\/wp-content\/uploads\/2011\/12\/post2_query1.txt&#8221;][\/crayon]<br \/>\nReturning:<br \/>\n[csv2table source=&#8221;http:\/\/oren.lederman.name\/wp-content\/uploads\/2011\/12\/post2_query1.csv&#8221;]<\/p>\n<p>If we did not have to deal with partitions, all we had to do is keep a counter and increase it on each call to evaluate().\u00a0You can see such an example in \u00a0Hive&#8217;s contrib code (<a href=\"http:\/\/svn.apache.org\/repos\/asf\/hive\/trunk\/contrib\/src\/java\/org\/apache\/hadoop\/hive\/contrib\/udf\/UDFRowSequence.java\">UDFRowSequence<\/a>). Note that this code assumes a single instance of UDFRowSequence (or else the same row number is assigned to more than one record).\u00a0However, using analytic functions with partitions is one of the most powerful features of analytic functions.<\/p>\n<p>The trick for writing an analytic function that supports partitioning is to <strong>sort<\/strong>\u00a0the data before calling the UDF. By sorting the records by the partitioning key (<em>station_no<\/em> in our example) we can assure that all the records belonging to the same partition arrives\u00a0sequentially so we can reset the counter every time the partitioning key changes.\u00a0By using a\u00a0secondary\u00a0sort based on the order by keys (<em>sample_date<\/em>\u00a0in our example) we assure that the records within each partition arrives in the order in which we want to number the records.<\/p>\n<p>This can be done using a SORT BY clause. SORT BY controls the order in which records arrive to a Reducer (read <a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/LanguageManual+SortBy\">The difference between Sort By and Order By<\/a> for further information), which is exactly what we need here.<br \/>\n[crayon lang=&#8221;sql&#8221; url=&#8221;http:\/\/oren.lederman.name\/wp-content\/uploads\/2011\/12\/post2_query2.txt&#8221;][\/crayon]<\/p>\n<p>You probably noticed the DISTRIBUTE BY clause in the last example. Since Hive can choose to use more\u00a0\u00a0than one reducer (and if you process a large data set, this is what you want), we must make sure that all records of the same partition key reaches the same Reducer.<\/p>\n<p>Now that the data is sorted and distributed properly, the UDF code itself is quite simple. Note how evaluate() \u00a0saves the list of partition keys, and resets the counter every time the partition changes.<\/p>\n<p>[crayon url=&#8221;http:\/\/oren.lederman.name\/wp-content\/uploads\/2011\/12\/PurswayRowNumberUDF.txt&#8221;][\/crayon]<\/p>\n<p>Disclosure &#8211; the code in this example is based on the code of <a href=\"http:\/\/mail-archives.apache.org\/mod_mbox\/hive-user\/201105.mbox\/%3C2B602E4D-7A75-4F59-85E8-35C3B1365534@gmail.com%3E\">TiaraUDFRank<\/a>. I decided to use StringBuilder and rely on Text.toString() instead of arg.getBytes() as there seem to be a\u00a0bizarre\u00a0memory leak in the original code that returned wrong values.<\/p>\n<p>Related links<\/p>\n<ul>\n<li><a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/LanguageManual+SortBy\">SORT BY, ORDER BY and DISTRIBUTE BY explained<\/a><\/li>\n<li><a href=\"https:\/\/cwiki.apache.org\/Hive\/languagemanual-ddl.html\">Hive DDL<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; On this post I will explain how to write \u00a0a row_number() function that works with partitions. I assume that you have read Part 1 and loaded the sample data set. We will be using a small sample of original data set: [crayon lang=&#8221;sql&#8221; url=&#8221;http:\/\/oren.lederman.name\/wp-content\/uploads\/2011\/12\/post2_query0.txt&#8221;][\/crayon] For the\u00a0purpose\u00a0of this post, we will implement a UDF imitates [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[4],"tags":[6,7,5],"_links":{"self":[{"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/posts\/32"}],"collection":[{"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=32"}],"version-history":[{"count":10,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/posts\/32\/revisions"}],"predecessor-version":[{"id":153,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/posts\/32\/revisions\/153"}],"wp:attachment":[{"href":"http:\/\/oren.lederman.name\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=32"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=32"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=32"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}