{"id":21,"date":"2011-12-24T12:22:56","date_gmt":"2011-12-24T12:22:56","guid":{"rendered":"http:\/\/oren.lederman.name\/?p=21"},"modified":"2012-01-07T09:00:32","modified_gmt":"2012-01-07T09:00:32","slug":"analytic-functions-in-hive-part-1","status":"publish","type":"post","link":"http:\/\/oren.lederman.name\/?p=21","title":{"rendered":"Analytic functions in Hadoop Hive part 1"},"content":{"rendered":"<p><a title=\"Pursway\" href=\"http:\/\/www.pursway.com\">Pursway<\/a>, the company I work for, has been using MPP (Massively Parallel Processing) databases for the last few years. Several months ago I started evaluating Hive as a replacement for a\u00a0commercial MPP database that we using at the time.\u00a0One of the first things I noticed was the lack of analytic functions, which are extremely important for the type of analysis we do in Pursway.\u00a0In this\u00a0series\u00a0of posts I will explain how some, though not all, analytic functions can be written as Hive UDF.<\/p>\n<p>I assume that you have basic understanding of Hive UDFs . If you do not, please refer to the links below.<\/p>\n<p>We will also need a data set to work with. Because of privacy issues I cannot use the data I usually work with, so we will be using publicly available data from <a title=\"Tel-O-Fun\" href=\"https:\/\/www.tel-o-fun.co.il\/en\/\">Tel-O-Fun<\/a> &#8211; an automatic bicycles renting service operated by\u00a0the municipality of Tel-Aviv.<\/p>\n<p>You can find the sample data set here &#8211;\u00a0<a href=\"http:\/\/oren.lederman.name\/wp-content\/uploads\/2011\/12\/tel_o_fun.txt.gz\">Tel-O-Fun sample data<\/a>. It is a\u00a0simple pipe-separated file with four fields:<\/p>\n<ul>\n<li>sample_date &#8211; the date in which the sample was taken<\/li>\n<li>station_number<\/li>\n<li>available_bikes<\/li>\n<li>available_docking_poles<\/li>\n<\/ul>\n<div>Download it, and load it into Hive:<\/div>\n<pre lang=\"sql\">create table tel_o_fun (\r\n\tsample_date string\r\n\t,station_no int\r\n\t,avaialble_bikes smallint\r\n\t,avaialble_docking_poles smallint\r\n) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;\r\n\r\nLOAD DATA local INPATH '\/home\/hadoop\/tel_o_fun.txt' INTO TABLE tel_o_fun;<\/pre>\n<p>On the next post I will explain the basic of Hive &#8220;Analytic&#8221; functions, and how to write a row_number() function.<\/p>\n<p>Related links:<\/p>\n<ul>\n<li><a href=\"http:\/\/www.slideshare.net\/ragho\/hive-user-meeting-august-2009-facebook\">Writing Hive UDFs<\/a>. Focus on\u00a0slides 74-87<\/li>\n<li><a href=\"http:\/\/dev.bizo.com\/2009\/06\/custom-udfs-and-hive.html\">Simple example<\/a> on how to write Hive UDFs<\/li>\n<li>Tel-O-Fun <a href=\"https:\/\/www.tel-o-fun.co.il\/en\/\">official web site<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Pursway, the company I work for, has been using MPP (Massively Parallel Processing) databases for the last few years. Several months ago I started evaluating Hive as a replacement for a\u00a0commercial MPP database that we using at the time.\u00a0One of the first things I noticed was the lack of analytic functions, which are extremely important [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[4],"tags":[6,7,5],"_links":{"self":[{"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/posts\/21"}],"collection":[{"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=21"}],"version-history":[{"count":10,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/posts\/21\/revisions"}],"predecessor-version":[{"id":154,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=\/wp\/v2\/posts\/21\/revisions\/154"}],"wp:attachment":[{"href":"http:\/\/oren.lederman.name\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=21"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=21"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/oren.lederman.name\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=21"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}