Hive SQL去重a b和b a类别

2024-09-07 22:12:57

字体：大中小

来源：转载

供稿：网友

　　昨天开发找到我们DBA，要我们写一条Hive SQL。

　　需求：

　　有一个t表，主要有机场名称airport，机场的经纬度distance这两个列组成，想得到所有距离小于100的两个机场名。

　　其实写这个SQL的逻辑并不是很困难，难点是如何去重复值，

　　我用MySQL模拟的一个表，其实Hive语法和SQL差不多，插入了三条数据，a, b, c 分别代表三个机场名称，结构如下：

　　mysql> show create table t/G
　　*************************** 1. row ***************************
　　       Table: t
　　Create Table: CREATE TABLE `t` (
　　 `airport` varchar(10) DEFAULT NULL,
　　 `distant` int(11) DEFAULT NULL
　　) ENGINE=InnoDB DEFAULT CHARSET=utf8
　　1 row in set (0.00 sec)

　　mysql> select * from t;
　　+---------+---------+
　　| airport | distant |
　　+---------+---------+
　　| a       |     130 |
　　| b       |     140 |
　　| c       |     150 |
　　+---------+---------+
　　3 rows in set (0.00 sec)
　　通过!=筛选掉本机场自己之间的比较，用abs函数取绝对值得到位置小于100的两个机场

　　mysql> select t1.airport, t2.airport from t t1,t t2 where t1.airport != t2.airport and abs(t1.distant-t2.distant) < 100;
　　+---------+---------+
　　| airport | airport |
　　+---------+---------+
　　| b       | a       |
　　| c       | a       |
　　| a       | b       |
　　| c       | b       |
　　| a       | c       |
　　| b       | c       |
　　+---------+---------+
　　6 rows in set (0.00 sec)
　　但是问题来了，(b,a) 与(a,b)，(c,a)与(a,c)，(c,b)与(b,c)这里被我们视为重复值，我们只需要得到其中某一行的数据，就知道是哪两个机场名了，那么，如何去掉这个重复值呢？

　　貌似distinct，group by都派不上用场了，最后咨询了一位资深的SQL高手，找到了这么一个函数hex(),可以把一个字符转化成十六进制，Hive也有对应的函数，效果如下：

　　mysql> select t1.airport,hex(t1.airport), t2.airport,hex(t2.airport) from t t1,t t2 where t1.airport != t2.airport and abs(t1.distant-t2.distant) < 100;
　　+---------+-----------------+---------+-----------------+
　　| airport | hex(t1.airport) | airport | hex(t2.airport) |
　　+---------+-----------------+---------+-----------------+
　　| b       | 62              | a       | 61              |
　　| c       | 63              | a       | 61              |
　　| a       | 61              | b       | 62              |
　　| c       | 63              | b       | 62              |
　　| a       | 61              | c       | 63              |
　　| b       | 62              | c       | 63              |
　　+---------+-----------------+---------+-----------------+
　　6 rows in set (0.00 sec)
　　这样我们就可以通过比较机场1和机场2的大小，来去掉重复值了

　　mysql> select t1.airport, t2.airport from t t1,t t2 where t1.airport != t2.airport and hex(t1.airport) < hex(t2.airport) and abs(t1.distant-t2.distant) < 100;
　　+---------+---------+
　　| airport | airport |
　　+---------+---------+
　　| a       | b       |
　　| a       | c       |
　　| b       | c       |
　　+---------+---------+
　　3 rows in set (0.00 sec)
　　最后再优化一下，结果如下：

　　mysql> select t1.airport, t2.airport from t t1,t t2 where hex(t1.airport) < hex(t2.airport) and abs(t1.distant-t2.distant) < 100;
　　+---------+---------+
　　| airport | airport |
　　+---------+---------+
　　| a       | b       |
　　| a       | c       |
　　| b       | c       |
　　+---------+---------+
　　3 rows in set (0.00 sec)

（编辑：武林网）

上一篇：InnoDB 事务加锁解析

下一篇：ERROR 1010 HY000 Error dropping database处理方案