[TIL] 在 Hive 中把带分区的文本格式的表转换成 ORC 格式

Tuesday, August 9, 2016

在我们 Data Pipeline 中有一个步骤我们需要对带分区的文本格式的表转换成 ORC 格式并进行
SNAPPY 压缩,放到 airflow 中 T+1 处理.

比如我们有一张 access_log_txt 外部表

CREATE EXTERNAL TABLE access_log_txt (
time string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

有一张 access_log_orc 的表

CREATE TABLE access_log_orc (
time string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");

如果数据表未分区,直接 insert into xxx select * from yyy

insert into access_log_orc select * from access_log_txt where foo=bar;

但是有分区的时候, 分区字段会包含在 select * from yyy 中,
造成和目标表字段数不一致的情况

hive> insert into access_log_orc PARTITION(dt='2016-08-09') select * from
access_log_txt where dt='2016-08-09';
FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target
table because column number/types are different ''2016-08-09'': Table
insclause-0 has 62 columns, but query has 63 columns.

指定 select 字段列表的话,字段太多太累了,找到一种把分区字段从结果集排除的方法

hive > set hive.support.quoted.identifiers=none;
hive > insert into access_log_orc PARTITION(dt='2016-08-09') select `(dt)?+.+` from
access_log_txt where dt='2016-08-09';

This entry was tagged Hadoop, Hive and TIL

comments powered by Disqus

© 2009-2013 lxneng.com. All rights reserved. Powered by Pyramid

go to Top