Java – Impala + inlay file

Impala + inlay file… here is a solution to the problem.

Impala + inlay file

I’ve created the parquet file, and then I’m trying to import it into an Impala table.

The table I created is as follows:

CREATE EXTERNAL TABLE `user_daily` (
`user_id` BIGINT COMMENT 'User ID',
`master_id` BIGINT,
`walletAgency` BOOLEAN,
`zone_id` BIGINT COMMENT 'Zone ID',
`day` STRING COMMENT 'The stats are aggregated for single days',
`clicks` BIGINT COMMENT 'The number of clicks',
`impressions` BIGINT COMMENT 'The number of impressions',
`avg_position` BIGINT COMMENT 'The average position * 100',   
`money` BIGINT COMMENT 'The cost of the clicks, in hellers',
`web_id` BIGINT COMMENT 'Web ID',
`discarded_clicks` BIGINT COMMENT 'Number of discarded clicks from   column "clicks"',
`impression_money` BIGINT COMMENT 'The cost of the impressions, in hellers'
)
PARTITIONED BY (
 year BIGINT,
 month BIGINT
)
STORED AS PARQUET
LOCATION '/warehouse/impala/contextstat.db/user_daily/';

Then I copy the file with this mode :

parquet-tools schema user_daily/year\=2016/month\=8/part-r-00001-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet 
message spark_schema {
  optional int32 user_id;
  optional int32 web_id (INT_16);
  optional int32 zone_id;
  required int32 master_id;
  required boolean walletagency;
  optional int64 impressions;
  optional int64 clicks;
  optional int64 money;
  optional int64 avg_position;
  optional double impression_money;
  required binary day (UTF8);
}

Then when I try to use View

Entry

SELECT * FROM user_daily;

I see

File 'hdfs://.../warehouse/impala/contextstat.db/user_daily/year=2016/month=8/part-r-00000-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet' 
has an incompatible Parquet schema for column 'contextstat.user_daily.user_id'. 
Column type: BIGINT, Parquet schema:
optional int32 user_id [i:0 d:1 r:0]

Do you know how to solve this problem? I THINK BIGINT IS THE SAME AS int_32. Should I change the table schema or generate a parquet file?

Solution

BIGINT is int64, which is why it prompts. But you don’t have to figure out the different types you have to use yourself, Impala can do that for you. Simply use the CREATE TABLE LIKE PARQUET variant:

The variation CREATE TABLE … LIKE PARQUET ‘hdfs_path_of_parquet_file’ lets you skip the column definitions of the CREATE TABLE statement. The column names and data types are automatically configured based on the organization of the specified Parquet data file, which must already reside in HDFS.

Related Problems and Solutions