hive partitions sample
Personally I have used partitions and bucketing for hive few times.
But since Hive supports these two features I should note them here.
Here is the partitions sample below.
hive> create table transactions_int (
> transdate date,
> transid int,
> custid int,
> fname string,
> lname string,
> item string,
> qty int,
> price float) partitioned by (store string);
OK
hive> INSERT INTO transactions_int PARTITION (store="new york") values ("2016-01-25",101,109,"MATTHEW","SMITH","SHOES",1,112.9);
hive> show partitions transactions_int;
OK
store=new york
Time taken: 0.276 seconds, Fetched: 1 row(s)
hive> select * from transactions_int;
OK
2016-01-25 101 109 MATTHEW SMITH SHOES 1 112.9 new york
Time taken: 0.424 seconds, Fetched: 1 row(s)
When you check files from hdfs command you will see the storage space is separated by partition keys.
$ hdfs dfs -ls /user/hive/warehouse/transactions_int
Found 1 items
drwxr-xr-x - pyh supergroup 0 2022-05-12 10:07 /user/hive/warehouse/transactions_int/store=new york
Partitoning considerations:
- Pick a column for partition key with low to medium Number of Distinct Values (NDVs).
- Avoid partitions that are less than 1 GB (bigger is better).
- Tune Hiveserver2 and Hive Metastore memory for large number of partitions.
- When you use multiple columns for partition key, it will create a nested tree of subdirectories for each combination of partition key columns. You should avoid deep nesting as it can cause too many partitions and hence create very small files.
- When insert data using Hive streaming, if multiple sessions write data to same partitions, it can lead to locking.
- You can modify the schema of a partitioned table; however, once the structure is changed, you cannot modify the data in existing partitions.
- If you are inserting data to multiple partitions in parallel, you should set hive. optimize.sort.dynamic.partition only to True.