apache kudu

80
APACHE KUDU ASIM JALIS GALVANIZE

Upload: asim-jalis

Post on 21-Apr-2017

1.998 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Apache kudu

APACHE KUDUASIM JALIS

GALVANIZE

Page 2: Apache kudu

INTRO

Page 3: Apache kudu

ASIM JALISGalvanize/Zipfian, DataEngineeringCloudera, Microso!,SalesforceMS in Computer Sciencefrom University ofVirginia

Page 4: Apache kudu

WHAT IS GALVANIZE’S DATAENGINEERING IMMERSIVE?

Immersive Peer LearningEnvironmentMaster High-DemandSkills and TechnologiesHeart of San Francisco inSOMA

Page 5: Apache kudu

YOU GET TO . . .Play with Terabytes ofDataSpark, Hadoop, Hive,Kafka, Storm, HBaseData Science at ScaleLevel UP your Career

Page 6: Apache kudu
Page 7: Apache kudu

FOR MORE INFORMATIONCheck outhttp://[email protected]

Page 8: Apache kudu

TALK OVERVIEW

Page 9: Apache kudu

WHAT IS THIS TALK ABOUT?What is Kudu?How can I use it tosimplify my Big Dataarchitecture?How can I use it as analternative to HBase?How does it work?Demo

Page 10: Apache kudu

HOW MANY PEOPLE HERE AREFAMILIAR WITH HDFS?

Page 11: Apache kudu

HOW MANY PEOPLE HERE AREFAMILIAR WITH HBASE?

Page 12: Apache kudu

HOW MANY PEOPLE HERE AREFAMILIAR WITH KUDU?

Page 13: Apache kudu

WHY KUDU

Page 14: Apache kudu

WHAT IS A KUDU?Kudus are a kind ofantelopeFound in eastern andsouthern Africa

Page 15: Apache kudu

WHAT PROBLEM DOES KUDUSOLVE?

Page 16: Apache kudu

WHAT IS LATENCY?How long it takes to startHow long it takes to setup

Page 17: Apache kudu

WHICH HAS HIGHER LATENCY:AIRPLANES OR CARS?

Page 18: Apache kudu

WHICH HAS HIGHER LATENCY:AIRPLANES OR CARS?

Airplanes have highlatencyCars have lower latencyWinner: Cars

Page 19: Apache kudu

WHAT IS THROUGHPUT?Operations per second/minute/hourHow much you get done in unit time

Page 20: Apache kudu

WHICH HAS HIGHERTHROUGHPUT: AIRPLANES OR

CARS?

Page 21: Apache kudu

WHICH HAS HIGHERTHROUGHPUT: AIRPLANES OR

CARS?Airplanes have highthroughputCars have lowerthroughputWinner: Airplanes

Page 22: Apache kudu

WHICH IS BETTER? Low throughput High throughput

Low latency Cars Jet-Packs

High latency Horses Planes

It depends, usually there is a tradeoffGoing to Oakland: DriveGoing to Florida: Fly

Page 23: Apache kudu

WHY DOES HBASE EXIST?HDFS immutable,append-onlyHBase mutable, randomaccess read/writeFast random accessLow latency (good)

Page 24: Apache kudu

WHY DOES PARQUET EXIST?Main idea: columnarstorageFast scan, fast batchprocessingHigh throughput (good)High latency (bad)

Page 25: Apache kudu

WHY DOES KUDU EXIST? Low throughput High throughput

Low latency HBase Kudu

High latency JSON on HDFS Parquet on HDFS

Page 26: Apache kudu

WHY KUDU?Kudu is the child ofParquet and HBaseHBase with columnarstorageMutable Parquet withfast random access forread/writeParquet-like HBaseHBase-like Parquet

Page 27: Apache kudu

WHAT IS THE USE CASE FORKUDU?

Use HBase-like features to store data as it arrives in real-timeUse Parquet-like features to run analytic workloads onreal-time dataIn queries easily combine long-term historical data andshort-term real-time data

Page 28: Apache kudu

WHAT IS THE BIG IDEA OF KUDU?HBase uses in-memory store for low-latency reads/writesParquet on HDFS uses columnar layout for high-throughput scansKudu idea: in-memory store + columnar layout

Page 29: Apache kudu

KUDU MOTIVATIONGoal Competing With

Fast columnar scans Parquet/HDFS

Low-latency random updates HBase

Consistent performance -

Page 30: Apache kudu

KUDU, HBASE, HDFS

Page 31: Apache kudu

WHAT LANGUAGE IS KUDUWRITTEN IN?

C++ for performanceJava and Python wrappers

Page 32: Apache kudu

HOW CAN I INTERACT WITHKUDU?

Impala is Kudu’s defacto shellC++, Java, or Python clientMapReduceSpark (beta)MapReduce and Spark read-only access (currently)

Page 33: Apache kudu

BENCHMARKS

Page 34: Apache kudu

KUDU VS PARQUET ON HDFSTPC-H: Business-oriented queries/updatesLatency in ms: lower is better

Page 35: Apache kudu

KUDU VS HBASEYahoo! Cloud SystemBenchmark (YCSB)Evaluates key-value andcloud serving storesRandom acccessworkloadThroughput: higher isbetter

Page 36: Apache kudu

KUDU VS PHOENIX VS PARQUET

SQL analytic workloadTPC-H LINEITEM table onlyPhoenix best-of-breed SQL on HBase

Page 37: Apache kudu

LAMBDA ARCHITECTURE

Page 38: Apache kudu

KUDU USE CASE: LAMBDAARCHITECTURE

Page 39: Apache kudu

LAMBDA ARCHITECTUREReal-time data processby Speed LayerHistorical dataprocessed by BatchLayerSpeed Layer resultssaved in HBaseNew and old data joinedin Spark Streaming

Page 40: Apache kudu

ONLINE RETAILER USE CASEList popular items top offront pageWhat is best selling item:

this hourtodaythis week

Keep inventory numbersupdated

Page 41: Apache kudu

KUDUUsing KuduSpeed layer queries caninclude analytics

Page 42: Apache kudu

PRE-KUDU ARCH

Page 43: Apache kudu

POST-KUDU ARCH

Page 44: Apache kudu

KUDU INTERNALS

Page 45: Apache kudu

KUDU ARCHITECTURE

Page 46: Apache kudu

KUDU MASTERFault toleranceFailover to backupmastersRa! used for electingnew leadersOnly leader serves clientrequests

Page 47: Apache kudu

KUDU MASTER ROLESCatalog Manager

Metadata: schema,replicationMaster Tablet storesmetadata

Cluster CoordinatorWho is alive:redistribute data ondeath

Tablet directoryWhich tablet is where

Page 48: Apache kudu

TABLETSTablets are similar toHBase RegionsTables staticallypartitioned into tabletsTablets can bereplicated to 3 or 5Replication uses Ra!’sLeader/Follower patternfor consensusData stored on local filesystem, not HDFS

Page 49: Apache kudu

TABLET REPLICATIONWrites must be done onleader tabletReads can be on leaderor followersWrite log is replicatedLog replication driven byleaderUses Ra! consensus

Page 50: Apache kudu

THICK CLIENTClient caches tablet metadataUses metadata to figure out leader to write toFailure on write to deposed leaderWrite failure forces metadata refresh

Page 51: Apache kudu

TABLET INTERNALSComponent Description

MemRowSet In-memory writes, stored row-wise

DiskRowSet Disk-based data store, columnar, 32 KB

DeltaMemStores In-memory update store

DeltaFile Disk-based update store

Page 52: Apache kudu

MEM ROW SETWrites for new keys arewritten in MemRowSets.Then flushed out toDiskRowSets.Kudu uses Bloom filtersto determine if a key is inDiskRowSet.

Page 53: Apache kudu

DELTA MEM STORESUpdates are written toDeltaMemStores.The flushed toDeltaFiles.

Page 54: Apache kudu

TABLET COUNTTablet count based on partitionsPartitioning function maps row to tabletPartitioning defined at table creation timePartitions cannot be redefined dynamically

Page 55: Apache kudu

PARTITIONINGHash Partitioning

Subset of the primary key columns and number of bucketsFor exampleDISTRIBUTE BY HASH(col1,col2) INTO 16 BUCKETS

Range PartitioningOrdered subset of primary key columnsMaps tuples into binary strings by concatenating values ofspecified columns using order-preserving encoding.

Page 56: Apache kudu

DOES KUDU REQUIRE HADOOP?It does not depend on HDFS.Has no required dependency on Hadoop, HDFS,MapReduce, Spark.However, in practice accessed most easily through Impala.

Page 57: Apache kudu

DO KUDU TABLETSERVERSSHARE DISK SPACE WITH HDFS?Kudu TabletServers and HDFS DataNodes can run on themachines.Kudu’s InputFormat enables data locality.Data locality: MapReduce and Spark tasks likely to run onmachines containing data.

Page 58: Apache kudu

KUDU SCHEMA

Page 59: Apache kudu

WHAT DATA TYPES DOES KUDUSUPPORT?

Boolean8-bit signed integer16-bit signed integer32-bit signed integer64-bit signed integer

Timestamp32-bit floating-point64-bit floating-pointStringBinary

Page 60: Apache kudu

HOW LARGE CAN VALUES BE INKUDU?

Values in the 10s of KB and above are not recommendedPoor performanceStability issues in current releaseNot intended for big blobs or images

Page 61: Apache kudu

ENCODING TYPESColumn Type Encoding

integer, timestamp plain, bitshuffle, run length

float plain, bitshuffle

bool plain, dictionary, run length

string, binary plain, prefix, dictionary

Page 62: Apache kudu

WHAT IS PLAIN ENCODING?Data in its natural format.E.g. int32 stored as 32-bit little-endian integers.

Page 63: Apache kudu

WHAT IS BITSHUFFLE ENCODING?Bitwise columnarencoding.MSB stored first, thensecond-MSB, etc.Result LZ4 compressed.Works well when valuesrepeat or change bysmall amounts.

11010010 1111 111111010011 --> 0000 111111010100 0000 001111010101 1100 1010

Page 64: Apache kudu

WHAT IS RUN LENGTHENCODING?

Repeated values (runs) are stored as value and count.Works well for denormalized tables with consecutiverepeated values when sorted by primary key.54.231.184.754.231.184.754.231.184.7 --> 54.231.184.7 | 554.231.184.754.231.184.7

Page 65: Apache kudu

WHAT IS DICTIONARY ENCODING?Dictionary of unique values.Value encoded as index in dictionary.Works well if column has small set of unique values.If there are too many values Kudu falls back to plainencoding.

Page 66: Apache kudu

WHAT IS PREFIX ENCODING?Common prefixes compressed in consecutive columnvalues.Works well when values share common prefixes.

Page 67: Apache kudu

COLUMN COMPRESSIONPer-column compression using LZ4, Snappy, or ZLib.By default, columns stored uncompressed.

Page 68: Apache kudu

WHAT IMPACT WILLCOMPRESSION HAVE ON SCAN

PERFORMANCE AND ON SPACE?It will reduce storage space.It will reduce scan performance.

Page 69: Apache kudu

DEMO

Page 70: Apache kudu

ADMIN UI VM PORT SETUPVirtual Box > Settings > Network > Adapter 2 > Port Forwarding > +

Enter: Name: Rule 1 Protocol: TCP Host IP: Host Port: 8051 Guest IP: Guest Port: 8051

Open browser:http://quickstart.cloudera:8051

Page 71: Apache kudu

CONNECT TO VMssh [email protected]

Page 72: Apache kudu

START IMPALA SHELLimpala-shell

Page 73: Apache kudu

CREATE A TABLECREATE TABLE sales ( state STRING, id INTEGER, sale_date STRING, store INTEGER, product INTEGER, amount DOUBLE) DISTRIBUTE BY RANGE(state) SPLIT ROWS(('CA'),('WA'),('OR'))TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'sales', 'kudu.master_addresses' = 'quickstart.cloudera:7051', 'kudu.key_columns' = 'state,id');

Page 74: Apache kudu

INSERT SOME DATA INTO ITINSERT INTO sales (state, id, sale_date, store, product, amount)VALUES('WA',101,'2014-11-13',100,331,300.00),('OR',104,'2014-11-18',700,329,450.00),('CA',102,'2014-11-15',203,321,200.00),('CA',106,'2014-11-19',202,331,330.00),('WA',103,'2014-11-17',101,373,750.00),('CA',105,'2014-11-19',202,321,200.00);

Page 75: Apache kudu

TRY SQLSELECT COUNT(*) FROM sales;SELECT STATE,COUNT(*) FROM sales GROUP BY STATE;

Page 76: Apache kudu

VIEW TABLE IN ADMIN UIGo to http://quickstart.cloudera:8051/

Page 77: Apache kudu

REFERENCES

Page 78: Apache kudu

READINGSKudu Whitepaperhttp://getkudu.io/kudu.pdf

Quickstart VMhttp://getkudu.io/docs/quickstart.html

Schemahttp://getkudu.io/docs/schema_design.html

Kudu Impala Guidehttp://www.cloudera.com/documentation/betas/kudu/0-5-0/topics/kudu_impala.html

Page 79: Apache kudu

GALVANIZE DATA ENGINEERING

Page 80: Apache kudu

QUESTIONS