kudu forrester webinar

31
1 © Cloudera, Inc. All rights reserved. Apache Kudu Webinar Series Understanding and Unlocking the Value of Real-Time Data Ryan Lippert | Cloudera Michele Goetz | Forrester (Special Guest)

Upload: cloudera-inc

Post on 19-Mar-2017

181 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Kudu Forrester Webinar

1© Cloudera, Inc. All rights reserved.

Apache Kudu Webinar SeriesUnderstanding and Unlocking the Value of Real-Time Data

Ryan Lippert | ClouderaMichele Goetz | Forrester (Special Guest)

Page 2: Kudu Forrester Webinar

2© Cloudera, Inc. All rights reserved.

Kudu Webinar SeriesPart 1: Lambda Architectures – Simplified by Apache KuduA look into the potential trouble involved with a lambda architecture, and how Apache Kudu can dramatically simplify real-time analytics.

Part 2: Extending the Capabilities of Operational and Analytical DatabasesAn examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and Analytical databases can handle.

Part 3: Data-in-Motion: Unlock the Value of Real-Time DataForrester will discuss their research into real-time data pipelines and analytics, and Cloudera will discuss how to make it a reality.

Part 4: Techincal Deep-Dive into Apache KuduAn in-depth examination of the technical architecture and design of Apache Kudu, straight from a PMCMember.New!

Page 3: Kudu Forrester Webinar

3© Cloudera, Inc. All rights reserved.

Updateable Analytic StorageSimple real-time analytics and updates with Apache Kudu

Kudu: Storage for fast analytics on fast data• Simplified architecture for building real-time analytic

applications• Designed for next-generation hardware for faster analytic

performance across frameworks • Native Hadoop storage engine

Flexibility for the right tools for the right use case in one platform• Only analytic database for big data with Kudu + Impala• Simple real-time applications with Kudu + Spark

Use cases• Time series data• Machine data analytics• Online reporting

STRUCTUREDSqoop

UNSTRUCTUREDKafka, Flume

PROCESS, ANALYZE, SERVE

UNIFIED SERVICES

RESOURCE MANAGEMENTYARN

SECURITYSentry, RecordService

STORE

INTEGRATE

BATCHSpark, Hive, Pig

MapReduce

STREAMSpark

SQLImpala

SEARCHSolr

OTHERKite

NoSQLHBase

OTHERObject Store

FILESYSTEMHDFS

RELATIONALKudu

Page 4: Kudu Forrester Webinar

4© Cloudera, Inc. All rights reserved.

Ingest data of any type or volume

Process data as it arrives

Serve data to users and applications

Real-Time Data

Page 5: Kudu Forrester Webinar

5© Cloudera, Inc. All rights reserved.

Agenda

Drivers for agile, real-time data platformsThe key use cases that are driving businesses towards real time platforms?

Data on adoption trends for real-time technologiesWhat is Forrester seeing in the market for real-time technologies?

Deploying a real-time OSS achitecture to grow your businessHow can you build a scalable, cost-effective platform to grow your business?

Page 6: Kudu Forrester Webinar

© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Michele GoetzSpecial Guest SpeakerPrincipal Analyst Serving Enterprise Architecture Professionals

Page 7: Kudu Forrester Webinar

7© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Agenda

Drivers for agile, real-time data platformsThe key use cases that are driving businesses towards real time platforms?

Data on adoption trends for real-time technologiesWhat is Forrester seeing in the market for real-time technologies?

Deploying a real-time OSS achitecture to grow your businessHow can you build a scalable, cost-effective platform to grow your business?

Page 8: Kudu Forrester Webinar

8© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Superior CX depends on data and insights

Page 9: Kudu Forrester Webinar

9© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Fraud and risk management requires real-time data

Page 10: Kudu Forrester Webinar

10© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

IoT heat map shows where data matters most, now

Page 11: Kudu Forrester Webinar

11© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Data bottlenecks are catalysts for transition

Page 12: Kudu Forrester Webinar

12© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Create a road map for a real-time, agile data platform

Page 13: Kudu Forrester Webinar

13© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Agenda

Drivers for agile, real-time data platformsThe key use cases that are driving businesses towards real time platforms?

Data on adoption trends for real-time technologiesWhat is Forrester seeing in the market for real-time technologies?

Deploying a real-time OSS achitecture to grow your businessHow can you build a scalable, cost-effective platform to grow your business?

Page 14: Kudu Forrester Webinar

14© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Leaders are focused on the technologies that allow data and insights to be consumed across the organization

What are your firm's plans for the following data driven initiatives?

Base: 3005 global data and analytics decision-makers. Source: Business Technographics® Global Data & Analytics Survey, 2016

Creating an organizational center of excellence for business intelligence

Combine content management and data management programs into a unified information management program

Changing our processes to promote data stewardship and sharing

Investing in platforms to and share out data content

Creating a business led data stewardship or governance program

Changing management incentives to promote data sharing

Implementing analytics insights in software systems to aid customers or support employee decisions.

Investing more in business friendly, self-service visualization and analytics

Engaging external services providers or strategic business consultants for data and analytics or insights services

Providing data preparation tools for self-service data management

Investing in distributed real time insight delivery technology

51%

51%

51%

51%

51%

49%

52%

52%

54%

54%

58%

22%

22%

22%

22%

22%

24%

22%

23%

22%

23%

22%

Expanding/Implemented Planning to implement within the next 12 months

Page 15: Kudu Forrester Webinar

15© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Base: 325 global data and analytics technology decision-makers. “Don’t know” not shown.Source: Business Technographics® Global Data & Analytics Survey, 2016

Which of the following describes your [TDM=”IT budget data and analytics technology or services”; BDM=”business budget

for data and analytics technology or services”] from 2015 to 2016?

Decrease by 5% to 10%

Don’t know

Decrease by 1-4%

Increase by more than 10%

Increase by 5% to 10%

Increase by 1-4%

Stay about the same

0% 5% 10% 15% 20% 25% 30% 35%

4%

5%

6%

6%

22%

26%

30%

54% of data and analytics technology decision-makers increased their budgets for data and analytics from 2015 to 2016

54%

Page 16: Kudu Forrester Webinar

16© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Companies of all sizes are spending millions for data & analytics

Note: Don’t know excluded. Base: 765*, 1,288 global data and analytics decision makersSource: Business Technographics® Global Data & Analytics Survey, 2016

Please estimate, in millions, how much your data and analytics budget is for 2016? (Note: Number is in US Dollars)

Less

than

$1 m

illion

$1 m

illion t

o und

er $1

0 millio

n

$10 m

illion t

o und

er $1

00 m

illion

$100

millio

n to u

nder

$500

millio

n

$500

millio

n to u

nder

$ 1 bi

llion

$1 bi

llion t

o und

er $5

billio

n

$5 bi

llion o

r more

55%

22%

9%

1% 1% 0% 0%

32% 30%

13%

4% 2% 2% 1%

SMB (20-999 employees)*

Enterprise (1,000 or more employees)

Page 17: Kudu Forrester Webinar

17© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Among the DM technologies Forrester tracks, interest for stream processing tools has grown the most YoYWhat are your firm's plans to use the following data management technologies?

Base: 2094 and *1805 global data and analytics technology decision-makers.Source: Business Technographics® Global Data & Analytics Survey, 2016

% with commitment

% with interest, but

no immediate plans

+5 p.p. +3 p.p. -2 p.p. -1 p.p. -2 p.p. -3 p.p.% with commitment (expanding, implemented, or planning to implement in the next 12 months)

Stream processing tools Inverted index database Distributed NoSQL databases

Hadoop Associative index databases

RDF, triple store

59%61% 63% 63%

60% 59%

64% 64%61% 62%

58% 56%

2015 2016

-20% -19% -19% -20% -19% -19%-13% -13% -16% -14% -14% -13%

Page 18: Kudu Forrester Webinar

18© 2017 FORRESTER. REPRODUCTION PROHIB ITED.

Base: Total: 2094Source: Business Technographics® Global Data & Analytics Survey, 2016

Which of the following are included in your plans for big data?

NoSQL other than Hadoop

A MPP (massively parallel processing) data warehouse

Semantic technologies (ontology building, search, auto curation, graph, etc.)

Hadoop (including Hbase or Accumulo)

Data anonymization or de-identification

Creating or building out a data lake

Marketing or digital data management platforms and service providers that brand their offerings as big data

Packaged analytics technologies that brand themselves as big data

Unstructured data mining / analytics

Distributed in memory databases, grids, analytics tools

Streaming analytics / computing

Large scale predictive modeling, data mining or other advanced analytics

Public cloud big data services

16%18%

22%23%23%

26%26%27%

28%30%

33%36%

40%

Streaming analytics high in the list of big data plans

Page 19: Kudu Forrester Webinar

19© Cloudera, Inc. All rights reserved.

Agenda

Drivers for agile, real-time data platformsThe key use cases that are driving businesses towards real time platforms?

Data on adoption trends for real-time technologiesWhat is Forrester seeing in the market for real-time technologies?

Deploying a real-time OSS achitecture to grow your businessHow can you build a scalable, cost-effective platform to grow your business?

Page 20: Kudu Forrester Webinar

20© Cloudera, Inc. All rights reserved.

Trend Towards Real-Time Data Platforms is ClearDrivers for Real-Time Platforms

• Enhancing customer experiences• Risk Management• Advancement of IoT and broader instrumentation

Adoption is Accelerating

• Top data-driven initiative by investment: distributed delivery of real-time data

• DM technology with highest momentum: stream processing• Top big data plans: streaming analytics is top 3• Broad, large investments: 90% of decision makers are either

continuing or increasing their investments in data and analytics; millions/billions being spent

Page 21: Kudu Forrester Webinar

21© Cloudera, Inc. All rights reserved.

The Underlying DriverWhat drives a use case to real-time?

High Frequency TradingAPT DetectionFraud DetectionPredictive MaintenanceNext Best OfferInventory ManagementShipping/Logistic SystemsCRM SystemsEmployee ManagementStrategic Planning

Real-time data management use cases are defined by a common set of characteristics.• Narrow time window in which to make a decision

(automated or manual)• Opportunity for the data points to change the

decision path• Decreasing value of data over time

Not all use cases have a pressing need for real-time data.• Broader strategic decisions, for example, do not

require real-time data input• Over time, decreases in HW costs and increases in

availability of real-time systems will lead most use cases to be conducted in real-time

Real Time

Some LatencyAcceptable

Page 22: Kudu Forrester Webinar

22© Cloudera, Inc. All rights reserved.

Moving to Real-Time and Leveraging AnalyticsWhat do we have to gain?

“Monitoring System”

Sensors are automatically monitored and programmed to deliver warnings when readings are delivered outside of an “optimal zone”.

Basic models developed over small subsets of data.

“Predictive System”

Ingestion and processing of all sensor data into an unlimited data store with analytic capabilities enables machine learning, which can provide automated optimization and predictive maintenance.

“Only 1 percent of data from an oil rig with 30,000 sensors is examined. The data that are used today are mostly for

anomaly detection and control, not optimization and prediction, which provide the greatest value.”

- McKinsey & Company

Traditional Architectures Real-Time Analytic Capabilities

Page 23: Kudu Forrester Webinar

23© Cloudera, Inc. All rights reserved.

Ingest data of any type or volume

Process data as it arrives

Serve data to users and applications

Real-Time Data

Page 24: Kudu Forrester Webinar

24© Cloudera, Inc. All rights reserved.

Ingestion at Cloudera• Apache Sqoop for data from

relational databases• Apache Flume for logs, event

based data• Apache Kafka is fast,

scalable, and fault-tolerant messaging

Partners, such as Streamsets, provide rich visualization tools

Ingestion in Real-TimeStream Ingestion is a Must for Many Use Cases

Ingestion isn’t just about internal business data anymore.• Traditional ingestion was internally focused, and often a matter

of moving data from one silo or system to another• Today, businesses aim to take in data from a variety of external

sources, IoT sensors, and machine-generated (user/network) data

Your data journey can’t start until the data arrives.• Each step of the ingest/process/serve data pipeline must occur

at real-time speed if decisions are to be made in time to affect the course of business

Visualization help practitioners understand their data.• Complex tasks can be made less complex via graphical

representations; data ingestion is no different

Page 25: Kudu Forrester Webinar

25© Cloudera, Inc. All rights reserved.

Stream Processing at Cloudera

Spark Streaming, the leading open-source framework for real-time use cases, is deployed in Cloudera’s real-time architectures.

Cloudera has the broadest base of Hadoop-adjacent experience with Spark and integrating it with Apache components.

Ingestion in Real-TimeUnlocking Value at Speed

For some use cases, batch just isn’t enough.• Batch processing can lead to bottlenecks and delays in data

transformations that cause missed opportunities.

Apache Spark is gaining momentum for a reason.• Leveraging Apache Spark for stream processing enables real-

time use cases with sub-second latency and best-in-class API’s.

Spark has a best-in-class ecosystem.• Machine learning (via MLlib) is seamlessly integrated into Spark.• Broadest set of vendors and contributors working on Spark

among available processing engines, leading to rapid innovation.

Page 26: Kudu Forrester Webinar

26© Cloudera, Inc. All rights reserved.

Data Serving at Cloudera

Apache Kudu provides batch analysis and real-time serving within the same storage layer

Apache HBase yields the best read/write performance

Cloudera Search enables SQL-like faceted search in natural language

Apache Kafka can be used to serve data to applications and users

Serving in Real-TimeInject Data into Real-Time Decisions

You need options that suit your use case.• Platform proliferation hurts IT departments as skillsets are

divided; fewer platforms with broad capabilities help.

Apache Kudu changes the game for open source software.• Combining real-time serving with analytic scans through a

relational database had taken a complex lambda architecture until Kudu

• Together, simplification and affordability should drive more use cases to real-time automated processes, in turn driving increased revenue, decreased risk, and better service for companies deploying Kudu

Page 27: Kudu Forrester Webinar

27© Cloudera, Inc. All rights reserved.

HDFS

Fast Scans, Analyticsand Processing of

Stored Data

Fast On-Line Updates &

Data Serving

Arbitrary Storage(Active Archive)

Fast Analytics(on fast-changing or

frequently-updated data)

Apache Kudu: Filling the Analytic Gap

Unchanging

Fast ChangingFrequent Updates

HBase

Append-Only

Real-Time

Kudu Kudu fills the GapModern analytic

applications often require complex data

flow & difficult integration work to move data between

HBase & HDFS

Analytic Gap

Pace of Analysis

Pace

of D

ata

Page 28: Kudu Forrester Webinar

28© Cloudera, Inc. All rights reserved.

Real-Time Data Analysis at WorkCustomer 360 “Next Best Offer 2.0”

Kafka Spark Streaming Kudu

Spark MLlib

ApplicationData

Sources

Individual Session

CustomerInteraction

Spark

Full Model/Learning

Data Request Sent For Stream Processing

Data Cleaned/Ordered/Processed, Then Delivered to Kudu for Modelling

User’s navigation returns the results they are looking for, in addition to offers and suggestions hyper-customized for them.

Illustrative, models will likely have >2 dimensions

Page 29: Kudu Forrester Webinar

29© Cloudera, Inc. All rights reserved.

Machine LearningKudu opens the door to machine learning

Kudu provides the ability to leverage real-time updates and analytic scans together - critical for many machine learning applications.

Source: GHOSTS IN THE MACHINE: Artificial intelligence, risks and regulation in financial markets

Page 30: Kudu Forrester Webinar

30© Cloudera, Inc. All rights reserved.

The Time for Real-Time Data and Analytics is Now.

And the platform for it is Cloudera Enterprise.

Page 31: Kudu Forrester Webinar

31© Cloudera, Inc. All rights reserved.