VESIT-CLOUDERA

  • Home
  • Student's Corner
  • Higher Studies and Counselling
  • Add on Courses
Industry Intensive Course

VESIT-CLOUDERA

Course Description: 
The Big Data landscape is continuously evolving as new technologies emerge and existing technologies mature. This is a comprehensivecourse covering Sparkand key elements of the Hadoop Ecosystem used in developing end to end applications for processing Big Data efficiently.Students who complete this course will understand key Spark and Hadoop concepts, and they will learn to apply Spark and Hadoop tools in developing applications for solving the types of problems faced by enterprises and research institutions today.

Prerequisites: 
This course is designed for developers and engineers who have programming experience. Apache Spark examples and homework labs are presented in Scala and Python, therefore, the ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful; prior knowledge of Hadoop is not required.

Course Objectives

During this course, the learner will learn:

  • How the Hadoop Ecosystem fits in with the data processing lifecycle
  • How data is distributed, stored and processed in a Hadoop cluster
  • How to use Sqoop and Flume to ingest data
  • How to process distributed data with Spark
  • How to model structured data as tables in Impala and Hive
  • How to choose a data storage format for your data usage patterns
  • Best practices for data storage

Course Outcomes:

After Completing the course learner will be able to :

  • Understand components of Hadoop and Hadoop Ecosystem.
  • Access and Process Data on Distributed File System
  • Manage Job Execution in Hadoop Environment
  • Ingest data using Sqoop and Flume
  • Analyze the Big Data using Hive and Impala
  • Develop Big data applications using Spark and Hadoop Eco-System
Course Structure

Course Outline

Sr. No. Contents Session Hrs
1

Module 1 : Introduction

  • About This Course
  • About Cloudera
Session 1 2 Hrs
2

Module 2: Introduction to Hadoop and the Hadoop Ecosystem

  • Hadoop
  • Data Storage and Ingest
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Introduction to the Homework Labs
  • Homework Labs: Setup and General Notes
Session 2 2 Hrs
3

Hadoop Architecture and HDFS

  • Hadoop
  • Data Storage and Ingest
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Introduction to the Homework Labs
  • Homework Labs: Setup and General Notes
Session 3 2 Hrs
4

Importing and Modeling Structured Data,Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2
  • Homework Labs: Import Data from MySQL Using Sqoop
Session 4 2 Hrs
5

Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Querying Data With Impala and Hive
  • Comparing Hive and Impala to Traditional Databases
Session 5 2 Hrs
6

Modeling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching
  • Homework Labs: Create and Populate Tables in Impala or Hive
Session 6 2 Hrs
7

Data Formats

  • File Formats
  • Avro Schemas
  • Avro Schema Evolution
  • Using Avro with Impala, Hive and Sqoop
  • Using Parquet with Impala, Hive and Sqoop
  • Compression
  • Homework Labs: Select a Format for a Data File
Session 7 2 Hrs
8

Data File Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive
  • Conclusion
  • Homework Labs:Partition Data in Impala or Hive
Session 8 2 Hrs
9

Module 4: Ingesting Streaming Data

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration
  • Homework Labs: Collect Web Server Logs with Flume
Session 9 2 Hrs
10

Module 5: Distributed Data Processing with SparkSpark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark
  • Homework Labs:
  • View the Spark Documentation
  • Explore RDDs Using the Spark Shell
  • Use RDDs to Transform a Dataset
Session 10 2 Hrs
11

Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations
  • Homework Labs:Process Data Files with Spark
Session 11 2 Hrs
12

Aggregating Data with Pair RDDs

  • Key?Value Pair RDDs
  • Map?Reduce
  • Other Pair RDD Operations
  • Homework Labs:Use Pair RDDs to Join Two Datasets
Session 12 2 Hrs
13

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Homework Labs:
  • Write and Run a Spark Application
  • Configuring Spark Properties
  • Logging
  • b>Homework Labs:Configure a Spark Application
Session 13 2 Hrs
14

Parallel Processing in Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File?based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks
  • Homework Labs:View Jobs and Stages in the Spark Application UI
Session 14 2 Hrs
15

Spark RDD Persistence

  • RDD Lineage
  • RDD Persistence Overview
  • Distributed Persistence
  • Homework Labs:Persist an RDD
Session 15 2 Hrs
16

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k?means
  • Homework Labs: Iterative Processing in Spark
  • Optional Homework Lab: Partition Data Files Using Spark
Session 16 2 Hrs
17

Spark SQL and DataFrames

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala and Hive?on?Spark
  • Homework Labs:Use Spark SQL for ETL
Session 17 2 Hrs
18

Conclusion and Project Discussion

Session 18 2 Hrs

Evaluation:   
Students registered for the Cloudera course are evaluated based on following parameters:
Evaluation Metrics: 

  1. Project (40%)
    • Leaderboard Rank(in the competition website like Kaggle, Data
    • Presentation of Solution
    • Report (soft copy)
  2. End of the course Exam (40%) - 50 Marks
  3. Attendance (20%)

Text Books Recommended 

  • Learning Spark, by Karau, Konwinski, Wendell, and Zaharia

Optional 

  • Hadoop: The Definitive Guide (third edition), by Tom White
  • Using Flume, by Hari Shreedharan
  • Hadoop Operations, by Eric Sammer
  • Programming Hive, by Capriolo, Wampler, and Rutherglen
  • Advanced Analytics with Spark, by Ryza, Laserson, Owen, and Wills

Tools:  
Cloudera supplies two fully configured Hadoop VMs (virtual machines), which include datasets for homework labs. The professor VM contains solutions to homework assignments as well as a set of supporting examples. The student VM does not contain solutions or examples; we leave it to the professor to decide whether to share these items.

Course Faculty

Resource Persons

  1. Dr. Mrs. M. Vijayalakshmi , Vice Principal
  2. Dr. Mrs. Sujata Khedkar, Associate Professor,Department of Computer Engineering
  3. Mrs. Asha Bharambe, Associate Professor, Department of Information Technology
  4. Mrs. Sangeeta Oswal, Assistant Professor, Department of Master of Computer Applications
  5. Mrs Jayshree Hajgude, Assistant Professor, Department of Information Technology

Excellence

Icon

Learning

Icon

Values

Icon

Leadership

Icon

Discipline

Icon

Creativity

Icon

Inspiration

Icon

Excellence

Icon

Learning

Icon

Values

Icon

Leadership

Icon

Discipline

Icon

Creativity

Icon

Inspiration

Icon

Excellence

Icon

Learning

Icon

Values

Icon

Leadership

Icon

Discipline

Icon

Creativity

Icon

Inspiration

Icon
Back to top