VESIT-CLOUDERA

Home
Student's Corner
Higher Studies and Counselling
Add on Courses

Industry Intensive Course

VESIT-CLOUDERA

Course Description:
The Big Data landscape is continuously evolving as new technologies emerge and existing technologies mature. This is a comprehensivecourse covering Sparkand key elements of the Hadoop Ecosystem used in developing end to end applications for processing Big Data efficiently.Students who complete this course will understand key Spark and Hadoop concepts, and they will learn to apply Spark and Hadoop tools in developing applications for solving the types of problems faced by enterprises and research institutions today.

Prerequisites:
This course is designed for developers and engineers who have programming experience. Apache Spark examples and homework labs are presented in Scala and Python, therefore, the ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful; prior knowledge of Hadoop is not required.

Course Objectives

During this course, the learner will learn:

How the Hadoop Ecosystem fits in with the data processing lifecycle
How data is distributed, stored and processed in a Hadoop cluster
How to use Sqoop and Flume to ingest data
How to process distributed data with Spark
How to model structured data as tables in Impala and Hive
How to choose a data storage format for your data usage patterns
Best practices for data storage

Course Outcomes:

After Completing the course learner will be able to :

Understand components of Hadoop and Hadoop Ecosystem.
Access and Process Data on Distributed File System
Manage Job Execution in Hadoop Environment
Ingest data using Sqoop and Flume
Analyze the Big Data using Hive and Impala
Develop Big data applications using Spark and Hadoop Eco-System

Course Structure

Course Outline

Sr. No.	Contents	Session	Hrs
1	Module 1 : Introduction About This Course About Cloudera	Session 1	2 Hrs
2	Module 2: Introduction to Hadoop and the Hadoop Ecosystem Hadoop Data Storage and Ingest Data Processing Data Analysis and Exploration Other Ecosystem Tools Introduction to the Homework Labs Homework Labs: Setup and General Notes	Session 2	2 Hrs
3	Hadoop Architecture and HDFS Hadoop Data Storage and Ingest Data Processing Data Analysis and Exploration Other Ecosystem Tools Introduction to the Homework Labs Homework Labs: Setup and General Notes	Session 3	2 Hrs
4	Importing and Modeling Structured Data,Importing Relational Data with Apache Sqoop Sqoop Overview Basic Imports and Exports Limiting Results Improving Sqoop’s Performance Sqoop 2 Homework Labs: Import Data from MySQL Using Sqoop	Session 4	2 Hrs
5	Introduction to Impala and Hive Introduction to Impala and Hive Why Use Impala and Hive? Querying Data With Impala and Hive Comparing Hive and Impala to Traditional Databases	Session 5	2 Hrs
6	Modeling and Managing Data with Impala and Hive Data Storage Overview Creating Databases and Tables Loading Data into Tables HCatalog Impala Metadata Caching Homework Labs: Create and Populate Tables in Impala or Hive	Session 6	2 Hrs
7	Data Formats File Formats Avro Schemas Avro Schema Evolution Using Avro with Impala, Hive and Sqoop Using Parquet with Impala, Hive and Sqoop Compression Homework Labs: Select a Format for a Data File	Session 7	2 Hrs
8	Data File Partitioning Partitioning Overview Partitioning in Impala and Hive Conclusion Homework Labs:Partition Data in Impala or Hive	Session 8	2 Hrs
9	Module 4: Ingesting Streaming Data What is Apache Flume? Basic Flume Architecture Flume Sources Flume Sinks Flume Channels Flume Configuration Homework Labs: Collect Web Server Logs with Flume	Session 9	2 Hrs
10	Module 5: Distributed Data Processing with SparkSpark Basics What is Apache Spark? Using the Spark Shell RDDs (Resilient Distributed Datasets) Functional Programming in Spark Homework Labs: View the Spark Documentation Explore RDDs Using the Spark Shell Use RDDs to Transform a Dataset	Session 10	2 Hrs
11	Working with RDDs in Spark Creating RDDs Other General RDD Operations Homework Labs:Process Data Files with Spark	Session 11	2 Hrs
12	Aggregating Data with Pair RDDs Key?Value Pair RDDs Map?Reduce Other Pair RDD Operations Homework Labs:Use Pair RDDs to Join Two Datasets	Session 12	2 Hrs
13	Writing and Deploying Spark Applications Spark Applications vs. Spark Shell Creating the SparkContext Building a Spark Application (Scala and Java) Running a Spark Application The Spark Application Web UI Homework Labs: Write and Run a Spark Application Configuring Spark Properties Logging b>Homework Labs:Configure a Spark Application	Session 13	2 Hrs
14	Parallel Processing in Spark Review: Spark on a Cluster RDD Partitions Partitioning of File?based RDDs HDFS and Data Locality Executing Parallel Operations Stages and Tasks Homework Labs:View Jobs and Stages in the Spark Application UI	Session 14	2 Hrs
15	Spark RDD Persistence RDD Lineage RDD Persistence Overview Distributed Persistence Homework Labs:Persist an RDD	Session 15	2 Hrs
16	Common Patterns in Spark Data Processing Common Spark Use Cases Iterative Algorithms in Spark Graph Processing and Analysis Machine Learning Example: k?means Homework Labs: Iterative Processing in Spark Optional Homework Lab: Partition Data Files Using Spark	Session 16	2 Hrs
17	Spark SQL and DataFrames Spark SQL and the SQL Context Creating DataFrames Transforming and Querying DataFrames Saving DataFrames DataFrames and RDDs Comparing Spark SQL, Impala and Hive?on?Spark Homework Labs:Use Spark SQL for ETL	Session 17	2 Hrs
18	Conclusion and Project Discussion	Session 18	2 Hrs

Evaluation:
Students registered for the Cloudera course are evaluated based on following parameters:
Evaluation Metrics:

Project (40%)
- Leaderboard Rank(in the competition website like Kaggle, Data
- Presentation of Solution
- Report (soft copy)
End of the course Exam (40%) - 50 Marks
Attendance (20%)

Text Books Recommended

Learning Spark, by Karau, Konwinski, Wendell, and Zaharia

Optional

Hadoop: The Definitive Guide (third edition), by Tom White
Using Flume, by Hari Shreedharan
Hadoop Operations, by Eric Sammer
Programming Hive, by Capriolo, Wampler, and Rutherglen
Advanced Analytics with Spark, by Ryza, Laserson, Owen, and Wills

Tools:
Cloudera supplies two fully configured Hadoop VMs (virtual machines), which include datasets for homework labs. The professor VM contains solutions to homework assignments as well as a set of supporting examples. The student VM does not contain solutions or examples; we leave it to the professor to decide whether to share these items.

Course Faculty

Resource Persons

Dr. Mrs. M. Vijayalakshmi , Vice Principal
Dr. Mrs. Sujata Khedkar, Associate Professor,Department of Computer Engineering
Mrs. Asha Bharambe, Associate Professor, Department of Information Technology
Mrs. Sangeeta Oswal, Assistant Professor, Department of Master of Computer Applications
Mrs Jayshree Hajgude, Assistant Professor, Department of Information Technology

Social

VESIT-CLOUDERA

VESIT-CLOUDERA

Course Objectives

Course Outcomes:

Course Outline

Resource Persons

Excellence

Learning

Values

Leadership

Discipline

Creativity

Inspiration

Excellence

Learning

Values

Leadership

Discipline

Creativity

Inspiration

Excellence

Learning

Values

Leadership

Discipline

Creativity

Inspiration