VESIT-CLOUDERA
- Home
- Student's Corner
- Higher Studies and Counselling
- Add on Courses
VESIT-CLOUDERA
Course Description:
The Big Data landscape is continuously evolving as new technologies emerge and existing technologies mature. This is a comprehensivecourse covering Sparkand key elements of the Hadoop Ecosystem used in developing end to end applications for processing Big Data efficiently.Students who complete this course will understand key Spark and Hadoop concepts, and they will learn to apply Spark and Hadoop tools in developing applications for solving the types of problems faced by enterprises and research institutions today.
Prerequisites:
This course is designed for developers and engineers who have programming experience. Apache Spark examples and homework labs are presented in Scala and Python, therefore, the ability to program in one of those languages is required. Basic familiarity with the Linux command line is assumed. Basic knowledge of SQL is helpful; prior knowledge of Hadoop is not required.
Course Objectives
During this course, the learner will learn:
- How the Hadoop Ecosystem fits in with the data processing lifecycle
- How data is distributed, stored and processed in a Hadoop cluster
- How to use Sqoop and Flume to ingest data
- How to process distributed data with Spark
- How to model structured data as tables in Impala and Hive
- How to choose a data storage format for your data usage patterns
- Best practices for data storage
Course Outcomes:
After Completing the course learner will be able to :
- Understand components of Hadoop and Hadoop Ecosystem.
- Access and Process Data on Distributed File System
- Manage Job Execution in Hadoop Environment
- Ingest data using Sqoop and Flume
- Analyze the Big Data using Hive and Impala
- Develop Big data applications using Spark and Hadoop Eco-System
Course Outline
| Sr. No. | Contents | Session | Hrs |
|---|---|---|---|
| 1 |
Module 1 : Introduction
|
Session 1 | 2 Hrs |
| 2 |
Module 2: Introduction to Hadoop and the Hadoop Ecosystem
|
Session 2 | 2 Hrs |
| 3 |
Hadoop Architecture and HDFS
|
Session 3 | 2 Hrs |
| 4 |
Importing and Modeling Structured Data,Importing Relational Data with Apache Sqoop
|
Session 4 | 2 Hrs |
| 5 |
Introduction to Impala and Hive
|
Session 5 | 2 Hrs |
| 6 |
Modeling and Managing Data with Impala and Hive
|
Session 6 | 2 Hrs |
| 7 |
Data Formats
|
Session 7 | 2 Hrs |
| 8 |
Data File Partitioning
|
Session 8 | 2 Hrs |
| 9 |
Module 4: Ingesting Streaming Data
|
Session 9 | 2 Hrs |
| 10 |
Module 5: Distributed Data Processing with SparkSpark Basics
|
Session 10 | 2 Hrs |
| 11 |
Working with RDDs in Spark
|
Session 11 | 2 Hrs |
| 12 |
Aggregating Data with Pair RDDs
|
Session 12 | 2 Hrs |
| 13 |
Writing and Deploying Spark Applications
|
Session 13 | 2 Hrs |
| 14 |
Parallel Processing in Spark
|
Session 14 | 2 Hrs |
| 15 |
Spark RDD Persistence
|
Session 15 | 2 Hrs |
| 16 |
Common Patterns in Spark Data Processing
|
Session 16 | 2 Hrs |
| 17 |
Spark SQL and DataFrames
|
Session 17 | 2 Hrs |
| 18 |
Conclusion and Project Discussion |
Session 18 | 2 Hrs |
Evaluation:
Students registered for the Cloudera course are evaluated based on following parameters:
Evaluation Metrics:
- Project (40%)
- Leaderboard Rank(in the competition website like Kaggle, Data
- Presentation of Solution
- Report (soft copy)
- End of the course Exam (40%) - 50 Marks
- Attendance (20%)
Text Books Recommended
- Learning Spark, by Karau, Konwinski, Wendell, and Zaharia
Optional
- Hadoop: The Definitive Guide (third edition), by Tom White
- Using Flume, by Hari Shreedharan
- Hadoop Operations, by Eric Sammer
- Programming Hive, by Capriolo, Wampler, and Rutherglen
- Advanced Analytics with Spark, by Ryza, Laserson, Owen, and Wills
Tools:
Cloudera supplies two fully configured Hadoop VMs (virtual machines), which include datasets for homework labs. The professor VM contains solutions to homework assignments as well as a set of supporting examples. The student VM does not contain solutions or examples; we leave it to the professor to decide whether to share these items.
Resource Persons
- Dr. Mrs. M. Vijayalakshmi , Vice Principal
- Dr. Mrs. Sujata Khedkar, Associate Professor,Department of Computer Engineering
- Mrs. Asha Bharambe, Associate Professor, Department of Information Technology
- Mrs. Sangeeta Oswal, Assistant Professor, Department of Master of Computer Applications
- Mrs Jayshree Hajgude, Assistant Professor, Department of Information Technology