PG Diploma in Big Data Analytics & Machine Learning
- Explain how a data warehouse combined with good business intelligence can increase a company’s bottom line
- Describe the components of a data warehouse
- Describe different forms of business intelligence that can be gleaned from a data warehouse and how that intelligence can be applied toward business decision-making
- Develop dimensional models from which key data for critical decision-making can be extracted
- Sketch out the process for extracting data from disparate databases and data sources, and then transforming the data for effective integration into a data warehouse
- Load extracted and transformed data into the data warehouse
- Understanding the evolution,
- Benefits of business intelligence
- Business intelligence lifecycle.
- Different Sources of Data.
- Need of Data Management.
- BI for Reporting and Querying
- Knowledge management and master data management (MDM)
OLAP (Online analytical processing)
- Features and functions of OLAP
- Data Drilling.
- Data design and dimensional modeling.
- ETL (Extract, Transform & Load).
- Dimension, Facts.
- Types of Schema – Snowflake & Star Schema.
- Design and architecture.
- Hardware and Software selection for BI.
- DW/BI Metrics
- Introduction to Informatica PowerCenter
- Components & Architecture
- Informatica PowerCenter Client Tools
- Designer, Workflow Manager, Monitor
- Types of Source & Targets
- Working with Source & Targets.
- Mapping, Mapplet, Transformation.
- Object Navigator
- Querying Tools
- Types of Transformations.
- Port Configurations.
- Source Qualifier, Expression, Sorter.
- Aggregator, Filter, Router Transformation.
- Joiner, Ranker, Sequence generator.
- SP, Union
- Lookups, Types of Lookups.
- Lookup Cache- Types
- Normalizer, Update Strategy.
- SQL, TCT, Java Transformation.
- Web Consumer.
- Variables & Parameter
- Parameter Files
- Versioning, Concurrent Workflows
- Debugger, Using Wizards
- Type 1 & Type 2 Design.
- Monitoring, Code Migration.
- Deployment Groups
- Query, Labels
- Pushdown Optimization.
- Types of data and their significance
- Need for Bigdata Analytics.
- Why Bigdata with Hadoop?
- History of Hadoop.
- Node, Rack, Cluster.
- Architecture of Hadoop.
- Characteristics of Namenode.
- Significance of JobTracker and Tasktrackers.
- Hase co-ordination with JobTracker.
- Secondary Namenode usage and workaround.
- Hadoop releases and their significance.
- Workaround with datanodes.
- YARN architecture.
- Significance of scalability of operation.
- Use cases where not to use Hadoop.
- Use cases where Hadoop Is used.
- Facebook, Twitter, Snapdeal, Flipkart.
- Hadoop Classes, What is MapReduceBase?
- Mapper Class and its Methods
- What is Partitioner and types
- Hadoop specific data types
- Working on unstructured data analytics
- What is an iterator and its usage techniques
- Types of mappers and reducers
- What is output collector and its significance
- Workaround with Joining of datasets
- Complications with MapReduce
- MapReduce anatomy
- Anagram example, Teragen Example, Treasury Example
- Word Count Example
- Working with multiple mappers
- Working with weather data on multiple datanodes in a Fully distributed architecture
- Use Cases where MapReduce anatomy fails
- Interview questions based on JAVA MapReduce.
- Introduction to Pig Latin
- History and evolution of Pig Latin
- Why Pig is used only with Bigdata
- Pig architecture and overview of Compiler and Execution Engine.
- Pig Release and significance with bugfixes.
- Pig Specific Data types
- Complex Data types
- Bags, Tuples, Fields
- Pig Specific Methods.
- Comparison between Yahoo Pig & Facebook Hive.
- Working with Grunt Shell.
- Grunt commands(total 17)
- Pig Data input techniques for flatfiles (comma separated, tab delimited and fixed width). Working with schemaless approach
- How to attach schema to a file/table in pig.
- Schema referencing for similar tables and files.
- Working with delimiters
- Working with Binary Storage and Text Loader.
- Bigdata Operations and Read write analogy.
- Filtering Datasets
- Filtering rows with specific condition
- Filtering rows with multiple conditions
- Filtering rows with string based conditions
- Sorting Datasets
- Sorting rows with specific column or columns
- Multilevel Sort
- Analogy of a sort operation
- Grouping datasets and Co-grouping data
- Joining Datasets
- Types of Joins supported by Pig Latin
- Aggregate operations like average, sum, min, max, count
- Flatten operator
- Creating a UDF(USER DFINED FUNCTION) using java
- Calling UDF from pig script
- Data validation scripts.
- Installation and Configuration
- Interacting HDFS using HIVE
- Map Reduce Programs through HIVE
- HIVE Commands
- Loading, Filtering, Grouping
- Data types, Operators
- Joins, Groups
- Sample programs in HIVE
- Alter and Delete in Hive.
- Partition in Hive.
- Joins in Hive. Unions in hive.
- Industry specific configuration of hive parameters.
- Authentication & Authorization.
- Statistics with Hive.
- Archiving in Hive.
- Hands-on exercise
- Hbase Architectural point of view
- Region servers and their implementation
- Client API’s and their features
- How messaging system works
- Columns and column families
- Configuring hbase-site.xml
- Available Client
- Loading Hbase with semi-structured data
- Internal data storage in Hbase
- Hbase Architecture
- Creating table with column families
- MapReduce Integration.
- Hbase: Advanced Usage, Schema Design
- Load data from pig to Hbase
- Sqoop architecture
- Data Import and export in SQOOP.
- Deploying quorum and configuration throughout the cluster.
- Introduction to YARN and MR2 daemons.
- Active and Standby Namenodes
- Resource Manager and Application Master
- Node Manager
- Container Objects and Container
- Namenode Federation
- Cloudera Manager and Impala
- Load balancing in cluster with Namenode federation
- Architectural differences between Hadoop 1.0 and 2.0
- Introduction to Flume data integration
- Flume installation on single node and multinode cluster
- Flume architecture and various components
- Data sources types and variants
- Data target types and variants
- Deploying an agent onto a single node cluster.
- Problems associated with flume
- Interview questions based on flume
Introduction to data science
- What is data science?
- Introduction to Analytics life cycle?
- Different types of analysis
R Programming Basics
- Why R
- Introduction to R and CRAN
- Nuts and Bolts of R language
- Advances Features in R
- ETL in Data Science world
- Concepts of Tidy Data
- Reading Tweets
- Working with dates
- Exploratory Data Analysis
- Plotting system like Base & ggplot
- Research Presentation
- Literate Programming
- R-markdown and R-pubs
- Publish document on Github
- Probability and expected values
- Various Frequency Distributions
- Confidence Intervals
- Hypothesis testing
- Regression definition
- Residual variance
- Automatic feature selection
Machine learning techniques
- Supervised and Un-supervised learning methods
- Classification and clustering
- Time series forecasting
- Model Ensemble
Natural Language Processing (NLP)
- Basic building block of Python
- Performing a Classification
Machine Learning with Python
- Python with Scikit-learn package
- Implementing regression, Decision Trees and Clustering Python
- Apache Spark
- Spark Transformation & Action
Machine Learning with PySpark
- Introduction to PySpark
- Appling Machine learning to Big Data
Introduction to Course
- Overview of course
- Types of Analysis – Description, Predictive, Prescriptive
R programming Basics
- Introduction to R and CRAN
- Introduction to interface, CLI, Data types
- Vectors, Lists, Factors, Matrices, Data-Frames
- File IO (Flat files, Excel), subsetting
- Control Statements
- Creating function
- Raw and Tidy Data –Nature of Data
- Base: plot(), hist(),boxplot(),barplot(),par()
- Summary Measures: central Tendency, Dispersion, Chebyshev’s Theorem
- Probability: Addition, Multiplicative, Independence, Definition of pmf, pdf
- Expected Values
- Pearson’s Correlation Coefficient, simple LR, and least squares
Machine Learning Techniques
- Types of ML algorithms, Prediction
- Types of Errors, Sensitivity, Specificity, Receiver Operation Characteristics caret package
- Bayes Theorem, Naïve Bayes, KNN
- Explanation of Classification trees, regression trees, packages: part, party
- Clustering – K-Means, Hierarchical, Dendrograms