Submitted by sneha on Sun, 06/04/2017 - 18:59
Apache Spark is an open-source cluster-computing framework. Apart from Spark core, we will also learn about its components such as Spark SQL, Spark Streaming, Mlib and GraphX. Though Spark can be used with Hadoop, it can also be used without. So for most notes, knowlwdge of Hadoop is not required.
Submitted by heartin on Fri, 06/02/2017 - 07:35
Before going deep on Amazon ElasticSearch Service, we will lstart with learning how to work with the ElasticSearch distribution available from elastic.co in your local machine. Updated notes for cloud are available here.
Submitted by heartin on Sat, 02/04/2017 - 06:55
A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model. Structured Query Language (SQL) is a special-purpose programming language designed for managing data held in a RDBMS.
Submitted by heartin on Wed, 02/01/2017 - 23:23
Get started learning about Hadoop and its ecosystem components through simple theory and Hands on exercises.
Submitted by heartin on Wed, 02/01/2017 - 21:27
Here I will include notes on Big Data and Data Science concepts in general. There will be separate books on specific technologies like Hadoop.
Submitted by heartin on Mon, 01/30/2017 - 03:35
Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems.
Submitted by heartin on Sun, 01/29/2017 - 02:30
Apache Kafka is an open source publish-subscribe based distributed messaging system. From the architecture perspective, Kafka is closer to traditional messaging systems such as ActiveMQ or RabitMQ. However from a Big Data and Hadoop perspective, Kafka can be compared with Scribe or Flume as it is useful for processing activity stream data.
Submitted by heartin on Fri, 12/30/2016 - 23:19
Data warehouse and data mining were the popular keywords before the rise of Big Data technologies. As we study about all kinds of data and data technologies, we will also have a book for data warehouse and data mining.
Submitted by heartin on Sun, 11/15/2015 - 17:54
We will discuss the basics of virtualization of computer platforms here. In computing, Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, operating systems, storage devices, and computer network resources.