Before going into details of Availability v/s Partition tolerance lets recap what does C,A and P stands for in CAP theorem.
Consistency – all clients of a data store get responses to requests that ‘make sense’. For example, if Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.In other words, all the clients will see the most recent copy of data.
Availability – all operations on a data store eventually return successfully. Means data store is available for querying or carrying out the read/write operations.
Partition tolerance – if the network stops delivering messages between two sets of servers, will the system continue to work correctly?
Availability v/s Partition tolerance
Let’s consider the case of a single resource and three nodes interested in that resource when a network partition occurs, according to the following diagram –
Quorum is said to be achieved when number of nodes = (N+1)/2 i.e. majority is achieved.
In an available but not partition-tolerant system, Y would be allowed to process its request because it can reach X to obtain the lock. Z’s request would be blocked because X is unreachable.
In a partition-tolerant but not available system, Z would be allowed to process its request because it is part of the quorum group (X’s lock will be broken). Y’s request would be blocked because it is not part of the quorum group.
In a system that is both available and partition-tolerant, both requests would be allowed to progress. Y would return current data as possibly modified by X, while Z would return possibly stale data. Stale data could possibly mean no data in cases where there is no replica available with Quorum nodes. Consistency is obviously sacrificed in this case.
Around a near decade ago Hadoop became the de-facto standard for batch processing unstructured data in the form of Map-Reduce jobs.Over the years people developed layers of other services/tools like Oozie for workflow management , HBase to support structured data , Hive to query into HDFS data etc etc.Hadoop was groundbreaking at its introduction, but by today’s standards it’s actually pretty slow and inefficient. It has several shortcomings:
Everything gets written to disk, including all the interim steps.
In many cases we need a chain of jobs to perform your analysis, making above point even worse.
Writing MapReduce code is cumbersome, because the API is rudimentary, hard to test, and easy to screw up. Tool like Pig, Hive, etc., make this easier,but it require separate configurations (another tedious job).
It requires lots of code to perform even the simplest of tasks.So amount of boilerplate is too huge,
It doesn’t do anything out of the box. There’s a good bit of configuration and far too many processes to run just to get a simple single-node installation working.
Spark offers a powerful alternative to Hadoop and all these add-on services in the form of Spark Core , Spark SQL , MLib , Spark Streaming and GraphX.
Spark move around data with a abstraction called Resilient Distributed Datasets (RDDs), which are pulled into memory from any of data store like HDFS , NOSQL DBs like Cassandra etc.RDDs allows for easy parallel processing of data because of their distributed nature of storage.Spark can run multiple in memory steps, which is much more efficient than dumping intermediate steps to a distributed file system as in Hadoop.
The another noticeable difference that Spark has made to development life cycle is the easy in programming. It offers a simple programming API with powerful idioms for common data processing tasks that require less coding effort than Hadoop. Even for the most basic processing tasks, Hadoop requires several Java classes and repetitive boilerplate code to carry out each step. In Spark, developers simply chain functions to filter, sort, transform the data, abstracting away many of the low-level details and allowing the underlying infrastructure to optimize behind the scenes. Spark’s abstractions reduce code size as compared to Hadoop, resulting in shorter development times and more maintainable codebases.
Spark itself is written in the Scala language, and applications for Spark are often written in Scala as well. Scala’s functional programming style is ideal for invoking Spark’s data processing idioms to build sophisticated processing flows from basic building blocks, hiding the complexity of processing huge data sets.
However, only few developers know Scala; Java skills are much more widely available in the industry. Fortunately, Java applications can access Spark seamlessly. But this is still not ideal, as Java is not a functional programming language, and invoking the Spark’s programming model without functional programming constructs requires lots of boilerplate code: Not as much as Hadoop, but still too much meaningless code that reduces readability and maintainability.
Fortunately, Java 8 supports the functional style more cleanly and directly with a new addition to the language: “lambdas” which concisely capture units of functionality that are then executed by the Spark engine as needed. This closes most of the gap between Java and Scala for developing applications on Spark.
Spark and Cassandra
Spark can be used to run distributed jobs that can write raw data to Cassandra and generate materialized views, which are cached in memory. These materialized views can then be queried using subsequent jobs.Spark can also be used to run distributed jobs to read data from Casandra and aggregate that data and restore the aggregated data in Cassandra for subsequent jobs.
Spark is fast enough that these jobs running on materialized views or aggregated data in Cassandra, can be used for interactive queries.
Building a Spark-Cassandra application using Java 8
Here we are dealing with some dummy Trade data.In any investment bank there is concept of Books/Ledgers for booking the trades.So consider that we have some Trades with Unique Trade Identifiers (utid) under two books with bookId 234234,334235.We are getting different sensitivities (risktypeids) against these utids with certain amount value.
Now we want to store this data in Cassandra so that later on we can query this data based upon ( businessdate,bookid,utid,risktypeid ) to get data at Trade level.But we also need data at Book level as well, means we need aggregated data at book level which can be queried based upon (businessdate,bookid,risktypeid).At Book level, the above data will look like some like below –
You may refer to complete code at following github repository:
All the dependencies like Spark related jars , Spark-Cassandra connector jars have already been included in pom.So we just need to run above code in our Eclipse IDE.The only thing we need to take care of is Cassandra Server is up and running.To start Cassandra single node cluster you may download binaries from planetcassandra.org.
To start the Cassandra node in foreground run the below command from bin directory of downloaded/installed Cassandra binary –
cassandra -f -p “<Your PID Directory>”
Once your Cassandra node is up and running, start cqlsh in another command prompt window.Using cqlsh you can directly query your data in Cassandra.
Once you are done with your setup, you can run the java program SparkCassandraApp in your Eclipse IDE to insert data in Cassandra and then you can also query the data from cqlsh.
Aggregation pipeline is a series of transformations applied to documents to perform some aggregation tasks and output some cursor or a collection.There can be N numbers of transformation stages where output of first is fed into second , second into third and so on.
Below are the basic pipeline operators which we will use to perform some aggregation tasks over the data which we have created earlier.
This is similar to SQL’s WHERE clause, to filter some data which is passed on to next stage.For example if for created data we want to perform some aggregation over data that belongs to “Urban” areas then $match operator can be used to filter out that data.
This is used to expand document if it contains some data in the form of arrays.When a $unwind operator is applied to a array data, it will generate a new record for each element of that array.For example, when we run below query:
After flattening our data we can now easily group our data using $group.It is something similar to SQL’s GROUP BY clause.For example to group our data based upon usage_type of building,we should use below query: