Storm Persistence and Real-Time Analytics

A short while back I mentioned some of the nice work being done on Storm.  Brian Bulkowski, founder and CTO at Aerospike has done two very nice presentations that highlights some of Storm’s advantages.  In it he goes over some of the architecture associated with Storm.  He points out that it is read/write optimized for flash and provides high performance.  Used in conduction with Aerospike’s flash-optimized in-memory database – the combination provides a real performance boost.  He points out an actual customer analysis where 500,000 transactions/second were required.  Aerospike’s server requirements were dramatically lower (186 versus 14).  You can read more here :

storm105

A more up-to-date version of the talk was given recently covering Real-time analytics with Storm, Aerospike and Hadoop.  In this talk he provided more coverage of storm explaining more details about (spouts) connections with data sources, (bolts) method of analysis and data manipulations, (nimbus) a control entity that allows re-balancing in real-time and much more. He also provides a nice example (Trending Words).  He also describes running and coding in Storm.  You can see more here :

Though the audio isn’t perfect. Here is more on storm, select to go to the page:

Incidentally, in one of the presentation he mentions using the Micron P320h to deliver exceptionally high performance.

 

Recommended Use-Case : Snapdeal, India’s Largest Online Marketplace, Using Aerospike

Nice use-case write-up on Snapdeal, India’s largest online marketplace, using Aerospike (NoSQL) databases to power its services.  Snapdeal looked at MongoDB, Couchbase, Redis, Terracota Big Memory Max, Amazon’s DynoDB and Aerospike.  The chose Aerospike for a number of reasons. The rationale and results can be found in a nice use-case write-up.  Among them the in-memory Aerospike database maintained sub-millisecond latency on Amazon’s EC2 while managing 100 million objects, predictable low latency with 95-99% of transactions completing within 10 milliseconds, low maintenance, full replication across EC2 servers and more.

snapdealInteresting read.  If you want to learn more about the architecture and choices made by Snapdeal they are presenting on this topic on Wednesday, March 12th.

aerospikesnt02

 

 

Architecture : SSD-Based Solutions Show Advantages In the NoSQL DB Tier (Video)

Today we look at the NoSQL database tier.  Some of this is taken from notes from a work-in-progressAn Introduction to Using High-Performance Flash-Based Storage in Constructing High Volume Transaction Architectures – A Manager’s Guide to Selecting Flash Storage.  This is not a complete look at Big Data, rather a partial look at some of the things Aerospike, one of the more interesting NoSQL databases, is doing. 

Aerospike and the NoSQL Database Tier.  An alternative or in addition to the relational database tier, there is a NoSQL database tier. With the arrival in recent years of Big Data architectures, new elements of a new architecture for dealing with both structured and unstructured data has arrived and with it some databases, like Aerospike, offer an extreme high performance solution in transaction-oriented environments.  Quite a bit different from typical Hadoop implementations as one of Aerospike’s real differentiators is that Aerospike was built as an in-memory database. Traditionally, in the past, this tier we have seen a number of spinning disks.  However, in the past few years, especially with the need for real-time information there has been a move to SSDs and PCIe-based flash cards.  Using Aerospike’s NoSQL database provides a means to get those high performance results. It is built to be run in-memory or in-flash. A partial glimpse into an architecture.  It is built to run on relatively low cost clustered hardware with either lots of memory and/or flash storage.  It supports ACID properties and as a NoSQL database also leverages a key-value store. If we look at an example in this tier – you can see the an example architecture where various transactions are occurring within applications and Aerospike interacts with these. It should be noted that with App tier, Aerospike uses a Smart Client to communicate to the Aerospike cluster.

nosqlarch

Of course, the producing/consuming sources may vary dramatically – from applications, web services, hadoop clusters, mobile devices, weblogs, marketing data repositories and many more.   Aerospike  is a best-of-breed of the NoSQL databases. You can see an example of a typical deployment is (from the Aerospike presentation below) :

aerospike100

And some of the Aerospike server deployments :

aerospike101

Aerospike offers support for the ACID standard and support for a high performance, clustered architecture.

aero_ssd

Of course, there are other databases such as MongoDB, Cassandra and HBase to name a few. You may choose  to use NoSQL database over relational databases. It depends wholly on what you are doing. The NoSQL database tier’s storage on these servers can use SSDs, flash PCIe cards and flash arrays.  Traditionally this tier has adopted a “share-nothing” philosophy using traditional spinning disks, SSDs or flash PCIe cards. Up to recently, flash arrays seemed like not only over-kill but also seemingly moving against the grain of the “share-nothing” philosophy.  SSDs and cards, like Micron’s P320h offer excellent performance and offer a price/performance advantage over arrays.  As prices of flash drops flash arrays are becoming a consideration in this tier and there are a number of recent deployments leveraging flash arrays for the NoSQL DB tier.  Recently, Aerospike tested Micron’s P320h (SLC SSD) PCIe card.  It “blew away the competition” according to people doing the testing.  You can read more here:

aero_ssd

More information on the P320h :

tomshwp320h

It should be noted that there are two versions of this from Micron. Micron offers a 2.5″ Flash PCIe form-factor which is hot-swappable.  You can read more here :

digimicron

It should be noted that competitors are not standing still and Virident, Fusion-IO and others have and are coming out with new cards that are worth looking at.

To understand what Aerospike is doing it is worth watching this video :

Aerospike_video

If you want to learn more, it is worth visiting Aerospike’s BrightTalk site

brighttalk_aerospike

 


gotostorageGo to more posts on storage and flash storage at http://digitalcld.com/cld/category/storage.


 

Recommended Viewing : The Netflix Cloud and Cassandra

Netflix is doing some amazing things. If you have the service, you know they are dependent on Amazon Web Services but their cloud practices transcend that dependency.  Adrian Cockroft has delivered some really excellent talks explaining how they do what they do.

and also a nice talk on how they moved to Cassandra to do a lot of the heavy lifting.

and he provided another very nice presentation at the Cassandra conference C*2012 about running Cassandra on AWS.

Interesting in the two Cassandra talks he discusses use of SSDs to improve Cassandra performance. He talks about moving from 2 drives (1.7 TB) to 2 SSD volumes (2 TB).  He shows results from a hard disk versus SSD comparison.  Netflix is offering a number of Cassandra-related software as open source, such as Priam (for Cassandra automation), Astyanax (client, front-end into Cassandra) and more (like Aegisthus, Zeno, Chaos Monkey, Zuul, Pythias, etc).  Note that AppDynamics is used throughout these presentations.  One other project I’m aware of is a non-JVM way of getting to the recipes in Astyanax is STAASH.  You can follow all of this on the Netflix technical blog.

 

Also a post that may be of interest : Some Thoughts on Why We Want To Run Databases on Flash

 

Recommended Viewing : HazelCast Intro, Management Center and an Example

If you have been following HazelCast, you know it is one of the most interesting of startups. If you haven’t and you are working on big data applications or looking at in-memory grid solutions – you should look at Hazelcast.

haz03Hazelcast has a Community Edition, an Enterprise Edition and a Management Center.  Hazelcast  is an open source clustering and highly scalable data distribution platform. Hazelcast allows you to easily share and partition your application data across your cluster. Hazelcast is a peer-to-peer solution (there is no master node, every node is a peer) so there is no single point of failure.

 

Hazelcast Enterprise Edition (EE) is an extension to Community Edition. It contains extra features such as Elastic Memory and Security. Elastic memory helps businesses on storing large amounts of data with high throughput. The off-heap technology used in enterprise version, resolves the performance problems experienced handling terabytes of data. With Enterprise Edition, big data will not be a big challenge.

– from the Hazelcast web site.

JVMs running Hazelcast will dynamically cluster. Miko Matsumora has two nice videos on Hazelcast In-Memory Grid technologies. In the first one he gives a quick intro to HazelCast :

And in the second video shows how to set-up the HazelCast Management Center and walks you through it :

Finally there is a nice blog on how to get started with HazelCast that walks you through a simple example.

haz01

and a nice presentation (PDF) from Team High Calibre at San Jose State University :

haz02

 

Aerospike Benchmark : Micron PCIe P320h & P420hm Flash Cards “blows away the competition”

If you are using  Aerospike’s database (which is a really nice NoSQL DB) you should be evaluating Micron’s P320h and P420hm PCIe cards.  Actually, I’ll go further, if you are looking at PCIe cards, in general, you should be evaluating at Micron.  If you don’t know what Aerospike is – it is a NoSQL database optimized for flash and it is in wide use by a class of customers that need extreme performance and low latency.

aerospike_db

In a review of testing by Aerospike, Brian Bulkowski, CTO at Aerospike was effusive in writing about the results in The SSD Journal.  Specifically he states in a review of tests – “Micron PCIe devices blow away the competition, and our customers who use them are very pleased.”  This is high praise coming from a company who has tested a large number of flash devices and even some flash arrays.  In the tests, Aerospike ran the Aerospike Certification Tool (ACT) and the results were impressive.

aerospike_ACT

The numbers are nothing short of stunning. The Micron P320h (SLC) was able to “sustain 99.8% of requests under 1 millisecond while at the same time satisfying 150,000 read operations per second and 225 megabytes per second of writes hour after hour.”  How has Micron been able to do it ?   You can read more in the review.

aerospike_micron

Recommended Reading : Enabling Database Replication in the Cloud Using Jelastic

Replication is an important and key  technology for any database server. Without it  downtime or significant data loss which can incur large revenue losses. By replicating data from a master to one or more standby servers one can avoid any data loss. Three interesting articles.  The first example of replication in the Jelastic cloud shows how to replicate with MariaDB : jelastic00 The second one shows how to enable PostgreSQL replication in a Jelastic cloud. jelastic01 The next one shows how to enable MongoDB replication : jelastic02

Hadoop in the Cloud

Increasingly, Big Data processing is moving to the clouds – Amazon, Joyent and Microsoft to name three.

Amazon has offered a way of attacking ‘big data’ from their clouds.  Amazon offers “Amazon Elastic MapReduce (Amazon EMR) which is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).”  To learn more

aws02

Meanwhile, Joyent, has created Manta – a ZFS object store. Manta offers strongly consistent writes with highly available reads with no object size limits and per-object replication policies.  This service can be coupled to MapReduce frameworks.  This coupling happens locally – unlike Amazon’s approach.  You can learn more about this coupling here :

joyent02

Microsoft, as well, has released HDInsight, which is Azure’s Hadoop-based service.

hdinsight

Ericsson’s Geoff Hollingsworth provided an example of their use of Manta :