Friday, February 8, 2013

Massively Parallel Processing – Different perspective, One objective


While exploring the latest data warehousing technologies and its concepts, I found that Amazon has also jumped into the Massively Parallel Processing (MPP) battle. Finding this information, I thought of writing a brief about some of the vendors providing such offering on their plate. As the data has been increasing at a fast pace and the older database management systems needs to upgrade their technologies. This huge amount of data needs to be processed at speed to provide most value hidden in that data. Massively parallel processing (MPP) is an architecture which allows this new class of warehouse to split up large data analytics jobs into smaller and more controllable chunks, which are then scattered to multiple processors. MPP can be simply defined or looked upon as a type of computing that uses multiple separate CPUs running in parallel to execute one single program. Systems with hundreds and thousands of such processors are known as massively parallel. Let’s have a brief idea about what some of the vendors are offering on their plates:
AmazonAmazon Redshift enables customers to obtain dramatically increased query performance when analyzing datasets ranging in size from hundreds of gigabytes to a petabyte or more, using the same SQL-based business intelligence tools they use today. Redshift has a massively parallel processing (MPP) architecture, which enables it to distribute and parallelize queries across multiple low cost nodes. The nodes themselves are designed specifically for data warehousing workloads. They contain large amounts of locally attached storage on multiple spindles and are connected by a minimally oversubscribed 10 Gigabit Ethernet network. Redshift runs the Paraccel PADB high-performance columnar, compressed DBMS, scaling to 100 8XL nodes, or 1.6PB of compressed data. XL nodes have 2 virtual cores, with 15GB of memory, while 8XL nodes have 16 virtual cores and 120 GB of memory and operate on 10 Gigabit Ethernet. But in case of Amazon, it's majorly the cost which has played a major role. Using the AWS Management Console, customers can launch a Redshift cluster, starting with gigabytes and scaling to more than a petabyte, costing less than $1,000 per terabyte per year. It can be termed as cheap in data warehousing terms compared to around $25,000 (approx.) per terabyte per year that companies are used charging for an on-premises deployments. Cost can never be the only judicious option because apart from the benefits, the offering may result in a bad boy for you – as data will be out of the corporate firewall and in some ways settled outside without your control, bandwidth and security costs, migrating to Redshift could also result in shifting your applications to some other part of the AWS ecosystem. These are just my assumptions from the understanding which I have gained from different articles. Let’s wait till a practical hands-on or technical review for more clarity about the reality.
MicrosoftMicrosoft has been working long ago for the MPP data warehousing solution. Microsoft has introduced MPP architecture in SQL Server 2008 in the form of an appliance named as Microsoft Parallel Data Warehouse (PDW) appliance. Now, recently we can see the latest refresh version with a new data processing engine including PolyBase, a technology which can handle both relational data and non-relational databases. Polybase is supposed to run on Microsoft's version of Hadoop, and will be a kind of revolution in the data warehousing market. Also, the newly introduced in-memory computing concept Hekaton seems to make Microsoft a strong competitor in the DW market. Microsoft’s all-time hit, Office application especially Excel, can be easily integrated with these applications for providing end user the best and easy to use interface for perform end user analytics. Since these solutions are not in the market completely, we just have to wait till these solutions are live in the market and then have hand-on experience.
GreenplumSimilar to these vendors, Greenplum is also offering MPP on its plate of data warehousing solution with additional capability of automatic parallelization of data loading and queries. It basically uses the technology known as Scatter/Gather Streaming which has loading speed of around 10 terabytes per hour, per rack, with linear scalability. The data is repeatedly partitioned completely across all nodes of the system, and queries are scheduled and performed using all nodes working together in a highly synchronized style.
IBM NetezzaIBM with its new Netezza appliance has also hit the market with the target of revolutionizing the DW market. Netezza's unique Field Programmable Gate Arrays (FPGA) combined with multi-core CPUs claims to deliver more than expected performance. It basically operates concurrently on the data stream in a pipeline fashion, maximizing utilization and extracting the utmost throughput from each MPP node delivering linear scalability to more than a thousand processing streams executing in parallel, while offering a very economical total cost of ownership.
This blog is completely based on my understanding from various articles and news. The information here are just my personal opinion and does not reflect anybody’s view. This list goes on and on for most of the data warehousing vendors struggling/competing in the market. I will explore other vendors and try to provide my view towards them later.
 
References:

3 comments:

  1. junk cars are essentially vehicles that are old and damaged enough that selling it parts would be more cost effective than spending money on repairs.cash for junk cars is a nationwide junk car buyer with a presence in almost all the states.a junkyard will pay you cash for junk cars, and they aren't picky.scrap car removal service fall into this category too.what you're paid for scrap.

    ReplyDelete