Ankit Matta: February 2013

Wednesday, February 13, 2013

Oracle Database 12c – Almost ready to compete in DB market

Oracle in the last Open World conference help in October 2012 announced their new database platform, Oracle Database 12c. Oracle has worked hard in developing a competitive database solution and recently there is a news of this new DB solution to be launched in days or few. I am not sure whether, Oracle would be launching it soon or not, as none of their press release mentioned about this announcement. It might be a rumor as well.

While exploring more about Oracle 12c, I found that there are various new advanced features which stands good in this competitive database market. I will try to cover as many features as I am aware of, in this blog, and will continue writing more blogs about new features of Oracle 12c. The information here is just my understanding from different articles, news and press release.

One of the main feature of Oracle 12c which has made so hype in the market is its pluggable database. It is basically a new concept where it breaks the database into two distinct entities. First one being the container database which holds all the functionality and metadata required to run the database and second being the user's database which is independent from the container database. As the name suggest, it enables the organizations to plug out a database from one container database and then plug in into another container database, which will be very helpful while migrating, and in high availability scenarios. More and detailed information about Pluggable database can be found here.

Another features is Multithreaded Database, a feature that may have significant performance and scalability implications. This feature allows multiple threads to exist in a single process and execute independently, while sharing the process's state and resources. It results in higher performance as context switching between threads is faster in the same process as compared to context switching between processes.

Next one in the list can be duplicate indexes, in the older version of Oracle DB, when a user try to create an index using the same columns, in the same order, as an existing index, there used to be an error. But with Oracle 12c, one can use two different types of index on the same data. Some of the other enhancements which I have heard are related to security, MapReduce within the new database, 32k VARCHAR2 Support, and many more.

Another latest improvement in Oracle as a platform is the Oracle Private Cloud, which gives users the advantages of a public-cloud service while running on Exadata and/or Exalogic servers that reside in a customer’s data center. The Oracle Database Cloud is based on monthly subscription -"per user-per month or per environment-per month basis".

A fully supported rapid web application development tool for the Oracle database i.e. Oracle Application Express (Oracle APEX) is available at no cost with all editions of the Oracle database. It allows users to develop and deploy professional applications that are both fast and secure using only a web browser.

References:

Friday, February 8, 2013

NoSQL Database + MongoDB + Windows Azure

Next generation databases will be majorly focusing on handling unstructured data i.e. data which can’t be stored in relational databases. For storing such data, there is another database technology known as NoSQL. It addresses the points such as non-relational, distributed, open-source and horizontally scalable databases, etc.
NoSQL is an open source technology, which according to techopedia can be generalized and defined as:

NoSQL is a type of database that does not adhere to the widely used relational database management model. In other words, NoSQL databases are not primarily built on tables, and unlike a RDBMS, they do not use SQL to manipulate data - hence the name. NoSQL was created as a support for SQL, not as its replacement. It is based on a model that is less stringent and does not essentially follow a fixed schema. It also may not stick to the ACID properties, and there is no concept like JOIN, unlike in most of the RDBMSs.

NoSQL is cheaper, more flexible, and require less management that’s why nowadays it is becoming a prominent alternative model for most of the database administrators. NoSQL's popularity is growing, and is likely to continue to be an increasingly important tool and skill for database administrators. Technically speaking, NoSQL can be divided into four subcategories, Key-Value Stores, Document Stores, Wide Column Stores, and Graph Databases.
NoSQL technology results in various benefits such as lightweight, low-friction, web developer-friendliness, supported across most of the platforms, with cross device support. But apart from these benefits, there are some drawbacks as well such as: manual scaling, lack of ACID support, procedural language only, etc.
For Windows Azure platform there are multiple NoSQL options such as: Azure Table Storage, XML Columns, OData, running NoSQL Database Products using Azure Worker Roles, VM Roles and Azure Drive such as MongoDB, etc.
MongoDB, One of the NoSQL Database, which is a scalable, high-performance, open source NoSQL database stores data as BSON (binary serialized object notation) documents, which are binary JSON (JavaScript Object Notation) documents with dynamic schemas instead of storing data in tables as is made in a "classical" relational database. MongoDB queries are expressed as JSON objects, so one can use JavaScript to replace the traditional CRUD operations.
Talking specifically about MongoDB on Microsoft platform, there is an option where users can select a Windows Azure configuration for MongoDB. Here, MongoDB servers run as Windows Azure worker roles and use Windows Azure storage containers for data storage.
While exploring MongoDB on Microsoft platform, I found all the relevant information consolidated in an article on MSDN site. For complete details about “MongoDB on Windows Azure for .NET Developers”, please refer the MSDN site here.

While going through my research about MongoDB I found a great company which provides training for MongoDB. 10gen also offers free online courses through its education portal.

References:

Massively Parallel Processing – Different perspective, One objective

While exploring the latest data warehousing technologies and its concepts, I found that Amazon has also jumped into the Massively Parallel Processing (MPP) battle. Finding this information, I thought of writing a brief about some of the vendors providing such offering on their plate. As the data has been increasing at a fast pace and the older database management systems needs to upgrade their technologies. This huge amount of data needs to be processed at speed to provide most value hidden in that data. Massively parallel processing (MPP) is an architecture which allows this new class of warehouse to split up large data analytics jobs into smaller and more controllable chunks, which are then scattered to multiple processors. MPP can be simply defined or looked upon as a type of computing that uses multiple separate CPUs running in parallel to execute one single program. Systems with hundreds and thousands of such processors are known as massively parallel. Let’s have a brief idea about what some of the vendors are offering on their plates:

Amazon –Amazon Redshift enables customers to obtain dramatically increased query performance when analyzing datasets ranging in size from hundreds of gigabytes to a petabyte or more, using the same SQL-based business intelligence tools they use today. Redshift has a massively parallel processing (MPP) architecture, which enables it to distribute and parallelize queries across multiple low cost nodes. The nodes themselves are designed specifically for data warehousing workloads. They contain large amounts of locally attached storage on multiple spindles and are connected by a minimally oversubscribed 10 Gigabit Ethernet network. Redshift runs the Paraccel PADB high-performance columnar, compressed DBMS, scaling to 100 8XL nodes, or 1.6PB of compressed data. XL nodes have 2 virtual cores, with 15GB of memory, while 8XL nodes have 16 virtual cores and 120 GB of memory and operate on 10 Gigabit Ethernet. But in case of Amazon, it's majorly the cost which has played a major role. Using the AWS Management Console, customers can launch a Redshift cluster, starting with gigabytes and scaling to more than a petabyte, costing less than $1,000 per terabyte per year. It can be termed as cheap in data warehousing terms compared to around $25,000 (approx.) per terabyte per year that companies are used charging for an on-premises deployments. Cost can never be the only judicious option because apart from the benefits, the offering may result in a bad boy for you – as data will be out of the corporate firewall and in some ways settled outside without your control, bandwidth and security costs, migrating to Redshift could also result in shifting your applications to some other part of the AWS ecosystem. These are just my assumptions from the understanding which I have gained from different articles. Let’s wait till a practical hands-on or technical review for more clarity about the reality.

Microsoft –Microsoft has been working long ago for the MPP data warehousing solution. Microsoft has introduced MPP architecture in SQL Server 2008 in the form of an appliance named as Microsoft Parallel Data Warehouse (PDW) appliance. Now, recently we can see the latest refresh version with a new data processing engine including PolyBase, a technology which can handle both relational data and non-relational databases. Polybase is supposed to run on Microsoft's version of Hadoop, and will be a kind of revolution in the data warehousing market. Also, the newly introduced in-memory computing concept Hekaton seems to make Microsoft a strong competitor in the DW market. Microsoft’s all-time hit, Office application especially Excel, can be easily integrated with these applications for providing end user the best and easy to use interface for perform end user analytics. Since these solutions are not in the market completely, we just have to wait till these solutions are live in the market and then have hand-on experience.

Greenplum –Similar to these vendors, Greenplum is also offering MPP on its plate of data warehousing solution with additional capability of automatic parallelization of data loading and queries. It basically uses the technology known as Scatter/Gather Streaming which has loading speed of around 10 terabytes per hour, per rack, with linear scalability. The data is repeatedly partitioned completely across all nodes of the system, and queries are scheduled and performed using all nodes working together in a highly synchronized style.

IBM Netezza –IBM with its new Netezza appliance has also hit the market with the target of revolutionizing the DW market. Netezza's unique Field Programmable Gate Arrays (FPGA) combined with multi-core CPUs claims to deliver more than expected performance. It basically operates concurrently on the data stream in a pipeline fashion, maximizing utilization and extracting the utmost throughput from each MPP node delivering linear scalability to more than a thousand processing streams executing in parallel, while offering a very economical total cost of ownership.
This blog is completely based on my understanding from various articles and news. The information here are just my personal opinion and does not reflect anybody’s view. This list goes on and on for most of the data warehousing vendors struggling/competing in the market. I will explore other vendors and try to provide my view towards them later.

References:

Tuesday, February 5, 2013

Big Data – Beginning of New Era of Analytics in 2013

In today’s digital world, data has been increasing at a very fast pace which demands organizations to focus on data and business intelligence. There is a drastic increase in technologies which rapidly evaluates massive amounts and varieties of data flowing from different devices, sensors, mobiles, web, etc. The recent improvements in storage, networking and computing technologies enable organizations to economically and efficiently connect this fast moving and varied type of data and turn it into a powerful source of business improvement. Apart from data and business intelligence, organizations are also focusing on data quality, better analytics, governance and data management as some of their top priorities.
The data generated from several devices is now more as compared to the computer networks which are capable of transporting the same data. According to IDC’s 2011 Digital Universe Study commissioned by EMC

, the amount of information created and replicated this year will surpass 1.8 zettabytes (1.8 trillion gigabytes), growing by a factor of nine in just five years. I read somewhere that Facebook itself has more than 800 million active users with more than 900 million objects such as pages, groups, events and community pages that people interact with. Facebook users spend over 700 billion minutes per month on the site, creating on average 90 pieces of content and sharing 30 billion pieces of content each month.
This endless growth in all kinds of data, most of which is unstructured, can be defined as Big Data. It can also be defined as a set of data which is growing exponentially, too large, too fast, or too unstructured for analysis using relational database techniques.

Wikipedia defines Big Data as “A term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set.”
Organizations want to process this petabytes of data in the minimal amount of time‚ in order to speed up and simplify the decision-making process. They want to perform analytics and BI over structured/unstructured Big Data which can extract valuable intelligence and insights from this huge information. This Big Data analytics provides next generation advanced view to the organization enabling them to imagine and implement on prospects of the future. With this analysis, organizations can also gain understanding not just about what’s happening with the business and why, but to also realize what else can be the possibilities.
There are many vendors in Big Data market, such as IBM, Microsoft, Sybase, EMC, and many more. Some of them have implemented Big Data on dedicated and proprietary hardware, some have opted for in-memory solutions, and some have switched from row-oriented style to a column-oriented style of storing the data. Microsoft HDInsight seems to be most viable of them all. According to a Forrester Consulting Total Economic Impact study on the potential benefits of upgrading to SQL Server 2012, a potential return on investment of up to 189 percent with a 12-month payback period can be realized.
An infographic

published by Microsoft also shows that SQL Server 2012 will help customers tame Big Data. Microsoft is offering new storage options and provides tools that help with big data analysis. It includes an in-memory column oriented database to improve analytics performance making its self-service business intelligence analytics more accessible to end users.
According to me Big Data seems to be a revolution for data analytics and it would not be wrong to say Big Data as “Beginning of New Era of Analytics in 2013”. Hadoop, an open-source technology, is also one of the answers to Big Data. I would be covering this topic in detail in my upcoming blog.
Have any of you used Big Data? Please share your experience or learning about Big Data. Let me know your views for improvements, or additions.