What is big data?

There is no hard and fast rule about exactly what size a database needs to be for the data inside of it to be considered "big." Instead, what typically defines big data is the need for new techniques and tools to be able to process it. In order to use big data, you need programs that span multiple physical and/or virtual machines working together in concert to process all of the data in a reasonable span of time.

Getting programs on multiple machines to work together in an efficient way so that each program knows which components of the data to process, and then being able to put the results from all the machines together to make sense of a large pool of data, takes special programming techniques. Since it is typically much faster for programs to access data stored locally instead of over a network, the distribution of data across a cluster and how those machines are networked together are also important considerations when thinking about big data problems.

The uses of big data are almost as varied as they are large. Prominent examples you're probably already familiar with include: social media networks analyzing their members' data to learn more about them and connect them with content and advertising relevant to their interests, or search engines looking at the relationship between queries and results to give better answers to users' questions.

What tools are used to analyze big data?

Perhaps the most influential and established tool for analyzing big data is known as Apache Hadoop. Apache Hadoop is a framework for storing and processing data at a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud.

The Hadoop Distributed File System (HDFS), which is a distributed file system designed for very high aggregate bandwidth.YARN, a platform for managing Hadoop's resources and scheduling programs that will run on the Hadoop infrastructure. MapReduce, as described above, a model for doing big data processing

The main selling point of Spark is that it stores much of the data for processing in memory, as opposed to on disk, which for certain kinds of analysis can be much faster

How is big data analyzed?

One of the best-known methods for turning raw data into useful information is what is known as MapReduce. MapReduce is a method for taking a large data set and performing computations on it across multiple computers, in parallel. It serves as a model for how to program and is often used to refer to the actual implementation of this model.

In essence, MapReduce consists of two parts. The Map function does sorting and filtering, taking data and placing it inside of categories so that it can be analyzed. The Reduce function provides a summary of this data by combining it all together. While largely credited to research that took place at Google, MapReduce is now a generic term and refers to a general model used by many technologies.

What tools are used to analyze big data?

The Hadoop Distributed File System (HDFS), which is a distributed file system designed for very high aggregate bandwidth;
YARN, a platform for managing Hadoop's resources and scheduling programs that will run on the Hadoop infrastructure;
MapReduce, as described above, a model for doing big data processing;
And a common set of libraries for other modules to use.

Other big data tools

Of course, these aren't the only big data tools out there. There are countless open source solutions for working with big data, many of them specialized for providing optimal features and performance for a specific niche or for specific hardware configurations.

Apache Beam : is "a unified model for defining both batch and streaming data-parallel processing pipelines." It allows developers to write code that works across multiple processing engines.

Apache Hive is a data warehouse built on Hadoop. A top-level Apache project, it "facilitates reading, writing, and managing large datasets … using SQL."

Apache Impala is an SQL query engine that runs on Hadoop. It's incubating within Apache and is touted for improving SQL query performance while offering a familiar interface.

Apache Kafka allows users to publish and subscribe to real-time data feeds. It aims to bring the reliability of other messaging systems to streaming data.

Apache Lucene is a full-text indexing and search software library that can be used for recommendation engines. It's also the basis for many other search projects, including Solr and Elasticsearch.

Apache Pig is a platform for analyzing large datasets that runs on Hadoop. Yahoo, which developed it to do MapReduce jobs on large datasets, contributed it to the ASF in 2007.

Apache Solr is an enterprise search platform built upon Lucene.

Apache Zeppelin is an incubating project that enables interactive data analytics with SQL and other programming languages.

Other open source big data tools you may want to investigate include:

Elasticsearch is another enterprise search engine based on Lucene. It's part of the Elastic stack (formerly known as the ELK stack for its components: Elasticsearch, Kibana, and Logstash) that generates insights from structured and unstructured data.

Cruise Control was developed by LinkedIn to run Apache Kafka clusters at large scale.

TensorFlow is a software library for machine learning that has grown rapidly since Google open sourced it in late 2015. It's been praised for democratizingmachine learning because of its ease-of-use.

As big data continues to grow in size and importance, the list of open source tools for working with it will certainly continue to grow as well.

Big data