Big Data and Apache Hadoop – Part 1
- Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures.
- Big data is a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.
- Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it's text-heavy. Metadata, Twitter tweets, and other social media posts are good examples of unstructured data.
- Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information. As digital disruption transforms communication and interaction channels—and as marketers enhance the customer experience across devices, web properties, face-to-face interactions and social platforms—multi-structured data will continue to evolve.
- The majority of big data solutions are now provided in three forms: software-only, as an appliance or cloud-based.
- The three Vs. of volume, velocity and variety are commonly used to characterize different aspects of big data
- Volume:
- to process large amounts of information
- It calls for scalable storage, and a distributed approach to querying.
- Hadoop is a platform for distributing computing problems across a number of servers.
- Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.
- "MapReduce" approach pioneered by Google.
- Typical Hadoop usage pattern involves three stages
- loading data into HDFS
- MapReduce operations
- retrieving results from HDFS
- Velocity:
- the increasing rate at which data flows into an organization
- fast-moving data tends to be either "streaming data," or "complex event processing."
- NoSQL:
- It's this need for speed, particularly on the web, that has driven the development of key-value stores and columnar databases, optimized for the fast retrieval of pre-computed information.
- Variety:
- The source data is diverse, and doesn't fall into neat relational structures
- Semi-structured NoSQL databases meet this need for flexibility: they provide enough structure to organize data, but do not require the exact schema of the data before storing it.
- A software framework for sorting, processing, and analyzing "big data".
- Distributed
- Scalable
- Fault-tolerant
- Open Source
- Apache Hadoop is a fast-growing big data framework.
Advantages:
- Problems with Traditional Large-Scale Systems
- Processor-bound and lots of complex processing with bigger computers ( changed with distributed computers)
- Programming complexity
- Keeping data and processes in sync
- Finite Bandwidth
- Partial Failures
- Distributed Systems: The Data Bottleneck
- Traditionally, data is stored in a central location
- Data is copied to processors at runtime
- Fine for limited amount of data
Comments
Post a Comment