Big Data

Big Data: A Brief Overview, Challenges, and Issues
COMP851: Big Data AnalyticsAssignment 1
Abenezer Girma
[email protected]
NC A &T State University September 7, 2018
What is Big Data
Big data is a term that describes the volume of structured, semistructured and unstructured data that has
the potential to be mined for information, which opens a new path to innovation and insight in dier-
ent sectors. Due to technological advancement toward digitalization and large availability of electronic
devices(cellphone, computer,etc..), IOT devices, sensors, websites combined with ever increasing social
media users results a tremendous amount of variety type of data generation. This availability of Big Data
is creating a new opportunities for all kind of industries to get a new insight and come up with competitive
advantage by exploiting information from the collected data. However, the huge inux of data, nature
of data, especially unstructured data, make it dicult to mine information very easily as it seems from
outside. Thus, several industries and academic institution started to develop their ability to gather and
mine data following organized &scientic approach which we usually called data science. Many of well
known companies such as Technological companies (Google, Facebook, LinkedIn,IBM,Intel,etc..), bank-
ing &credit card companies &, manufacturing companies, mobile phone companies are taking the lead
in data mining. But, in the coming years the use case of Big Data or in general Articial Intelligence will
impact every individuals day to day activities in one or another way.
ˆThree attributes stand out as dening Big Data characteristics which usually called 3Vs:
Volume – As the name implies one of the main characteristic of Big Data is the tremendous
amount of data that is generated continuously in the way that never been seen before. Struc-
tured Big Data could have thousands of features and millions of rows and unstructured data
like, image and video could easily reach to size of Tera bytes 1. This data explosion keeps
increasing in higher rate.
Variety – Big data has a a wide variety of types, broadly it could be divided as structured data
and unstructured data. Under these two categories the data format can range from; text(email,
tweet, sentiment), image(RGB, x-ray, ultrasound), video, sound, numerical, etc… Thus we don’t
have a control over the data type or data structure that we are dealing with.
Velocity – The speed of a new data creation very fast and it keeps increasing. In 2017 Facebook is
storing roughly 250 billion images and currently Facebook users upload more than 900 million
photos a day, so that 250 billion number from last year will seem like small after few months
2.
Data Structures
Out of the 3Vs characteristics of big data one is variety or structure of data, and it’s manifested by a
growing number of data types that come in multiple forms, including structured and non-structured data.
In contrast to traditional data analysis, most of the Big Data is unstructured or semi-structured in nature,
which requires dierent approach and tools to process and analyze4. Data structures can be categorized in 4 types:
1

ˆ
Structured data:- this category contains data that have a dened data type, format and structure.
Every piece of information included is known ahead of time, comes in a specied format, and occurs
in a specied order. This makes it easy to work with.
ˆ Unstructured data:- are data sources those that we have little or no control over and that has no
inherent structure, which may include text documents, PDFs, images, and video. For example a
picture has a format of individual pixels set up in rows, but how those pixels t together to create
the picture seen by an observer is going to vary substantially in each case.
ˆ Semi-structured data:- has a logical ow and format to it that can be understood, but the format
is not user-friendly.Reading semi-structured data to analyze it isn’t as simple as specif ying a xed
le format.To read semi-structured data, it is necessary to employ complex rules that dynamically
determine how to proceed after reading each piece of information 4. Textual data les with a
discernible pattern that enables parsing is a good example.
ˆ Quasi-structured data:- Textual data with unpredictable formats that can be formatted with eort,
tools, and time (for instance, web click-stream data that may contain inconsistencies in data values
and formats) 1.
Big Data Analysis
Data analytics is concerned with extraction of actionable knowledge and insights from big data. Big data
analytics can have three forms 5
ˆDescriptive Analytics:- This describes about the past and present in the easy way to be understand
by many people. Thus, the gathered data get organized and visualization tool such as charts, maps,
graphs, etc., is used to easily give insight into what the data implies.
ˆ Predictive Analytics:- after learning form the collected data using the trained model to tell what
is expected to happen in near future. The tools used for prediction are time series analysis using
statistical methods and machine learning algorithms.
ˆ Exploratory or Discovery Analytic:- This focus on discovering hidden patterns and information
using data collected from various sources. Discovering customers’ behavior by companies using
sentimental analysis form the collected feedback could be a good example.
ˆ Prescriptive analytics:- This identies, based on data gathered, opportunities to optimize solutions
to existing problems. In other words, the analysis tells us what to do to achieve a goal.
Data Science
Our daily activities involves using a lot of digital equipments which generates a lot of data and industries
are also moving from traditional working process to digitalization which make them a big source of data.
This data explosion is an excellent resource and power to make better decisions, understand world, im-
prove performance in production, build a competitive business, provide a better health care and can be
applied in lots of more possible application areas. However, it is challenging to take the collected raw,
process it and change it into useful information, insights and action. Data science is the key discipline that
uses dierent kind of approach and tools to change the ood of raw data into useful information. Simply
we can put, data science is the art of wrangling data to predict our future behavior, uncover patterns to
help prioritize or provide actionable information, or otherwise draw meaning from these vast, untapped
data resources 6. Generally there are three recurring activities that usually data scientists do:
ˆ Converting business problem into analytics problem. This needs a good observation and problem
solving skills to diagnose business problems and determine which kinds of analytical solution can be
applied to solve the problem.
ˆ After selecting the analytical approach, model designing testing and nally implementation take
place to mine the required information from the available data.
2

ˆ
Using the developed model try to extract useful information from the data in-order to solve the
business problem and increase productivity/eciency. This stage also involves people out of data
science, so visualization techniques is applied to draw insights and communicate them eectively.
Data Analytics Life-cycle
Based on multiple long and challenging deployments in many areas, trials and errors, and multiple consult-
ing exchanges with many customers from many elds, those vendors coined a life-cycle for data analytics.
1. Identif y/ Formulate Problem Data: It starts all from understanding dening the problem and
setting objective from a business perspective. Then converting the dened business problem into a
data mining problem denition.
2. Data Preparation &Exploration: Data collection, cleaning, preprocessing, wrangling and visual-
ization is performed in this phase to understand the data, increase the quality of the data and to
get insights into the data. Generally, it covers required steps to change the collected raw data to the
nal data set that can be feed to the model in next step.
3. Build Model: After exploring data you have all the information needed to develop the mathematical
model that encodes the relationship between the data. These models are useful for understand-
ing the system under study, and in a specic way they are used for three main purposes. The
rst is prediction about the data values(regression models) , the second is to classif y new data
products(classication models) and the third is to nd a pattern in data in unsupervised man-
ner(clustering) 8.
4. Model Validation: Training data set is uses for building the model and validation set of data is uses
for validating the model. Thus, the model will be evaluating by comparing the produced output to
the validation dataset. This process not only used for evaluating the accuracy of the model but also
to compare it with other existing models.
5. Deploy Model and Monitor: This is the nal step of the analysis process, which aims to present
the results, that is, the conclusions of the analysis. In the deployment process, in the business envi-
ronment, the analysis is translated into a benet for the client who has commissioned it. In technical
or scientic environments, it is translated into design solutions or scientic publications. That is,
the deployment basically consists of putting into practice the results obtained from the data analysis
8. Deployment phase consists of writing report for the customer who requested the analysis. In
the documentation supplied by the analyst, each of these four topics will generally be discussed in
detail: analysis results, decision deployment, risk analysis and measuring the business impact. 8 Figure 1: Data analytics lifecycle 9
3

Big Data Benets
Big Data has changed the way that we adopt in doing businesses, managements and researches. Data-
intensive science especially in data-intensive computing is coming into the world that aims to provide the
tools that we need to handle the Big Data problems 10.
ˆBig Data in marketing Advances in data storage and mining technologies make it possible to pre-
serve increasing amounts of data generated directly or indirectly by users and analyze it to yield
valuable new insights. For example, companies can study consumer purchasing trends to better
target marketing. In addition, near-real-time data from mobile phones could provide detailed char-
acteristics about shoppers that help reveal their complex decision-making processes as they walk
through malls 11.
ˆ Big Data in social study: Big data can expose people’s hidden behavioral patterns and even shed
light on their intentions and how they interact with others and their environment. This information
is useful to government agencies as well as private companies to support decision making in areas
ranging from law enforcement to social services to homeland security 12.
ˆ Big Data in Medical : In the scientic domain, secondary uses of patient data could lead to the
discovery of cures for a wide range of devastating diseases and the prevention of others 12.
Big Data Challenges
While big data can yield extremely useful information, it also presents new challenges with respect to how
much data to store, how much this will cost, whether the data will be secure, and how long it must be
maintained 12. Big data also presents new ethical challenges where companies track their employees
and customers using digital footprints. Such monitoring might be useful for the companies but is not in
the best interest of the people.
Examples of Big Data Analytics
Two examples of use of big data analytics in a two dierent application are shown, 1. Predicting whether a wait-listed train ticket will be conrmed 5: India runs around 8500
trains each day and there are around 7000 railway stations of which about 300 are major stations.
20 million passengers travel on a given day. The total number of reserved seats/berths issued every
day is around 250,000 and reservation can be made 60 days in advance. The allocation of reserved
seats is very complicated. There are quotas for VIPs, ladies, emergency travel, handicapped persons
etc. There are also quotas for starting and intermediate stations. A passenger with a wait-listed
ticket would like to know the probability of getting the ticket conrmed. The problem of predicting
this is very complicated as it depends on several factors such as weekends, festivals, night train,
starting or intermediate station, etc. A company in the travel business gathered data on 10 million
reservations on various trains over a period of 2 years. The data gathered was on wait-listed tickets
and those which got conrmed. Using the data and complex analytics, which took into account all
the constraints noted above, the company was able to predict with 90 to 95 %accuracy the probability
that a wait-listed ticket with a given waiting list number in a given class of travel on a specied train
from a given starting station on a given date would lead to a conrmed reservation or not. The
algorithm required machine learning based on past big data collection.
2. Using Data Analytics to Detect Anomalous States in Vehicles 13: With recent advancements
in the automotive world and the introductions of autonomous vehicles, automotive cybersecurity
has become a main and primary issue for every automaker. In order to come up with measures to
detect and protect against malicious attacks, intrusion detection systems (IDS) are commonly used.
These systems identif y attacks while comparing normal behavior with abnormalities. Two-stage IDS
based on deep-learning and rule-based systems are used to detect malicious attacks and ensure CAN
(Controller Area Network) security in real time. Two-stage intrusion detection system for in-vehicle
CAN bus. In the rst stage, a rule-based component to check injected CAN messages while in the
second stage we use a DNN based component. Experimental results show that this two-stage CAN
IDS has a good performance with high detection rate, low false positives, and small processing delay.
4

References
1introduction to big data analytics and data science.
2Rich Miller. https://datacenterfrontier.com/inside-facebooks-blu-ray-cold-storage-data-center/.
3Philip Russom et al. Big data analytics. TDWI best practices report, fourth quarter, 19(4):134, 2011.
4Bill Franks. Taming the big data tidal wave: Finding opportunities in huge data streams with advanced
analytics , volume 49. John Wiley ; Sons, 2012.
5V Rajaraman. Big data analytics. Resonance, 21(8):695716, 2016.
6Lillian Pierson. Data science for dummies . John Wiley ; Sons, 2015.
7Top Five Skills Data Scientist need. https://www.forbes.com/sites/quora/2017/06/15/what-are- the-top-ve-skills-data-scientists-need/4d1ea4e47c0c. 2017.
8Fabio Nelli. Python Data Analytics: Data Analysis and Science using pandas, matplotlib and the Python
Programming Language . Apress, 2015.
9Feras A Batarseh and Eyad Abdel Latif. Assessing the quality of service using big data analytics: With application to healthcare. Big Data Research, 4:1324, 2016.
10CL Philip Chen and Chun-Yang Zhang. Data-intensive applications, challenges, techniques and tech- nologies: A survey on big data. Information Sciences, 275:314347, 2014.
11Katina Michael and Roger Clarke. Location and tracking of mobile devices: Überveillance stalks the streets. Computer Law ; Security Review , 29(3):216228, 2013.
12Katina Michael and Keith W Miller. Big data: New opportunities and new challenges guest editors’ introduction. Computer, 46(6):2224, 2013.
13Linxi Zhang, Lyndon Shi, and Nevrus Kaja. A two-stage deep learning approach for can intrusion detection.
5