Big Data, Data Analysis and Data Mining
About ArtIStE and Big Data, Data Analysis and Data Mining
What is Big Data?
“Data whose scale, diversity and complexity require new architectures, techniques, algorithms and analytics to manage it and extract value and hidden knowledge from it”
Generation
1) Passive recording
▪Typically structured data
▪Bank trading transactions, shopping records, government sector archives
2) Active generation
▪Semistructured or unstructured data
▪User-generated content, e.g., social networks
3) Automatic production
▪Location-aware, context-dependent, highly mobile data
▪Sensor-based Internet-enabled devices
Acquisition
1) Collection
▪Pull-based, e.g., web crawler
▪Push-based, e.g., video surveillance, click stream
2) Transmission
▪Transfer to data center over high capacity links
3) Preprocessing
▪Integration, cleaning, redundancy elimination
Storage
1)Storage infrastructure
▪Storage technology, e.g., HDD, SSD
▪Networking architecture, e.g., DAS, NAS, SAN
2)Data management
▪File systems (HDFS), key-value stores (Memcached), column-oriented databases (Cassandra), document databases (MongoDB)
3)Programming models
▪Map reduce, stream processing, graph processing
Analysis
1)Objectives
▪Descriptive analytics, predictive analytics, prescriptive analytics
2)Methods
▪Statistical analysis, data mining, text mining, network and graph data mining
▪Clustering, classification and regression, association analysis
3)Diverse domains call for customized techniques
Big Data Challenges:
Technology and infrastructure
- New architectures, programming paradigms and techniques are needed
Data management and analysis
- New emphasis on “data”
- Data science