Coursera
Coursera Logo

Washington University - Data Manipulation at Scale: Systems and Algorithms 

  • Offered byCoursera
  • Public/Government Institute

Data Manipulation at Scale: Systems and Algorithms
 at 
Coursera 
Overview

Duration

20 hours

Total fee

Free

Mode of learning

Online

Official Website

Explore Free Course External Link Icon

Credential

Certificate

Data Manipulation at Scale: Systems and Algorithms
Table of contents
Accordion Icon V3

Data Manipulation at Scale: Systems and Algorithms
 at 
Coursera 
Highlights

  • Shareable Certificate Earn a Certificate upon completion
  • 100% online Start instantly and learn at your own schedule.
  • Course 1 of 4 in the Data Science at Scale Specialization
  • Flexible deadlines Reset deadlines in accordance to your schedule.
  • Approx. 20 hours to complete
  • English Subtitles: Arabic, French, Portuguese (European), Italian, Vietnamese, German, Russian, English, Spanish
Read more
Details Icon

Data Manipulation at Scale: Systems and Algorithms
 at 
Coursera 
Course details

More about this course
  • Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making --- we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.
  • In this course, you will learn the landscape of relevant systems, the principles on which they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how practical systems were derived from the frontier of research in computer science and what systems are coming on the horizon. Cloud computing, SQL and NoSQL databases, MapReduce and the ecosystem it spawned, Spark and its contemporaries, and specialized systems for graphs and arrays will be covered.
  • You will also learn the history and context of data science, the skills, challenges, and methodologies the term implies, and how to structure a data science project. At the end of this course, you will be able to:
  • Learning Goals:
  • 1. Describe common patterns, challenges, and approaches associated with data science projects, and what makes them different from projects in related fields.
  • 2. Identify and use the programming models associated with scalable data manipulation, including relational algebra, mapreduce, and other data flow models.
  • 3. Use database technology adapted for large-scale analytics, including the concepts driving parallel databases, parallel query processing, and in-database analytics
  • 4. Evaluate key-value stores and NoSQL systems, describe their tradeoffs with comparable systems, the details of important examples in the space, and future trends.
  • 5. ?Think? in MapReduce to effectively write algorithms for systems including Hadoop and Spark. You will understand their limitations, design details, their relationship to databases, and their associated ecosystem of algorithms, extensions, and languages.
  • write programs in Spark
  • 6. Describe the landscape of specialized Big Data systems for graphs, arrays, and streams
Read more

Data Manipulation at Scale: Systems and Algorithms
 at 
Coursera 
Curriculum

Data Science Context and Concepts

Appetite Whetting: Politics

Appetite Whetting: Extreme Weather

Appetite Whetting: Digital Humanities

Appetite Whetting: Bibliometrics

Appetite Whetting: Food, Music, Public Health

Appetite Whetting: Public Health cont'd, Earthquakes, Legal

Characterizing Data Science

Characterizing Data Science, cont'd

Distinguishing Data Science from Related Topics

Four Dimensions of Data Science

Tools vs. Abstractions

Desktop Scale vs. Cloud Scale

Hackers vs. Analysts

Structs vs. Stats

Structs vs. Stats cont'd

A Fourth Paradigm of Science

Data-Intensive Science Examples

Big Data and the 3 Vs

Big Data Definitions

Big Data Sources

Course Logistics

Twitter Assignment: Getting Started

Supplementary: Three-Course Reading List

Supplementary: Resources for Learning Python

Supplementary: Class Virtual Machine

Supplementary: Github Instructions

Relational Databases and the Relational Algebra

Data Models, Terminology

From Data Models to Databases

Pre-Relational Databases

Motivating Relational Databases

Relational Databases: Key Ideas

Algebraic Optimization Overview

Relational Algebra Overview

Relational Algebra Operators: Union, Difference, Selection

Relational Algebra Operators: Projection, Cross Product

Relational Algebra Operators: Cross Product cont'd, Join

Relational Algebra Operators: Outer Join

Relational Algebra Operators: Theta-Join

From SQL to RA

Thinking in RA: Logical Query Plans

Practical SQL: Binning Timeseries

Practical SQL: Genomic Intervals

User-Defined Functions

Support for User-Defined Functions

Optimization: Physical Query Plans

Optimization: Choosing Physical Plans

Declarative Languages

Declarative Languages: More Examples

Views: Logical Data Independence

Indexes

MapReduce and Parallel Dataflow Programming

What Does Scalable Mean?

A Sketch of Algorithmic Complexity

A Sketch of Data-Parallel Algorithms

"Pleasingly Parallel" Algorithms

More General Distributed Algorithms

MapReduce Abstraction

MapReduce Data Model

Map and Reduce Functions

MapReduce Simple Example

MapReduce Simple Example cont'd

MapReduce Example: Word Length Histogram

MapReduce Examples: Inverted Index, Join

Relational Join: Map Phase

Relational Join: Reduce Phase

Simple Social Network Analysis: Counting Friends

Matrix Multiply Overview

Matrix Multiply Illustrated

Shared Nothing Computing

MapReduce Implementation

MapReduce Phases

A Design Space for Large-Scale Data Systems

Parallel and Distributed Query Processing

Teradata Example, MR Extensions

RDBMS vs. MapReduce: Features

RDBMS vs. Hadoop: Grep

RDBMS vs. Hadoop: Select, Aggregate, Join

NoSQL: Systems and Concepts

NoSQL Context and Roadmap

NoSQL Roundup

Relaxing Consistency Guarantees

Two-Phase Commit and Consensus Protocols

Eventual Consistency

CAP Theorem

Types of NoSQL Systems

ACID, Major Impact Systems

Memcached: Consistent Hashing

Consistent Hashing, cont'd

DynamoDB: Vector Clocks

Vector Clocks, cont'd

CouchDB Overview

CouchB Views

BigTable Overview

BigTable Implementation

HBase, Megastore

Spanner

Spanner cont'd, Google Systems

MapReduce-based Systems

Bringing Back Joins

NoSQL Rebuttal

Almost SQL: Pig

Pig Architecture and Performance

Data Model

Load, Filter, Group

Group, Distinct, Foreach, Flatten

CoGroup, Join

Join Algorithms

Skew

Other Commands

Evaluation Walkthrough

Review

Context

Spark Examples

RDDs, Benefits

Graph Overview

Structural Analysis

Degree Histograms, Structure of the Web

Connectivity and Centrality

PageRank

PageRank in more Detail

Traversal Tasks: Spanning Trees and Circuits

Traversal Tasks: Maximum Flow

Pattern Matching

Querying Edge Tables

Relational Algebra and Datalog for Graphs

Querying Hybrid Graph/Relational Data

Graph Query Example: NSA

Graph Query Example: Recursion

Evaluation of Recursive Programs

Recursive Queries in MapReduce

The End-Game Problem

Representation: Edge Table, Adjacency List

Representation: Adjacency Matrix

PageRank in MapReduce

PageRank in Pregel

Other courses offered by Coursera

– / –
3 months
Beginner
– / –
20 hours
Beginner
– / –
2 months
Beginner
– / –
3 months
Beginner
View Other 6716 CoursesRight Arrow Icon
qna

Data Manipulation at Scale: Systems and Algorithms
 at 
Coursera 

Student Forum

chatAnything you would want to ask experts?
Write here...