CLASS NOTES: Scalability Basics (Introduction to Data Science)

924

Lectures Watched
Since January 1, 2014

Hundreds of free, self-paced university courses available:
my recommendations here

Books I am currently reading

Peruse my collection of 275
influential people of the past.

View My Class Notes via:

Receive My Class Notes via E-Mail:

Contact Me via E-Mail:

edward [at] tanguay.info

Notes on video lecture:

Scalability Basics

Introduction to Data Science

Bill Howe, University of Washington

Notes taken by Edward Tanguay on July 2, 2014 (go to class or lectures)

Choose from these words to fill the blanks below:

matching, datasets, computers, trees, footprint, baked, complexity, SQL, old, touch, core, reusing, memory, preprocessing, load

•

what does scalability mean?

•

operationally

•

in the past, "scale up"

•

needs to work even if data doesn't fit in main on one machine

•

need to bring in data, work on it, then bring in more data, work on it, etc.

•

what was important was the size of the memory

•

this was "out of " processing of large datasets

•

but this began to not be enough since you couldn't bring data on and off fast enough to satisfy the

•

now, "scale out"

•

you can increase the speed that you can process the data by making use of 1000s of cheap

•

algorithmically

•

in the past

•

algorithm is scalable if when you have n data items, you must do no more than n^m operations, where m is 1 or 2 or some low number, with 4 and beyond it becomes difficult for large

•

this was tractable, became the definition of scalable

•

now:

•

n^m/k where k is the number of computers you can apply to the problem

•

soon, "streaming data"

•

"if you have n data items, you should do no more that n * log(n) operations

•

as data grows, you may only get one pass at it

•

whenever you hear "log" you should think " " so this enables you to look at each item and put it in some tree data structure

•

example: Large Synoptic Survey Telescope (30TB / night)

•

example 1: "Needle in Haystack"

•

find DNA sequences that equal "GATTACGATATTA"

•

linear search

•

move through each, compare, until you get a match

•

40 records, 40 comparisons

•

n records, N comparisons

•

algorithmic is order N

•

sort the sequences

•

sort them first, then start in the middle, and move to the direction of the one you are looking for

•

after the first check, you remove the need to check half the data

•

40 records, 4 comparisons

•

but we had to sort the data ahead of time

•

algorithmic complexity is log(N) comparisons

•

this type of approach to search is into databases, i.e. "needle in haystack" problems

•

e.g. CREATE INDEX seq_idx ON sequence

•

this is -style scalability, i.e. it fits in memory

•

builds a platform for building and indexes

•

"there is a lot of work that you are getting for free just by turning your problem into an statement"

•

example 2: "Read Trimming"

•

given set of DNA sequences, trim a suffix of each and create a new dataset

•

this is a standard operation

•

linearly

•

loop through and process each

•

there is no index that is going to help us here, you have to at least every record

•

how can we do better?

•

the processing of each doesn't have to do with the processing of the other

•

therefore, break dataset into a number of groups and give groups to different processors which process simultaneously

•

complexity is N/k, where

•

N is the number of items to operate

•

k is the number of workers

Vocabulary:

tractable, adj. of a decision problem, algorithmically solvable fast enough to be practically relevant, typically in polynomial time ⇒ "This algorithm proved to be tractable and became the definition of scalable."

Scalability Basics