924
Lectures Watched
Since January 1, 2014
Hundreds of free, self-paced university courses available:
my recommendations here
Peruse my collection of 275
influential people of the past.
View My Class Notes via:
Receive My Class Notes via E-Mail:

VIEW ARCHIVE


Contact Me via E-Mail:
edward [at] tanguay.info
Notes on video lecture:
Scalability Basics
Choose from these words to fill the blanks below:
datasets, matching, computers, trees, reusing, complexity, memory, touch, core, old, preprocessing, baked, footprint, SQL, load
what does scalability mean?
operationally
in the past, "scale up"
needs to work even if data doesn't fit in main              on one machine
need to bring in data, work on it, then bring in more data, work on it, etc.
what was important was the size of the memory                   
this was "out of         " processing of large datasets
but this began to not be enough since you couldn't bring data on and off fast enough to satisfy the         
now, "scale out"
you can increase the speed that you can process the data by making use of 1000s of cheap                   
algorithmically
in the past
algorithm is scalable if when you have n data items, you must do no more than n^m operations, where m is 1 or 2 or some low number, with 4 and beyond it becomes difficult for large                 
this was tractable, became the definition of scalable
now:
n^m/k where k is the number of computers you can apply to the problem
soon, "streaming data"
"if you have n data items, you should do no more that n * log(n) operations
as data grows, you may only get one pass at it
whenever you hear "log" you should think "          " so this enables you to look at each item and put it in some tree data structure
example: Large Synoptic Survey Telescope (30TB / night)
example 1: "Needle in Haystack"
find                  DNA sequences that equal "GATTACGATATTA"
linear search
move through each, compare, until you get a match
40 records, 40 comparisons
n records, N comparisons
algorithmic                      is order N
sort the sequences
sort them first, then start in the middle, and move to the direction of the one you are looking for
after the first check, you remove the need to check half the data
40 records, 4 comparisons
but we had to sort the data ahead of time
algorithmic complexity is log(N) comparisons
this type of approach to search is            into databases, i.e. "needle in haystack" problems
e.g. CREATE INDEX seq_idx ON sequence
this is       -style scalability, i.e. it fits in memory
builds a platform for building and                indexes
"there is a lot of work that you are getting for free just by turning your problem into an        statement"
example 2: "Read Trimming"
given set of DNA sequences, trim a suffix of each and create a new dataset
this is a standard                            operation
linearly
loop through and process each
there is no index that is going to help us here, you have to at least            every record
how can we do better?
the processing of each doesn't have to do with the processing of the other
therefore, break dataset into a number of groups and give groups to different processors which process simultaneously
complexity is N/k, where
N is the number of items to operate
k is the number of workers

Vocabulary:

tractable, adj. of a decision problem, algorithmically solvable fast enough to be practically relevant, typically in polynomial time  "This algorithm proved to be tractable and became the definition of scalable."
Scalability Basics