Презентация Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication онлайн
На нашем сайте вы можете скачать и просмотреть онлайн доклад-презентацию на тему Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication абсолютно бесплатно. Урок-презентация на эту тему содержит всего 105 слайдов. Все материалы созданы в программе PowerPoint и имеют формат ppt или же pptx. Материалы и темы для презентаций взяты из открытых источников и загружены их авторами, за качество и достоверность информации в них администрация сайта не отвечает, все права принадлежат их создателям. Если вы нашли то, что искали, отблагодарите авторов - поделитесь ссылкой в социальных сетях, а наш сайт добавьте в закладки.
Презентации » Математика » Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication
Оцените!
Оцените презентацию от 1 до 5 баллов!
- Тип файла:ppt / pptx (powerpoint)
- Всего слайдов:105 слайдов
- Для класса:1,2,3,4,5,6,7,8,9,10,11
- Размер файла:2.01 MB
- Просмотров:86
- Скачиваний:0
- Автор:неизвестен
Слайды и текст к этой презентации:
№5 слайд
![What is dense linear algebra?](/documents_6/f52151e43a7ccc88c4d673f344686f88/img4.jpg)
Содержание слайда: What is dense linear algebra?
Not just matmul!
Linear Systems: Ax=b
Least Squares: choose x to minimize ||Ax-b||2
Overdetermined or underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax = λx), Generalized (Ax=λBx)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Different matrix structures
Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded …
Level of detail
Simple Driver
Expert Drivers with error bounds, extra-precision, other options
Lower level routines (“apply certain kind of orthogonal transformation”, matmul…)
№6 слайд
![A brief history of Dense](/documents_6/f52151e43a7ccc88c4d673f344686f88/img5.jpg)
Содержание слайда: A brief history of (Dense) Linear Algebra software (1/7)
Libraries like EISPACK (for eigenvalue problems)
Then the BLAS (1) were invented (1973-1977)
Standard library of 15 operations (mostly) on vectors
“AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc
Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC
Goals
Common “pattern” to ease programming, readability, self-documentation
Robustness, via careful coding (avoiding over/underflow)
Portability + Efficiency via machine specific implementations
Why BLAS 1 ? They do O(n1) ops on O(n1) data
Used in libraries like LINPACK (for linear systems)
Source of the name “LINPACK Benchmark” (not the code!)
№8 слайд
![A brief history of Dense](/documents_6/f52151e43a7ccc88c4d673f344686f88/img7.jpg)
Содержание слайда: A brief history of (Dense) Linear Algebra software (2/7)
But the BLAS-1 weren’t enough
Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes
Computational intensity = (2n)/(3n) = 2/3
Too low to run near peak speed (read/write dominates)
Hard to vectorize (“SIMD’ize”) on supercomputers of the day (1980s)
So the BLAS-2 were invented (1984-1986)
Standard library of 25 operations (mostly) on matrix/vector pairs
“GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x
Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC
Why BLAS 2 ? They do O(n2) ops on O(n2) data
So computational intensity still just ~(2n2)/(n2) = 2
OK for vector machines, but not for machine with caches
№9 слайд
![A brief history of Dense](/documents_6/f52151e43a7ccc88c4d673f344686f88/img8.jpg)
Содержание слайда: A brief history of (Dense) Linear Algebra software (3/7)
The next step: BLAS-3 (1987-1988)
Standard library of 9 operations (mostly) on matrix/matrix pairs
“GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B
Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC
Why BLAS 3 ? They do O(n3) ops on O(n2) data
So computational intensity (2n3)/(4n2) = n/2 – big at last!
Good for machines with caches, other mem. hierarchy levels
How much BLAS1/2/3 code so far (all at www.netlib.org/blas)
Source: 142 routines, 31K LOC, Testing: 28K LOC
Reference (unoptimized) implementation only
Ex: 3 nested loops for GEMM
Lots more optimized code (eg Homework 1)
Motivates “automatic tuning” of the BLAS
Part of standard math libraries (eg AMD AMCL, Intel MKL)
№11 слайд
![A brief history of Dense](/documents_6/f52151e43a7ccc88c4d673f344686f88/img10.jpg)
Содержание слайда: A brief history of (Dense) Linear Algebra software (4/7)
LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)
Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1
How do we reorganize GE to use BLAS-3 ? (details later)
Contents of LAPACK (summary)
Algorithms we can turn into (nearly) 100% BLAS 3
Linear Systems: solve Ax=b for x
Least Squares: choose x to minimize ||Ax-b||2
Algorithms that are only 50% BLAS 3
Eigenproblems: Find and x where Ax = x
Singular Value Decomposition (SVD)
Generalized problems (eg Ax = Bx)
Error bounds for everything
Lots of variants depending on A’s structure (banded, A=AT, etc)
How much code? (Release 3.3, Nov 2010) (www.netlib.org/lapack)
Source: 1586 routines, 500K LOC, Testing: 363K LOC
Ongoing development (at UCB and elsewhere) (class projects!)
№12 слайд
![A brief history of Dense](/documents_6/f52151e43a7ccc88c4d673f344686f88/img11.jpg)
Содержание слайда: A brief history of (Dense) Linear Algebra software (5/7)
Is LAPACK parallel?
Only if the BLAS are parallel (possible in shared memory)
ScaLAPACK – “Scalable LAPACK” (1995 – now)
For distributed memory – uses MPI
More complex data structures, algorithms than LAPACK
Only (small) subset of LAPACK’s functionality available
Details later (class projects!)
All at www.netlib.org/scalapack
№13 слайд
![Success Stories for Sca](/documents_6/f52151e43a7ccc88c4d673f344686f88/img12.jpg)
Содержание слайда: Success Stories for Sca/LAPACK (6/7)
Widely used
Adopted by Mathworks, Cray, Fujitsu, HP, IBM, IMSL, Intel, NAG, NEC, SGI, …
5.5M webhits/year @ Netlib (incl. CLAPACK, LAPACK95)
New Science discovered through the solution of dense matrix systems
Nature article on the flat universe used ScaLAPACK
Other articles in Physics Review B that also use it
1998 Gordon Bell Prize
www.nersc.gov/news/reports/newNERSCresults050703.pdf
№16 слайд
![Review Nave Sequential MatMul](/documents_6/f52151e43a7ccc88c4d673f344686f88/img15.jpg)
Содержание слайда: Review: Naïve Sequential MatMul: C = C + A*B
for i = 1 to n
{read row i of A into fast memory, n2 reads}
for j = 1 to n
{read C(i,j) into fast memory, n2 reads}
{read column j of B into fast memory, n3 reads}
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory, n2 writes}
№19 слайд
![Communication Lower Bounds](/documents_6/f52151e43a7ccc88c4d673f344686f88/img18.jpg)
Содержание слайда: Communication Lower Bounds: Prior Work on Matmul
Assume n3 algorithm (i.e. not Strassen-like)
Sequential case, with fast memory of size M
Lower bound on #words moved to/from slow memory = (n3 / M1/2 ) [Hong, Kung, 81]
Attained using blocked or cache-oblivious algorithms
№20 слайд
![New lower bound for all](/documents_6/f52151e43a7ccc88c4d673f344686f88/img19.jpg)
Содержание слайда: New lower bound for all “direct” linear algebra
Holds for
BLAS, LU, QR, eig, SVD, tensor contractions, …
Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak)
Dense and sparse matrices (where #flops << n3 )
Sequential and parallel algorithms
Some graph-theoretic algorithms (eg Floyd-Warshall)
№21 слайд
![Can we attain these lower](/documents_6/f52151e43a7ccc88c4d673f344686f88/img20.jpg)
Содержание слайда: Can we attain these lower bounds?
Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds?
Mostly not
If not, are there other algorithms that do?
Yes
Goals for algorithms:
Minimize #words = (#flops/ M1/2 )
Minimize #messages = (#flops/ M3/2 )
Need new data structures
Minimize for multiple memory hierarchy levels
Cache-oblivious algorithms would be simplest
Fewest flops when matrix fits in fastest memory
Cache-oblivious algorithms don’t always attain this
Attainable for nearly all dense linear algebra
Just a few prototype implementations so far (class projects!)
Only a few sparse algorithms so far (eg Cholesky)
№22 слайд
![A brief future look at Dense](/documents_6/f52151e43a7ccc88c4d673f344686f88/img21.jpg)
Содержание слайда: A brief future look at (Dense) Linear Algebra software (7/7)
PLASMA and MAGMA (now)
Planned extensions to Multicore/GPU/Heterogeneous
Can one software infrastructure accommodate all algorithms and platforms of current (future) interest?
How much code generation and tuning can we automate?
Details later (Class projects!)
Other related projects
BLAST Forum (www.netlib.org/blas/blast-forum)
Attempt to extend BLAS to other languages, add some new functions, sparse matrices, extra-precision, interval arithmetic
Only partly successful (extra-precise BLAS used in latest LAPACK)
FLAME (www.cs.utexas.edu/users/flame/)
Formal Linear Algebra Method Environment
Attempt to automate code generation across multiple platforms
№25 слайд
![For all linear algebra](/documents_6/f52151e43a7ccc88c4d673f344686f88/img24.jpg)
Содержание слайда: For all linear algebra problems:
Ex: LAPACK Table of Contents
Linear Systems
Least Squares
Overdetermined, underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax = λx), Generalized (Ax=λBx)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Level of detail
Simple Driver
Expert Drivers with error bounds, extra-precision, other options
Lower level routines (“apply certain kind of orthogonal transformation”)
№26 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img25.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№27 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img26.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№28 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img27.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№29 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img28.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№30 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img29.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№31 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img30.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№32 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img31.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№33 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img32.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№37 слайд
![Parallel Matrix-Vector](/documents_6/f52151e43a7ccc88c4d673f344686f88/img36.jpg)
Содержание слайда: Parallel Matrix-Vector Product
Compute y = y + A*x, where A is a dense matrix
Layout:
1D row blocked
A(i) refers to the n by n/p block row that processor i owns,
x(i) and y(i) similarly refer to segments of x,y owned by i
Algorithm:
Foreach processor i
Broadcast x(i)
Compute y(i) = A(i)*x
Algorithm uses the formula
y(i) = y(i) + A(i)*x = y(i) + j A(i,j)*x(j)
№39 слайд
![Parallel Matrix Multiply](/documents_6/f52151e43a7ccc88c4d673f344686f88/img38.jpg)
Содержание слайда: Parallel Matrix Multiply
Computing C=C+A*B
Using basic algorithm: 2*n3 Flops
Variables are:
Data layout
Topology of machine
Scheduling communication
Use of performance models for algorithm design
Message Time = “latency” + #words * time-per-word
= + n*
Efficiency (in any model):
serial time / (p * parallel time)
perfect (linear) speedup efficiency = 1
№40 слайд
![Matrix Multiply with D Column](/documents_6/f52151e43a7ccc88c4d673f344686f88/img39.jpg)
Содержание слайда: Matrix Multiply with 1D Column Layout
Assume matrices are n x n and n is divisible by p
A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))
B(i,j) is the n/p by n/p sublock of B(i)
in rows j*n/p through (j+1)*n/p - 1
Algorithm uses the formula
C(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)
№41 слайд
![Matrix Multiply D Layout on](/documents_6/f52151e43a7ccc88c4d673f344686f88/img40.jpg)
Содержание слайда: Matrix Multiply: 1D Layout on Bus or Ring
Algorithm uses the formula
C(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)
First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet)
Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously
№42 слайд
![MatMul D layout on Bus](/documents_6/f52151e43a7ccc88c4d673f344686f88/img41.jpg)
Содержание слайда: MatMul: 1D layout on Bus without Broadcast
Naïve algorithm:
C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc)
for i = 0 to p-1
for j = 0 to p-1 except i
if (myproc == i) send A(i) to processor j
if (myproc == j)
receive A(i) from processor i
C(myproc) = C(myproc) + A(i)*B(i,myproc)
barrier
Cost of inner loop:
computation: 2*n*(n/p)2 = 2*n3/p2
communication: + *n2 /p
№43 слайд
![Nave MatMul continued Cost of](/documents_6/f52151e43a7ccc88c4d673f344686f88/img42.jpg)
Содержание слайда: Naïve MatMul (continued)
Cost of inner loop:
computation: 2*n*(n/p)2 = 2*n3/p2
communication: + *n2 /p … approximately
Only 1 pair of processors (i and j) are active on any iteration,
and of those, only i is doing computation
=> the algorithm is almost entirely serial
Running time:
= (p*(p-1) + 1)*computation + p*(p-1)*communication
2*n3 + p2* + p*n2*
This is worse than the serial time and grows with p.
№45 слайд
![Matmul for D layout on a](/documents_6/f52151e43a7ccc88c4d673f344686f88/img44.jpg)
Содержание слайда: Matmul for 1D layout on a Processor Ring
Time of inner loop = 2*( + *n2/p) + 2*n*(n/p)2
Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop
2*n3/p + 2*p* + 2**n2
(Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast:
Perfect speedup for arithmetic
A(myproc) must move to each other processor, costs at least
(p-1)*cost of sending n*(n/p) words
Parallel Efficiency = 2*n3 / (p * Total Time)
= 1/(1 + * p2/(2*n3) + * p/(2*n) )
= 1/ (1 + O(p/n))
Grows to 1 as n/p increases (or and shrink)
But far from communication lower bound
№47 слайд
![Cannon s Algorithm C i,j C](/documents_6/f52151e43a7ccc88c4d673f344686f88/img46.jpg)
Содержание слайда: Cannon’s Algorithm
… C(i,j) = C(i,j) + A(i,k)*B(k,j)
… assume s = sqrt(p) is an integer
forall i=0 to s-1 … “skew” A
left-circular-shift row i of A by i
… so that A(i,j) overwritten by A(i,(j+i)mod s)
forall i=0 to s-1 … “skew” B
up-circular-shift column i of B by i
… so that B(i,j) overwritten by B((i+j)mod s), j)
for k=0 to s-1 … sequential
forall i=0 to s-1 and j=0 to s-1 … all processors in parallel
C(i,j) = C(i,j) + A(i,j)*B(i,j)
left-circular-shift each row of A by 1
up-circular-shift each column of B by 1
№51 слайд
![Cost of Cannon s Algorithm](/documents_6/f52151e43a7ccc88c4d673f344686f88/img50.jpg)
Содержание слайда: Cost of Cannon’s Algorithm
forall i=0 to s-1 … recall s = sqrt(p)
left-circular-shift row i of A by i … cost ≤ s*( + *n2/p)
forall i=0 to s-1
up-circular-shift column i of B by i … cost ≤ s*( + *n2/p)
for k=0 to s-1
forall i=0 to s-1 and j=0 to s-1
C(i,j) = C(i,j) + A(i,j)*B(i,j) … cost = 2*(n/s)3 = 2*n3/p3/2
left-circular-shift each row of A by 1 … cost = + *n2/p
up-circular-shift each column of B by 1 … cost = + *n2/p
№52 слайд
![Cannon s Algorithm is optimal](/documents_6/f52151e43a7ccc88c4d673f344686f88/img51.jpg)
Содержание слайда: Cannon’s Algorithm is “optimal”
Optimal means
Considering only O(n3) matmul algs (not Strassen)
Considering only O(n2/p) storage per processor
Use communication lower bound #words = (#flops/M1/2)
Sequential: a processor doing n3 flops from matmul with a fast memory of size M must do (n3 / M1/2 ) references to slow memory
Parallel: a processor doing f =#flops from matmul with a local memory of size M must do (f / M1/2 ) references to remote memory
Local memory = O(n2/p) , f n3/p for at least one proc.
So f / M1/2 = ((n3/p)/ (n2/p)1/2 ) = ( n2/p1/2 )
#Messages = ( #words sent / max message size) = ( (n2/p1/2)/(n2/p)) = (p1/2)
№53 слайд
![Pros and Cons of Cannon So](/documents_6/f52151e43a7ccc88c4d673f344686f88/img52.jpg)
Содержание слайда: Pros and Cons of Cannon
So what if it’s “optimal”, is it fast?
Local computation one call to (optimized) matrix-multiply
Hard to generalize for
p not a perfect square
A and B not square
Dimensions of A, B not perfectly divisible by s=sqrt(p)
A and B not “aligned” in the way they are stored on processors
block-cyclic layouts
Needs extra memory for copies of local matrices
№54 слайд
![SUMMA Algorithm SUMMA](/documents_6/f52151e43a7ccc88c4d673f344686f88/img53.jpg)
Содержание слайда: SUMMA Algorithm
SUMMA = Scalable Universal Matrix Multiply
Slightly less efficient, but simpler and easier to generalize
Presentation from van de Geijn and Watts
www.netlib.org/lapack/lawns/lawn96.ps
Similar ideas appeared many times
Used in practice in PBLAS = Parallel BLAS
www.netlib.org/lapack/lawns/lawn100.ps
№60 слайд
![Summary of Parallel Matrix](/documents_6/f52151e43a7ccc88c4d673f344686f88/img59.jpg)
Содержание слайда: Summary of Parallel Matrix Multiplication so far
1D Layout
Bus without broadcast - slower than serial
Nearest neighbor communication on a ring (or bus with broadcast): Efficiency = 1/(1 + O(p/n))
2D Layout – one copy of all matrices (O(n2/p) per processor)
Cannon
Efficiency = 1/(1+O(sqrt(p) /n+* sqrt(p) /n)) – optimal!
Hard to generalize for general p, n, block cyclic, alignment
SUMMA
Efficiency = 1/(1 + O(log p * p / (b*n2) + log p * sqrt(p) /n))
Very General
b small => less memory, lower efficiency
b large => more memory, high efficiency
Used in practice (PBLAS)
№61 слайд
![Summary of Parallel Matrix](/documents_6/f52151e43a7ccc88c4d673f344686f88/img60.jpg)
Содержание слайда: Summary of Parallel Matrix Multiplication so far
1D Layout
Bus without broadcast - slower than serial
Nearest neighbor communication on a ring (or bus with broadcast): Efficiency = 1/(1 + O(p/n))
2D Layout – one copy of all matrices (O(n2/p) per processor)
Cannon
Efficiency = 1/(1+O(sqrt(p) /n+* sqrt(p) /n)) – optimal!
Hard to generalize for general p, n, block cyclic, alignment
SUMMA
Efficiency = 1/(1 + O(log p * p / (b*n2) + log p * sqrt(p) /n))
Very General
b small => less memory, lower efficiency
b large => more memory, high efficiency
Used in practice (PBLAS)
№68 слайд
![Recursive Layouts For both](/documents_6/f52151e43a7ccc88c4d673f344686f88/img67.jpg)
Содержание слайда: Recursive Layouts
For both cache hierarchies and parallelism, recursive layouts may be useful
Z-Morton, U-Morton, and X-Morton Layout
Also Hilbert layout and others
What about the user’s view?
Fortunately, many problems can be solved on a permutation
Never need to actually change the user’s layout
№71 слайд
![Recursive Factorizations Just](/documents_6/f52151e43a7ccc88c4d673f344686f88/img70.jpg)
Содержание слайда: Recursive Factorizations
Just as accurate as conventional method
Same number of operations
Automatic variable blocking
Level 1 and 3 BLAS only !
Extreme clarity and simplicity of expression
Highly efficient
The recursive formulation is just a rearrangement of the point-wise LINPACK algorithm
The standard error analysis applies (assuming the matrix operations are computed the “conventional” way).
№85 слайд
![Work-Depth Model of](/documents_6/f52151e43a7ccc88c4d673f344686f88/img84.jpg)
Содержание слайда: Work-Depth Model of Parallelism
The work depth model:
The simplest model is used
For algorithm design, independent of a machine
The work, W, is the total number of operations
The depth, D, is the longest chain of dependencies
The parallelism, P, is defined as W/D
Specific examples include:
circuit model, each input defines a graph with ops at nodes
vector model, each step is an operation on a vector of elements
language model, where set of operations defined by language
№86 слайд
![Latency Bandwidth Model](/documents_6/f52151e43a7ccc88c4d673f344686f88/img85.jpg)
Содержание слайда: Latency Bandwidth Model
Network of fixed number P of processors
fully connected
each with local memory
Latency ()
accounts for varying performance with number of messages
gap (g) in logP model may be more accurate cost if messages are pipelined
Inverse bandwidth ()
accounts for performance varying with volume of data
Efficiency (in any model):
serial time / (p * parallel time)
perfect (linear) speedup efficiency = 1
№89 слайд
![Motivation Basic Linear](/documents_6/f52151e43a7ccc88c4d673f344686f88/img88.jpg)
Содержание слайда: Motivation (1)
3 Basic Linear Algebra Problems
Linear Equations: Solve Ax=b for x
Least Squares: Find x that minimizes ||r||2 ri2 where r=Ax-b
Statistics: Fitting data with simple functions
3a. Eigenvalues: Find and x where Ax = x
Vibration analysis, e.g., earthquakes, circuits
3b. Singular Value Decomposition: ATAx=2x
Data fitting, Information retrieval
Lots of variations depending on structure of A
A symmetric, positive definite, banded, …
№90 слайд
![Motivation Why dense A, as](/documents_6/f52151e43a7ccc88c4d673f344686f88/img89.jpg)
Содержание слайда: Motivation (2)
Why dense A, as opposed to sparse A?
Many large matrices are sparse, but …
Dense algorithms easier to understand
Some applications yields large dense matrices
LINPACK Benchmark (www.top500.org)
“How fast is your computer?” = “How fast can you solve dense Ax=b?”
Large sparse matrix algorithms often yield smaller (but still large) dense problems
Do ParLab Apps most use small dense matrices?
№91 слайд
![Algorithms for D D Poisson](/documents_6/f52151e43a7ccc88c4d673f344686f88/img90.jpg)
Содержание слайда: Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm Serial PRAM Memory #Procs
Dense LU N3 N N2 N2
Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3)
Jacobi N2 (N5/3) N (N2/3) N N
Explicit Inv. N2 log N N2 N2
Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N
Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N
Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N
FFT N*log N log N N N
Multigrid N log2 N N N
Lower bound N log N N
PRAM is an idealized parallel model with zero cost communication
Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997
(Note: corrected complexities for 3D case from last lecture!).
№92 слайд
![Lessons and Questions](/documents_6/f52151e43a7ccc88c4d673f344686f88/img91.jpg)
Содержание слайда: Lessons and Questions (1)
Structure of the problem matters
Cost of solution can vary dramatically (n3 to n)
Many other examples
Some structure can be figured out automatically
“A\b” can figure out symmetry, some sparsity
Some structures known only to (smart) user
If performance not critical, user may be happy to settle for A\b
How much of this goes into the motifs?
How much should we try to help user choose?
Tuning, but at algorithmic choice level (SALSA)
Motifs overlap
Dense, sparse, (un)structured grids, spectral
№93 слайд
![Organizing Linear Algebra By](/documents_6/f52151e43a7ccc88c4d673f344686f88/img92.jpg)
Содержание слайда: Organizing Linear Algebra (1)
By Operations
Low level (eg mat-mul: BLAS)
Standard level (eg solve Ax=b, Ax=λx: Sca/LAPACK)
Applications level (eg systems & control: SLICOT)
By Performance/accuracy tradeoffs
“Direct methods” with guarantees vs “iterative methods” that may work faster and accurately enough
By Structure
Storage
Dense
columnwise, rowwise, 2D block cyclic, recursive space-filling curves
Banded, sparse (many flavors), black-box, …
Mathematical
Symmetries, positive definiteness, conditioning, …
As diverse as the world being modeled
№95 слайд
![For all linear algebra](/documents_6/f52151e43a7ccc88c4d673f344686f88/img94.jpg)
Содержание слайда: For all linear algebra problems:
Ex: LAPACK Table of Contents
Linear Systems
Least Squares
Overdetermined, underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax = λx), Generallzed (Ax=λxB)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Level of detail
Simple Driver
Expert Drivers with error bounds, extra-precision, other options
Lower level routines (“apply certain kind of orthogonal transformation”)
№96 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img95.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№97 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img96.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№98 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img97.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№99 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img98.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№100 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img99.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№101 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img100.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
№102 слайд
![For all matrix problem](/documents_6/f52151e43a7ccc88c4d673f344686f88/img101.jpg)
Содержание слайда: For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general
GG – general, pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP – positive definite, packed
PT – positive definite, tridiagonal
Скачать все slide презентации Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication одним архивом:
Похожие презентации
-
Matrix Equations and Systems of Linear Equations
-
A block version of Gmres, Bicg, Bicgstab for linear systems with multiple right-hand sides
-
Straight lines and Linear Equations
-
Boolean algebra. Logic operations. Formula and their conversion
-
Principal of geometry and Some Applications of Crystal Structure in Materials
-
Conditional probability and the multiplication rule
-
Логика высказываний и булевы алгебры (Boolean Algebra and Logic)
-
Cеребренникова Ирина Павловна учитель математики МОУ «Вырицкая СОШ 1» Стаж работы 25 лет Вторая категория E-mail iserebro47yandex. ru
-
Автор: Галдин В. А. МБОУ ЛСОШ 3 п. Локоть Брасовского р-на Электронный адрес: galdin. vasyandex. ru
-
Автор: Галдин В. А. Учитель математики и физики МБОУ ЛСОШ 3 п. Локоть Брасовского р-на Электронная поста: galdin. vasyandex. ru