Schedule
MVP6: 4.110 at Kroghstræde 3.
MVP11: 4.112 at Kroghstræde 3.
The schedule is preliminary and will be changed.
Since the exercises count in your assignments, no solution will be
provided unless you all finish them.
Lecture 1:
Introduction to Parallel Computing,
Parallel Programming Platforms.
Abstract:
In this lecture, I present the course, its goals, the textbook,
etc... Then I start the course with an introduction to
parallelism, motivating parallelism and presenting its scope. I
start the chapter on parallel programming platforms that will
serve mostly as an introduction to parallelism from the hardware
aspects.
Reading:
Pointers and memory,
Chapter 1, Sections 2.1, 2.2, and 2.3.
Material:
C introduction,
introduction slides.
Exercises:
During this first exercise session, you are expected to refresh
your skills in C by looking at the C introduction slides on your
own and by doing the following exercises:
- Write a function that, given two matrices A and B of
dimension nxn, computes the result of the matrix
multiplication AxB and writes the result in a matrix C.
Testing of your function: Implement Fibonacci numbers using the
simple (linear) algorithm and the matrix based (logarithmic)
algorithm, see
slides
for the algorithms and some implementations.
- Given an array of length n, implement different
functions to invert it. Benchmark your implementations and explain
the results.
- [By benchmarking read/write access time on arrays (this is a
hint), find out the size of the cache of your machine -- optional].
This is the first half of assignment 1.
Lecture 2:
Introduction to the Case-Study,
The PRAM Model,
Physical Organization of Parallel Platforms.
Abstract:
I present the case-study that will be used later for a series of
assignments. You can download it here.
Then, I introduce the architecture of an ideal parallel computer,
known as the PRAM model. I use extra material from Sven Skyum to
present Brent's scheduling principle and go through examples on
sorting and merging. I complete (maybe) on the networked
connection of parallel computers.
Reading:
Skyum's lecture notes and Section 2.4.
Material:
Model-checker case-study,
Introduction to Parallel Algorithms
Sven Skyum (BRICS Lecture Series) -- typo p6, read "O(n/lg n)
processors" instead of "O(nlg n) processors" --,
PRAM/Organization slides.
Exercises:
Exercises 2.1 (design only, implementation is optional), 2.2, and
2.3. This completes assignment 1.
Assignment 1: Exercises of lecture 1 and
lecture 2. All exercises are compulsory unless stated otherwise.
Deadline March 7th.
Lecture 3:
Physical Organization of Parallel Platforms cont.,
Programming Shared Address Space Platforms
Abstract:
I finish the part on physical organization of parallel platforms
and start (in parallel) pthread programming. From now on we'll
have two threads in the course running in parallel, one on theory,
one on practice.
Reading:
Sections 2.4, 7.1, 7.2, 7.3, 7.4, and 7.5.
Material:
Pthread slides.
Exercises:
Exercise 2.10.
Questions on the the case-study (see previous lecture):
- Find-out the data flow
diagram of the model-checker.
- What are the data dependencies?
- Which computations can be done in parallel?
- What are the critical sections?
- Find two designs to parallelize the model-checker applicable to
shared-address space paradigm.
Lecture 4:
POSIX Threads cont.,
Hints for Assignment 3.
Abstract:
I finish Pthreads and I give you help on how to parallelize the
model-checker. The slides are not here, you are expected to
find your own solution.
Reading: Chapter 7.
Material:
Slides from the previous lecture.
Exercises:
Exercise 2.17.
Assignment 2: Exercises of lecture 3 and
lecture 4. Please do the exercise on the case-study as soon as
possible! Note: The model-checker
has been updated to be more modular.
Deadline March 16th.
Lecture 5:
Communication Costs in Parallel Machines,
Process-Processor Mapping,
Preliminaries to Parallel Algorithm Design.
Abstract:
I discuss different communication costs (message passing and
shared adress space), the impact of process-processor mapping, and
start on parallel algorithm design (decomposition techniques of
apparently sequential algorithm into parallel algorithms).
Reading:
Sections 2.5, 2.6, 2.7, and 3.1.
Material:
Communication costs slides,
process-processor mapping slides,
algorithm design slides.
Assignment 3:
The model-checker has been updated with
a pthread utility support library to help you for this assignment.
- How does detection of termination work in the support library?
- Why is it not an issue for getting traces when a goal state
is found?
- Parallelize the model-checker with pthreads, following one of
your design. The solution of having one shared waiting queue and
one shared state-set is adviced to start with but it is
not an acceptable solution for this assignment.
Note: There is a
modelchecker.help.c file that you are
expected to use for the exercise.
Lecture 6:
Decomposition Techniques and Tasks.
Mapping Techniques for Load Balancing,
Methods for Containing Interaction Overheads,
Parallel Algorithm Models.
Abstract:
I continue on parallel algorithm design, treating characteristics
of tasks and their interactions and mapping techniques. The point
is to use techniques to minimize interaction.
Reading:
Chapter 3.
Material:
Algorithm design slides (from last
time) and new slides (for today).
Exercises: Continue on assignment 3.
Lecture 7:
Help for Assignment 3,
Programming Using the Message Passing Paradigm.
Abstract:
I present the API of pthread_utils.h for those who still haven't
got its overview and you have the opportunity to use part of the
lecture to ask questions and get any issue concerning the
assignment clarified. I start on MPI then.
Reading:
Sections 6.1, 6.2, 6.3, 6.4, and 6.5.
Material:
Help slides and
MPI slides.
Exercises: Continue on assignment 3.
Deadline March 28th.
Lecture 8:
MPI cont., MPI in our Model-checker.
Abstract:
I finish MPI and I discuss problems and solutions related to
parallelizing our model-checker using MPI. This is to prepare
assignment 5.
Reading:
Chapter 6 and section 11.4.4.
Material: previous MPI
slides and
the model-checker slides.
Exercises:
3.2, 3.7, and 3.21. [Optional: 2.24, 3.11, 3.12, 3.13]
Solutions:
solutions
for the optional exercises.
Lecture 9:
Basic Communication Operations.
Abstract:
I start chapter 4. It's about how to implement collective
communication operations that we saw during the lectures on
MPI. There are interesting algorithms for that.
Reading:
Sections 4.1, 4.2, 4.3, and 4.4.
Material:
slides.
Exercises:
4.1 and 4.5.
Assignment 4: Exercises of lectures 8 and 9.
Deadline April 18th (2 weeks extra
because some of you are "traveling").
Lecture 10:
Basic Communication Operations cont.
Abstract:
I finish chapter 4.
Reading:
Sections 4.5, 4.6, and 4.7.
Material:
slides from last time.
You may be interested in the tutorial page
of LAM/MPI.
Assignment 5:
This assignment is split on 3 exercise sessions (that do not
necessarily map to the following 3 sections for the exercises).
There is very little coding to be made but on the other
side, you need to read and analyse quite some code. The
model-checker has been updated
for MPI. Questions:
- The model-checker explores states, receives states, sends
states, detects termination, receives stop or shutdown
notifications, and gets traces in a distributed manner.
Most of it is hidden in the MPI support library and works
automatically through a simple interface.
- Which kinds of messages and protocols are implemented?
- How are blocking and non-blocking communications used?
- How do all these protocols work together?
- During the lectures, I have presented a termination
algorithm. A distributed leader election protocol is also
implemented to avoid races when one or several processes want to
stop because they found an interesting state and they want to
initiate a trace. Of course only one trace should be printed
here. A special protocol is also implemented to print traces.
- How is the termination algorithm implemented? Give details on
where the parts of the algorithm are implemented.
- Give the leader election algorithm used here.
- The traces are printed in a distributed manner. How does this work?
- Eventually you want to make this work with MPI.
- In the comments of the MPI support library you will discover
that the local queue provided here is supposed to be used in a
particular way to and fit together with incoming messages from
other processes. This differs from the slides presented during the
lectures. Update the design from one process point of view and
use the local queue.
- Copy modelchecker.mpi-help.c to modelchecker.mpi.c and
complete the code.
- You will see that this MPI version of the modelchecker has
terrible performance. Why? Suggest some improvements to address one
of the major issues.
The debug versions of the model-checker are available on
my account on the system in MVP/.
Note: Progress is not implemented at this
time. You should not add it.
Lecture 11:
Analytical Modeling of Parallel Programs.
Abstract:
I treat chapter 5. This chapter looks scary but it's not! It's
actually very interesting.
Reading:
Chapter 5.
Material:
slides.
Exercises:
Continue exercise from lecture 10. [Optional exercise 4.3].
Solution: exercise
4.3.
Lecture 12:
Sorting.
Abstract:
Sorry for swapping Dense matrix algorithms and sorting at the last
minute but I just realized that SW6 will need the lecture on
matrices. I will discuss different issues in sorting on parallel
computers, present some well-known (sequential) sorting algorithms
and their parallel versions.
Reading: Chapter 9.
Material:
slides.
Exercises:
Finish exercise from lecture 10.
Deadline for asssignment 5: 27th April.
[Optional exercises 5.1 and 5.2].
Solutions: exercises
5.1 and 5.2.
Lecture 13:
Clusters & GRID
Abstract:
Henrik Thostrup will hold this lecture.
Reading:
Paper
about GRID - get the basic concepts out of it.
Material: slides
part 1 and
part 2.
Exercises (Assignment 6, part 1):
Study the scalability of your implementations, in particular
- run experiments (pthread + MPI) with different sizes of a
given problem (multiple readers, single writer is easily
configurable), or different number of threads/processes,
- explain your results,
- come up with a reasonable model (in terms of cost) for your
MPI implementation that should not be scalable as it it -- the
improvement that you are supposed to suggest (previous assignment)
should improve the cost accordingly.
Lecture 14:
Dense Matrix Algorithms.
Abstract:
I go through dense matrix algorithms for matrix*vector and
matrix*matrix multiplications. I give the idea of parallel
versions of Gauss' elimination algorithm.
Reading:
Chapter 8.
Material:
slides.
Exercises (Cluster/GRID - Assignment 6, part 2):
- General
- The idea of a huge virtual supercomputer was rejected in the
lecture. Explain what problems could be encountered in such a
project.
- Describe problems that must be dealt with when sharing resource with
people outside an organization.
- Job Execution
- What are the advantages and disadvantages of centralized and
decentralized brokering?
Decentralized brokering: Client submits directly to resources.
Centrialized brokering: Clients submit to a central broker which
decides where jobs are executed.
- Make a list of the main attributes that are needed to check if a job can
execute on a cluster.
- Information System
- What happens if the infrastructure for resource discovery crashes /
becomes unavailable?
- What are the implications of an information system delivering
information which does not accurately describes the state of a
resource?
- Security
- Explain why delegation is used in grid.
- What are the problem of reusing existing security infrastructures in
grid?
Deadline 11th of May for assignment 6.
Assignment 7 (SW6 only):
In this directory you will find
- gaussian_solver.c a matrix inversion program based on
Gauss' elimination algorithm,
- check_id.c a matrix multiplication program that checks
that A*B == Identity,
- test_allreduce.c an MPI program to test
MPI_Allreduce,
- matrices a directory that contains 3 matrices.
Your task is to make a Pthread *or* MPI version of
gaussian_solver.c where you have to parallelize the solver
itself and its built-in identity checker. The standalone
check_id.c is provided for testing. To do so, use a row
partitioning with a cyclic distribution of rows around processes
to ensure some minimal load balancing. Since the solver is
implementing the Gaussian method with pivoting, it is more
difficult to implement the pipeline algorithm as described in the
book so do not do it. Getting it right with pivoting is enough.
You may want to add a second
inv_index array to the solver
to get the inverse mapping of
index.
The MPI version is recommended. It is recommended you do it in
groups. You may generate your own test matrices of different
sizes. Test your program. Run experiments.
Deadline 31st of May.
Lecture 15:
Graph Algorithms.
Abstract:
I treat of some parallel algorithms for dense graphs.
Reading:
Sections 10.1 to 10.6.
Material:
slides.
Exercises:
Finish your completions if any.