{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chapter 6: High Performance Computing\n",
"Module 5 of Phys212/Biol212 class\n",
"### by Ilya Nemenman, 2016-2019"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Defining High Performance Computing\n",
"\n",
"Note to self: need to expand the introduction\n",
"Nowadays news are full with articles about new technological breakthroughs in artificial intelligence: self-driving cars, automatic translation, and even computer-generated art. What makes all of this possible, among other things, is a dramatic increase in our computational capabilities in the last decade. Interestingly, the speed of computer processors has not changed that much over this time -- an individual processor now is only a bit faster than is used to be ten years ago. What has changed is that we figured out how to put more processors into a single computer (and sometimes dramatically more -- literally thousands of specialized processors in some video cards) and how to make all of these processors to work collectively on solving large computational problems we want to solve. In this module, we discuss the basics of such parallel computing and will figure out how to implement it on your laptops.\n",
"\n",
"What is high performance computing? To answer this, we need to define a series of terms. \n",
" - **Processor** , or Central Processing Unit (**CPU**): The *brain* of the computer, the part in which most computing operations are controlled and executed. \n",
" - **Serial Processing**: Execution of a single program step by step on one CPU.\n",
" - **Sequential Processing**: Execution of multiple programs on a single CPU in a sequence.\n",
" - **Concurrent Processing**: Execution of multiple programs at the same time. \n",
" - **Parallel Processing**: Execution of multiple programs (or parts of the same one) on multiple CPUs. \n",
" - **High Performance Computing**: It is basically another name for solving computational problems concurrently on multiple CPUs.\n",
" \n",
"It gets a bit more complicated. One can have serial and sequential processing (completing one task before starting the other on a single CPU). One can also have serial and concurrent processing (alternating between two or more tasks on the same single CPU). Finally, one can have triw parallel and concurrent processing, which means executing multiple tasks simultaneously at the same instant of time on multiple CPUs.\n",
"\n",
"### Organization of Parallel Computing\n",
"What are the different ways of organizing parallel processing? We can develop the intuition by considering how you can arrange for multiple people in a group to do a certain complex task that consists of many different subtasks. For example, let's suppose that a group of people are arriving to a supermarket to buy a long list of supplies for a party. How should the long shopping list be split among the group? There are many different ways of doing this.\n",
" - The simplest option is not to delegate different parts of the shopping list to different people, but to have just one person to complete all of the steps of the task. This is equivalent to sequential processing.\n",
" - The second option is to have the group captain assign a list of subtasks to each of the people in the group, and let them all focus on each of their assigned tasks. This incurs communication cost. First, the partitioning of the full list of tasks into pieces must be done, and then these sublists must be communicated to the people. When they complete their individual tasks, the results of the completion must be communicated to the leader. Further, there are additional inefficiencies. While the task is being partitioned, the people will be waiting for their assignments, doing nothing. Then some of them will finish their subtasks before the others, and will again wait, doing nothing.\n",
" - The arrangement above may possibly be improved, where whenever a person finishes his/her task, s/he starts helping the others who are still doing theirs. But this is easier said then done. Whom should they help? To know this, each person must be constantly communicating his/her progress, either broadcasting it to everyone in the group, or at least sending it to the captain, so that either everyone knows who needs help, or the captain can tell everyone. Further, when a helper arrives, the task that is running behind now needs to be partitioned again, which will take additional time and resources.\n",
" - To avoid some of these problems, one may want to duplicate the tasks originally, but prioritize them, so that every person first works on his/her task, and then switches on other pre-assignees tasks, constantly reporting results. You can fill in the gaps of what kind of problems this solution carries with itself.\n",
"\n",
">### Your turn\n",
"Can you suggest other arrangements of how the tasks can be divided over multiple individuals?\n",
"\n",
"Note that, in all of these (and yours) suggestion, the more complicated the arrangement is, the more communication needs to be done. And communication takes time -- so that, at a certain point, communication cost outweighs the savings one will generate from a complex task-partitioning scheme.\n",
"\n",
"With these different arrangements, it makes sense to have a bit finer characterization of concurrent / parallel processing. \n",
" - **Parallel processing** per se: Collection of connected, tightly coupled, strongly communicating processors concurrently solving a problem.\n",
" - **Distributed processing**: Loosely coupled processors, perhaps great distances from each other, working concurrently on problems with minimal communication among them.\n",
"\n",
"True parallel processing is further distinguished by how memory of the computers is organized\n",
" - **Shared memory**: Parallel processors access the same memory during concurrent processing, requiring minimal communication beyond the shared memory.\n",
" - **Distributed memory**: Each processor has it's own memory to operate with, and synchronization of tasks is achieved through communication.\n",
"\n",
"Crucially, a program written for sequential processing won't execute on many processors. It needs to be **parallelized** or **scaled** to many processors. This usually involves careful consideration of how concurrent processing will be organized, how the tasks will be partitioned, how multiple data streams and execution streams will be synchronized, etc. A useful metric of how well the code is parallelized is the **speedup**, the ratio of the time it takes to execute it on $n$ processors compared to execution on $1$. It's essentially impossible to achieve the speedup of larger than $n$, and it is usually less than $n$ due to communication costs. \n",
"\n",
"### Types of parallel algorithms\n",
"Note to self: need pictures of different parallelization schemes.\n",
"Some problems are easier to parallelize than others, and different approaches are required for parallelization. These include:\n",
" - **Embarrassingly parallel algorithms**: The task can be divided into subtasks that have virtually no communication.\n",
"Typically problems such as generating statistics (e.g., producing many DLA clusters of Module 4) are embarrassingly parallelizable. For example, you could have run multiple Jupyter notebooks at the same time, and each would produce a single cluster, whose lengths can then be averaged.\n",
" - **Data partitioning algorithms**: For example, when calculating a larger sum, the data set can be partitioned into parts, each part can be summed independently, and then the sums will be added. Here a *root* process partitions the data into subsets, and then collects the results. The root has a privileged state -- it distributes the tasks, but doesn't execute them itself.\n",
" - **Divide and conquer algorithms**: Work is done without a root. Here each processor gets a task and, if it can't execute it in a requested time, it subdivides the task into two, leaves one half for itself, and recruits another processor for the remaining half. Both of the processors then continue doing the same, until each of the recruited processors has a manageable task. When the tasks are completed, each processor reports its results to the processor who originally gave it the task. \n",
" - **Coarsening algorithms**: Here the processors solve the problem on multiple scales. Note to self: more to be added.\n",
"\n",
"For all of these algorithms, it is crucial to think how the data must be organized to speed up communication."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problems to be solved with parallel processing\n",
"We already saw a few such problems in Module 4. Essentially every problem solved using agent-based simulations can be parallelized using embarrassingly parallelizeable approach. Embarrassing parallelization won't work, however, for approaches using ceullar automata. The problems we discussed last time (DLA, annihilation) can be solved by data partitioning, where each one of the processors analyzes a sub-part of the lattice.\n",
"\n",
"There is a special type of cellular automata, for which parallel processing is usually used (and, in fact, was developed for originally). These are called *partial differential equations*. Consider a situation where one is trying to predict weather, or, for simplicity, just the temperature $T$ over some extended spatial range, which is what weather forcasters routinely do. So the variable $T$ that we care about depends on the spatial coordinates in addition to time, $T=T(x,y,z,t)$. This is in contrast to all previous examples we considered, where the dynamical variables were few, and they depended just on time, such as the coordinate and the angular velocity of the pendulum, $\\theta(t), \\omega(t)$, or the concentration of nutrients and the number of bacteria, $\\rho(t),n(t)$. Variable that are space and time dependent, like the temperature, are known as *fields*. On a digital computer, we usually approximate the continuous coordinates $x,y,z$ as a grid, and so $T(x,y,z,t)$ becomes a set of variables $T(x_i,y_j,z_k,t)$, with the dimensionality of the set being $N=\\frac{L_x}{\\Delta x}\\frac{L_y}{\\Delta y}\\frac{L_z}{\\Delta z}\\sim \\left(\\frac{L_x}{\\Delta x}\\right)^d$, where $L_x$ is the linear span of the system, $\\Delta x$ is the lattice spacing (and similarly for $y$ and $z$), and $d=3$ is the dimension. Even for reazonably small $\\frac{L_x}{\\Delta x}\\sim 100$, the number of dynamical variables we need to store to describe the dynamics of $T$ is $N\\sim 10^6$. That is, the spatially extended dynamical system is equivalent to a dynamical system with very many (potentially millions) dynamical variables. As you have observed earlier, solving ODEs for even a few dynamical variables can take a substantial amount of time. It is also clear that this time complexity is going to scale as, at least, $O(N)$, and potentially even worse if many variables interact with each other. Thus the need to use all of the available computational resources (and hence parallelizing the code) for such spatially extended simulations is obvious. \n",
"\n",
">### Track 2 \n",
"Notice that the probability distribution of having a certain number of molecules in the Chemical Master Equation (CME), which we studied in Track 2 in the previous module, depends on that number as well as on time, $P=P(n,t)$. So while the CME does not describe a function $P(n,t)$ for a continuous $n$, it becomes very similar to the discretized dynamics of continuous fields, such as $T$, when both are represented on a digital computer. The dimensionality of the dynamics of the CME thus scales as $N\\sim \\prod N_{{\\rm max},i}\\sim N_{\\rm max}^d$, where $N_{{\\rm max},i}$ is the maximum allowable number of molecules of a chemical species $i$ involved in the process, and $d$ is the dimensionality, or the number of these species. Thus if the number of chemical species involved in the dynamics is more than a few (and it can be arbitrary!), the CME becomes a system of astronomically many coupled ODEs, and parallelizing the code becomes even more important than for fields in real spatial dimensions. Note to self: maybe stick with CME instead of diffusion? We will ontroduce one of those here -- the heat diffusion equation, but the Chemical Master Equation from Track 2 in Module 4 is a related example. \n",
"\n",
"### Newton's law of cooling and Fourier's law of heat conductance\n",
"Let's start with a simple example. Suppose you have two bodies in contact with temperatures $T_1$ and $T_2$. How do their temperatures depend on time? To develop a mathematical model of this process, we ... to be continued...\n",
"\n",
"\n",
"### Partial Differential Equations (PDEs): The Diffusion Equation\n",
"The heat diffusion equation -- one of many partial differential equations (differential equations involving partial derivatives).\n",
" - Fick's first law of diffusion $J=-D\\frac{d\\phi}{dx}$ or $\\vec{J}=-D\\nabla \\phi$ (also known as Newton's law of cooling if $\\phi=T$, or the Fourier law of heat conduction).\n",
" - Fick's second law of diffusion $\\partial_t \\phi = -D\\nabla^2 \\phi$.\n",
" - Finite difference form of the heat diffusion equation; need boundary conditions to complete the solution.\n",
" - Boundary conditions (implemented with extending the grid matrix):\n",
" - Absorbing\n",
" - Reflecting\n",
" - Periodic\n",
"\n",
"### Solving the Diffusion Equation using Python\n",
"Talk about instabilities-- TBC."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Initialization block\n",
"import numpy as np\n",
"import time\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# The function below implements the single step of the diffusion equation\n",
"def diffusion(D, dt, dx, u):\n",
" \"\"\" compute one time step of diffusion\n",
" D: diffusion constant\n",
" dt: step size in time\n",
" dx: step size in space\n",
" u: concentration on lattice points of diffusing quantity, assumed (L+2)x(L+2) because of the\n",
" boundary conditions\n",
" return: u after one time step \n",
"\n",
" note that because of the boundary conditions, \n",
" the assumed lattice has a boundary belt all around it, so that the lattice is (L+1)x(L+2)\"\"\"\n",
" \n",
" tmp = np.empty((u.shape[0] - 2, u.shape[1] - 2)) \n",
" \n",
" for x in np.arange(1, u.shape[0] - 1):\n",
" for y in np.arange(1, u.shape[1] - 1):\n",
" tmp[x - 1, y - 1] = u[x, y] + dt * D * (u[x - 1, y] + u[x, y - 1] - 4 * u[x, y] + \n",
" u[x + 1, y] + u[x, y + 1]) / dx**2\n",
" \n",
" # we return the lattice LxL\n",
" return tmp"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAO4AAAD7CAYAAABt9agKAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAACaFJREFUeJzt22msXWUZR/H1QK1oCzSAAzMx1gGNGmMMGg0YcICIJE4RERFwjhqNOASl0khAjYkfHIgRRW2NsxJUTCQYSKrFMRicYhRbwRYRQaFQleHxw97Fw80d21vLX9YvOcntPvu87z67Z5093La6G0lZdtvVGyBp4QxXCmS4UiDDlQIZrhTIcKVA96twq+rMqrpgsdedx1hdVY+c4bnLq+rVizHPjqqqV1XVuok/b6mqRyzS2Pfsz6o6bNwnSxZp7EPGbd19McZLEBvu+CG7uqpur6rrq+r8qlox22u6+9zunlckC1n3vqiqjqqq63ZkjO5e3t3XLMY8i7k/q2pDVR0zMfafxm29azHGTxAZblW9Hfgg8A5gb+AI4FDg0qpaOsNrFuXbXQvnvt8JujvqAewFbAFeOmX5cuAG4LTxz2cDXwPWArcArx6XrZ14zSuBjcDfgLOADcAxE69fO/58GNDAKcCfgBuB90yM81RgPfB3YDPwMWDpxPMNPHKG93M58H7gB8CtwPeA/SaePwL44Tj2L4CjJp47FfjN+LprgNeNy5cBW4G7x321BThgmrn3BS4e98+Px+1YN912A8cBvx7n+jNwxkzzzLXvJ/bna4FN4z57+8S8nwXOmfjzUcB1489rxvm2jvO9c2K8JeM6B4zv6ybg98BrJsY6G/gK8PnxvfwKeMqu/lwv9JF4xH06sAfwjcmF3b0F+C7w7InFJzB8gFYAX5hcv6oOBz4BnATsz3DkPnCOuZ8BPBo4GlhVVY8dl98FvA3YD3ja+PwbF/CeXs4Q4UOBpQxRUFUHAt8BzgH2GZd/vaoeMr7uBuD5DF9mpwIfqaond/dtwLHAph5OIZd396Zp5v048M/x/Z82PmbyaYYvhj2BxwPfn2OeGff9hGcBK4HnAO+ePP2dSXefzPDlefw434emWe2LwHUMAb8YOLeqjp54/gXAl8Ztu5jhizZKYrj7ATd2953TPLd5fH6b9d19UXff3d1bp6z7YuBb3b2uu/8NrGL41p7N6u7e2t2/YDj6PRGgu3/W3Vd2953dvQH4JHDkAt7Thd39u3EbvwI8aVz+CuCS7r5kfA+XAj9lOPrR3d/p7j/04AqGo/Uz5zPheCPnRcCq7r6tu38JfG6Wl9wBHF5Ve3X3zd398zmmmG3fb7N6nPtq4ELgxPls+2yq6mCGL9h3dfc/u/sq4ALg5InV1o379C6GI/gTd3Te/7XEcG8E9pvhumn/8fltrp1lnAMmn+/u2xlOmWdz/cTPtzOcnlNVj6qqb483yW4BzuXeXyBzmXZchuv2l1TV37c9GD6U+4/zHltVV1bVTeNzxy1g3ocAS7j3Pto4y/ovGsffWFVXVNXT5hh/tn0/3TobGf5OdtQBwE3dfeuUsSfPpqbu7z3SrsMTw10P/At44eTCqlrGcNp22cTi2Y6gm4GDJl7/IIZrvu1xPvBbYGV37wWcCdR2jjXpWmBNd6+YeCzr7g9U1QOBrwMfBh7W3SuASybmnevs4a/AncDBE8sOmWnl7v5Jd5/AcDp/EcOZwWzzzOe/nU2de9tp9m3Agyeee/gCxt4E7FNVe04Z+8/z2J4YceF29z+A1cBHq+p5VfWAqjoM+CrDdc2aeQ71NeD4qnr6eCd6Ndsf254MN2G2VNVjgDds5zhTrR238blVtXtV7TH++uUghmvhBzIGWFXHMlwrbvMXYN+q2nu6gcfTxG8AZ1fVg8dr/lOmW7eqllbVSVW1d3ffMb7Xbb96mXWeOZw1zv04hmv0L4/LrwKOq6p9qurhwFunvO4vwLS/X+7uaxlu5p037q8nAKcz83V2pLhwAcYbEmcyHG1uAX7EcHQ6urv/Nc8xfgW8meEmxWaGO4w3MBzNF+oMhhtMtwKf4r8fwB0yfghPYHivf2V4j+8AdhtPBd/CcOS7eZz/4onX/pbhJs0142n2dKehb2I4Lb+e4U7uhbNszsnAhvFS4PUM19/znWcmVzDc9b0M+HB3f29cvobhHsIGhuv2qfvzPOC943xnTDPuiQx3mjcB3wTeN94f+L9R7X+kB6CqljP8ymVld/9xV2+PNJvII+5iqarjx1O1ZQxH76sZvuWl+7T7dbgMp6GbxsdK4GXtKYgCeKosBbq/H3GlSIYrBVrQvxapKs+rpZ2su+f89wQecaVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgZbs6g24L/jM6Ufe8/Npn75iF26JND8ecaVAhisFqu6e/8pV819Z0nbp7pprHY+4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhTIcP/P9fpV9PpVu3oztMgMVwpU3T3/lavmv7Kk7dLdNdc6HnGlQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBTJcKZDhSoEMVwpkuFIgw5UCGa4UyHClQIYrBVqywPVvBDbujA2RBMCh81mpuntnb4ikReapshTIcKVAhisFMlwpkOFKgQxXCmS4UiDDlQIZrhToP2VF4NppvreoAAAAAElFTkSuQmCC\n",
"text/plain": [
"