forgot the documentation in previous commit...

9ed9ce5a · Valentin Bruch · 48511f75 · 9ed9ce5a
Commit 9ed9ce5a authored 4 years ago by Valentin Bruch
--- a/doc/main.tex
+++ b/doc/main.tex
@@ -22,7 +22,7 @@
 \newcommand\cuname{rtrg\_cublas}
 \author{Valentin Bruch}
-\date{\today}
+\date{May 21, 2021}
 \title{Extended matrix multiplication with \texttt{\name}}
 \begin{document}
@@ -48,6 +48,17 @@ In the implementation of \texttt{\name} a linear continuation is calculated base
 \end{figure}
 \section{Usage}
+The available functions are:
+\begin{description}
+  \item[\texttt{multiply\_extended}]
+    Multiply two matrices and correct truncation effects.
+  \item[\texttt{invert\_extended}]
+    Invert a matrix and correct truncation effects.
+  \item[\texttt{extend\_matrix}]
+    Extend a matrix by linear extrapolation along the diagonals.
+\end{description}
 This package is written for numpy arrays of 128-bit complex numbers. Given two such arrays \texttt{a} and \texttt{b} of shape \texttt{(n,n)} ($\mathtt{n}=2N+1$) where $K<\mathtt{cutoff}$ is (approximately) known, you can calculate the matrix product with truncation mitigation as follows:
 \begin{lstlisting}
 c = rtrg_c.multiply_extended(b.T, a.T, cutoff).T  # fast
@@ -56,14 +67,16 @@ c = rtrg_c.multiply_extended(a, b, cutoff)        # slow
 \end{lstlisting}
 The transposition in the version of the implementation seems unnecessary, but it can significantly speed up the calculation. The reason for this is that by taking the transpose a C-ordered array is converted to an F-ordered array and \texttt{\name} is optimized for F-ordered arrays.
-In general input matrices must be numpy arrays of 128-bit complex numbers with shape \texttt{(m,n,...)} where \texttt{...} stands for any number of extra dimensions. Note that since you should take the transpose before and after calling \texttt{\name} functions, you should work with arrays of shape \texttt{(...,n,m)} in python. The extrapolation requires \texttt{m>cutoff+2} and \texttt{n>cutoff+2}. When calculating the product of two matrices, the extra dimensions must match exactly. No broadcasting is done as you may be used to from other python functions.
+In general input matrices must be numpy arrays of 128-bit complex numbers with shape \texttt{(m,n,...)} where \texttt{...} stands for any number of extra dimensions. Note that since you should take the transpose before and after calling \texttt{\name} functions, you should work with arrays of shape \texttt{(...,n,m)} in python. The extrapolation requires \texttt{m>cutoff+2} and \texttt{n>cutoff+2}. When calculating the product of two matrices, the extra dimensions must match exactly. No broadcasting is done in contrast to many other python functions.
 The function \texttt{invert\_extended} requires square matrices (\texttt{n=m}).
 Multiplication and plain matrix extension also work with non-square matrices.
 However, multiplication which makes use of symmetries (see \texttt{help(\name.multiply\_extended)}) is only meaningful for nearly-square matrices ($|\mathtt{n}-\mathtt{m}|\leq 1$).
 Matrix multiplication can be boosted by using symmetries which origin from hermicity preservation in Floquet theory in Liouville space.
-For matrices fulfilling $A_{-n,-m}=s_a {A_{n,m}}^*$ and $B_{-n,-m}=s_b {B_{n,m}}^*$ with $s_a,s_b=\pm1$ the matrix product $AB$ can be calculated using 
+For matrices fulfilling%
+\footnote{This package was written by a physicists, so ${}^*$ denotes (only) complex conjugation.}
+$A_{-n,-m}=s_a A_{n,m}^*$ and $B_{-n,-m}=s_b B_{n,m}^*$ with $s_a,s_b=\pm1$ the matrix product $AB$ can be calculated using 
 \begin{lstlisting}
 c = rtrg_c.multiply_extended(b.T, a.T, cutoff, s).T
 \end{lstlisting}
@@ -75,30 +88,80 @@ where $\mathrm{s}=s_a s_b=\pm1$ is an integer defining the symmetry. Additionall
 \texttt{\name} is based on numpy and uses (C)BLAS and LAPACK.
 It can be build like any simple python extension with \texttt{python setup.py build\_ext --inplace}.
 But you should have a look at \texttt{setup.py} and the explanation of the configuration at the beginning of \texttt{\name.c}.
-For best performance you should try out different values of \texttt{TRIANGULAR\_OPTIMIZE\_THRESHOLD} and you can also try \texttt{PARALLEL\_EXTRA\_DIMS} (but note that this can also make calculations significantly slower).
+For best performance you should try out different values of \texttt{TRIANGULAR\_OPTIMIZE\_THRESHOLD}.
 The CUBLAS version \texttt{\cuname} can be built with \texttt{make -f \cuname.makefile}.
 Tuning parameters for the CUBLAS version can and should be adapted in \texttt{\cuname.h}.
+\subsection{Optimization}
+\begin{itemize}
+  \item
+    Fancy options can be useful for large matrices, but should always be tested with benchmarks.
+    They probably only slow down the calculation when working with small matrices.
+  \item
+    Enabling more parallelization in \texttt{\name.c} might hamper the more efficient parallelization of BLAS and therefore worsen the performance.
+  \item
+    Parallelization of the matrix multiplication through (C)BLAS is not affected by the parallelization settings in \texttt{\name.c}.
+  \item
+    Most of the settings listed here are only relevant for the matrix multiplication.
+  \item
+    Only performance settings are listed here. Settings required to adapt to your BLAS or LAPACK version and settings concerning the extrapolation scheme or the data type are not listed.
+  \item
+    Only settings for \texttt{\name} are listed here. The configuration for \texttt{\cuname} differs and is explained in \texttt{\cuname.h}.
+\end{itemize}
+\begin{description}
+  \item[\texttt{TRIANGULAR\_OPTIMIZE\_THRESHOLD}] (int)
+    The multiplication of triangular matrices is split into the multiplication of multiple smaller matrices by a divide-and-conquer algorithm until this threshold is reached.
+    Higher values should be used if many threads and a fast BLAS version are available. Lower values may be reasonable when using a single thread.
+  \item[\texttt{PARALLEL}] (flag)
+    Do multiple matrix multiplications at the same time if possible.
+  \item[\texttt{PARALLEL\_EXTRAPOLATION}] (flag)
+    Use an OMP parallelized loop for matrix extrapolation. This might improve performance for large matrices when many threads are available.
+    Currently this parallelization is only implemented for the extrapolation functions used for matrix multiplication. The functions \texttt{extend\_matrix} and \texttt{invert\_extended} do not profit from this.
+  \item[\texttt{PARALLEL\_EXTRA\_DIMS}] (flag)
+    When working with arrays of $n$ dimensions, the functions in \texttt{\name} will only iterate over the first $n-2$ dimensions, always performing the same operation. If this option is flag is defined, this iteration will be OMP parallelized. Note that this may require more copying of data in memory for symmetry-improved multiplication of non-square matrices.
+  \item[\texttt{PARALLELIZE\_CORRECTION\_THREADS\_THRESHOLD}] (int)
+    \emph{Only relevant if \texttt{PARALLEL} flag is set.}
+    Only if at least this number of threads is available the truncation correction will be maximally parallelized by correcting top left and bottom right of the matrix product in parallel.
+  \item[\texttt{DEBUG}] (flag)
+    Obviously this should only be used for debugging.
+\end{description}
+\subsection{Other options}
+\begin{description}
+  \item[\texttt{CBLAS}]
+    Use CBLAS instead of directly calling BLAS functions.
+  \item[\texttt{LAPACK\_C}] (flag)
+    Include the LAPACK\_C header instead of just linking to LAPACK. This probably should not make any difference.
+  \item[\texttt{extrapolate(i, a, b, c)}]
+    Simple linear extrapolation based on the last 3 elements.
+    Given the mapping $0\mapsto\mathtt{a}, -1\mapsto\mathtt{b}, -2\mapsto\mathtt{c}$ estimate the value at $\mathtt{i}$.
+    This can be replaced, e.g., by a constant extrapolation or an unbiased linear extrapolation.
+  \item[\texttt{complex\_type}]
+    Data type of all input arrays. Only this data type is accepted. The (C)BLAS and LAPACK functions must be adapted to this data type.
+  \item[\texttt{NPY\_COMPLEX\_TYPE}]
+    Numpy's name for the data type define in \texttt{complex\_type}.
+  \item[\texttt{gemm}, \texttt{trmm}]
+    BLAS or CBLAS function names
+  \item[\texttt{getrf}, \texttt{getri}]
+    LAPACK function names.
+\end{description}
 \section{Implementation}
-All functions of \texttt{\name} internally work with F-contiguous arrays. Input arrays which are not F-contiguous are converted to such arrays, which and lead to bad performance. Return values of these functions are always F-contiguous except if the input array is left unchanged.
+All functions of \texttt{\name} internally work with F-contiguous arrays. Input arrays which are not F-contiguous are converted to such arrays and that is quite slow. Return values of these functions are always F-contiguous except if the input array is left unchanged.
 The implementation of \texttt{\name.extend\_matrix} is straight forward.
 \texttt{\name.invert\_extended} first extrapolates the matrix in the same way as it is done by \texttt{\name.extend\_matrix}, inverts it using the LU decomposition from LAPACK, and truncates the inverted matrix to the original size. An optional argument allows one to truncate the extended matrix by a certain number of rows and columns prior to the inversion to speed up the calculation while making only a very small mistake (typically smaller than the mistake done in the extrapolation).
 The most complicated and optimized function is the matrix multiplication. Here only those parts are extrapolated, which are required for the matrix product.
 In the correction of the truncation effects from the extrapolated matrix elements this makes use of fast multiplication of triangular matrices in BLAS.
-The problem of multiplying a lower triangular matrix with an upper triangular matrix (or vice versa) is subdivided because BLAS has no explicit function for this. The reduction of the problem to smaller matrices is configured at compile time by \texttt{TRIANGULAR\_OPTIMIZE\_THRESHOLD}, which should be adapted to your BLAS version.
+The problem of multiplying a lower triangular matrix with an upper triangular matrix (or vice versa) is subdivided because BLAS has no explicit function for this. The reduction of the problem to smaller matrices is configured at compile time by \texttt{TRIANGULAR\_OPTIMIZE\_THRESHOLD}, which should be adapted to your needs.
-Note that these relatively many calls to BLAS functions (at least 3 per matrix-matrix multiplication without symmetries) can be slow when, e.g., using small matrices but running BLAS functions on a GPU.
-In the CUBLAS version the some calculations are sent to the GPU if the matrix sizes exceed a threshold defined in \texttt{\cuname.h}. This includes the multiplication of the non-extrapolated matrices and the multiplication of upper and lower triangular matrices as required for the correction of the truncation effects.
+In the CUBLAS version some calculations are sent to the GPU if the matrix sizes exceed a threshold defined in \texttt{\cuname.h}. This includes the multiplication of the non-extrapolated matrices and the multiplication of upper and lower triangular matrices as required for the correction of the truncation effects.
 In the symmetry-boosted matrix multiplication the problem of plain matrix multiplication without truncation mitigation is reduced to the multiplication of a general matrix and a triangular matrix, which is implemented in BLAS, plus $O(N^2)$ operations.
 Since multiplication by a triangular matrix is done in-place in BLAS, this calculation can be sped up by allowing for overwriting the first argument of \texttt{\name.multiply\_extended}.
 \emph{(Here is probably some potential for further improvement of the CUBLAS version, which does not do \texttt{ztrr} in-place by default.)}
+I do not know what happens when putting OpenMP parallelization (\texttt{PARALLEL\_EXTRA\_DIMS} or \texttt{PARALLEL}) on top of CUBLAS. This should probably better be avoided.
-By relying on BLAS for all slow operations this library also hands the problem of parallelization to BLAS.
-For input arrays with more than 2 dimensions, the iteration over extra dimensions can be parallelized using OpenMP by defining \texttt{PARALLEL\_EXTRA\_DIMS} in \texttt{\name.c}. But this option should only be used after verifying that it improves the performance. As this inhibits internal parallelization in BLAS, this option may significantly lower the performance depending on the input array.
-I don't know what happens when putting OpenMP parallelization (\texttt{PARALLEL\_EXTRA\_DIMS}) on top of CUBLAS. This should probably better be avoided.
 \end{document}