Multi-core Hash Joins
Main-memory hash join implementations for multi-core CPUs
Main-Memory Hash Joins On Multi-Core CPUs: Tuning to the Underlying Hardware

Introduction

This package provides implementations of the main-memory hash join algorithms described and studied in our ICDE 2013 paper. Namely, the implemented algorithms are the following with the abbreviated names:

  • NPO: No Partitioning Join Optimized (Hardware-oblivious algo. in paper)
  • PRO: Parallel Radix Join Optimized (Hardware-conscious algo. in paper)
  • PRH: Parallel Radix Join Histogram-based
  • PRHO: Parallel Radix Join Histogram-based Optimized
  • RJ: Radix Join (single-threaded)
  • NPO_st: No Partitioning Join Optimized (single-threaded)

Compilation

The package includes implementations of the algorithms and also the driver code to run and repeat the experimental studies described in the paper.

The code has been written using standard GNU tools and uses autotools for configuration. Thus, compilation should be as simple as:

       $ ./configure
       $ make

Besides the usual ./configure options, compilation can be customized with the following options:

   --enable-debug         enable debug messages on commandline  [default=no]
   --enable-key8B         use 8B keys and values making tuples 16B  [default=no]
   --enable-perfcounters  enable performance monitoring with Intel PCM  [no]
   --enable-paddedbucket  enable padding of buckets to cache line size in NPO [no]
   --enable-timing        enable execution timing  [default=yes]
   --enable-syncstats     enable synchronization timing stats  [default=no]
   --enable-skewhandling  enable fine-granular task decomposition based skew handling in radix [default=no]

Additionally, the code can be configured to enable further optimizations discussed in the Technical Report version of the paper:

   --enable-prefetch-npj   enable prefetching in no partitioning join [default=no]
   --enable-swwc-part      enable software write-combining optimization in
                           partitioning (Experimental, not tested extensively) ? [default=no]

Our code makes use of the Intel Performance Counter Monitor tool which was slightly modified to be integrated in to our implementation. The original code can be downloaded from:

http://software.intel.com/en-us/articles/intel-performance-counter-monitor/

We are providing the copy that we used for our experimental study under lib/intel-pcm-1.7 directory which comes with its own Makefile. Its build process is actually separate from the autotools build process but with the –enable-perfcounters option, make command from the top level directory also builds the shared library ‘libperf.so’ that we link to our code. After compiling with –enable-perfcounters, in order to run the executable add ‘lib/intel-pcm-1.7/lib’ to your LD_LIBRARY_PATH. In addition, the code must be run with root privileges to acces model specific registers, probably after issuing the following command: modprobe msr. Also note that with –enable-perfcounters the code is compiled with g++ since Intel code is written in C++.

We have successfully compiled and run our code on different Linux variants; the experiments in the paper were performed on Debian and Ubuntu Linux systems.

Usage and Invocation

The mchashjoins binary understands the following command line options:

      Join algorithm selection, algorithms : RJ, PRO, PRH, PRHO, NPO, NPO_st
         -a --algo=<name>    Run the hash join algorithm named <name> [PRO]
 
      Other join configuration options, with default values in [] :
         -n --nthreads=<N>  Number of threads to use <N> [2]
         -r --r-size=<R>    Number of tuples in build relation R <R> [128000000]
         -s --s-size=<S>    Number of tuples in probe relation S <S> [128000000]
         -x --r-seed=<x>    Seed value for generating relation R <x> [12345]    
         -y --s-seed=<y>    Seed value for generating relation S <y> [54321]    
         -z --skew=<z>      Zipf skew parameter for probe relation S <z> [0.0]  
         --non-unique       Use non-unique (duplicated) keys in input relations 
         --full-range       Spread keys in relns. in full 32-bit integer range
         --basic-numa       Numa-localize relations to threads (Experimental)

      Performance profiling options, when compiled with --enable-perfcounters.
         -p --perfconf=<P>  Intel PCM config file with upto 4 counters [none]  
         -o --perfout=<O>   Output file to print performance counters [stdout]
 
      Basic user options
          -h --help         Show this message
          --verbose         Be more verbose -- show misc extra info 
          --version         Show version

The above command line options can be used to instantiate a certain configuration to run various joins and print out the resulting statistics. Following the same methodology of the related work, our joins never materialize their results as this would be a common cost for all joins. We only count the number of matching tuples and report this. In order to materialize results, one needs to copy results to a result buffer in the corresponding locations of the source code.

Configuration Parameters

Logical to Pyhsical CPU Mapping

If running on a machine with multiple CPU sockets and/or SMT feature enabled, then it is necessary to identify correct mappings of CPUs on which threads will execute. For instance one of our experiment machines, Intel Xeon L5520 had 2 sockets and each socket had 4 cores and 8 threads. In order to only utilize the first socket, we had to use the following configuration for mapping threads 1 to 8 to correct CPUs:

cpu-mapping.txt
8 0 1 2 3 8 9 10 11

This file is must be created in the executable directory and used by default if exists in the directory. It basically says that we will use 8 CPUs listed and threads spawned 1 to 8 will map to the given list in order. For instance thread 5 will run CPU 8. This file must be changed according to the system at hand. If it is absent, threads will be assigned round-robin. This CPU mapping utility is also integrated into the Wisconsin implementation (found in ‘wisconsin-src’) and same settings are also valid there.

Performance Monitoring

For performance monitoring a config file can be provided on the command line with –perfconf which specifies which hardware counters to monitor. For detailed list of hardware counters consult to "Intel 64 and IA-32 Architectures Software Developer’s Manual" Appendix A. For an example configuration file used in the experiments, see ‘pcm.cfg’ file. Lastly, an output file name with –perfout on commandline can be specified to print out profiling results, otherwise it defaults to stdout.

System and Implementation Parameters

The join implementations need to know about the system at hand to a certain degree. For instance CACHE_LINE_SIZE is required by both of the implementations. In case of no partitioning join, other implementation parameters such as bucket size or whether to pre-allocate for overflowing buckets are parametrized and can be modified in ‘npj_params.h’.

On the other hand, radix joins are more sensitive to system parameters and the optimal setting of parameters should be found from machine to machine to get the same results as presented in the paper. System parameters needed are CACHE_LINE_SIZE, L1_CACHE_SIZE and L1_ASSOCIATIVITY. Other implementation parameters specific to radix join are also important such as NUM_RADIX_BITS which determines number of created partitions and NUM_PASSES which determines number of partitioning passes. Our implementations support between 1 and 2 passes and they can be configured using these parameters to find the ideal performance on a given machine.

Generating Data Sets of Our Experiments

Here we briefly describe how to generate data sets used in our experiments with the command line parameters above.

Workload B

In this data set, the inner relation R and outer relation S have 128*10^6 tuples each. The tuples are 8 bytes long, consisting of 4-byte (or 32-bit) integers and a 4-byte payload. As for the data distribution, if not explicitly specified, we use relations with randomly shuffled unique keys ranging from 1 to 128*10^6. To generate this data set, append the following parameters to the executable mchashjoins:

      $ ./mchashjoins [other options] --r-size=128000000 --s-size=128000000 
Note
Configure must have run without –enable-key8B.

Workload A

This data set reflects the case where the join is performed between the primary key of the inner relation R and the foreign key of the outer relation S. The size of R is fixed at 16*2^20 and size of S is fixed at 256*2^20. The ratio of the inner relation to the outer relation is 1:16. In this data set, tuples are represented as (key, payload) pairs of 8 bytes each, summing up to 16 bytes per tuple. To generate this data set do the following:

     $ ./configure --enable-key8B
     $ make
     $ ./mchashjoins [other options] --r-size=16777216 --s-size=268435456 

Introducing Skew in Data Sets

Skew can be introduced to the relation S as in our experiments by appending the following parameter to the command line, which is basically a Zipf distribution skewness parameter:

     $ ./mchashjoins [other options] --skew=1.05

Wisconsin Implementation

A slightly modified version of the original implementation provided by Blanas et al. from University of Wisconsin is provided under ‘wisconsin-src’ directory. The changes we made are documented in the header of the README file. These implementations provide the algorithms mentioned as ‘non-optimized no partitioning join’ and ‘non-optimized radix join’ in our paper. The original source code can be downloaded from http://pages.cs.wisc.edu/∼sblanas/files/multijoin.tar.bz2 .

Author
Cagri Balkesen cagri.nosp@m..bal.nosp@m.kesen.nosp@m.@inf.nosp@m..ethz.nosp@m..ch

(c) 2012, ETH Zurich, Systems Group