4.  Crypto API in kernel

The Linux crypto API was initially developed to support the cryptographic part of IPsec in the kernel. Meanwhile, other subsystems of the kernel that require cryptographic support, like the wireless stack, are using this API. The crypto API requires that every algorithm registers itself at run time. Typically algorithms (like DES, AES, . . . ) and block mode templates (like ECB, CBC, . . . ) can be registered. Users that require cryptographic functionality do not use ciphers and templates directly; instead they request a “block cipher”. The crypto API uses the cipher and the template, stacks them together, and produces the requested block cipher. The advantage is that the block modes need not be implemented in every algorithm separately but can be shared. A few drivers, like the s390–AES driver, implement the block cipher directly because it is handled entirely in hardware.

3.3 SPU variant

4.1.  Implementing in User space

Developing and testing the VMX version of AES (chapter 3.2) is realized in user space. The benefit is that bugs do not lead to a kernel panic (a system crash) but to a simple crash (segmentation fault). Additionally, all tools for debugging like gdb can be used. The resulting algorithm must not only be tested for theoretical working but must also be tested against test vectors that are specified in [Publication 01]. The usage of test vectors during the development of the algorithm is also very helpful because it provides interim results after every step within the algorithm. To prevent conflicts with the later integration of the algorithm into the kernel, the API that the algorithm is providing, as well as the environment which the algorithm is using, should be same as in the kernel. The only way to satisfy all these requirements is to port the crypto subsystem from the kernel into the user space.

The crypto subsystem in the kernel is fully independent from the remaining kernel. As mentioned earlier, every algorithm registers itself in the initialization phase. The crypto user then requests a specific block cipher, for instance “ecb(aes)” which represents the AES cipher in ECB mode. The result of this request is a struct crypto_blkcipher which is required for all further crypto requests. It contains function pointers to the functions provided by the specific algorithm (i.e. encrypt, decrypt and set key), its private data (i.e. information about the key), and various API related informations that are only used internally by the API. The struct is not accessed directly by the crypto user, but the API provides wrapper functions. Figure 4.1 shows a work flow for the setkey operation.


PIC

Figure 4.1: Work flow for a set key function


crypto_blkcipher_setkey() is used by the crypto user to set the key. It is basically a wrapper method that calls the appropriate function referenced in the struct. The next function that is passed is setkey() within the block cipher module which acts as a glue function between the user and the algorithm. It makes sure that the memory address of the key has the alignment that was specified by the algorithm to work properly. In the case that the memory is not aligned as required it allocates additional memory and copies the key. cipher->setkey invokes the final set key function as implemented by the algorithm. The workflow of encrypt and decrypt functions is similar. As can be seen in this example, the crypto API (including the algorithms) does not use much of the kernel infrastructure. Memory allocation is used for struct crypto_blkcipher and within the glue functions to ensure alignment. The algorithms themselves usually do not need any special functions since they only “modify” the memory.

The easiest way to port the crypto API into user space is to replace all the header files required by the subsystem and to emulate the kernel. This was already done for the netfilter subsystem by Jeremy Kerr and Rusty Russel in their project “Netfilter simulation environment” [Jeremy Kerr 07]. This “new kernel environment” implements only the infrastructure that is required by netfilter, not the whole kernel function set. The simulator must be adjusted to meet the requirements of the crypto subsystem. This requires adding support for memory on a page basis, which is widely used. The resulting simulator registers the algorithm and a template and invokes the tcrypt module which tests the block cipher against test vectors.

4.2.  Implementing a cipher

A cipher is the basic of an cryptographic encryption algorithm that can be registered. The user has to fill the struct crypto_alg, and the struct cipher_alg which is embedded in the first one.

 
1struct crypto_alg { 
2     struct list_head cra_list; 
3     struct list_head cra_users; 
4 
5     u32 cra_flags; 
6     unsigned int cra_blocksize; 
7     unsigned int cra_ctxsize; 
8     unsigned int cra_alignmask; 
9 
10     int cra_priority; 
11     atomic_t cra_refcnt; 
12 
13     char cra_name[CRYPTO_MAX_ALG_NAME]; 
14     char cra_driver_name[CRYPTO_MAX_ALG_NAME]; 
15 
16     const struct crypto_type *cra_type; 
17 
18     union { 
19           struct ablkcipher_alg ablkcipher; 
20           struct blkcipher_alg blkcipher; 
21           struct cipher_alg cipher; 
22           struct digest_alg digest; 
23           struct hash_alg hash; 
24           struct compress_alg compress; 
25     } cra_u; 
26 
27     int (*cra_init)(struct crypto_tfm *tfm); 
28     void (*cra_exit)(struct crypto_tfm *tfm); 
29     void (*cra_destroy)(struct crypto_alg *alg); 
30 
31     struct module *cra_module; 
32}; 
33 
34struct cipher_alg { 
35     unsigned int cia_min_keysize; 
36     unsigned int cia_max_keysize; 
37     int (*cia_setkey)(struct crypto_tfm *tfm, const u8 *key, 
38                unsigned int keylen); 
39     void (*cia_encrypt)(struct crypto_tfm *tfm, u8 *dst, const u8 *src); 
40     void (*cia_decrypt)(struct crypto_tfm *tfm, u8 *dst, const u8 *src); 
41};
Listing 4.1: struct crypto_alg and struct cipher_alg

Here is a brief description of the most important fields from listing 4.1:

cra_flags This field determines the type of the crypto algorithm. In the case of
CRYPTO_ALG_TYPE_CIPHER it treats the algorithm as an cipher and the member cipher within cra_u is accessed, in the case of CRYPTO_ALG_TYPE_DIGEST the digest member is accessed.

cra_blocksize The block size which the algorithm expects.

cra_ctxsize The size of the private context which is used to identify the state.

cra_alignmask The alignmask that is required by the algorithm. Some algorithms require a special alignment of their private context data structure or the data that they have to process.

cra_priority The priority of the algorithm. Usually generic implementations of an algorithm have a priority of 100, assembly optimized 200 and hardware support 300. The API takes the algorithm with the highest priority if there is more than one available.

cra_name The name of the algorithm (which is used for lookup).

cra_driver_name The name of the driver. It is used for information purposes (in /proc/crypto) and to ensure that the same driver does not register the same algorithm more than once.

cra_u This union contains all possible types of crypto (cipher, hash, blkcipher, . . . ), which contain specific information about the algorithm, like encrypt, decrypt, update hash or the size of an IV.

cia_min_keysize The minimum size of key supported.

cia_max_keysize The maximum size of key supported.

cia_setkey A function pointer to the “setkey” function. If the algorithm has to transform the supplied key into a different one, it must be saved in the private context.

cia_encrypt A function pointer to the “encrypt” function. This algorithm processes chunks of data that are cra_blocksize bytes long with the key which was supplied earlier.

cia_decrypt A function pointer to the “decrypt” function. This algorithm processes chunks of data that are cra_blocksize bytes long with the key which was supplied earlier. In order to implement a cipher the user has to set all these fields and supply the three functions. The cipher must not use any global variables because they may exist for multiple users at the same time. In order to save a state between each invocation of a function the cipher must use its own struct, the size of which is determined by the cra_ctxsize field. The private context is allocated automatically by the crypto API and can be retrieved from the struct struct crypto_tfm with the helper functions crypto_tfm_ctx() or crypto_tfm_ctx_aligned(). Once a function is chosen by the user it has to be used consistently. The difference between the two is whether or not the private struct starts at an aligned address in memory as specified by the cra_alignmask member. Some algorithms, mainly ciphers that offload the actual work to hardware, rely on proper alignment. The only downside of the “aligned” private context is that the crypto API has to lookup the specified alignment mask and calculate the correct address.

4.2.1.  Benchmarking the VMX version of AES

The crypto API offers a module for testing every algorithm that is registered against official test vectors to check for correct behavior of the implementation. This module also has an option to benchmark the selected algorithm in different categories. Here are snippets of an output generated for the AES cipher:

testing speed of ecb(aes) encryption  
test 0 (128 bit key, 16 byte blocks): 1 operation in 616 cycles (16 bytes)  
test 1 (128 bit key, 64 byte blocks): 1 operation in 1402 cycles (64 bytes)  
test 2 (128 bit key, 256 byte blocks): 1 operation in 4712 cycles (256 bytes)  
 
testing speed of ecb(aes) decryption  
test 0 (128 bit key, 16 byte blocks): 1 operation in 650 cycles (16 bytes)  
test 1 (128 bit key, 64 byte blocks): 1 operation in 1396 cycles (64 bytes)  
 
testing speed of cbc(aes) encryption  
test 0 (128 bit key, 16 byte blocks): 1 operation in 597 cycles (16 bytes)  
test 1 (128 bit key, 64 byte blocks): 1 operation in 1460 cycles (64 bytes)  
test 2 (128 bit key, 256 byte blocks): 1 operation in 4894 cycles (256 bytes)

The module performs the test for various sizes of the key with different numbers of blocks and different block modes and with the encrypting and decrypting functions of the algorithm. The performance results are shown in cycles. Especially for encryption of small blocks, the result is not very precise. According to the output, the encryption of 16 bytes in CBC is faster than with ECB (less cycles). Since CBC performs an additional XOR operation over the ECB mode, this seems not to be correct. For larger block sizes, 256 bytes for instance, the result resembles the predicted values better. Repeated usage of the testing module shows more consistent results on PC-style hardware, where the difference between multiple runs is only a few cycles. The consistent results have been accomplished by disabling IRQs. The kernel can not be interrupted during the measurement and there is almost no source of interference. There is a difference on SMT CPUs. Usually when both threads are busy the performance of a single thread is less than when just one is active. The performance that is measured can only be used for comparison between two implementations, and not as a specification of what can be achieved in a real applications, because the testing environment is not real; it is just the optimal one. The results of the testing module vary a lot on the Playstation 3. The reason might be the SMT processor or the hypervisor which runs belows Linux and controls the hardware. The hypervisor cannot be removed or switched off for the test. Therefore another benchmarking module must be used which returns repeatable results which do not differ from each other very much. The problem with interruptions occurs even more in user space: the user space application may be interrupted by the kernel not only due to upcoming hardware events but also due to scheduling policy or the fact that the memory page is currently not accessible (swapped out to disk or not available in the MMU). This problem can be solved in three steps:

A long term benchmark has been developed (listing ??) to perform a repeatable benchmark. The module allocates memory and repeatedly invokes the encryption or decryption function with different key sizes. By default the program allocates 16 KiB of memory and processes it continuous until the total amount that has been is processed reaches 156 MiB (10000 loops 16 KiB). The reason why it only processes 16 KiB of memory is that the crypto API expects physically contiguous memory. 16 KiB are 4 contiguous pages with 4 KiB each. It is unlikely that more than this can be allocated on a system that has been running for a while, especially on system that has only 256 MiB of main memory such as the Playstation 3. This measurement is repeated ten times to give an average result. The time that is needed for a single pass (156 MiB) is measured in jiffies1 and converted into MiB/sec as the unit for comparison. The conversion is performed on integer variables because floating point is not available in kernel mode. The results that are used for comparison and figures are in KiB/sec and collected in appendix ??.

The performance of the AES cipher from chapter 3.1 can be seen in figure 4.2


PIC

Figure 4.2: Cipher implementation of the VMX AES algorithm


The profiling snapshots of the program resulted in a lot of hits at the beginning or before the actual AES program and almost the same amount of hits within the AES program. The first samples were tracked down to the instruction:

 enable_kernel_altivec();

This kernel function enables the VMX unit in kernel mode because like float registers, VMX registers are not available in kernel mode. Enabling the VMX unit consists of:

This last point is needed because of a technique that is referred as “lazy binding of the VMX unit”. Instead of saving the VMX registers (32 registers16 bytes= 512 bytes) on every context switch, the registers are only saved and restored if the application is using them.

The benchmarking module is started by modprobe, which should not perform any VMX operations, and then continues until termination. Therefore, the enable_kernel_altivec() function should only enable the VMX unit and not save any VMX register because they were not used before. According to the profiling results, this seems to be an expensive operation. In order not to perform this operation very frequently, it may be moved into the exception handler. The first VMX operation in kernel will raise an exception which enables the VMX unit instead of raising a general exception.


PIC

Figure 4.3: Cipher implementation of the VMX AES algorithm in manual and exception mode


Figure 4.3 shows the results of this modification, where “manual” represents the old behavior and “exception” the modified one. The improvement in the case of encryption in ECB mode with 128 bit keys is 50% (from 26560 KiB/sec to 40363 KiB/sec).

After this tuning, the profiling snapshots were almost evenly distributed around the AES code. The code uses more VMX registers than are available in the hardware. This forces the compiler to overwrite the contents of registers on reloading them from memory. This can be also seen in figure 4.4.


PIC

Figure 4.4: Encryption speed according to the blocksize of the VMX AES algorithm, cipher implementation


The same benchmarking program encrypts the same amount of data. In the case of smaller chunk sizes (block size) the encryption process is repeated more often in order to encrypt 156 MiB in total. At the beginning of the graph (16 bytes) the encryption speed is very low; only about 7.9 MiB/sec for 128 bit keys, ECB block mode and with exception handling. At a chunk size of 512 bytes it reaches a processing speed of 37.9 MiB/sec which seems to be the optimal chunk size. The chunk size of 16 KiB reaches a processing speed of 39.4 MiB/sec which is 4% faster.

4.3.  Implementing a synchronous block cipher

The benefit of a block cipher over a normal cipher is that the block cipher can process the whole chunk of data instead of only 16 bytes. The disadvantage is that the developer must also implement every block mode that should be supported separately. In order to provide a block cipher cra_flags in struct struct crypto_alg (listing 4.1) must be set to
CRYPTO_ALG_TYPE_BLKCIPHER and cra_type to crypto_blkcipher_type. The crypto API then considers the blkcipher field in cra_u to be of type struct blkcipher_alg; see listing 4.2.

 
1struct blkcipher_alg { 
2     int (*setkey)(struct crypto_tfm *tfm, const u8 *key, 
3                unsigned int keylen); 
4     int (*encrypt)(struct blkcipher_desc *desc, 
5                struct scatterlist *dst, struct scatterlist *src, 
6                unsigned int nbytes); 
7     int (*decrypt)(struct blkcipher_desc *desc, 
8                struct scatterlist *dst, struct scatterlist *src, 
9                unsigned int nbytes); 
10 
11     unsigned int min_keysize; 
12     unsigned int max_keysize; 
13     unsigned int ivsize; 
14};
Listing 4.2: struct blkcipher_alg

The remaining elements in struct crypto_alg do not differ much, except for the block cipher name which is provided in cra_name. It must now contain the block mode and cipher name, for instance “ecb(aes)” or “cbc(aes)”. All elements in struct blkcipher_alg have the same meaning as the corresponding fields in struct cipher_alg. ivsize is new and describes the size of the IV that must be supplied.

An obvious difference is the different function signature of the encrypt and decrypt function. The block cipher does not operate on pure memory address but on struct scatterlist and contains the total length of the block that has to be processed. The crypto API provides wrapper functions which convert the information provided by struct scatterlist into physical memory addresses. The results of the implementation of the ECB block cipher can be seen in figure 4.5.


PIC

Figure 4.5: AES VMX block cipher and cipher implementation and generic code


The generic algorithm is the generic AES algorithm available in the kernel (crypto/aes.c in v2.6.21). According to the results, the generic implementation performs still better than the self–developed VMX version of AES. It is very interesting that in cipher mode the use of the exception handler accelerated the execution, while in block mode it is slowing down the execution (encryption 128 bit keys) or the performance remains almost the same (encryption 192 bit keys). The performance of the algorithm could be improved if the code used less VMX registers. The compiler could than leave some constant values (like the lookup table in SubBytes()) in registers and reuse them while processing the second 16 byte block instead of reloading them from memory. This optimization would probably be meaningless in the SPU version because the SPU provides 128 registers.

Figure 4.6 shows the performance of AES with VMX in block cipher and in CBC block mode.


PIC

Figure 4.6: AES VMX block cipher in CBC mode, and the generic code


The additional XOR operation which is required by CBC carries almost no weight in the VMX variant of AES.

To summarize, the VMX version performs better than the generic code in encryption mode, but worse in decryption mode.

4.4.  The asynchronous interface

The initial interface of cryptographic API was synchronous. That means that every encryption or decryption request by the crypto user was satisfied immediately. As of kernel v2.6.22, Herbert Xu introduced an asynchronous interface for block cipher: the ablkcipher.


PIC

Figure 4.7: Work flow for the synchronous and asynchronous block cipher interface


The different interface requires a different program logic. Figure 4.7 shows the two work flows in comparison. The major difference is that the crypto user does not continue working on the memory block that it wanted to process. Instead, it may enqueue further requests or wait for completion of its request. The advantage of asynchronous processing is the exploitation of dedicated hardware devices and multi processor systems.

An example based on network processing: An interrupt is triggered and the network driver turns over the newly arrived network packet to the network stack for further processing. Once the packet is interpreted as an encrypted packet, it is passed to the crypto API for decryption. During the interpretation another packet may be received. The crypto API hands the packets over to the hardware. The CPU may enqueue further requests or work on other jobs while the hardware is processing the cryptographic requests. Once the hardware has decrypted a packet it signalizes this, via the crypto API, to the network stack which can resume working on it. It is not impossible to support hardware devices without the asynchronous API but this would require a busy loop until the device finishes its work. This is a pure waste of resources. In addition the asynchronous API enables batch processing. This even has advantages on systems without hardware support. On a multiprocessor system one CPU can fetch packets and the other can decrypt them. Without the asynchronous processing the other CPU would remain idle (except that the interrupts are generated on both CPUs). Batch processing also has advantages on single processor systems. Since the CPU can now process several requests in batch the chances are higher that the data needed by the algorithm will remain in the L1 or L2 cache.

4.5.  Implementing an asynchronous block cipher

In order to provide such a driver the user has to set cra_flags to CRYPTO_ALG_TYPE_BLKCIPHER | CRYPTO_ALG_ASYNC and cra_type to crypto_ablkcipher_type. The crypto API now uses the ablkcipher member of cra_u which is of type struct ablkcipher_alg and shown in listing 4.3.

 
1struct ablkcipher_alg { 
2     int (*setkey)(struct crypto_ablkcipher *tfm, const u8 *key, 
3                unsigned int keylen); 
4     int (*encrypt)(struct ablkcipher_request *req); 
5     int (*decrypt)(struct ablkcipher_request *req); 
6 
7     struct crypto_queue *queue; 
8 
9     unsigned int min_keysize; 
10     unsigned int max_keysize; 
11     unsigned int ivsize; 
12};
Listing 4.3: struct ablkcipher_alg

Once again the function signatures of the encrypt and decrypt functions have changed. They now expect a parameter of type struct ablkcipher_request which contains the source and destination addresses, as a struct scatterlist, the initialization vector, and a pointer to a “completion” function which has to be called once the request has been completed. The queue in line 7 represents a software queue for incoming requests. This queue is required for both software and hardware support. Dedicated hardware with crypto support usually has its own queue where the cryptographic requests are expected. Since the requests are enqueued directly on the hardware, the queue size is limited. Once the queue is full further requests cannot be enqueued and therefore proceed. The enqueue function cannot sleep and try again later (once hardware has finished some requests) because the enqueue function may be called from a softirq context. The requests cannot be dropped (and signalled to the caller that they were dropped) because not every caller is able to redo the request. The software queue ensures that all requests can be processed under all circumstances. However, the queue has a soft limit for the number of requests that may be queued up simultaneously. Once that limit is reached all further requests will be dropped unless they are marked as non–dropable. In both cases the crypto API returns a special code so that the caller notices this. The reason is that the caller has to slow down its enqueuing rate because the hardware is probably overloaded.

This soft limit applies also to drivers which realize the cryptographic operations in software. The software has an unlimited queue, but the kernel may enqueue more requests than the CPU can process in time. This can potentially consume all available memory resources, which should be avoided.