ChaCha20 Encryption Performance on DEC PDP-11

PDP-11 Front Panel

Image by Don DeBold, CC BY 2.0, some rights reserved.

Introduction

For fun, I’m implementating the ChaCha20 cipher in PDP-11 assembly. To maximize its compatibility, my primary reference target is a plain PDP-11/40 without the Extended Instruction Set (EIS) option nor speedups available in later machines. Backward compatibility is also provided for PDP-11/20 to cover 100% of the installations.

The natural question is: is our 50-year-old computer fast enough to run ChaCha20, the state-of-the-art crypto on today’s Internet?

Currently, the implementation is incomplete - I have no prior experience on programming the PDP-11 and I plan to learn everything on-the-fly. Still, I cannot wait any longer to answer this question, so I did a performance analysis. Here’s my findings.

Background

ChaCha20 is a 256-bit stream cipher for symmetric encryption. It’s designed to have both high performance and high security. It can be implemented efficiently in pure software. By avoiding secret-dependent memory accesses and conditional branches in its construction, it’s immune to many forms of timing side-channel attacks that software implementations of other algorithms (e.g. AES) are plagued with.

In recent years, it has been widely accepted by the applied cryptography community. It has been standardized and incorporated in many software projects and Internet protocols, including the Linux kernel, OpenBSD, OpenSSH, TLSv1.3, and Tor.

ChaCha20 Quarter Round

In ChaCha20, the basic unit of operation is the Chacha20 quarter round function. It takes 4 unsigned 32-bit integer a, b, c, d as inputs and applies a series of transformations in 4 steps. In each step, 3-different 32-bit unsigned integers are added and XORed with one another, then the result is rotated (circular shifted) to the left.

The quarter round function is shown below.

a += b; d ^= a; d <<<= 16;
c += d; b ^= c; b <<<= 12;
a += b; d ^= a; d <<<= 8;
c += d; b ^= c; b <<<= 7;

As you see, in step 1 and step 3, it uses the integers a, b, d; in step 2 and 4, it uses b, c, d. Rotation offsets are different in each step. Nevertheless, the same operations are always repeated: ADD, XOR and ROTATE. This construction is known as an ARX cipher in cryptography.

A full ChaCha20 round consists of 4 quarter rounds, hence the name “quarter round” (not to be confused with the 4 substeps in a single quarter round). They operate on 16 unsigned 32-bit integers as a 4x4 matrix, not just 4 (I’ll explain it later).

Register Allocation

Registers on the PDP-11 are listed below.

Register Description
r0 - r5 16-bit General Purpose
sp (r6) 16-bit Stack Pointer
pc (r7) 16-bit Program Counter

Six 16-bit general purpose registers are available, allowing us to hold three 32-bit integers in registers. Each step in the ChaCha20 quarter round happens to operate on three 32-bit integers, we found the number of registers is just enough - but we do have to load and store one register per step. ChaCha20 is really designed for machines with four 32-bit registers or more, not three.

The stack pointer r6, can be used as a scratchpad register if we really need to. However, it should be saved to a known memory location and be restored later, for some systems, one should also enable and disable interrupts appropriately. It’ll be explained later.

It’s worth mentioning that register r7, in princple, works as a general-purpose register. It responds to all PDP-11’s addressing mode, but we can’t use it unless we want to trash our own program.

Wait, 32-bit integers in PDP-endian?

No. There are many legends about the infamous PDP-endian in the computer folklore. In PDP-endian, a 32-bit integer is stored as two 16-bit integers in big-endian, but the 32-bit intergers themselves are stored in little-endian in memory.

For example, a 64-bit bignum looks like this:

+---------------------------+---------------------------+
|        32-bit low         |        32-bit high        |
|-------------+-------------|-------------+-------------|
| 16-bit high | 16-bit low  | 16-bit high | 16-bit low  |
|_____________|_____________|_____________|_____________|

Having said that, the PDP-endian is completely irrelevant to what we’re doing here. One must realize that the PDP-11 in its basic form is a pure 16-bit machine in little-endian.

First, The PDP-endian is mainly used by the Floating-Point Unit for double-precision floating point arithmetic - it’s not even supported on a PDP-11/40. Second, the Unix C complier always stores 32-bit integers in PDP-endian for consistency, but we don’t need to exchange any binary data with other programs. The internal state of a cipher is never something you exchange with others, well, at least not intentionally.

The take-home message is: if you don’t need to use floats or C, there’s no need to use PDP-endian. Regular little-endian integers are adequate.

Implementation

Let’s try implementing the quarter round in PDP-11 assembly.

To avoid repetition without loss of generality, if the operations of the remaining 3 steps are identical and their execution time can be directly calculated, I’ll only explicitly written down the code for the first step. If needed, you can find the full code at the end of the article.

All assembly code is written in Unix as(1) syntax. An immediate constant is prefixed with $, as in $0777. Dereferencing memory from an address is denoted as *addr, dereferencing memory from an register is denoted as (rX). DEC MACRO-11 uses a different syntax, they are written as #addr, @0777, and (rX).

Numbers that start with a zero (e.g. 0777) are in octets - it was still the preferred number system by programmers at that time, because serious computers (until IBM S/360) all used 36-bit integers, which could be cleanly represented as four 3-octet groups: 777 777 777 777. PDP-11 is a 16-bit machine, but it was still before hexadecimal became ubiquitous. The PDP-11 machine code was designed in octets, octets were also the default number system in Unix assembler and debugger, and of course, the Unix filesystem permission.

Initialization

Before any computation can begin, ChaCha20’s 512-bit internal state must be initialized with a 256-bit encryption key, a block counter, and a nonce. Because this is trivial (just write some numbers to memory) and my implementation is still incomplete, I’ll omit this part.

Setup

To begin a quarter round, we need to first load the number a and d into the registers, so we’ll have a number that can be the can ADDed to or XORed against.

PDP-11 is a 16-bit machine, it means 2 MOV operations for each integer.

mov	*$chacha_a + 0, r0	// 1.74 + 1.46 (+ 0.15 x 2)
mov	*$chacha_a + 2, r1	// 1.74 + 1.46 (+ 0.15 x 2)
mov	*$chacha_d + 0, r4	// 1.74 + 1.46 (+ 0.15 x 2)
mov	*$chacha_d + 2, r5	// 1.74 + 1.46 (+ 0.15 x 2)

Prefix $ indicates the number is an immediate (literal) constant, and * tells the CPU to get the content at this memory address. So it says, “dereference the memory at constant address chacha_a”. Alternatively, instead of *$addr, you can also simply write addr.

From the PDP-11/40 manual, I found a MOV from memory takes 3.2 µs (1.74 µs of source time and 1.46 µs of fetch and execution time, see Appendix 2). A Memory Management Unit (required for UNIX and other timesharing operating systems) also adds a 0.15 µs overhead to each memory cycle, so we optionally adds 0.6 µs.

But we can do better. Didn’t I mention that *$addr can be shortened to addr before? It has an additional benefit: it triggers the Unix assembler as to use PC-relative addressing mode. Instead of generating hardcoded absolute memory addresses, all addresses would be expressed as an offset to PC. This is known as Position-Independent Code.

On the PDP-11, it runs faster. The PIC version shown here (addressing mode 6) is 0.28 µs faster than the non-PIC version (addressing mode 3).

mov	chacha_a + 0, r0	// 1.46 + 1.46 (+ 0.15 x 2)
mov	chacha_a + 2, r1	// 1.46 + 1.46 (+ 0.15 x 2)
mov	chacha_d + 0, r4	// 1.46 + 1.46 (+ 0.15 x 2)
mov	chacha_d + 2, r5	// 1.46 + 1.46 (+ 0.15 x 2)

This operation is only necessary in step 1, because at this point we don’t have anything in registers yet. Later, in step 2, 3, and 4, there is already one integer from the previous calculation in our register, allowing us to skip it. Therefore, I call this step the setup overhead.

Setup takes 11.68 µs (+ 1.2 µs) in total.

Load and Add

Then, a 32-bit carryless addition is performed. We load b and do an addition. Again, because the PDP-11 is a 16-bit machine, it means we need both a carryless addition (ADD) and an addition with carry (ADC).

mov	chacha_b + 0, r2	// 1.46 + 1.46	(+ 0.15 x 2)
mov	chacha_b + 2, r3	// 1.46 + 1.46	(+ 0.15 x 2)
add	r2, r0                  // 0.99		(+ 0.15)
adc	r1			// 0.99		(+ 0.15)
add	r3, r1			// 0.99		(+ 0.15)

On PDP-11, adding from registers to registers is one of the most basic operations, it’s fast and only takes 0.99 µs (+ 0.15 µs) - one of the fastest thing you can do on a PDP-11. A register-to-register transfer (0.9 µs + 0) is the fastest, but only 0.09 µs (+ 0.15 µs) faster.

In total, loading b and adding take 8.81 µs (+ 1.05 µs).

Store and XOR

Next, the resultant a is XORed with the integer d, we simply perform an XOR here.

Because we have 6 registers to hold 3 integers, but ChaCha20 operates on 4. It forces us to store one integer back to RAM in each step within the quarter round. Storing it prior to the XOR is the best time - it frees up two spare registers, r0 and r1, as our scratchpad registers.

For example, in step 1, the integer a is not used in the next subsequent rotation and can be written back.

// store a
mov	r0, chacha_a + 0	// 2.84		(+ 0.15 x 3)
mov	r1, chacha_a + 2	// 2.84		(+ 0.15 x 3)

// d ^= a
xor	r0, r4			// 0.99		(+ 0.15)
xor	r1, r5			// 0.99		(+ 0.15)

The XOR insturction has the same timing, 0.99 µs (+ 0.15 µs), just like other arithmetic insturctions.

In total, a store and a XOR take 7.66 µs (+ 1.2 µs).

Rotate to the Left

ChaCha20 requires a 16-bit, 12-bit, 8-bit and 7-bit rotation in each quarter round. This is the most painful operation on the PDP-11. For a basic PDP-11/40 machine, only 1 bit can be shifted at a time. Writing this part takes a lot of head scratching. I found some encouragement in the Jargon File.

SHIFT TO THE LEFT!
SHIFT TO THE RIGHT!
POP UP, PUSH DOWN,
BYTE! BYTE! BYTE!
– Programmer’s Cheer, Jargon File

Basic Rotation

To rotate an integer, We have to do a left shift on the lower 16-bit word, a rotation (shift with carry) at the higher 16-bit, add the overflowed carry bit back to the lower 16-bit. And we need to do it for 16 times!

// repeat 16, 12, 8, 7 times
asl	r0	// 0.99	(+ 0.15)
rol	r1	// 0.99	(+ 0.15)
adc	r0	// 0.99	(+ 0.15)

In our quarter round, step 1 runs like a snail and takes 47.52 µs (+ 7.2 µs) in total. Later, in step 2, 3 and 4, the rotation offsets are smaller - 12, 8 and 7 bits, they are still slow - they take 35.64 µs (+ 5.4 µs), 23.75 µs (+ 3.6 µs), and 20.79 µs (+ 3.15 µs).

But all is not lost, the rotation offset in ChaCha20 is selected on purpose to make our life easier.

Optimizing 16-bit Rotation

Rotating 16-bit to the left is equivalent to swapping the lower and higher 16-bit words. Even an explicit swap is not necessary. We can simply swap the order of operation while adding, XORing and storing register r0 and r1 from now on. Here, we can use the best technology in existence to do it: No Code – Write nothing; deploy nowhere.

// No Code	// 0.00

Optimizing 12-bit Rotation

Rotating 12-bit to the left is equivalent to a word swap and a 4-bit shift to the right. The swapping is done implicitly during loading and storing, so there are only 4 bits to shift.

The adjustment for a rightshift is a bit tricky (no pun intended): when the high 16-bit word is right-shifted, its leftmost bit is filled by a zero, this is incorrect. We need to set or clear the leftmost bit of the high word according to the rightmost bit of the low word. We can’t call ROR - ROR only set the bit after shifting it and creates an additional unwanted shift, it has to be adjusted manually. In a leftshift, we simply use ADC to toggle bit 0. In a rightshift, however, there is no single instruction to toggle bit 15 using the carry flag.

Usually, the conditional branch instructions BCS (branch if carry set) and BCC (branch if carry clear) can be used.

// initialize
mov	$0100000, r0	// 0.84 + 1.46	(+ 0.15 x 2)

asr	r3		// 1.25		(+ 0.15)
ror	r2		// 1.25		(+ 0.15)

// adjust MSB (or skip if carry is cleared)
// Don't use. This is insecure, non-constant time code,
// See commentary in the blog post.
bcc	skip		// 1.76 (+ 0.15) / 1.40 (+ 0.15)
add	r0, r3		// 0.99 (+ 0.15)

skip:
// more code...

It takes 2.75 µs (+ 0.3 µs) to adjust the MSB per shift. If no adjustment is needed, it takes only 1.40 µs (+ 0.15 µs). But it makes the execution of the cipher depend on the secret internal state, which is derived from our secret key - it opens up an entire category of side-channel attacks, a big no in cryptography.

My solution: If we can’t use ROR to adjust the most significant bit after the shift, at least we can pre-set the carry flag manually before the shift.

To do this, first, we copy the lower word to a scratchpad register (to avoid destroying the original word) and shift it by 1 bit. After this shift, the carry flag is set according to the rightmost bit. Later, we rotate the high word - now rightmost bit is transferred to the leftmost bit. Finally, rotate the low word.

mov	r2, r0	// 0.90

// repeat 4 times
asr	r0	// 1.25	(+ 0.15)
ror	r3	// 1.25	(+ 0.15)
ror	r2	// 1.25	(+ 0.15)

As the rotation doesn’t cross a word boundary, we can reuse the scratchpad register repeatedly. Now we have a branchless 3-instruction rotation, the vulnerability is eliminated, and it also runs faster than a conditional branch.

Curiously, shifting to the right has an additional 0.25 µs performance penalty, my uneducated guess is that the hardware prefers leftshifting, when shifting to the right, the microcode has to execute one more instruction or so (yes, almost all PDP-11 models are microcoded).

Update: I’ve read the list of microcode and found no micro-instruction for shifting or rotating bits, but found some mentions about registers and multiplexer signals, thus, this may be related to ALU’s logic details. More investigation needed.

This step takes 15.90 µs (+ 1.80 µs) in total.

Optimizing 8-bit Rotation

On PDP-11, almost all 16-bit instructions have a 8-bit version, suffixed with b (e.g. MOV and MOVB). This allows ASCII strings and byte arrays (or shall I say octet arrays?) to be handled directly and is regarded as a major strength of the platform, unlike 18-bit or 36-bit computers of the same era which were best suited for 6-bit characters.

There are some pitfalls to watch, though. First, PDP-11’s 16-bit registers cannot be used as two 8-bit registers pair. We can work on the low 8-bit only, there is no direct access to the high 8-bit word. Instead, SWAB is provided to quickly exchange the upper and lower part in a register - it is this single instruction that enables efficient byte processing. Also, pay attention to MOVB - it performs a sign extension to the register. It means the high 8-bit of the register is overwritten by 11111111 or 00000000 depending on the sign of the loaded byte, hence, a C statement like int16_t x = (int8_t) y works automatically, at the expense of killing our bitpacked data. Beware not to trash the register! Other byte-instructions are safe to use, they all ignore the high 8-bit word.

Now we’re ready for an implemantion. Basically, to transfer a low word safely without destroying the high word, we can clear the low 8-bit word first via CLRB, then apply the “bit set” (logical or) instruction BISB to safely perform the transfer. Combining this sequence with SWAB would rotate the word by 8-bit.

It can be difficult to follow, so example values are given in the comment.

// rotate 8-bit to the left		r4: 011 022
// example values are given		r5: 033 044
swab	r5	// 0.99 (+ 0.15)	r5: 044 033
mov	r5, r0	// 0.90			r0: 044 033

clrb	r5	// 0.99 (+ 0.15)	r5: 044 000
swab	r4	// 0.99 (+ 0.15)	r4: 022 011
bisb	r4, r5	// 0.99 (+ 0.15)	r5: 044 011

clrb	r4	// 0.99 (+ 0.15)	r4: 022 000
bisb	r0, r4	// 0.99 (+ 0.15)	r4: 022 033

I cannot say it’s the fastest implementation with full confidence, but I strongly suspect it to be the case. Note: a slower version appeared in a previous version of this article.

This step takes 6.84 µs (+ 0.90 µs) in total.

Optimizing 7-bit Rotation

A 7-bit rotation is done by using the previous 8-bit rotation and an addtional 1-bit shift to the right.

// 8-bit ROTL...	// 6.84	(+ 0.90)
// 1-bit ROTR...	// 4.65	(+ 0.45)

This step takes 11.49 µs (+ 1.35 µs) in total.

The Futility of EIS

One may wonder: if the machine has the Extended Instruction Set option, can we go faster? At first glance, EIS provides convenient and efficient multi-bit shifts and rotations.

But a more careful examination shows it’s not the case - EIS only provides 32-bit shift or 16-bit rotation, but not our much-needed 32-bit rotation! To synthesize a 32-bit rotation, we must copy the 32-bit operand to 2 spare registers, shift the first two registers by n bits to the left, and the scratchpad registers by (32 - n) bits to the right. The same technique is used in the C programming language and some RISC machines, both lacked an operator for arbitrary integer rotation.

mov	r2, r0		// 0.90
mov	r3, r1		// 0.90

// shift r0r1 to the left (positive offset)
// warning: big-endian shift, MSB is r0
ashc	N, r0		// 0.28 + 3.26 + 0.30 x N

// shift r2r3 to the right (negative offset)
// warning: big-endian shift, MSB is r2
ashc	-(32 - N), r2	// 0.28 + 3.26 + 0.30 x (32 - N)

// merge the result
bis	r0, r2		// 0.99 (+ 0.15)
bis	r1, r3		// 0.99 (+ 0.15)

As a result, in all cases, 32 bits are always shifted, and a lot of time is wasted: A 32-bit rotation always takes 20.46 µs (+ 0.30 µs). It offers zero advantages.

Finalizing

After all four steps in a quarter round has completed, all remaining registers have to be written back to memory.

mov	r3, chacha_b + 0	// 2.84	(+ 0.15 x 3)
mov	r2, chacha_b + 2	// 2.84	(+ 0.15 x 3)
mov	r5, chacha_d + 0	// 2.84	(+ 0.15 x 3)
mov	r4, chacha_d + 2	// 2.84	(+ 0.15 x 3)

Some registers, including b and d have been implicitly swapped, we reverse the high 16-bit and low 16-bit word.

The cost of finalizing a quarter round is 11.36 µs (+ 1.80 µs).

Hard Mode: Software XOR

We’ve successfully implemented the ChaCha20 quarter round function on the PDP-11/40, but sorry, our princess is on another machine.

There’s one last issue to be solved: XOR is only available on PDP-11/40 and better machine. On the low-end PDP-11/05, /10, /15, and /20, this instruction doesn’t exist. It would be nice if we can provide a software fallback option for the owners of these computers using basic Boolean operations, such as AND or OR.

From any introductory book on electronics, we can learn that in combinational logic, we have

a XOR b = ((NOT a) AND b) OR (a AND (NOT b))

On the PDP-11, there’s no AND. The closest instruction we get is the “bit clear” instruction, BIC. If the source register has an “1” bit, the bit at the same location in the destination register is cleared to “0”. Its logical expression is (NOT a) AND b. We also the “bit set” instruction, BIS. It transfers all the “1” bits from the source to the destination register, this is just a logical OR operation.

First attempt

On the PDP-11, due to the existence of BIC, the textbook formula is optimal and only requires 3 instructions. However, BIC destroys the destination register.

One solution is reloading the same integer from RAM, but accessing memory is very expensive and must be avoided when possible. Thus, the only remaining option is saving it to a scratchpad register. At this point, we are running out of regular registers, forcing us to seize the stack pointer, r6.

mov	r0, chacha_a + 0
mov	r1, chacha_a + 2

mov	r0, r6
bic	r4, r0
bic	r6, r4
bis	r0, r4

mov	r1, r6
bic	r5, r1
bic	r6, r5
bis	r1, r5

Disabling and Enabling Interrupts

In the previous software XOR code, we’ve sacrificed our stack pointer r6 and used it as our scratchpad register. It’s not always safe to do this.

When we use a timesharing operating system (like UNIX v6) on a PDP-11/40, the optional Memory Management Unit hardware would be purchased and installed on the PDP-11. The MMU provides two independent stack pointers for the kernel and userspace and switches them automatically. No need to worry, just be sure to save and restore r6 before and after the entire encryption routine is finished.

However, if our program is a freestanding one that runs on the bare metal, or when we use a non-timesharing operating system. In this case, r6 is dedicated to handle interrupts. In particular, when we’re using the PDP-11/20, there is no MMU support, not even an optional one!

If an IRQ occurs after r6 is overwritten, we’ll have random data corruption that leads to a total system crash. In this case, we must disable and enable interrupts appropriately.

// save stack pointer
mov	r6, sp_backup

// set CPU priority level to 7 (ignore all interrupts)
bis	$340, 0177776

// full quarter round

// restore stack pointer
mov	sp_backup, r6

// set CPU priority level to 0 (receive all interrupts)
bic	$340, 0177776

The interrupt status is controlled by the Processor Status Word, mapped to memory address 0177776. There is no separate instruction to enable or disable interrupts (you can access all CPU flags and use them like any data in memory). CPU priority is selected by bit 5, 6 and 7. The highest priority is mode 7, it’s reserved for critical tasks and masks all interrupts. When restoring it, I arbitrarily restored priority level to 0, it may be unsuitable for your application if you make use of multiple priority levels, make sure to adjust it appropriately.

Interrupts should not be postponed for too long. In our code, they’re disabled for the duration of a single quarter round, creating a latency of several hundreds of microseconds.

Second Attempt

Although the previous solution works, it leaves much to be desired. First of all it affects the response speed of interrupts. It may be unacceptable in some systems, or at least not ideal. It also prevents us from using the stack pointer inside the encryption routine - we don’t need it for now, but it may otherwise be useful when it’s integrated into an application.

In the first attempt, reloading the data from RAM was ruled out due to performance concerns. But now we have realized, memory accesses per se is not the problem - saving and restoring the stack pointer and interrupts itself requires 4 memory accesses. The question is whether we can keep the number of memory accesses no more than 4, without relying on the stack pointer.

The answer is affirmative: by careful instruction scheduling, it’s possible.

Step 1

In the first Setup step in a quarter round, we always load d to register r4 and r5 at the very beginning of the program. For a PDP-11/40, it’s a reasonable choice to organize the code. It allows us to calculate the one-off setup overhead as a separate item.

The situation is different on the PDP-11/20, it wastes a valuable register that’s needed for software XOR calculation - clearly, when we’re processing r4, the spare r5 can be used as a scratchpad register.

mov	r0, chacha_a + 0
mov	r1, chacha_a + 2

mov     r0, r5
mov     chacha_d + 0, r4
bic     r4, r0
bic     r5, r4
bis     r0, r4

mov     r1, r0
mov     chacha_d + 2, r5
bic     r5, r1
bic     r0, r5
bis     r1, r5

When processing the low word using r0 and r4, our scratchpad register is r5. Once r4 has been calculated, we can used the free r0 as a scratchpad for r1 and r5.

Finally, don’t forget to delete these two lines from the section Setup. Here, I use a .if macro to allow the generation of either PDP-11/40 or PDP-11/20 code by a global flag (PDP1120 - 1 is just a way to say not PDP1120 due to the lack of a Boolean NOT operator in as).

    mov	chacha_a + 0, r0	// 1.46 + 1.46 (+ 0.15 x 2)
    mov	chacha_a + 2, r1	// 1.46 + 1.46 (+ 0.15 x 2)
.if PDP1120 - 1
    mov	chacha_d + 0, r4	// 1.46 + 1.46 (+ 0.15 x 2)
    mov	chacha_d + 2, r5	// 1.46 + 1.46 (+ 0.15 x 2)
.endif

Although we’ve made two memory accesses, but they’re simply moved from the pre-existing Setup section to the XOR section, so far the overhead is effectively zero.

Step 2 & Step 3

In step 2 and step 3, no spare register is available as the scratchpad register, forcing us to reload the data from memory. Only step 2 is shown, for step 3, change chacha_c to chacha_a.

mov	r0, chacha_c + 0
mov	r1, chacha_c + 2

bic	r2, r0
bic     chacha_c + 0, r2
bis     r0, r2

bic     r3, r1
bic     chacha_c + 2, r3
bis     r1, r3

As a quintessential CISC machine, the PDP-11 instruction set is not limited to load-and-store, but allows direct operations between registers and memory.

In total, there are 4 memory accesses: two in step 2, and two in step 3.

Step 4

In the final step, two spare registers for d are available because they’ve already finished their jobs. One register r4 is already enough.

mov     r5, chacha_d + 0
mov     r4, chacha_d + 2

mov     r0, r4
bic     r3, r0
bic     r4, r3
bis     r0, r3

mov     r1, r4
bic     r2, r1
bic     r4, r2
bis     r1, r2

Again, delete these two lines in the section Finalizing.

    mov	r3, chacha_b + 0	// 2.84	(+ 0.15 x 3)
    mov	r2, chacha_b + 2	// 2.84	(+ 0.15 x 3)
.if PDP1120 - 1
    mov	r5, chacha_d + 0	// 2.84	(+ 0.15 x 3)
    mov	r4, chacha_d + 2	// 2.84	(+ 0.15 x 3)
.endif

And we’re done. The use of the stack pointer has been eliminated without negative impact on performance.

ChaCha20 Internal State

The ChaCha20 internal state is represented by 16 unsigned 32-bit integers as a 4x4 matrix. To calculate a complete iteration of ChaCha20, we repeatedly apply the ChaCha20 Quarter Round function. A full ChaCha20 “round” contains 4 quarter rounds.

First, vertically. This is known as an “odd round”.

QR(0, 4,  8, 12)
QR(1, 5,  9, 13)
QR(2, 6, 10, 14)
QR(3, 7, 11, 15)

ChaCha20 Odd Round

Then, diagonally. This is an “even round”.

QR(0, 5, 10, 15)
QR(1, 6, 11, 12)
QR(2, 7,  8, 13)
QR(3, 4,  9, 14)

ChaCha20 Even Round

Because ChaCha20 has 20 rounds, the odd round and the even around are applied 10 times for each.

Total Cost of 20 Full Rounds

So let’s add the timing of these quarter rounds together…

Setup	11.68 µs
MMU	 1.20 µs
            (x 4 rounds)
Add	 8.81 µs
MMU	 1.05 µs
            (x 4 steps x 4 rounds)
Xor	 7.66 µs
MMU	 1.20 µs
            (x 4 steps x 4 rounds)
Rotate	15.90 µs + 6.84 µs + 11.49 µs
MMU	 1.80 µs + 0.90 µs +  1.35 µs
            (x 4 rounds)
Fin.	11.36 µs
MMU	 1.80 µs
            (x 4 rounds)
------------------------------------------------
Total	492.60 µs
(MMU	+ 64.20 µs)

To complete a 512-bit block of ChaCha20 encryption, 20 full rounds are needed (thus the name, again). It takes…

Baseline:	492.60 µs x 20 = 9852.0 µs
MMU overhead:	 64.20 µs x 20 = 1284.0 µs

Apply Encryption

Mixing

At this point, a full block of ChaCha20 state has been generated. But we haven’t actually started encrypting our data yet. In fact, it’s totally unsafe to use the ChaCha20 output in its current form - an attacker can easily invert the Quarter Round function and read the original input. To obtain the intended security property, the original input block should be backed up, and later mixed into the output block via 32-bit carryless addition.

Again, the PDP-11, as a quintessential and highly orthogonal CISC architecture, is not limited to load-and-store. Register-memory, and even memory-memory operations are supported by almost all instructions on all registers. It allows programmers to write really dense and tight code in assembly - which made perfect sense before highly pipelined processors.

// load pointers
mov	$chacha_state, r0	// 0.84 + 1.46 (+ 0.15)
mov	$chacha_orig, r1	// 0.84 + 1.46 (+ 0.15)

// backup the ChaCha20 inputs
// repeat 32 times...
mov	(r0)+, (r1)+		// 0.84 + 2.42 (+ 0.15 x 3)

// perform 80 quarter rounds

// load pointers
mov	$chacha_state, r0	// 0.84 + 1.46 (+ 0.15)
mov	$chacha_orig, r1	// 0.84 + 1.46 (+ 0.15)

// mix ChaCha20 input and output
// repeat 32 times
add	(r0)+, (r1)+		// 0.84 + 0.84 + 1.76	(+ 0.15 x 4)
adc	(r1)			// 0.78 + 1.77		(+ 0.15 x 3)
add	(r0)+, (r1)+		// 0.84 + 0.84 + 1.76	(+ 0.15 x 4)

The line mov (r0)+, (r1)+ is a memcpy implemented in a single instruction, showing the peak performance of the PDP-11 ISA. A C programmer should be able to recognize it immediately: it’s the origin of *dest++ = *src++ in C.

For convenience, the initial backup is shown and calculated as part of the mixing process. After the mixing step, the output of ChaCha20 becomes highly nonlinear and irreversible, secure for cryptographic purposes.

In total, backing up and mixing take 415.28 µs (+ 67.80 µs).

XORing

ChaCha20 is a stream cipher, what we have now is not the ciphertext, but 512-bit of cryptographically-secure pseudorandom data. To obtain the actual ciphertext, we simply XOR the random data with our plaintext. It’s actually very similar to the One Time Pad, except that we use cryptographically-secure pseudorandomness as a substitute of true randomness. In other words, the heart of every stream cipher is a good Cryptographically Secure Pseudorandom Number Generator (CSPRNG).

Because the calling convention or the full ChaCha20 state handling code is not implemented at this point, I’ll write down only a generic XOR routine.

// load plaintext pointer
mov	$plaintext, r0		// 0.84 + 1.46	(+ 0.15)

// repeat 32 times...
mov	(r0)+, r1		// 0.84 + 1.46	(+ 0.15)
xor	r1, chacha_state+N	// 1.46 + 1.76	(+ 0.15 x 4)

The XOR instruction suffers a limitation, it’s the only instruction on the PDP-11 that only supports a register as its source address, other addressing modes such as auto-increment, are not supported - it’s probably related to the fact that it was originally an extension to the basic instruction set, implemented as a module on the CPU.

The execution time is around 179 µs (+ 24.15 µs). Beware that this routine destroys our ChaCha20 internal state, a more practical routine can be slower.

Discussion

Performance

In conclusion, it takes 10446.28 µs to encrypt 64 bytes of data in our implementation, or 163.23 µs per byte (5.98 KiB/s). The MMU adds an slight overhead, making it 11822.23 µs, or 184.73 µs per byte (5.28 KiB/s).

Note: Our reference machine, PDP-11/40, was a mid-range system, on the high-end system like the PDP-11/45, performance is approximately 15 KiB/s. See Appendix 1.

A register transfer on PDP-11/40 takes 0.90 µs, which is the fastest operation. We use this time as the cycle time of the machine for performance estimation. Hence, ChaCha20 achieved 182 cycles/bytes (non-MMU) and 206 cycles/bytes (MMU) on our system.

Metric Result (w/o MMU) w/ MMU
cycles per byte 182 206
time per byte 163.23 µs 184.73 µs
throughput 5.98 KiB/s 5.28 KiB/s

The “cycles per byte” performance is actually impressive. According to a paper NaCl on 8-bit AVR Microcontrollers by Michael Hutter and Peter Schwabe, an 8-bit AVR microcontroller can run Salsa20 (an earlier design before the ChaCha20) at 216 cycles per bytes. Daniel J. Bernstein also cited the performance reported by developers of the Arduino Cryptography Library, for ChaCha20 it is 238 cycles per bytes.

Hence, I’m confident to say that, at the same clock frequency, ChaCha20’s efficiency on the PDP-11 architecture is similar to 8-bit AVR microcontrollers, like those used by an Arduino. Consider the fact that the AVR is a modern (late 90s) RISC machine with all single-cycle instructions, meanwhile the PDP-11 is a computer architecture from half a century ago, this is not bad at all. It’s also not bad for ChaCha20, a symmetric cipher of today, it clearly achieved its “high performance” objective (Of course, an Arduino board runs at 16 MHz, a PDP-11/40 runs only at 1.11 MHz, but decades later, LSI-11 eventually achieved 15.2 MHz as well).

On the flip side, time per byte is slow, throughput is slow: 5 KiB/s is less than 0.5% of the Unibus bandwidth - because a PDP-11/40’s 0.90 µs cycle time is only 1.11 MHz. But to put things in perspective, a PDP-11/40 had a maximum of ~248 KiB of addressable memory, which can be encrypted in ~1 minute. Encrypting an entire RK05 hard disk drive (~2.5 MB) would only take ~10 minutes, the performance is actually acceptable for sending secret messages and well worth it if you have confidential information to protect.

When a phone line with a Bell 202 modem runs at 1200 baud, 5 KiB/s will not be the bottleneck.

Of course, Public-Key Cryptography is the slowest part in such applications, not the symmetric cipher. According to my quick analysis, the PDP-11 can only perform 0.1 key exchanges per second, even using an elliptic curve. However, public-key cryptography is only used for infrequent key negotiation. For a point-to-point link, it’s only used once per connection. Thus, it won’t be the bottleneck in spite of its slowness. A client-server application may be more problematic, though. Also, public-key cryptography hasn’t been discovered by Diffie and Hellman yet in 1970 (GCHQ did already, but it was classified). Implementating Curve25519 is beyond the scope of this project.

Accuracy of my Estimation

I’m neither famillar with PDP-11 nor high-speed crypto, I’m sure additional optimizations are possible to make the program a bit faster. On the other hand, I doubt a 2x speedup is possible.

Since program overheads (such as function calls, or initializating ChaCha20 with key and ciphertext) are not included in my calculation, I’m almost sure my result is still an overoptimistic overestimation. The actual performance will be slower.

Overall, I think my estimation is believable within +/- 50%. Corrections and optimizations are welcomed, indeed.

Optimizing for Size

Until now, the priority was to optimize for speed at all cost. But in order to make our encryption routine be practical, program size must be considered. Our quarter round implementation takes 198 bytes. In full ChaCha20, there are 80 quarter rounds in total, which eats 15.47 KiB of RAM, this is clearly unacceptable.

If we replace all hardcoded memory addresses with pointers, there is a slight performance degradation due to switching from relative addressing to relative deferred addressing mode, each load and store will be ~1 µs slower. The advantage is that same 206-byte quarter round can always be reused.

I prefer the third solution: We don’t need to unroll 80 quarter rounds, only 8 is enough. In ChaCha20, there are only 8 unique quarter rounds that operates on different memory locations: 4 odds and 4 even rounds. It reduces our program size to 1.55 KiB with almost no performance overhead.

Too Much Crypto

It’s worth mentioning a paper, Too Much Crypto, by Jean-Philippe Aumasson (a serious expert cryptographer who designed BLAKE2 and organized the Password Hashing Competition).

It showed that, in the hope that a cipher can survive an unexpected major breakthough in cryptanalysis, most symmetric ciphers are overdesigned with many more rounds than what’s really necessary. The author argued that a high number of rounds is often irrationally chosen, even when the current studies showed great impracticality to attack even a reduced-round cipher.

The author advocated an objective and fact-based analysis, and claimed that based on the current cryptanalysis, ChaCha8 is secure, and anything higher is only for show without practical significance (i.e. security theater).

It’s worth noting that Bernstein, the designer of ChaCha20 himself, is aware of J. P. Aumasson’s observation. In a talk on high-speed symmetric ciphers, Bernstein acknowledged that a tradeoff exists between confidence and performance.

If you agree with the paper and are willing to take higher (perhaps purely imagined) future risk for better performance, With ChaCha12, you can get 9.60 KiB/s (non-MMU) or 8.48 KiB/s (MMU). ChaCha8 is even faster, 13.78 KiB/s (non-MMU) or 12.15 KiB/s (MMU).

Cipher Speed (w/o MMU) w/ MMU
ChaCha20 5.98 KiB/s 5.28 KiB/s
ChaCha12 9.60 KiB/s 8.48 KiB/s
ChaCha8 13.78 KiB/s 12.15 KiB/s

Side-Channel Attack

ChaCha20 is known for its inherent resistence to side-channel attacks due to the careful avoidance of secret-dependent branching or memory accesses - it naturally runs in constant time. Still, constant time does not imply constant electromagnetic radiation!

Beware that a PDP-11 is electrically noisy and emits more radiation than any modern PCs. It was an excellent enginnering accomplishment of its era, but it was designed before the modern Electromagnetic Compatibility (EMC) standards or multilayer circuit boards with multiple dedicated ground planes were developed and used in the industry (in fact, Howard Johnson’s famed book High Speed Digital Design, page 200, specifically mentioned PDP-8 as an example for how not to design boards to meet EMC standards). It’s possible that an attacker is able to extract memory data by capturing RF leakage from the Unibus traffic. Electromechanical teletypes are also major offenders. Such a setup is extremely vulnerable to RF side-channel attacks. The large size of the installation also makes it impractical to simply shield the equipment itself. If physical attacks is in your threat model, you’ll need two fully shielded rooms at each end with filters on all power and data lines, which is a multi-million dollar project. Fortunately, most research labs probably don’t need them.

Conclusion

A PDP-11 has approximately ~5 KiB/s of ChaCha20 encryption performance, using two PDP-11s is a good way to establish at encrypted point-to-point communication link over a teletype or phone line between two research labs for confidental information. As minicomputers like PDP-11s are already used in many labs, a high degree of security can be achieved at reasonable cost.

However, if additional R&D budgets are available, it’s strongly recommended to implement the ChaCha20 cipher in hardware, as a peripheral on the Unibus.

Appendix 0: See Also

A full read is required for all PDP-11 users. Appendix C Instruction Timing and Appendix E Summary of PDP-11 Instructions should always be referred to when optimizing a program.

Required for start programming in Unix as (or cc). The Unix as assembler is a barebone assembler without extre features like macros. Even in 2.11BSD, its syntax and functionality is essential unchanged since Dennis Ritchie’s early version. Some aspects can be quite confusing for first-time user (e.g. expr vs *$expr, only recognize sp, not r6, etc). A quick browse through the manual answers the questions.

Not required, but sometimes extremely useful (“Use the source, Luke!").

Initially, I got lost in the Processor Handbook and missed its description on interrupts and confused. Reading Lions' book, Section 9.1 Hardware Interrupts and Section 9.5 Interrupt Priorities with references to the Unix source code helped to clarify my doubts.

Appendix 1: Upgrade Path

If you are a would-be owner of a PDP-11, I strongly recommend you to buy the more expensive PDP-11/45, already available one year before the PDP-11/40 (11/40 was the mid-range model, 11/45 is a high-end model specifically for multiuser timesharing). Newer successors from the high-end line, including 11/50, 11/55, and 11/70, are great, too.

Turbocharged with Schottky TTL circuits and semiconductor memory (unlike the PDP-11/40 that only uses core memory), it allows a 3x speedup. In fact, a PDP-11/45 class machine became the standard for UNIX v7 and later, the underpowered PDP-11/40 was struggling to run them.

Dilbert: Buy a Real Computer

On the PDP-11/45, if the fastest bipolar memory option is installed, on one hand, you’ll get a 3x speedup, loading and storing will only take 1 µs. Register transfers, basic arithmetic and logical operations only take 0.3 µs. On the other hand, it’s guaranteed to drain your research budgets… Fortunately, PDP-11/45 is flexible enough to mix different types of memory, buying a few kilobytes of bipolar RAM and using it as a buffer is a more economical option.

Here’s the ChaCha20 timing recalculated for a PDP-11/45 with the fastest bipolar memory installed.

Setup	4.20 µs
MMU	1.08 µs
            (x 4 rounds)
Add	3.00 µs
MMU	0.81 µs
            (x 4 steps x 4 rounds)
Xor	2.70 µs
MMU	0.72 µs
            (x 4 steps x 4 rounds)
Rotate	3.90 µs + 2.10 µs + 3.30 µs
MMU	1.17 µs + 0.63 µs + 0.99 µs
            (x 4 rounds)
Fin.	4.20 µs
MMU	1.08 µs
            (x 4 rounds)
---------------------
Total	162.00 µs
(MMU	+ 44.28 µs)

To complete a 512-bit block of ChaCha20 encryption, it takes…

162.00 µs x 20 = 3240.00 µs

And the final 64-byte mixing and XOR takes 247.80 µs (+ 64.08 µs). In total, it’s 54.50 µs per byte (17.91 KiB/s). The speedup is almost exactly 3x (299%) without MMU.

Unfortunately, PDP-11/45’s MMU is only 2x faster (0.09 µs per cycle), and some PDP-11/45 instructions take more memory cycles to complete, especially the MOV instructions. On PDP-11/40, a register-to-register transfer takes 0 cycles, but on PDP-11/40 it takes at least 1. Hence, encryption performance with a MMU is only 14.08 KiB/s, or 266%.

Naturally, ChaCha12 and ChaCha8 have much higher throughput.

Cipher Speed (w/o MMU) w/ MMU
ChaCha20 17.91 KiB/s 14.08 KiB/s
ChaCha12 28.51 KiB/s 22.42 KiB/s
ChaCha8 40.48 KiB/s 31.85 KiB/s

Appendix 2: Instruction Timing

On PDP-11/40, the execution time of an instruction is calculated by adding a source time, a destination time, and an execution and fetch time together. They are given in microseconds. All the three times can be different depending on the addressing mode (see Appendix 3) and the installed memory option of your paticular machine.

T = t_src + t_dest + t_ef

The MOV instruction is special - only t_src and t_ef are needed, so we assume t_dest is always 0. After looking at the manual, I found that a MOV instruction from memory to register takes 1.74 µs of source time and 1.46 µs of EF time.

EIS instructions are also special, the timing is calculated just like MOV, only t_src and t_ef are used, but they are given in a different table. Also, their execution time depends on the bitshift distance, shifting more bits causes a longer EF time.

Some shortcuts to remember when calculating the timing:

  1. If the source and destination are both registers, t_src and t_dest are 0, only t_ef counts.

  2. When the data is in a register, a basic transfer, arithmetic, or logic operation takes ~1 µs.

  3. A load or store takes ~3 µs.

Appendix 3: Addressing Modes

For those who’s following the manual, loading from memory to register is called the “absolute” addressing mode, or mode 3. Actually, the “auto-increment deferred” mode is also mode 3 - They are the same mode. In other words, the following two instructions are the same.

// load the content of address XXXX to r0
// In C, it's "int r0 = *(int *) XXXX"
mov *$XXXX, r0

// get the content of an address stored in register R, use
// the content as a new address, then load its content to r0,
// finally, increments the register R (imagine iterating an
// int ** array of int * pointers).
// In C, it's "int r0 = **R++".
mov *(R)+, r0

They look like two very dissimilar operations: One dereferences a hardcoded constant pointer, another dereferences a pointer to a list of pointers, then auto-increments the register. Why on Earth are they the same operation?

Because on PDP-11, the program counter, PC, can be used like any other registers, and “dereferencing a hardcoded constant pointer” is simply seen as a special case of “dereferencing the register twice and increment the register” - the register just happens to be PC.

  PC is here
      |
      V
mov (XXXX), r0

When an instruction that dereferences a constant pointer is executed, the program pointer is pointed to the beginning of the instruction itself, at where our hardcoded address is located! If we dereferencing PC twice (first to get the hardcoded memory address itself, then to dereference that address), we can get the content at memory address XXXX and load to the register, finally, we increments PC, so we start running the next instruction.

Similarly, both PC-relative addressing mode (for Position-Independent Code) and index mode (base + offset, for obtaining elements from an array) are the same mode: mode 6. PC-relative mode is simply the index mode using PC as the base address.

The PDP-11 instruction set demostrates a lot of ingenuity from the engineers at DEC, doesn’t it?

Appendix 4: Similarities to Later Microprocessors

The PDP-11 instruction set has some similarities to later microprocessors. For example, instruction mnemonics: JSR, RTS, ASL, ROR, BCS, BCC, are all perfectly understandable to MOS 6502, Motorola 6800, or 68000 developer, no explanation is needed. If you are one of them, you probably can understand this article with ease. This is not a coincidence - many designers of the later microprocessors were personally famillar with the PDP-11, or had studied it. This can be confirmed in some interviews.

Jeff LaVell, an engineering director on Motorota’s 6800 project said:

We also, that was the time of the PDP-11, and it was the great hi-technology computer at that time, so a little mini-thing with a lot of switches and so that was a neat architecture and we liked that a lot. So we started with that architecture and built some variations on it, and the original product definition of the 6800 family was a 16-bit machine. And then we talked to Tom and he said, “Go away.” And then we evolved into the real world of what we could build.

Bill Mensch, an engineer on Motorota’s 6800 (and later MOS6502) project offered a different account:

The 6800 microprocessor was thrown together by a lot of contributors, so therefore the design group didn’t really have much of the say about what went into it. And if a sales guy came back and said, hey, we need this instruction, so they added another instruction, whether–whatever. So it was kind of a random thing, but it was built on the specification of the PDP-8, the 6800. The 6502 came later and was modeled on the PDP-11, better architecture.

Chuck Peddle, the chief enginner on MOS6502 commented:

I took a short-term assignment to design a typesetting system for a friend of mine who had thought that typesetting was the new future. And so we put a typesetting system on a PDP-11 and started doing the local town Cave Creek newspapers and everything. So the PDP-11 was an extremely good eye-opener for me. Because it showed you that memory registers and memory addressing was important.

Although the architectures of 8-bit processors are quite different from the PDP-11 (the PDP-11 has a highly orthogonal ISA with many general-purpose registers, the 8-bit microprocessors mainly use accumulator-based designs with a lot more limitations, VLSI was still in its infancy), but at least we can see an indirect influence.

Appendix 5: Full Code

The ChaCha20 quarter round implementation I’ve just described can be downloaded here. The code snippet can be compiled directly on 2.11BSD. The implementation is incomplete - only a single quarter round is written (ChaCha20 requires 80 quarter rounds, this code should be converted to a macro that accepts four memory addresses chacha_a, chacha_b, chacha_c and chacha_d as its arguments). Also, there is no input or output. The result can only be seen in a debugger by setting a breakpoint at the end and inspecting the memory. For more information, see the comments in code.

// Copyright 2021 Tom Li <tomli@tomli.me>
//
// Permission to use, copy, modify, and/or distribute this software for any
// purpose with or without fee is hereby granted, provided that the above
// copyright notice and this permission notice appear in all copies.
//
// THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
// WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
// MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
// ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
// WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
// ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
// OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
//
// ChaCha20 Quarter Round implementation for DEC PDP-11.
//
// This is a proof-of-concept, only a single quarter round is written.
// ChaCha20 requires 80 quarter rounds, this code should be converted
// to a macro that accepts four memory addresses chacha_a, chacha_b, chacha_c
// and chacha_d as its arguments. Also, there is no input or output.
// The result can only be seen in a debugger by setting a breakpoint at
// the end and inspecting the memory.
//
// Note: Before using this program, see the commentary on my blog:
// https://niconiconi.neocities.org/posts/chacha20-performance-on-pdp-11/
//
// Version history:
//
// * v0.02 (Mar 2020):
// - Performance optimization.
// - Added Software XOR simulation for compatibility on PDP-11/20. All
//   PDP-11 machines are now supported.
//
// * v0.01 (Feb 2020):
// - First release.
// - Only support PDP-11/40 and better machines.
//
// FAQ:
//
// Q: The code a lot of repetition, why didn't you use a real assembler
// like MACRO-11 or GNU as?
//
// A: I should. But on my emulator (2.11BSD), it has Unix as and MACRO-11,
// and on my main system, it only has GNU as. This is a short project and
// I didn't have time to install a cross-development setup on my main system,
// or install GNU as in 2.11BSD, so Unix as became the only choice, although
// it has no macro capabilities. It also ensures everyone who runs Unix on
// the PDP-11 can assemble it. If I still have time, I'll probably rewrite
// it in GNU as.
//
// To assemble it on 2.11BSD, run:
// $ cc chacha20.s -o chacha20
//
// Then use the debugger:
// $ adb ./chacha20
//
// Select the memory address "end":
// adb> end
//
// Set a breakpoint at this address:
// adb> :b
//
// Run the program:
// adb> :r
// ./chacha20: running
// breakpoint      end:            cflg    0
//
// Inspect memory:
// adb> chacha_a, 2 ?x
// _etext+02:      #92f4   #ea2a
// adb> chacha_b, 2 ?x
// _etext+06:      #f8ce   #cb1c
// adb> chacha_c, 2 ?x
// _etext+012:     #472e   #4581
// adb> chacha_d, 2 ?x
// _etext+016:     #c4bb   #5881
//
// The output should match the test vector in RFC 8439, section 2.1.1.
//
// Input:
//
//    a = 0x11111111
//    b = 0x01020304
//    c = 0x9b8d6f43
//    d = 0x01234567
//
// After running a Quarter Round on these four numbers, we get these:
//
//    a = 0xea2a92f4
//    b = 0xcb1cf8ce
//    c = 0x4581472e
//    d = 0x5881c4bb

.data

// By default, the XOR instruction is used for performance. It's only
// supported on PDP-11/40 and better machines. Set this macro to 1 to
// use software XOR simulation for PDP-11/20 class machines without
// the XOR instruction.
PDP1120 = 0

chacha_a:
.byte 21, 21, 21, 21		// 0x11111111
chacha_b:
.byte 4, 3, 2, 1		// 0x01020304
chacha_c:
.byte 103, 157, 215, 233	// 0x4581472e
chacha_d:
.byte 147, 105, 43, 1		// 0x5881c4bb

.text

// All timings are for PDP-11/40.

.globl _main
_main:
    mov     chacha_a + 0, r0        // 1.46 + 1.46	(+ 0.15 x 2)
    mov     chacha_a + 2, r1        // 1.46 + 1.46	(+ 0.15 x 2)
.if PDP1120 - 1
    mov     chacha_d + 0, r4        // 1.46 + 1.46	(+ 0.15 x 2)
    mov     chacha_d + 2, r5        // 1.46 + 1.46	(+ 0.15 x 2)
.endif

step1:
    // a += b
    mov     chacha_b + 0, r2        // 1.46 + 1.46	(+ 0.15 x 2)
    mov     chacha_b + 2, r3        // 1.46 + 1.46	(+ 0.15 x 2)
    add     r2, r0                  // 0.99		(+ 0.15)
    adc     r1                      // 0.99		(+ 0.15)
    add     r3, r1                  // 0.99		(+ 0.15)

    // store a
    mov     r0, chacha_a + 0        // 2.84		(+ 0.15 x 3)
    mov     r1, chacha_a + 2        // 2.84		(+ 0.15 x 3)

    // d ^= a
.if PDP1120 - 1
    xor     r0, r4                  // 0.99		(+ 0.15)
    xor     r1, r5                  // 0.99		(+ 0.15)
.endif
.if PDP1120
    mov     r0, r5
    mov     chacha_d + 0, r4
    bic     r4, r0
    bic     r5, r4
    bis     r0, r4

    mov     r1, r0
    mov     chacha_d + 2, r5
    bic     r5, r1
    bic     r0, r5
    bis     r1, r5
.endif

step2:
    // c += (d <<< 16)
    mov     chacha_c + 0, r0        // 1.46 + 1.46	(+ 0.15 x 2)
    mov     chacha_c + 2, r1        // 1.46 + 1.46	(+ 0.15 x 2)
    add     r5, r0                  // 0.99		(+ 0.15)
    adc     r1                      // 0.99		(+ 0.15)
    add     r4, r1                  // 0.99		(+ 0.15)

    // store c
    mov     r0, chacha_c + 0        // 2.84		(+ 0.15 x 3)
    mov     r1, chacha_c + 2        // 2.84		(+ 0.15 x 3)

    // b ^= c
.if PDP1120 - 1
    xor     r0, r2                  // 0.99		(+ 0.15)
    xor     r1, r3                  // 0.99		(+ 0.15)
.endif
.if PDP1120
    bic     r2, r0
    bic     chacha_c + 0, r2
    bis     r0, r2

    bic     r3, r1
    bic     chacha_c + 2, r3
    bis     r1, r3
.endif

    // b >>>= 4
    mov     r2, r0		// 0.90
    asr     r0			// 1.25		(+ 0.15)
    ror     r3			// 1.25		(+ 0.15)
    ror     r2      		// 1.25		(+ 0.15)
    asr     r0      		// 1.25		(+ 0.15)
    ror     r3      		// 1.25		(+ 0.15)
    ror     r2      		// 1.25		(+ 0.15)
    asr     r0      		// 1.25		(+ 0.15)
    ror     r3      		// 1.25		(+ 0.15)
    ror     r2      		// 1.25		(+ 0.15)
    asr     r0      		// 1.25		(+ 0.15)
    ror     r3      		// 1.25		(+ 0.15)
    ror     r2			// 1.25		(+ 0.15)

step3:
    // a += (b <<< 16)
    mov     chacha_a + 0, r0        // 1.46 + 1.46	(+ 0.15 x 2)
    mov     chacha_a + 2, r1        // 1.46 + 1.46	(+ 0.15 x 2)
    add     r3, r0                  // 0.99		(+ 0.15)
    adc     r1                      // 0.99		(+ 0.15)
    add     r2, r1                  // 0.99		(+ 0.15)

    // store a
    mov     r0, chacha_a + 0        // 1.46 + 1.46	(+ 0.15 x 3)
    mov     r1, chacha_a + 2        // 1.46 + 1.46	(+ 0.15 x 3)

    // d ^= a
.if PDP1120 - 1
    xor     r0, r5                  // 0.99		(+ 0.15)
    xor     r1, r4                  // 0.99		(+ 0.15)
.endif
.if PDP1120
    bic     r5, r0
    bic     chacha_a + 0, r5
    bis     r0, r5

    bic     r4, r1
    bic     chacha_a + 2, r4
    bis     r1, r4
.endif

    // d <<<= 8
    // rotate 8-bit to the left                        r4: 011 022
    // example values are given                        r5: 033 044
    swab    r5                      // 0.99 (+ 0.15)   r5: 044 033
    mov     r5, r0                  // 0.90            r0: 044 033

    clrb    r5                      // 0.99 (+ 0.15)   r5: 044 000
    swab    r4                      // 0.99 (+ 0.15)   r4: 022 011
    bisb    r4, r5                  // 0.99 (+ 0.15)   r5: 044 011

    clrb    r4                      // 0.99 (+ 0.15)   r4: 022 000
    bisb    r0, r4                  // 0.99 (+ 0.15)   r4: 022 033

step4:
    // c += (d <<< 16)
    mov     chacha_c + 0, r0        // 1.46 + 1.46	(+ 0.15 x 2)
    mov     chacha_c + 2, r1        // 1.46 + 1.46	(+ 0.15 x 2)
    add     r5, r0                  // 0.99		(+ 0.15)
    adc     r1                      // 0.99		(+ 0.15)
    add     r4, r1                  // 0.99		(+ 0.15)

    // store c
    mov     r0, chacha_c + 0        // 2.84		(+ 0.15 x 3)
    mov     r1, chacha_c + 2        // 2.84		(+ 0.15 x 3)

    // (b <<< 16) ^= c
.if PDP1120 - 1
    xor     r0, r3                  // 0.99		(+ 0.15)
    xor     r1, r2                  // 0.99		(+ 0.15)
.endif
.if PDP1120
    mov     r5, chacha_d + 0
    mov     r4, chacha_d + 2

    mov     r0, r4
    bic     r3, r0
    bic     r4, r3
    bis     r0, r3

    mov     r1, r4
    bic     r2, r1
    bic     r4, r2
    bis     r1, r2
.endif

    // (b <<< 16) <<<= 8
    // rotate 8-bit to the left                        r3: 011 022
    // example values are given                        r2: 033 044
    swab    r2                      // 0.99 (+ 0.15)   r2: 044 033
    mov     r2, r0                  // 0.90            r0: 044 033

    clrb    r2                      // 0.99 (+ 0.15)   r2: 044 000
    swab    r3                      // 0.99 (+ 0.15)   r3: 022 011
    bisb    r3, r2                  // 0.99 (+ 0.15)   r2: 044 011

    clrb    r3                      // 0.99 (+ 0.15)   r3: 022 000
    bisb    r0, r3                  // 0.99 (+ 0.15)   r3: 022 033

    // b >>>= 1
    mov     r3, r0                  // 0.90
    ror     r0                      // 1.25		(+ 0.15)
    ror     r2                      // 1.25		(+ 0.15)
    ror     r3                      // 1.25		(+ 0.15)

    // store b, d
    mov     r3, chacha_b + 0	// 2.84		(+ 0.15 x 3)
    mov     r2, chacha_b + 2	// 2.84		(+ 0.15 x 3)
.if PDP1120 - 1
    mov     r5, chacha_d + 0	// 2.84		(+ 0.15 x 3)
    mov     r4, chacha_d + 2	// 2.84		(+ 0.15 x 3)
.endif

end:
    // use debugger to set a breakpoint here
    nop

    // call exit() system call, return 0.
    // XXX: This is 2.11BSD-specific code.
    mov	$0, -(sp)
    tst	-(sp)
    sys	1