ChaCha20 Encryption Performance on DEC PDP-11
Image by Don DeBold, CC BY 2.0, some rights reserved.
Introduction
For fun, I’m implementating the ChaCha20 cipher in PDP-11 assembly. To maximize its compatibility, my primary reference target is a plain PDP-11/40 without the Extended Instruction Set (EIS) option nor speedups available in later machines. Backward compatibility is also provided for PDP-11/20 to cover 100% of the installations.
The natural question is: is our 50-year-old computer fast enough to run ChaCha20, the state-of-the-art crypto on today’s Internet?
Currently, the implementation is incomplete - I have no prior experience on programming the PDP-11 and I plan to learn everything on-the-fly. Still, I cannot wait any longer to answer this question, so I did a performance analysis. Here’s my findings.
Background
ChaCha20 is a 256-bit stream cipher for symmetric encryption. It’s designed to have both high performance and high security. It can be implemented efficiently in pure software. By avoiding secret-dependent memory accesses and conditional branches in its construction, it’s immune to many forms of timing side-channel attacks that software implementations of other algorithms (e.g. AES) are plagued with.
In recent years, it has been widely accepted by the applied cryptography community. It has been standardized and incorporated in many software projects and Internet protocols, including the Linux kernel, OpenBSD, OpenSSH, TLSv1.3, and Tor.
ChaCha20 Quarter Round
In ChaCha20, the basic unit of operation is the Chacha20 quarter round function. It takes 4 unsigned 32-bit integer a, b, c, d as inputs and applies a series of transformations in 4 steps. In each step, 3-different 32-bit unsigned integers are added and XORed with one another, then the result is rotated (circular shifted) to the left.
The quarter round function is shown below.
a += b; d ^= a; d <<<= 16;
c += d; b ^= c; b <<<= 12;
a += b; d ^= a; d <<<= 8;
c += d; b ^= c; b <<<= 7;
As you see, in step 1 and step 3, it uses the integers a, b, d; in step 2 and 4, it uses b, c, d. Rotation offsets are different in each step. Nevertheless, the same operations are always repeated: ADD, XOR and ROTATE. This construction is known as an ARX cipher in cryptography.
A full ChaCha20 round consists of 4 quarter rounds, hence the name “quarter round” (not to be confused with the 4 substeps in a single quarter round). They operate on 16 unsigned 32-bit integers as a 4x4 matrix, not just 4 (I’ll explain it later).
Register Allocation
Registers on the PDP-11 are listed below.
Register | Description |
---|---|
r0 - r5 | 16-bit General Purpose |
sp (r6) | 16-bit Stack Pointer |
pc (r7) | 16-bit Program Counter |
Six 16-bit general purpose registers are available, allowing us to hold three 32-bit integers in registers. Each step in the ChaCha20 quarter round happens to operate on three 32-bit integers, we found the number of registers is just enough - but we do have to load and store one register per step. ChaCha20 is really designed for machines with four 32-bit registers or more, not three.
The stack pointer r6
, can be used as a scratchpad register if
we really need to. However, it should be saved to a known memory
location and be restored later, for some systems, one should also
enable and disable interrupts appropriately. It’ll be explained
later.
It’s worth mentioning that register r7
, in princple, works as a
general-purpose register. It responds to all PDP-11’s addressing
mode, but we can’t use it unless we want to trash our own program.
Wait, 32-bit integers in PDP-endian?
No. There are many legends about the infamous PDP-endian in the computer folklore. In PDP-endian, a 32-bit integer is stored as two 16-bit integers in big-endian, but the 32-bit intergers themselves are stored in little-endian in memory.
For example, a 64-bit bignum looks like this:
+---------------------------+---------------------------+
| 32-bit low | 32-bit high |
|-------------+-------------|-------------+-------------|
| 16-bit high | 16-bit low | 16-bit high | 16-bit low |
|_____________|_____________|_____________|_____________|
Having said that, the PDP-endian is completely irrelevant to what we’re doing here. One must realize that the PDP-11 in its basic form is a pure 16-bit machine in little-endian.
First, The PDP-endian is mainly used by the Floating-Point Unit for double-precision floating point arithmetic - it’s not even supported on a PDP-11/40. Second, the Unix C complier always stores 32-bit integers in PDP-endian for consistency, but we don’t need to exchange any binary data with other programs. The internal state of a cipher is never something you exchange with others, well, at least not intentionally.
The take-home message is: if you don’t need to use floats or C, there’s no need to use PDP-endian. Regular little-endian integers are adequate.
Implementation
Let’s try implementing the quarter round in PDP-11 assembly.
To avoid repetition without loss of generality, if the operations of the remaining 3 steps are identical and their execution time can be directly calculated, I’ll only explicitly written down the code for the first step. If needed, you can find the full code at the end of the article.
All assembly code is written in Unix as(1)
syntax. An immediate constant
is prefixed with $
, as in $0777
. Dereferencing memory from an address
is denoted as *addr
, dereferencing memory from an register is denoted as
(rX)
. DEC MACRO-11 uses a different syntax, they are written as #addr
,
@0777
, and (rX)
.
Numbers that start with a zero (e.g. 0777
) are in octets - it was still
the preferred number system by programmers at that time, because serious
computers (until IBM S/360) all used 36-bit integers, which could be cleanly
represented as four 3-octet groups: 777 777 777 777
. PDP-11 is a 16-bit
machine, but it was still before hexadecimal became ubiquitous. The PDP-11
machine code was designed in octets, octets were also the default number
system in Unix assembler and debugger, and of course, the Unix filesystem
permission.
Initialization
Before any computation can begin, ChaCha20’s 512-bit internal state must be initialized with a 256-bit encryption key, a block counter, and a nonce. Because this is trivial (just write some numbers to memory) and my implementation is still incomplete, I’ll omit this part.
Setup
To begin a quarter round, we need to first load the number a and d into the registers, so we’ll have a number that can be the can ADDed to or XORed against.
PDP-11 is a 16-bit machine, it means 2 MOV operations for each integer.
mov *$chacha_a + 0, r0 // 1.74 + 1.46 (+ 0.15 x 2)
mov *$chacha_a + 2, r1 // 1.74 + 1.46 (+ 0.15 x 2)
mov *$chacha_d + 0, r4 // 1.74 + 1.46 (+ 0.15 x 2)
mov *$chacha_d + 2, r5 // 1.74 + 1.46 (+ 0.15 x 2)
Prefix $
indicates the number is an immediate (literal) constant, and
*
tells the CPU to get the content at this memory address. So it says,
“dereference the memory at constant address chacha_a
”. Alternatively,
instead of *$addr
, you can also simply write addr
.
From the PDP-11/40 manual, I found a MOV from memory takes 3.2 µs (1.74 µs of source time and 1.46 µs of fetch and execution time, see Appendix 2). A Memory Management Unit (required for UNIX and other timesharing operating systems) also adds a 0.15 µs overhead to each memory cycle, so we optionally adds 0.6 µs.
But we can do better. Didn’t I mention that *$addr
can be shortened to addr
before? It has an additional benefit: it triggers the Unix assembler as
to use
PC-relative addressing mode. Instead of generating hardcoded absolute memory
addresses, all addresses would be expressed as an offset to PC. This is known
as Position-Independent Code.
On the PDP-11, it runs faster. The PIC version shown here (addressing mode 6) is 0.28 µs faster than the non-PIC version (addressing mode 3).
mov chacha_a + 0, r0 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_a + 2, r1 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_d + 0, r4 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_d + 2, r5 // 1.46 + 1.46 (+ 0.15 x 2)
This operation is only necessary in step 1, because at this point we don’t have anything in registers yet. Later, in step 2, 3, and 4, there is already one integer from the previous calculation in our register, allowing us to skip it. Therefore, I call this step the setup overhead.
Setup takes 11.68 µs (+ 1.2 µs) in total.
Load and Add
Then, a 32-bit carryless addition is performed. We load b and do an addition. Again, because the PDP-11 is a 16-bit machine, it means we need both a carryless addition (ADD) and an addition with carry (ADC).
mov chacha_b + 0, r2 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_b + 2, r3 // 1.46 + 1.46 (+ 0.15 x 2)
add r2, r0 // 0.99 (+ 0.15)
adc r1 // 0.99 (+ 0.15)
add r3, r1 // 0.99 (+ 0.15)
On PDP-11, adding from registers to registers is one of the most basic operations, it’s fast and only takes 0.99 µs (+ 0.15 µs) - one of the fastest thing you can do on a PDP-11. A register-to-register transfer (0.9 µs + 0) is the fastest, but only 0.09 µs (+ 0.15 µs) faster.
In total, loading b and adding take 8.81 µs (+ 1.05 µs).
Store and XOR
Next, the resultant a is XORed with the integer d, we simply perform an XOR here.
Because we have 6 registers to hold 3 integers, but ChaCha20 operates on 4.
It forces us to store one integer back to RAM in each step within the
quarter round. Storing it prior to the XOR is the best time - it frees up
two spare registers, r0
and r1
, as our scratchpad registers.
For example, in step 1, the integer a is not used in the next subsequent rotation and can be written back.
// store a
mov r0, chacha_a + 0 // 2.84 (+ 0.15 x 3)
mov r1, chacha_a + 2 // 2.84 (+ 0.15 x 3)
// d ^= a
xor r0, r4 // 0.99 (+ 0.15)
xor r1, r5 // 0.99 (+ 0.15)
The XOR insturction has the same timing, 0.99 µs (+ 0.15 µs), just like other arithmetic insturctions.
In total, a store and a XOR take 7.66 µs (+ 1.2 µs).
Rotate to the Left
ChaCha20 requires a 16-bit, 12-bit, 8-bit and 7-bit rotation in each quarter round. This is the most painful operation on the PDP-11. For a basic PDP-11/40 machine, only 1 bit can be shifted at a time. Writing this part takes a lot of head scratching. I found some encouragement in the Jargon File.
SHIFT TO THE LEFT!
SHIFT TO THE RIGHT!
POP UP, PUSH DOWN,
BYTE! BYTE! BYTE!
– Programmer’s Cheer, Jargon File
Basic Rotation
To rotate an integer, We have to do a left shift on the lower 16-bit word, a rotation (shift with carry) at the higher 16-bit, add the overflowed carry bit back to the lower 16-bit. And we need to do it for 16 times!
// repeat 16, 12, 8, 7 times
asl r0 // 0.99 (+ 0.15)
rol r1 // 0.99 (+ 0.15)
adc r0 // 0.99 (+ 0.15)
In our quarter round, step 1 runs like a snail and takes 47.52 µs (+ 7.2 µs) in total. Later, in step 2, 3 and 4, the rotation offsets are smaller - 12, 8 and 7 bits, they are still slow - they take 35.64 µs (+ 5.4 µs), 23.75 µs (+ 3.6 µs), and 20.79 µs (+ 3.15 µs).
But all is not lost, the rotation offset in ChaCha20 is selected on purpose to make our life easier.
Optimizing 16-bit Rotation
Rotating 16-bit to the left is equivalent to
swapping the lower and higher 16-bit words. Even an explicit swap
is not necessary. We can simply swap the order of operation while
adding, XORing and storing register r0
and r1
from now on. Here,
we can use the best technology in existence to do it:
No Code – Write nothing;
deploy nowhere.
// No Code // 0.00
Optimizing 12-bit Rotation
Rotating 12-bit to the left is equivalent to a word swap and a 4-bit shift to the right. The swapping is done implicitly during loading and storing, so there are only 4 bits to shift.
The adjustment for a rightshift is a bit tricky (no pun intended): when the high 16-bit word is right-shifted, its leftmost bit is filled by a zero, this is incorrect. We need to set or clear the leftmost bit of the high word according to the rightmost bit of the low word. We can’t call ROR - ROR only set the bit after shifting it and creates an additional unwanted shift, it has to be adjusted manually. In a leftshift, we simply use ADC to toggle bit 0. In a rightshift, however, there is no single instruction to toggle bit 15 using the carry flag.
Usually, the conditional branch instructions BCS (branch if carry set) and BCC (branch if carry clear) can be used.
// initialize
mov $0100000, r0 // 0.84 + 1.46 (+ 0.15 x 2)
asr r3 // 1.25 (+ 0.15)
ror r2 // 1.25 (+ 0.15)
// adjust MSB (or skip if carry is cleared)
// Don't use. This is insecure, non-constant time code,
// See commentary in the blog post.
bcc skip // 1.76 (+ 0.15) / 1.40 (+ 0.15)
add r0, r3 // 0.99 (+ 0.15)
skip:
// more code...
It takes 2.75 µs (+ 0.3 µs) to adjust the MSB per shift. If no adjustment is needed, it takes only 1.40 µs (+ 0.15 µs). But it makes the execution of the cipher depend on the secret internal state, which is derived from our secret key - it opens up an entire category of side-channel attacks, a big no in cryptography.
My solution: If we can’t use ROR to adjust the most significant bit after the shift, at least we can pre-set the carry flag manually before the shift.
To do this, first, we copy the lower word to a scratchpad register (to avoid destroying the original word) and shift it by 1 bit. After this shift, the carry flag is set according to the rightmost bit. Later, we rotate the high word - now rightmost bit is transferred to the leftmost bit. Finally, rotate the low word.
mov r2, r0 // 0.90
// repeat 4 times
asr r0 // 1.25 (+ 0.15)
ror r3 // 1.25 (+ 0.15)
ror r2 // 1.25 (+ 0.15)
As the rotation doesn’t cross a word boundary, we can reuse the scratchpad register repeatedly. Now we have a branchless 3-instruction rotation, the vulnerability is eliminated, and it also runs faster than a conditional branch.
Curiously, shifting to the right has an additional 0.25 µs performance penalty, my uneducated guess is that the hardware prefers leftshifting, when shifting to the right, the microcode has to execute one more instruction or so (yes, almost all PDP-11 models are microcoded).
Update: I’ve read the list of microcode and found no micro-instruction for shifting or rotating bits, but found some mentions about registers and multiplexer signals, thus, this may be related to ALU’s logic details. More investigation needed.
This step takes 15.90 µs (+ 1.80 µs) in total.
Optimizing 8-bit Rotation
On PDP-11, almost all 16-bit instructions have a 8-bit version, suffixed with b (e.g. MOV and MOVB). This allows ASCII strings and byte arrays (or shall I say octet arrays?) to be handled directly and is regarded as a major strength of the platform, unlike 18-bit or 36-bit computers of the same era which were best suited for 6-bit characters.
There are some pitfalls to watch, though. First, PDP-11’s 16-bit registers
cannot be used as two 8-bit registers pair. We can work on the low 8-bit only,
there is no direct access to the high 8-bit word. Instead, SWAB is provided to
quickly exchange the upper
and lower part in a register - it is this single instruction that enables
efficient byte processing. Also, pay attention to MOVB - it performs a
sign extension to the register. It means the high 8-bit of the register
is overwritten by 11111111
or 00000000
depending on the sign of the loaded
byte, hence, a C statement like int16_t x = (int8_t) y
works automatically,
at the expense of killing our bitpacked data. Beware not to trash the register!
Other byte-instructions are safe to use, they all ignore the high 8-bit word.
Now we’re ready for an implemantion. Basically, to transfer a low word safely without destroying the high word, we can clear the low 8-bit word first via CLRB, then apply the “bit set” (logical or) instruction BISB to safely perform the transfer. Combining this sequence with SWAB would rotate the word by 8-bit.
It can be difficult to follow, so example values are given in the comment.
// rotate 8-bit to the left r4: 011 022
// example values are given r5: 033 044
swab r5 // 0.99 (+ 0.15) r5: 044 033
mov r5, r0 // 0.90 r0: 044 033
clrb r5 // 0.99 (+ 0.15) r5: 044 000
swab r4 // 0.99 (+ 0.15) r4: 022 011
bisb r4, r5 // 0.99 (+ 0.15) r5: 044 011
clrb r4 // 0.99 (+ 0.15) r4: 022 000
bisb r0, r4 // 0.99 (+ 0.15) r4: 022 033
I cannot say it’s the fastest implementation with full confidence, but I strongly suspect it to be the case. Note: a slower version appeared in a previous version of this article.
This step takes 6.84 µs (+ 0.90 µs) in total.
Optimizing 7-bit Rotation
A 7-bit rotation is done by using the previous 8-bit rotation and an addtional 1-bit shift to the right.
// 8-bit ROTL... // 6.84 (+ 0.90)
// 1-bit ROTR... // 4.65 (+ 0.45)
This step takes 11.49 µs (+ 1.35 µs) in total.
The Futility of EIS
One may wonder: if the machine has the Extended Instruction Set option, can we go faster? At first glance, EIS provides convenient and efficient multi-bit shifts and rotations.
But a more careful examination shows it’s not the case - EIS only provides 32-bit shift or 16-bit rotation, but not our much-needed 32-bit rotation! To synthesize a 32-bit rotation, we must copy the 32-bit operand to 2 spare registers, shift the first two registers by n bits to the left, and the scratchpad registers by (32 - n) bits to the right. The same technique is used in the C programming language and some RISC machines, both lacked an operator for arbitrary integer rotation.
mov r2, r0 // 0.90
mov r3, r1 // 0.90
// shift r0r1 to the left (positive offset)
// warning: big-endian shift, MSB is r0
ashc N, r0 // 0.28 + 3.26 + 0.30 x N
// shift r2r3 to the right (negative offset)
// warning: big-endian shift, MSB is r2
ashc -(32 - N), r2 // 0.28 + 3.26 + 0.30 x (32 - N)
// merge the result
bis r0, r2 // 0.99 (+ 0.15)
bis r1, r3 // 0.99 (+ 0.15)
As a result, in all cases, 32 bits are always shifted, and a lot of time is wasted: A 32-bit rotation always takes 20.46 µs (+ 0.30 µs). It offers zero advantages.
Finalizing
After all four steps in a quarter round has completed, all remaining registers have to be written back to memory.
mov r3, chacha_b + 0 // 2.84 (+ 0.15 x 3)
mov r2, chacha_b + 2 // 2.84 (+ 0.15 x 3)
mov r5, chacha_d + 0 // 2.84 (+ 0.15 x 3)
mov r4, chacha_d + 2 // 2.84 (+ 0.15 x 3)
Some registers, including b and d have been implicitly swapped, we reverse the high 16-bit and low 16-bit word.
The cost of finalizing a quarter round is 11.36 µs (+ 1.80 µs).
Hard Mode: Software XOR
We’ve successfully implemented the ChaCha20 quarter round function on the PDP-11/40, but sorry, our princess is on another machine.
There’s one last issue to be solved: XOR is only available on PDP-11/40 and better machine. On the low-end PDP-11/05, /10, /15, and /20, this instruction doesn’t exist. It would be nice if we can provide a software fallback option for the owners of these computers using basic Boolean operations, such as AND or OR.
From any introductory book on electronics, we can learn that in combinational logic, we have
a XOR b = ((NOT a) AND b) OR (a AND (NOT b))
On the PDP-11, there’s no AND. The closest instruction we get is
the “bit clear” instruction, BIC. If the source register has an “1” bit,
the bit at the same location in the destination register is cleared to “0”.
Its logical expression is (NOT a) AND b
. We also the “bit set” instruction,
BIS. It transfers all the “1” bits from the source to the destination
register, this is just a logical OR operation.
First attempt
On the PDP-11, due to the existence of BIC, the textbook formula is optimal and only requires 3 instructions. However, BIC destroys the destination register.
One solution is reloading the same integer from RAM, but accessing memory is very
expensive and must be avoided when possible. Thus, the only remaining option is
saving it to a scratchpad register. At this point, we are running out of regular
registers, forcing us to seize the stack pointer, r6
.
mov r0, chacha_a + 0
mov r1, chacha_a + 2
mov r0, r6
bic r4, r0
bic r6, r4
bis r0, r4
mov r1, r6
bic r5, r1
bic r6, r5
bis r1, r5
Disabling and Enabling Interrupts
In the previous software XOR code, we’ve sacrificed our stack pointer r6
and used it as our scratchpad register. It’s not always safe to do this.
When we use a timesharing operating system (like UNIX v6) on a PDP-11/40,
the optional Memory Management Unit hardware would be purchased and
installed on the PDP-11. The MMU provides two independent stack pointers
for the kernel and userspace and switches them automatically. No need to
worry, just be sure to save and restore r6
before and after the entire
encryption routine is finished.
However, if our program is a freestanding one that runs on the bare metal,
or when we use a non-timesharing operating system. In this case, r6
is
dedicated to handle interrupts. In particular, when we’re using the PDP-11/20,
there is no MMU support, not even an optional one!
If an IRQ occurs after r6
is overwritten, we’ll have random data corruption
that leads to a total system crash. In this case, we must disable and enable
interrupts appropriately.
// save stack pointer
mov r6, sp_backup
// set CPU priority level to 7 (ignore all interrupts)
bis $340, 0177776
// full quarter round
// restore stack pointer
mov sp_backup, r6
// set CPU priority level to 0 (receive all interrupts)
bic $340, 0177776
The interrupt status is controlled by the Processor Status Word, mapped to
memory address 0177776
. There is no separate instruction to enable or
disable interrupts (you can access all CPU flags and use them like any data
in memory). CPU priority is selected by bit 5, 6 and 7. The highest priority
is mode 7, it’s reserved for critical tasks and masks all interrupts. When
restoring it, I arbitrarily restored priority level to 0, it may be unsuitable
for your application if you make use of multiple priority levels, make sure to
adjust it appropriately.
Interrupts should not be postponed for too long. In our code, they’re disabled for the duration of a single quarter round, creating a latency of several hundreds of microseconds.
Second Attempt
Although the previous solution works, it leaves much to be desired. First of all it affects the response speed of interrupts. It may be unacceptable in some systems, or at least not ideal. It also prevents us from using the stack pointer inside the encryption routine - we don’t need it for now, but it may otherwise be useful when it’s integrated into an application.
In the first attempt, reloading the data from RAM was ruled out due to performance concerns. But now we have realized, memory accesses per se is not the problem - saving and restoring the stack pointer and interrupts itself requires 4 memory accesses. The question is whether we can keep the number of memory accesses no more than 4, without relying on the stack pointer.
The answer is affirmative: by careful instruction scheduling, it’s possible.
Step 1
In the first Setup step in a quarter round, we always load d to register r4
and r5
at the very beginning of the program. For a PDP-11/40, it’s a reasonable
choice to organize the code. It allows us to calculate the one-off setup overhead
as a separate item.
The situation is different on the PDP-11/20, it wastes a valuable register that’s
needed for software XOR calculation - clearly, when we’re processing r4
, the spare
r5
can be used as a scratchpad register.
mov r0, chacha_a + 0
mov r1, chacha_a + 2
mov r0, r5
mov chacha_d + 0, r4
bic r4, r0
bic r5, r4
bis r0, r4
mov r1, r0
mov chacha_d + 2, r5
bic r5, r1
bic r0, r5
bis r1, r5
When processing the low word using r0
and r4
, our scratchpad register is
r5
. Once r4
has been calculated, we can used the free r0
as a scratchpad
for r1
and r5
.
Finally, don’t forget to delete these two lines from the section Setup. Here,
I use a .if
macro to allow the generation of either PDP-11/40 or PDP-11/20
code by a global flag (PDP1120 - 1
is just a way to say not PDP1120
due to
the lack of a Boolean NOT operator in as
).
mov chacha_a + 0, r0 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_a + 2, r1 // 1.46 + 1.46 (+ 0.15 x 2)
.if PDP1120 - 1
mov chacha_d + 0, r4 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_d + 2, r5 // 1.46 + 1.46 (+ 0.15 x 2)
.endif
Although we’ve made two memory accesses, but they’re simply moved from the pre-existing Setup section to the XOR section, so far the overhead is effectively zero.
Step 2 & Step 3
In step 2 and step 3, no spare register is available as the scratchpad register,
forcing us to reload the data from memory. Only step 2 is shown, for step 3,
change chacha_c
to chacha_a
.
mov r0, chacha_c + 0
mov r1, chacha_c + 2
bic r2, r0
bic chacha_c + 0, r2
bis r0, r2
bic r3, r1
bic chacha_c + 2, r3
bis r1, r3
As a quintessential CISC machine, the PDP-11 instruction set is not limited to load-and-store, but allows direct operations between registers and memory.
In total, there are 4 memory accesses: two in step 2, and two in step 3.
Step 4
In the final step, two spare registers for d are available because
they’ve already finished their jobs. One register r4
is already
enough.
mov r5, chacha_d + 0
mov r4, chacha_d + 2
mov r0, r4
bic r3, r0
bic r4, r3
bis r0, r3
mov r1, r4
bic r2, r1
bic r4, r2
bis r1, r2
Again, delete these two lines in the section Finalizing.
mov r3, chacha_b + 0 // 2.84 (+ 0.15 x 3)
mov r2, chacha_b + 2 // 2.84 (+ 0.15 x 3)
.if PDP1120 - 1
mov r5, chacha_d + 0 // 2.84 (+ 0.15 x 3)
mov r4, chacha_d + 2 // 2.84 (+ 0.15 x 3)
.endif
And we’re done. The use of the stack pointer has been eliminated without negative impact on performance.
ChaCha20 Internal State
The ChaCha20 internal state is represented by 16 unsigned 32-bit integers as a 4x4 matrix. To calculate a complete iteration of ChaCha20, we repeatedly apply the ChaCha20 Quarter Round function. A full ChaCha20 “round” contains 4 quarter rounds.
First, vertically. This is known as an “odd round”.
QR(0, 4, 8, 12)
QR(1, 5, 9, 13)
QR(2, 6, 10, 14)
QR(3, 7, 11, 15)
Then, diagonally. This is an “even round”.
QR(0, 5, 10, 15)
QR(1, 6, 11, 12)
QR(2, 7, 8, 13)
QR(3, 4, 9, 14)
Because ChaCha20 has 20 rounds, the odd round and the even around are applied 10 times for each.
Total Cost of 20 Full Rounds
So let’s add the timing of these quarter rounds together…
Setup 11.68 µs
MMU 1.20 µs
(x 4 rounds)
Add 8.81 µs
MMU 1.05 µs
(x 4 steps x 4 rounds)
Xor 7.66 µs
MMU 1.20 µs
(x 4 steps x 4 rounds)
Rotate 15.90 µs + 6.84 µs + 11.49 µs
MMU 1.80 µs + 0.90 µs + 1.35 µs
(x 4 rounds)
Fin. 11.36 µs
MMU 1.80 µs
(x 4 rounds)
------------------------------------------------
Total 492.60 µs
(MMU + 64.20 µs)
To complete a 512-bit block of ChaCha20 encryption, 20 full rounds are needed (thus the name, again). It takes…
Baseline: 492.60 µs x 20 = 9852.0 µs
MMU overhead: 64.20 µs x 20 = 1284.0 µs
Apply Encryption
Mixing
At this point, a full block of ChaCha20 state has been generated. But we haven’t actually started encrypting our data yet. In fact, it’s totally unsafe to use the ChaCha20 output in its current form - an attacker can easily invert the Quarter Round function and read the original input. To obtain the intended security property, the original input block should be backed up, and later mixed into the output block via 32-bit carryless addition.
Again, the PDP-11, as a quintessential and highly orthogonal CISC architecture, is not limited to load-and-store. Register-memory, and even memory-memory operations are supported by almost all instructions on all registers. It allows programmers to write really dense and tight code in assembly - which made perfect sense before highly pipelined processors.
// load pointers
mov $chacha_state, r0 // 0.84 + 1.46 (+ 0.15)
mov $chacha_orig, r1 // 0.84 + 1.46 (+ 0.15)
// backup the ChaCha20 inputs
// repeat 32 times...
mov (r0)+, (r1)+ // 0.84 + 2.42 (+ 0.15 x 3)
// perform 80 quarter rounds
// load pointers
mov $chacha_state, r0 // 0.84 + 1.46 (+ 0.15)
mov $chacha_orig, r1 // 0.84 + 1.46 (+ 0.15)
// mix ChaCha20 input and output
// repeat 32 times
add (r0)+, (r1)+ // 0.84 + 0.84 + 1.76 (+ 0.15 x 4)
adc (r1) // 0.78 + 1.77 (+ 0.15 x 3)
add (r0)+, (r1)+ // 0.84 + 0.84 + 1.76 (+ 0.15 x 4)
The line mov (r0)+, (r1)+
is a memcpy
implemented in a single instruction,
showing the peak performance of the PDP-11 ISA. A C programmer should be able to
recognize it immediately: it’s the origin of *dest++ = *src++
in C.
For convenience, the initial backup is shown and calculated as part of the mixing process. After the mixing step, the output of ChaCha20 becomes highly nonlinear and irreversible, secure for cryptographic purposes.
In total, backing up and mixing take 415.28 µs (+ 67.80 µs).
XORing
ChaCha20 is a stream cipher, what we have now is not the ciphertext, but 512-bit of cryptographically-secure pseudorandom data. To obtain the actual ciphertext, we simply XOR the random data with our plaintext. It’s actually very similar to the One Time Pad, except that we use cryptographically-secure pseudorandomness as a substitute of true randomness. In other words, the heart of every stream cipher is a good Cryptographically Secure Pseudorandom Number Generator (CSPRNG).
Because the calling convention or the full ChaCha20 state handling code is not implemented at this point, I’ll write down only a generic XOR routine.
// load plaintext pointer
mov $plaintext, r0 // 0.84 + 1.46 (+ 0.15)
// repeat 32 times...
mov (r0)+, r1 // 0.84 + 1.46 (+ 0.15)
xor r1, chacha_state+N // 1.46 + 1.76 (+ 0.15 x 4)
The XOR instruction suffers a limitation, it’s the only instruction on the PDP-11 that only supports a register as its source address, other addressing modes such as auto-increment, are not supported - it’s probably related to the fact that it was originally an extension to the basic instruction set, implemented as a module on the CPU.
The execution time is around 179 µs (+ 24.15 µs). Beware that this routine destroys our ChaCha20 internal state, a more practical routine can be slower.
Discussion
Performance
In conclusion, it takes 10446.28 µs to encrypt 64 bytes of data in our implementation, or 163.23 µs per byte (5.98 KiB/s). The MMU adds an slight overhead, making it 11822.23 µs, or 184.73 µs per byte (5.28 KiB/s).
Note: Our reference machine, PDP-11/40, was a mid-range system, on the high-end system like the PDP-11/45, performance is approximately 15 KiB/s. See Appendix 1.
A register transfer on PDP-11/40 takes 0.90 µs, which is the fastest operation. We use this time as the cycle time of the machine for performance estimation. Hence, ChaCha20 achieved 182 cycles/bytes (non-MMU) and 206 cycles/bytes (MMU) on our system.
Metric | Result (w/o MMU) | w/ MMU |
---|---|---|
cycles per byte | 182 | 206 |
time per byte | 163.23 µs | 184.73 µs |
throughput | 5.98 KiB/s | 5.28 KiB/s |
The “cycles per byte” performance is actually impressive. According to a paper NaCl on 8-bit AVR Microcontrollers by Michael Hutter and Peter Schwabe, an 8-bit AVR microcontroller can run Salsa20 (an earlier design before the ChaCha20) at 216 cycles per bytes. Daniel J. Bernstein also cited the performance reported by developers of the Arduino Cryptography Library, for ChaCha20 it is 238 cycles per bytes.
Hence, I’m confident to say that, at the same clock frequency, ChaCha20’s efficiency on the PDP-11 architecture is similar to 8-bit AVR microcontrollers, like those used by an Arduino. Consider the fact that the AVR is a modern (late 90s) RISC machine with all single-cycle instructions, meanwhile the PDP-11 is a computer architecture from half a century ago, this is not bad at all. It’s also not bad for ChaCha20, a symmetric cipher of today, it clearly achieved its “high performance” objective (Of course, an Arduino board runs at 16 MHz, a PDP-11/40 runs only at 1.11 MHz, but decades later, LSI-11 eventually achieved 15.2 MHz as well).
On the flip side, time per byte is slow, throughput is slow: 5 KiB/s is less than 0.5% of the Unibus bandwidth - because a PDP-11/40’s 0.90 µs cycle time is only 1.11 MHz. But to put things in perspective, a PDP-11/40 had a maximum of ~248 KiB of addressable memory, which can be encrypted in ~1 minute. Encrypting an entire RK05 hard disk drive (~2.5 MB) would only take ~10 minutes, the performance is actually acceptable for sending secret messages and well worth it if you have confidential information to protect.
When a phone line with a Bell 202 modem runs at 1200 baud, 5 KiB/s will not be the bottleneck.
Of course, Public-Key Cryptography is the slowest part in such applications, not the symmetric cipher. According to my quick analysis, the PDP-11 can only perform 0.1 key exchanges per second, even using an elliptic curve. However, public-key cryptography is only used for infrequent key negotiation. For a point-to-point link, it’s only used once per connection. Thus, it won’t be the bottleneck in spite of its slowness. A client-server application may be more problematic, though. Also, public-key cryptography hasn’t been discovered by Diffie and Hellman yet in 1970 (GCHQ did already, but it was classified). Implementating Curve25519 is beyond the scope of this project.
Accuracy of my Estimation
I’m neither famillar with PDP-11 nor high-speed crypto, I’m sure additional optimizations are possible to make the program a bit faster. On the other hand, I doubt a 2x speedup is possible.
Since program overheads (such as function calls, or initializating ChaCha20 with key and ciphertext) are not included in my calculation, I’m almost sure my result is still an overoptimistic overestimation. The actual performance will be slower.
Overall, I think my estimation is believable within +/- 50%. Corrections and optimizations are welcomed, indeed.
Optimizing for Size
Until now, the priority was to optimize for speed at all cost. But in order to make our encryption routine be practical, program size must be considered. Our quarter round implementation takes 198 bytes. In full ChaCha20, there are 80 quarter rounds in total, which eats 15.47 KiB of RAM, this is clearly unacceptable.
If we replace all hardcoded memory addresses with pointers, there is a slight performance degradation due to switching from relative addressing to relative deferred addressing mode, each load and store will be ~1 µs slower. The advantage is that same 206-byte quarter round can always be reused.
I prefer the third solution: We don’t need to unroll 80 quarter rounds, only 8 is enough. In ChaCha20, there are only 8 unique quarter rounds that operates on different memory locations: 4 odds and 4 even rounds. It reduces our program size to 1.55 KiB with almost no performance overhead.
Too Much Crypto
It’s worth mentioning a paper, Too Much Crypto, by Jean-Philippe Aumasson (a serious expert cryptographer who designed BLAKE2 and organized the Password Hashing Competition).
It showed that, in the hope that a cipher can survive an unexpected major breakthough in cryptanalysis, most symmetric ciphers are overdesigned with many more rounds than what’s really necessary. The author argued that a high number of rounds is often irrationally chosen, even when the current studies showed great impracticality to attack even a reduced-round cipher.
The author advocated an objective and fact-based analysis, and claimed that based on the current cryptanalysis, ChaCha8 is secure, and anything higher is only for show without practical significance (i.e. security theater).
It’s worth noting that Bernstein, the designer of ChaCha20 himself, is aware of J. P. Aumasson’s observation. In a talk on high-speed symmetric ciphers, Bernstein acknowledged that a tradeoff exists between confidence and performance.
If you agree with the paper and are willing to take higher (perhaps purely imagined) future risk for better performance, With ChaCha12, you can get 9.60 KiB/s (non-MMU) or 8.48 KiB/s (MMU). ChaCha8 is even faster, 13.78 KiB/s (non-MMU) or 12.15 KiB/s (MMU).
Cipher | Speed (w/o MMU) | w/ MMU |
---|---|---|
ChaCha20 | 5.98 KiB/s | 5.28 KiB/s |
ChaCha12 | 9.60 KiB/s | 8.48 KiB/s |
ChaCha8 | 13.78 KiB/s | 12.15 KiB/s |
Side-Channel Attack
ChaCha20 is known for its inherent resistence to side-channel attacks due to the careful avoidance of secret-dependent branching or memory accesses - it naturally runs in constant time. Still, constant time does not imply constant electromagnetic radiation!
Beware that a PDP-11 is electrically noisy and emits more radiation than any modern PCs. It was an excellent enginnering accomplishment of its era, but it was designed before the modern Electromagnetic Compatibility (EMC) standards or multilayer circuit boards with multiple dedicated ground planes were developed and used in the industry (in fact, Howard Johnson’s famed book High Speed Digital Design, page 200, specifically mentioned PDP-8 as an example for how not to design boards to meet EMC standards). It’s possible that an attacker is able to extract memory data by capturing RF leakage from the Unibus traffic. Electromechanical teletypes are also major offenders. Such a setup is extremely vulnerable to RF side-channel attacks. The large size of the installation also makes it impractical to simply shield the equipment itself. If physical attacks is in your threat model, you’ll need two fully shielded rooms at each end with filters on all power and data lines, which is a multi-million dollar project. Fortunately, most research labs probably don’t need them.
Conclusion
A PDP-11 has approximately ~5 KiB/s of ChaCha20 encryption performance, using two PDP-11s is a good way to establish at encrypted point-to-point communication link over a teletype or phone line between two research labs for confidental information. As minicomputers like PDP-11s are already used in many labs, a high degree of security can be achieved at reasonable cost.
However, if additional R&D budgets are available, it’s strongly recommended to implement the ChaCha20 cipher in hardware, as a peripheral on the Unibus.
Appendix 0: See Also
A full read is required for all PDP-11 users. Appendix C Instruction Timing and Appendix E Summary of PDP-11 Instructions should always be referred to when optimizing a program.
- UNIX Assembler Reference Manual by Dennis M. Ritchie.
Required for start programming in Unix as
(or cc
). The Unix as
assembler
is a barebone assembler without extre features like macros. Even in 2.11BSD, its
syntax and functionality is essential unchanged since Dennis Ritchie’s early
version. Some aspects can be quite confusing for first-time user (e.g. expr
vs *$expr
, only recognize sp
, not r6
, etc). A quick browse through the
manual answers the questions.
- Lions’ Commentary on UNIX 6th Edition, with Source Code by John Lions
Not required, but sometimes extremely useful (“Use the source, Luke!”).
Initially, I got lost in the Processor Handbook and missed its description on interrupts and confused. Reading Lions’ book, Section 9.1 Hardware Interrupts and Section 9.5 Interrupt Priorities with references to the Unix source code helped to clarify my doubts.
Appendix 1: Upgrade Path
If you are a would-be owner of a PDP-11, I strongly recommend you to buy the more expensive PDP-11/45, already available one year before the PDP-11/40 (11/40 was the mid-range model, 11/45 is a high-end model specifically for multiuser timesharing). Newer successors from the high-end line, including 11/50, 11/55, and 11/70, are great, too.
Turbocharged with Schottky TTL circuits and semiconductor memory (unlike the PDP-11/40 that only uses core memory), it allows a 3x speedup. In fact, a PDP-11/45 class machine became the standard for UNIX v7 and later, the underpowered PDP-11/40 was struggling to run them.
On the PDP-11/45, if the fastest bipolar memory option is installed, on one hand, you’ll get a 3x speedup, loading and storing will only take 1 µs. Register transfers, basic arithmetic and logical operations only take 0.3 µs. On the other hand, it’s guaranteed to drain your research budgets… Fortunately, PDP-11/45 is flexible enough to mix different types of memory, buying a few kilobytes of bipolar RAM and using it as a buffer is a more economical option.
Here’s the ChaCha20 timing recalculated for a PDP-11/45 with the fastest bipolar memory installed.
Setup 4.20 µs
MMU 1.08 µs
(x 4 rounds)
Add 3.00 µs
MMU 0.81 µs
(x 4 steps x 4 rounds)
Xor 2.70 µs
MMU 0.72 µs
(x 4 steps x 4 rounds)
Rotate 3.90 µs + 2.10 µs + 3.30 µs
MMU 1.17 µs + 0.63 µs + 0.99 µs
(x 4 rounds)
Fin. 4.20 µs
MMU 1.08 µs
(x 4 rounds)
---------------------
Total 162.00 µs
(MMU + 44.28 µs)
To complete a 512-bit block of ChaCha20 encryption, it takes…
162.00 µs x 20 = 3240.00 µs
And the final 64-byte mixing and XOR takes 247.80 µs (+ 64.08 µs). In total, it’s 54.50 µs per byte (17.91 KiB/s). The speedup is almost exactly 3x (299%) without MMU.
Unfortunately, PDP-11/45’s MMU is only 2x faster (0.09 µs per cycle), and some PDP-11/45 instructions take more memory cycles to complete, especially the MOV instructions. On PDP-11/40, a register-to-register transfer takes 0 cycles, but on PDP-11/40 it takes at least 1. Hence, encryption performance with a MMU is only 14.08 KiB/s, or 266%.
Naturally, ChaCha12 and ChaCha8 have much higher throughput.
Cipher | Speed (w/o MMU) | w/ MMU |
---|---|---|
ChaCha20 | 17.91 KiB/s | 14.08 KiB/s |
ChaCha12 | 28.51 KiB/s | 22.42 KiB/s |
ChaCha8 | 40.48 KiB/s | 31.85 KiB/s |
Appendix 2: Instruction Timing
On PDP-11/40, the execution time of an instruction is calculated by adding a source time, a destination time, and an execution and fetch time together. They are given in microseconds. All the three times can be different depending on the addressing mode (see Appendix 3) and the installed memory option of your paticular machine.
T = t_src + t_dest + t_ef
The MOV instruction is special - only t_src and t_ef are needed, so we assume t_dest is always 0. After looking at the manual, I found that a MOV instruction from memory to register takes 1.74 µs of source time and 1.46 µs of EF time.
EIS instructions are also special, the timing is calculated just like MOV, only t_src and t_ef are used, but they are given in a different table. Also, their execution time depends on the bitshift distance, shifting more bits causes a longer EF time.
Some shortcuts to remember when calculating the timing:
-
If the source and destination are both registers, t_src and t_dest are 0, only t_ef counts.
-
When the data is in a register, a basic transfer, arithmetic, or logic operation takes ~1 µs.
-
A load or store takes ~3 µs.
Appendix 3: Addressing Modes
For those who’s following the manual, loading from memory to register is called the “absolute” addressing mode, or mode 3. Actually, the “auto-increment deferred” mode is also mode 3 - They are the same mode. In other words, the following two instructions are the same.
// load the content of address XXXX to r0
// In C, it's "int r0 = *(int *) XXXX"
mov *$XXXX, r0
// get the content of an address stored in register R, use
// the content as a new address, then load its content to r0,
// finally, increments the register R (imagine iterating an
// int ** array of int * pointers).
// In C, it's "int r0 = **R++".
mov *(R)+, r0
They look like two very dissimilar operations: One dereferences a hardcoded constant pointer, another dereferences a pointer to a list of pointers, then auto-increments the register. Why on Earth are they the same operation?
Because on PDP-11, the program counter, PC, can be used like any other registers, and “dereferencing a hardcoded constant pointer” is simply seen as a special case of “dereferencing the register twice and increment the register” - the register just happens to be PC.
PC is here
|
V
mov (XXXX), r0
When an instruction that dereferences a constant pointer is executed, the program pointer is pointed to the beginning of the instruction itself, at where our hardcoded address is located! If we dereferencing PC twice (first to get the hardcoded memory address itself, then to dereference that address), we can get the content at memory address XXXX and load to the register, finally, we increments PC, so we start running the next instruction.
Similarly, both PC-relative addressing mode (for Position-Independent Code) and index mode (base + offset, for obtaining elements from an array) are the same mode: mode 6. PC-relative mode is simply the index mode using PC as the base address.
The PDP-11 instruction set demostrates a lot of ingenuity from the engineers at DEC, doesn’t it?
Appendix 4: Similarities to Later Microprocessors
The PDP-11 instruction set has some similarities to later microprocessors. For example, instruction mnemonics: JSR, RTS, ASL, ROR, BCS, BCC, are all perfectly understandable to MOS 6502, Motorola 6800, or 68000 developer, no explanation is needed. If you are one of them, you probably can understand this article with ease. This is not a coincidence - many designers of the later microprocessors were personally famillar with the PDP-11, or had studied it. This can be confirmed in some interviews.
Jeff LaVell, an engineering director on Motorota’s 6800 project said:
We also, that was the time of the PDP-11, and it was the great hi-technology computer at that time, so a little mini-thing with a lot of switches and so that was a neat architecture and we liked that a lot. So we started with that architecture and built some variations on it, and the original product definition of the 6800 family was a 16-bit machine. And then we talked to Tom and he said, “Go away.”
And then we evolved into the real world of what we could build.
Bill Mensch, an engineer on Motorota’s 6800 (and later MOS6502) project offered a different account:
The 6800 microprocessor was thrown together by a lot of contributors, so therefore the design group didn’t really have much of the say about what went into it. And if a sales guy came back and said, hey, we need this instruction, so they added another instruction, whether–whatever. So it was kind of a random thing, but it was built on the specification of the PDP-8, the 6800. The 6502 came later and was modeled on the PDP-11, better architecture.
Chuck Peddle, the chief enginner on MOS6502 commented:
I took a short-term assignment to design a typesetting system for a friend of mine who had thought that typesetting was the new future. And so we put a typesetting system on a PDP-11 and started doing the local town Cave Creek newspapers and everything. So the PDP-11 was an extremely good eye-opener for me. Because it showed you that memory registers and memory addressing was important.
Although the architectures of 8-bit processors are quite different from the PDP-11 (the PDP-11 has a highly orthogonal ISA with many general-purpose registers, the 8-bit microprocessors mainly use accumulator-based designs with a lot more limitations, VLSI was still in its infancy), but at least we can see an indirect influence.
Appendix 5: Full Code
The ChaCha20 quarter round implementation I’ve just described can be downloaded
here. The code snippet can be compiled directly on 2.11BSD. The implementation
is incomplete - only a single quarter round is written (ChaCha20 requires 80
quarter rounds, this code should be converted to a macro that accepts four memory
addresses chacha_a
, chacha_b
, chacha_c
and chacha_d
as its arguments).
Also, there is no input or output. The result can only be seen in a debugger by
setting a breakpoint at the end and inspecting the memory. For more information,
see the comments in code.
// Copyright 2021 Tom Li <tomli@tomli.me>
//
// Permission to use, copy, modify, and/or distribute this software for any
// purpose with or without fee is hereby granted, provided that the above
// copyright notice and this permission notice appear in all copies.
//
// THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
// WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
// MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
// ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
// WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
// ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
// OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
//
// ChaCha20 Quarter Round implementation for DEC PDP-11.
//
// This is a proof-of-concept, only a single quarter round is written.
// ChaCha20 requires 80 quarter rounds, this code should be converted
// to a macro that accepts four memory addresses chacha_a, chacha_b, chacha_c
// and chacha_d as its arguments. Also, there is no input or output.
// The result can only be seen in a debugger by setting a breakpoint at
// the end and inspecting the memory.
//
// Note: Before using this program, see the commentary on my blog:
// https://niconiconi.neocities.org/posts/chacha20-performance-on-pdp-11/
//
// Version history:
//
// * v0.02 (Mar 2020):
// - Performance optimization.
// - Added Software XOR simulation for compatibility on PDP-11/20. All
// PDP-11 machines are now supported.
//
// * v0.01 (Feb 2020):
// - First release.
// - Only support PDP-11/40 and better machines.
//
// FAQ:
//
// Q: The code a lot of repetition, why didn't you use a real assembler
// like MACRO-11 or GNU as?
//
// A: I should. But on my emulator (2.11BSD), it has Unix as and MACRO-11,
// and on my main system, it only has GNU as. This is a short project and
// I didn't have time to install a cross-development setup on my main system,
// or install GNU as in 2.11BSD, so Unix as became the only choice, although
// it has no macro capabilities. It also ensures everyone who runs Unix on
// the PDP-11 can assemble it. If I still have time, I'll probably rewrite
// it in GNU as.
//
// To assemble it on 2.11BSD, run:
// $ cc chacha20.s -o chacha20
//
// Then use the debugger:
// $ adb ./chacha20
//
// Select the memory address "end":
// adb> end
//
// Set a breakpoint at this address:
// adb> :b
//
// Run the program:
// adb> :r
// ./chacha20: running
// breakpoint end: cflg 0
//
// Inspect memory:
// adb> chacha_a, 2 ?x
// _etext+02: #92f4 #ea2a
// adb> chacha_b, 2 ?x
// _etext+06: #f8ce #cb1c
// adb> chacha_c, 2 ?x
// _etext+012: #472e #4581
// adb> chacha_d, 2 ?x
// _etext+016: #c4bb #5881
//
// The output should match the test vector in RFC 8439, section 2.1.1.
//
// Input:
//
// a = 0x11111111
// b = 0x01020304
// c = 0x9b8d6f43
// d = 0x01234567
//
// After running a Quarter Round on these four numbers, we get these:
//
// a = 0xea2a92f4
// b = 0xcb1cf8ce
// c = 0x4581472e
// d = 0x5881c4bb
.data
// By default, the XOR instruction is used for performance. It's only
// supported on PDP-11/40 and better machines. Set this macro to 1 to
// use software XOR simulation for PDP-11/20 class machines without
// the XOR instruction.
PDP1120 = 0
chacha_a:
.byte 21, 21, 21, 21 // 0x11111111
chacha_b:
.byte 4, 3, 2, 1 // 0x01020304
chacha_c:
.byte 103, 157, 215, 233 // 0x4581472e
chacha_d:
.byte 147, 105, 43, 1 // 0x5881c4bb
.text
// All timings are for PDP-11/40.
.globl _main
_main:
mov chacha_a + 0, r0 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_a + 2, r1 // 1.46 + 1.46 (+ 0.15 x 2)
.if PDP1120 - 1
mov chacha_d + 0, r4 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_d + 2, r5 // 1.46 + 1.46 (+ 0.15 x 2)
.endif
step1:
// a += b
mov chacha_b + 0, r2 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_b + 2, r3 // 1.46 + 1.46 (+ 0.15 x 2)
add r2, r0 // 0.99 (+ 0.15)
adc r1 // 0.99 (+ 0.15)
add r3, r1 // 0.99 (+ 0.15)
// store a
mov r0, chacha_a + 0 // 2.84 (+ 0.15 x 3)
mov r1, chacha_a + 2 // 2.84 (+ 0.15 x 3)
// d ^= a
.if PDP1120 - 1
xor r0, r4 // 0.99 (+ 0.15)
xor r1, r5 // 0.99 (+ 0.15)
.endif
.if PDP1120
mov r0, r5
mov chacha_d + 0, r4
bic r4, r0
bic r5, r4
bis r0, r4
mov r1, r0
mov chacha_d + 2, r5
bic r5, r1
bic r0, r5
bis r1, r5
.endif
step2:
// c += (d <<< 16)
mov chacha_c + 0, r0 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_c + 2, r1 // 1.46 + 1.46 (+ 0.15 x 2)
add r5, r0 // 0.99 (+ 0.15)
adc r1 // 0.99 (+ 0.15)
add r4, r1 // 0.99 (+ 0.15)
// store c
mov r0, chacha_c + 0 // 2.84 (+ 0.15 x 3)
mov r1, chacha_c + 2 // 2.84 (+ 0.15 x 3)
// b ^= c
.if PDP1120 - 1
xor r0, r2 // 0.99 (+ 0.15)
xor r1, r3 // 0.99 (+ 0.15)
.endif
.if PDP1120
bic r2, r0
bic chacha_c + 0, r2
bis r0, r2
bic r3, r1
bic chacha_c + 2, r3
bis r1, r3
.endif
// b >>>= 4
mov r2, r0 // 0.90
asr r0 // 1.25 (+ 0.15)
ror r3 // 1.25 (+ 0.15)
ror r2 // 1.25 (+ 0.15)
asr r0 // 1.25 (+ 0.15)
ror r3 // 1.25 (+ 0.15)
ror r2 // 1.25 (+ 0.15)
asr r0 // 1.25 (+ 0.15)
ror r3 // 1.25 (+ 0.15)
ror r2 // 1.25 (+ 0.15)
asr r0 // 1.25 (+ 0.15)
ror r3 // 1.25 (+ 0.15)
ror r2 // 1.25 (+ 0.15)
step3:
// a += (b <<< 16)
mov chacha_a + 0, r0 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_a + 2, r1 // 1.46 + 1.46 (+ 0.15 x 2)
add r3, r0 // 0.99 (+ 0.15)
adc r1 // 0.99 (+ 0.15)
add r2, r1 // 0.99 (+ 0.15)
// store a
mov r0, chacha_a + 0 // 1.46 + 1.46 (+ 0.15 x 3)
mov r1, chacha_a + 2 // 1.46 + 1.46 (+ 0.15 x 3)
// d ^= a
.if PDP1120 - 1
xor r0, r5 // 0.99 (+ 0.15)
xor r1, r4 // 0.99 (+ 0.15)
.endif
.if PDP1120
bic r5, r0
bic chacha_a + 0, r5
bis r0, r5
bic r4, r1
bic chacha_a + 2, r4
bis r1, r4
.endif
// d <<<= 8
// rotate 8-bit to the left r4: 011 022
// example values are given r5: 033 044
swab r5 // 0.99 (+ 0.15) r5: 044 033
mov r5, r0 // 0.90 r0: 044 033
clrb r5 // 0.99 (+ 0.15) r5: 044 000
swab r4 // 0.99 (+ 0.15) r4: 022 011
bisb r4, r5 // 0.99 (+ 0.15) r5: 044 011
clrb r4 // 0.99 (+ 0.15) r4: 022 000
bisb r0, r4 // 0.99 (+ 0.15) r4: 022 033
step4:
// c += (d <<< 16)
mov chacha_c + 0, r0 // 1.46 + 1.46 (+ 0.15 x 2)
mov chacha_c + 2, r1 // 1.46 + 1.46 (+ 0.15 x 2)
add r5, r0 // 0.99 (+ 0.15)
adc r1 // 0.99 (+ 0.15)
add r4, r1 // 0.99 (+ 0.15)
// store c
mov r0, chacha_c + 0 // 2.84 (+ 0.15 x 3)
mov r1, chacha_c + 2 // 2.84 (+ 0.15 x 3)
// (b <<< 16) ^= c
.if PDP1120 - 1
xor r0, r3 // 0.99 (+ 0.15)
xor r1, r2 // 0.99 (+ 0.15)
.endif
.if PDP1120
mov r5, chacha_d + 0
mov r4, chacha_d + 2
mov r0, r4
bic r3, r0
bic r4, r3
bis r0, r3
mov r1, r4
bic r2, r1
bic r4, r2
bis r1, r2
.endif
// (b <<< 16) <<<= 8
// rotate 8-bit to the left r3: 011 022
// example values are given r2: 033 044
swab r2 // 0.99 (+ 0.15) r2: 044 033
mov r2, r0 // 0.90 r0: 044 033
clrb r2 // 0.99 (+ 0.15) r2: 044 000
swab r3 // 0.99 (+ 0.15) r3: 022 011
bisb r3, r2 // 0.99 (+ 0.15) r2: 044 011
clrb r3 // 0.99 (+ 0.15) r3: 022 000
bisb r0, r3 // 0.99 (+ 0.15) r3: 022 033
// b >>>= 1
mov r3, r0 // 0.90
ror r0 // 1.25 (+ 0.15)
ror r2 // 1.25 (+ 0.15)
ror r3 // 1.25 (+ 0.15)
// store b, d
mov r3, chacha_b + 0 // 2.84 (+ 0.15 x 3)
mov r2, chacha_b + 2 // 2.84 (+ 0.15 x 3)
.if PDP1120 - 1
mov r5, chacha_d + 0 // 2.84 (+ 0.15 x 3)
mov r4, chacha_d + 2 // 2.84 (+ 0.15 x 3)
.endif
end:
// use debugger to set a breakpoint here
nop
// call exit() system call, return 0.
// XXX: This is 2.11BSD-specific code.
mov $0, -(sp)
tst -(sp)
sys 1