Most modern cryptocurrency wallets implement Bitcoin Improvement Proposal (BIP) 39. At a high level, BIP 39 defines a formula for 1) the generation of a mnemonic sentence (also referred to as mnemonic words, seed phrase, recovery phrase, etc.), then 2) the generation of a seed from that mnemonic sentence. That seed is used to produce your private and public keys, but those details will be covered in the next post in this series.
This post will step you through the nitty-gritty bits and bytes of generating of a mnemonic sentence, and from those words, a 512-bit seed. Python code snippets will be used to demonstrate the concepts along the way. Disclaimer: this code is written purely for educational purposes; use responsibly, etc.
Note: If you prefer to skip straight to the code, or want to run it as you read, the implementation is available within a Jupyter notebook here.
Why BIP 39? Because this mnemonic sentence:
indoor dish desk flag debris potato excuse depart ticket judge file exit
is much easier to recognize and relay than this hexadecimal seed:
The formula is deterministic, meaning that the same mnemonic words will always produce the same 512-bit seed. If your device gets stolen or a wallet vendor goes out of business, you can fully restore multiple wallets with just those random 12–24 words. Social recovery of wallets is made easier too, for example, by sharing three words with each of four trusted friends.
Ready for some code? Let’s step through the formula, starting with the generation of the mnemonic sentence. At a high level, we’re looking to start with a random number, slice it up into the number of words we want in our mnemonic sentence, then convert each chunk of data into an English word.
The first thing we’ll need is that random number, also referred to as entropy. The BIP 39 spec states that this entropy can only come in a few sizes: multiples of 32 bits, between 128 and 256. The larger the entropy, the more mnemonic words generated, and the greater the security of your wallets.
For simplicity’s sake, we’ll choose a 128-bit entropy, from which we can expect to derive 12 mnemonic words. For reference, each 32 bits beyond 128 adds three more mnemonic words to the sentence — the upper bounds being 24 words, using a 256-bit random number.
os.urandom can be used to generate a number of random bytes, and the
bitarray package provides a convenient way to convert those bytes into bits. We’ll need both representations later.
Note: if you follow along at home, you will see different results than are displayed in these examples. It’s random, after all.
# valid_entropy_bit_sizes = [128, 160, 192, 224, 256]
entropy_bit_size = 128
entropy_bytes = os.urandom(entropy_bit_size // 8)print(entropy_bytes)
# b'Q\x83\xe1\xf4\xf1j\xac5\x16\x04<\x0bm`\xcf\x0c'from bitarray import bitarray
entropy_bits = bitarray()
Random number achieved!
entropy_bytes are two representations of the same number.
We’re expecting 12 mnemonic words in the end, so we’re going to want to chop up our data into 12 groups. 128 bits is not evenly divisible by 12, though. The BIP 39 formula accounts for this by adding a checksum to the end of the entropy.
The size of the checksum is dependent on the size of the entropy. To find the checksum length, divide the entropy size (e.g. 128) by 32:
checksum_length = entropy_bit_size // 32
So, we know that the checksum will be four bits in length. Which four bits? The first four of the SHA-256 hash of the entropy:
from hashlib import sha256
hash_bytes = sha256(entropy_bytes).digest()
# b'\xef\x88\xad\x02\x16\x7f\xa6y\xde\xa6T...'hash_bits = bitarray()
# bitarray('111011111000100010...')checksum = hash_bits[:checksum_length]
The first 4 bits in this case are
1110. This checksum gets appended to the end of the
entropy_bits, bringing the total bits to 132 — a number evenly divisible into 12 groups of 11 bits.
11 bits is the “magic number” chosen in the BIP 39 spec. Regardless of entropy size, the entropy + checksum needs to be evenly divided into groups of 11 bits. The following Python one-liner does just that:
grouped_bits = tuple(entropy_bits[i * 11: (i + 1) * 11] for i in range(len(entropy_bits) // 11))print(grouped_bits)
# (bitarray('01010001100'), bitarray('00011111000'), ...)print(len(grouped_bits))
The next step is to convert each 11-bit group into integers. The
bitarray package provides a convenient helper function,
ba2int, for converting bit arrays to integers. The resulting integers should range from zero to 2047
ba2int(bitarray(‘11111111111’)) == 2047).
from bitarray.util import ba2int
indices = tuple(ba2int(ba) for ba in grouped_bits)print(indices)
# (652, 248, 1001, 1814, 1366, 212, 704, 1084, 91, 856, 414, 206)
At this point, we have twelve integers, each representing a word in a word list. Word lists come in several languages, but each has 2048 words.
Note: if you’re implementing your own wallet, you’re free to make up your own word list, but wallets produced with your word list will not be interoperable/recoverable with other BIP 39-compliant wallet providers.
For this example, we’ll assume the English word list is already loaded into memory. Simply swap out the English word at the corresponding index to reveal your mnemonic:
english_word_list = ['abandon', 'ability', ..., 'zone', 'zoo']mnemonic_words = tuple(english_word_list[i] for i in indices)print(mnemonic_words)
# ('face', 'business', 'large', 'tissue', 'print', 'box', 'fix', 'maple', 'arena', 'help', 'critic', 'border')
Mnemonic words generated! 💥
Mind you, the words are only useful when they produce a seed, which can derive private and public keys. So, lets find that seed and wrap this up.
The 512-bit seed is produced by a Password-Based Key Derivation Function, and specifically, PBKDF2. The inputs to this function are the pseudorandom function (HMAC-SHA512), a password (our mnemonic sentence), a salt, and the number of iterations the hash function will run (2048).
The only argument we haven’t covered yet is the salt. This is an opportunity to add an additional level of security to your wallets. To produce the salt, the string
“mnemonic” is concatenated with an optional passphrase of your choosing. If you don’t supply one, the passphrase will default to an empty string.
passphrase = "you-make-this-up"
salt = "mnemonic" + passphrase
That’s everything we need to derive the seed. In Python-land,
pbkdf2_hmac function is the one we’re looking for. Note that the mnemonic sentence needs to be in string format, with the words separated by spaces. Then, both the mnemonic and the salt need to be converted to bytes.
mnemonic_string = ' '.join(mnemonic_words)
# 'across abstract shine ... uphold already club'seed = hashlib.pbkdf2_hmac(
✨ Voila! ✨
The seed is returned as a set of 64 bytes (512 bits), but the hexadecimal format is how you would commonly see it represented. If you coded along at home, a quick way to check your work is to plug in the mnemonic sentence you generated into a hosted BIP 39 converter and see if the resulting seed matches yours. Want the code? Here’s that Jupyter notebook link again.
Disclaimer, again: code provided is for educational purposes. It’s *not* a good idea to store assets in a wallet after you’ve plugged its seed into random websites.
Coming up next: a walkthrough of BIP 32, illustrating how to convert that 512-bit seed into multiple private and public keys for various use cases.
Update: published! The next post is available here: Ethereum 201: HD Wallets.