Simple hardening of the Python interpreter
For companies protecting their source code form reverse engineering is between very to vitally important. Using languages based on virtual machines easily expose the bytecode to simple un-compile techniques which revert the op-codes back to human readable code. One solution involves obfuscating the source code but with Python this is really hard because renaming function names, class methods and attributes can break code which access them by literals (i.e. a getattr(obj, "attribute")
can fail if the attribute name is changed by the obfuscator).
In this post I’ll talk about how to obfuscate the bytecode generated by the interpreter.
Source code to bytecode
The python interpreter doesn’t run the code straight from the plain text human-readable representation, it parse it and produce a bytecode representation which contains the op-codes to be executed by the virtual machine. For example:
>>> code = compile("print 'hello world'", "<string>", "exec")
>>> code.co_code
'd\x00\x00GHd\x01\x00S'
# More readable bytecode
>>> ':'.join(x.encode('hex') for x in code.co_code)
'64:00:00:47:48:64:01:00:53'
The print 'hello world'
statement is compile into the bytestream accessible by the code.co_code
attribute and can be disassembled using the dis
package:
>>> import dis
>>> dis.dis(code)
1 0 LOAD_CONST 0 ('Hello world')
3 PRINT_ITEM
4 PRINT_NEWLINE
5 LOAD_CONST 1 (None)
8 RETURN_VALUE
When a .py
file is loaded by the Python interpreter it parses the content of the file and generates the bytecode into a file with the same name of the original one but with the .pyc
extension (note that on Python 3.x the generated bytecode will be in the __pycache__
folder).
As explained in the dis
package’s documentation:
Bytecode is an implementation detail of the CPython interpreter! No guarantees are made that bytecode will not be added, removed, or changed between versions of Python. Use of this module should not be considered to work across Python VMs or Python releases.
So the bytecode generated by an interpreter works only for that particular interpreter, different releases of the same interpreter can generate different bytecode and the generated bytecode doesn’t depend by the underlying hardware architecture.
This means we can create or version the interpreter with different op-codes for the same instructions rendering the bytecode incompatible with any other Python interpreter and protecting the code from disassembling.
Building custom interpreter
In order to build your custom Python interpreter you need to download the source code first:
apt-get source python2.7
This will download the source code of the Python 2.7 interpreter and automatically unpack it; you will find a python2.7-2.7._x_
folder (where x is the patch number) in the current directory. Now navigate into python2.7-2.7.x/Include
and open the header file opcode.h
. This file contains the definition of all the op-codes used by the current interpreter, changing the content of this file will generate different op-codes from the same instructions.
As a test we’ll change the op-codes of the PRINT_*
VM’s instructions from the original values:
#define PRINT_EXPR 70
#define PRINT_ITEM 71
#define PRINT_NEWLINE 72
#define PRINT_ITEM_TO 73
#define PRINT_NEWLINE_TO 74
to this new values:
#define PRINT_EXPR 74
#define PRINT_ITEM 70
#define PRINT_NEWLINE 71
#define PRINT_ITEM_TO 72
#define PRINT_NEWLINE_TO 73
Time to install the compiler and all the packages needed to compile the interpreter’s source code:
sudo apt-get install build-essential
sudo apt-get build-dep python2.7
and proceeding to the building phase:
./configure
make
When done launch the just built Python interpreter and display again the bytecode of our simple statement:
>>> code = compile("print 'hello world'", "<string>", "exec")
>>> ':'.join(x.encode('hex') for x in code.co_code)
'64:00:00:46:47:64:01:00:53'
Now, as you can see the bytecode generated by our interpreter is different than the one generated by the standard interpreter, in particular the 4th and 5th bytes are changed from 47:48 to 46:47 accordingly with our modification in the opcode.h
file.
Conclusion
Building a custom version of the Python interpreter with scrambled op-codes to protect the source code from being disassembled is a pretty fast and cheap solution; a single change in the header file and the generated bytecode will be incompatible with any other Python interpreter.
However this is only the first step, scrambling the op-codes is not enough to protect your code:
-
An attacker can use our custom interpreter to load the bytecode and disassemble with the
dis
package, so it must be removed -
Again an attacker can load its
.py
file, compile it into a.pyc/.pyo
and reverse engineering the new op-code map, so any way to translate source files into bytecode should be disabled -
Scrambling the op-codes is not just reassigning new values but if you take a look at the comments in the
opcode.h
file some values are strictly tied to another op-codes or the new op-code value must match a particular criteria (i.e. look at theCALL_FUNCTION_*
opcodes for an example about tight criteria)
So this is only the top of the iceberg, if really your code should be protected creating a custom hardened Python interpreter can be a pretty time consuming job and it needs a good understanding of the internals of the interpreter itself.