/ BYTECODE, HARDENING, PYTHON, REVERSE ENGINEERING

Simple hardening of the Python interpreter

For companies protecting their source code form reverse engineering is between very to vitally important. Using languages based on virtual machines easily expose the bytecode to simple un-compile techniques which revert the op-codes back to human readable code. One solution involves obfuscating the source code but with Python this is really hard because renaming function names, class methods and attributes can break code which access them by literals (i.e. a getattr(obj, "attribute") can fail if the attribute name is changed by the obfuscator).

In this post I’ll talk about how to obfuscate the bytecode generated by the interpreter.

Source code to bytecode

The python interpreter doesn’t run the code straight from the plain text human-readable representation, it parse it and produce a bytecode representation which contains the op-codes to be executed by the virtual machine. For example:

>>> code = compile("print 'hello world'", "<string>", "exec")
>>> code.co_code
'd\x00\x00GHd\x01\x00S'
# More readable bytecode
>>> ':'.join(x.encode('hex') for x in code.co_code)
'64:00:00:47:48:64:01:00:53'

The print 'hello world' statement is compile into the bytestream accessible by the code.co_code attribute and can be disassembled using the dis package:

>>> import dis
>>> dis.dis(code)
  1           0 LOAD_CONST               0 ('Hello world')
              3 PRINT_ITEM
              4 PRINT_NEWLINE
              5 LOAD_CONST               1 (None)
              8 RETURN_VALUE

When a .py file is loaded by the Python interpreter it parses the content of the file and generates the bytecode into a file with the same name of the original one but with the .pyc extension (note that on Python 3.x the generated bytecode will be in the __pycache__ folder).

As explained in the dis package’s documentation:

Bytecode is an implementation detail of the CPython interpreter! No guarantees are made that bytecode will not be added, removed, or changed between versions of Python. Use of this module should not be considered to work across Python VMs or Python releases.

So the bytecode generated by an interpreter works only for that particular interpreter, different releases of the same interpreter can generate different bytecode and the generated bytecode doesn’t depend by the underlying hardware architecture.

This means we can create or version the interpreter with different op-codes for the same instructions rendering the bytecode incompatible with any other Python interpreter and protecting the code from disassembling.

Building custom interpreter

In order to build your custom Python interpreter you need to download the source code first:

apt-get source python2.7

This will download the source code of the Python 2.7 interpreter and automatically unpack it; you will find a python2.7-2.7._x_ folder (where x is the patch number) in the current directory. Now navigate into python2.7-2.7.x/Include and open the header file opcode.h. This file contains the definition of all the op-codes used by the current interpreter, changing the content of this file will generate different op-codes from the same instructions.

As a test we’ll change the op-codes of the PRINT_* VM’s instructions from the original values:

#define PRINT_EXPR      70
#define PRINT_ITEM      71
#define PRINT_NEWLINE   72
#define PRINT_ITEM_TO   73
#define PRINT_NEWLINE_TO 74

to this new values:

#define PRINT_EXPR      74
#define PRINT_ITEM      70
#define PRINT_NEWLINE   71
#define PRINT_ITEM_TO   72
#define PRINT_NEWLINE_TO 73

Time to install the compiler and all the packages needed to compile the interpreter’s source code:

sudo apt-get install build-essential
sudo apt-get build-dep python2.7

and proceeding to the building phase:

./configure
make

When done launch the just built Python interpreter and display again the bytecode of our simple statement:

>>> code = compile("print 'hello world'", "<string>", "exec")
>>> ':'.join(x.encode('hex') for x in code.co_code)
'64:00:00:46:47:64:01:00:53'

Now, as you can see the bytecode generated by our interpreter is different than the one generated by the standard interpreter, in particular the 4th and 5th bytes are changed from 47:48 to 46:47 accordingly with our modification in the opcode.h file.

Conclusion

Building a custom version of the Python interpreter with scrambled op-codes to protect the source code from being disassembled is a pretty fast and cheap solution; a single change in the header file and the generated bytecode will be incompatible with any other Python interpreter.

However this is only the first step, scrambling the op-codes is not enough to protect your code:

  • An attacker can use our custom interpreter to load the bytecode and disassemble with the dis package, so it must be removed

  • Again an attacker can load its .py file, compile it into a .pyc/.pyo and reverse engineering the new op-code map, so any way to translate source files into bytecode should be disabled

  • Scrambling the op-codes is not just reassigning new values but if you take a look at the comments in the opcode.h file some values are strictly tied to another op-codes or the new op-code value must match a particular criteria (i.e. look at the CALL_FUNCTION_* opcodes for an example about tight criteria)

So this is only the top of the iceberg, if really your code should be protected creating a custom hardened Python interpreter can be a pretty time consuming job and it needs a good understanding of the internals of the interpreter itself.