(just wrote this up in a few minutes, so there might be typos and stuff)
(the reader is expected to have at least basic knowledge of python)
(and programming in general)
(if not, read fullern's introduction)
(if you have a question about it (or feedback), comment here or join the repl.it discord and DM elias#7990)
This document is an explanation of how Python handles and manipulates data. It is an attempt to clarify some common misconceptions about Python. Important vocabulary and facts will be bolded.
First, forget what you know about Python objects (but not syntax). I will be explaining most of the useful features here.
Expressions in python are the main way of manipulating data. They calculate (evaluate) a value based on the current state of the program. There are a few types of expressions:
- Literals: Literals always evaluate to the same thing. For example, the literal
5always evaluates to the number
- Operations: Operations combine values. For example, the
+operations adds numbers (or concatenates strings). An expression with an operation looks like
<some expression> <operation (+, -, *, etc.)> <other expression>(without the angle brackets). Operations follow an extended version of PEMDAS, described here.
- Function calls: These are described in the Functions section.
- Ternaries: Ternary expressions look like
<expression a> if <condition expression> else <expression b>. First
<condition expression>is evaluated, and if it is truthy (acts like True; see Truthiness section for more),
<expression a>is evaluated. If it is not truthy (falsy),
<expression b>is evaluated instead.
- Comprehensions: See the Comprehension section
Python code is composed of lines and blocks. Lines in python can have a few different forms (more than the ones shown here, but the others are used less):
<expression> # <expression> is evaluated, and then the value is discarded variable = <expression> # <expression> is evaluated, and assigned to variable as described in Scoping. var1, var2 = <expression> # <expression> is evaluated, and (assuming it's a list, tuple, etc.; see Iterables) var1 is set to its zeroth element and # var2 is set to its first (var1, var2), var3 = <expression> # <expression> is evaluated, and (var1, var2) is set to its zeroth element and var3 is set to its first # other patterns similar to those above also work, like (var1, var2), (var3, var4) # i use <lvalue> to represent any of those patterns if <expression>: # <expression> is evaluated <more lines of code, indented by some amount> # if <expression> was truthy, <more lines of code, indented> is run and the code skips to <more lines, not indented> elif <expression2>: # <expression2> is evaluated (only if <expression> was falsy). elifs are optional, and don't have to be included <more indented code> # if <expression2> was truthy, <more indented code> is run and the code skips. ... # the pattern continues else: <last indented code> # if none of <expression>, <expression2>, ... was truthy, <last indented code> is run. <more lines, not indented> while <expression>: # evaluate <expression>. if it is true, <indented code> # run <indented code> and go back to the start for <lvalue> in <expression>: # for every element in <expression> (probably a list, tuple, or iterable), set <lvalue> to that element and <indented code> # run <indented code> def name(value): # described in Functions <indented code> class name(): # described in User-Defined Classes <indented code>
class can also be used in indented code.
When a program starts running, all variables are in a scope (collection of variables) called the global scope. Other scopes can be created and manipulated as described in Functions and User-Defined Classes.
Objects & Types:
Data in Python is an object. Numbers are objects, string are objects, and functions are objects. Statements like
while are not objects, since they are code, not data. Every object in Python has a type, a.k.a. class. The type of a number is
float. The type of a string is
str. The type of an object can be determined with
type(obj). These types are also objects. Their type is called
type's type is itself. Types (but not non-type objects) also have superclasses. These superclasses are also types, and specify additional data about the type (more info in Attributes). If a type has multiple superclasses, this algorithm is used to determine priorities for them in certain situations.
Objects also have attributes. These attributes are named. Their names start with a character a-z, A-Z, or _. The characters after the first can be a-z, A-Z, _, or a digit. For example, an attribute could be called
__add__, but not
123AbC (the first character cannot be a digit). These attributes can be accessed with
the_object.attribute_name. When an attribute is accessed, if it is not found, the object's class (a.k.a. type) is checked. If the attribute is not present there, the superclasses of the type are checked next, in order by priority. If none of the superclasses have the attribute, an exception is raised (see Exceptions).
Attributes also have values, which are the values returned when an attribute is accessed. These values can be any object. There are also descriptors, which allow code to be run when an attribute is accessed. They are described in more detail in the Descriptors section.
Some attribute names are special. For example, an attribute named
__add__ specifies how objects are added together with
+. This will be described in more detail in the User-Defined Classes section.
Functions are snippets of Python code, and are also objects. They can be called with
the_function_object(argument_1, argument_2, ...). When a function is called, the code that called it is paused, the passed arguments are placed in variables, and the function's code is run. When the function executes a
return statement, the function's code stops, and the returned value is send back to the calling code. If the function reaches the last line, it returns
None, an object representing no data.
def a_function(number): # when a_function(value) is called, the calling code is paused # then a new scope is created, # and a variable called number is created with the argument in it # the function's code is then run in the new scope result = number + 1 # result is created in this function's scope return result # the function's code stops here, and result is sent to the calling code. function_result = a_function(41) # calling a_function, where argument is set to 41 # function_result is set to the value returned from a_function print(function_result) # print is a function too! it returns None
When run, this code takes the following path:
def a_function(number): # all that code above
This line (and the code in the
def) creates a function object, then puts it in a variable called
function_result = a_function(41)
This line calls
a_function with the argument
41. We'll come back to this line once the function returns.
The function is called, so a new scope is created. Here, the argument
41 corresponds to the variable
number, so the variable
number in the new scope is set to
result = number + 1
This line calculates
41 (the value of the
+ 1, which equals
result variable is then set to
This line ends the function, returning the value of
42). You don't need to return a variable, so this function could have just returned
number + 1 instead.
This also goes back to the scope that existed before the function was called (the global scope).
function_result = a_function(41)
Now that the function has returned,
function_result is set to the value it returned, which is
42. Note that this variable is created in the global scope, not in
This outputs the value of
Generators, Iterators, and Iterables
Generators are one of my favorite features in Python. They are easiest to understand with an example, so here is one:
def a_generator_func(): print("about to yield first value") yield "first value" print("just yielded first value") print("about to yield second value") yield "second_value" print("done yielding, there is no third value") a_generator = a_generator_func() print("about to get first value") print("the first value is", repr(next(a_generator))) print("about to get second value") print("the first value is", repr(next(a_generator))) print("about to get third value") print("the first value is", repr(next(a_generator)))
This code will output
about to get first value about to yield first value the first value is 'first value' about to get second value just yielded first value about to yield second value the first value is 'second_value' about to get third value done yielding, there is no third value Traceback (most recent call last): File "python", line 17, in <module> StopIteration
a_generator_func is a function, that, instead of pausing the main program when called, returns an object representing the current state.
next can be repeatedly called on that object to run it until it
yields a value.
When a generator runs out of things to yield, a
StopIteration exception (TODO: section on exceptions) is raised. These can be caught with:
try: next_val = next(a_generator) except StopIteration: print("the generator ended")
Generators can also be used with
for elem in a_generator_func(): print(elem)
This would output
about to yield first value elem: first value just yielded first value about to yield second value elem: second_value done yielding
, but would not raise
for automatically handles this).
Generators are a special class of iterators, which are objects that can have
next called on them, and can be used with
tuples are not iterators, but iterables, which can be converted to iterators by calling
iter on them.
There are many online tutorials about creating classes, most of which are better than I can write. So before you read this, go look at one of those tutorials (but keep in mind that everything in play is an object).
So, now that you have read one of those tutorials (go read one if you haven't), consider the following example class:
class Point(): def __init__(self, x_pos, y_pos): self.x_pos = x_pos self.y_pos = y_pos def print(not_self): print(not_self.x_pos, not_self.y_pos) def __add__(tomayto, tomahto): return Point(tomayto.x_pos + tomahto.x_pos, tomayto.y_pos + tomahto.y_pos) a = Point(1, 2) b = Point(3, 4) a.print() b.print() (a + b).print()
This code creates a
Point class, and 2
instances of it (
b). It then calls
a + b. The only peculiarity you might notices about this is that it doesn't use
self, but different argument names. Despite what most tutorials would have you believe, there is nothing special or magic about
self. You can replace it with any other name and your code will work fine. If your tutorial didn't cover
__add__ and similar functions, they basically define what happens when
a + b is executed. More information can be found here.
The only "magic" thing left in this code is how the
tomayto arguments actually get added;
a.print() never specifies a value for
not_self. This is covered in the next section,
Descriptors are a way to make code run when an attribute is accessed. They are objects with
__set__ special methods. When they are accessed as an attribute of a class (but not from an instance), the
__get__ method is called with the instance and the class the descriptor was found on. When attributes are set, a similar thing is done with
__set__. This is how functions automatically add a
self argument. They are descriptors, so when they are accessed, it returns a "bound" version of the function, that has the instance (
self) already added as an argument. A shorthand way to create descriptors is:
class Test(): @property def x(): print("getting x") @x.setter def x(): print("setting x") a = Test() a.x a.x = 1
@x.setter lines are decorators, described in the next section.
Consider the following code:
def function_making_function(old_function): print("making a new function from", old_function) def new_function(): print("calling the old function") old_function() print("done calling the old function") print("made the new function") return new_function
This code is relatively simple, except for the fact that a function is defined inside another function. This gives the inner defined function access to all the variable in the outer function.
def some_function(): print("in some_function") print("starting to make a new function") new_function = function_making_function(some_function) # not calling it here, just moving at around (remember, functions are objects!) print("done making the new function") print("calling the new function") new_function() print("done")
starting to make a new function making a new function from <function old_function at 0xhesyfyua> made the new function done making the new function calling the new function calling the old function in some_function done calling the old function done
print("starting to make a new function") @function_making_function def some_function(): print("in some_function") print("done making the new function") print("calling the new function") some_function() print("done")
does the exact same thing as the other code above.
@function_making_function is just shorthand for
some_function = function_making_function(some_function) (but it must be before the
def). This is called a decorator. Multiple decorators can be used on the same function, and are applied bottom-to-top. Decorators do not have to return functions; they can also return other objects (such as descriptors in the case of
@property_name.setter works by having a
setter() method on the descriptor objects
@property returns. This
setter() method changes the behavior of the property's
__set__ method (but not the value as an instance attribute, then the function binding descriptor wouldn't work).
That is most of the features I regularly use in Python. There are a few simplifications and some things are left out, but for the most part this should be accurate. However, I left out a lot of information about the standard library (built-in functions and modules). Information about them can be found here and here.
This section details how a certain Python interpreter (CPython) actually runs Python code. Most things discussed here don't apply to other interpreters, although some concepts might be the same.
When Python code is run, it is lexed, parsed, converted to bytecode, and then run.
Lexing, or tokenizing, is the process of breaking code up into groups of symbols so it is easier to process later. A simple example of lexing (not with Python code) could look like:
1 + 2*(3 +4- 5) -> NUM 1 OP + NUM 2 OP * OPEN_PAREN NUM 3 OP + NUM 4 OP - NUM 5 CLOSE_PAREN
Note that this takes care of unused whitespace and other things that might complicate the process later on.
TODO: explain python lex
Parsing converts tokenized code into a tree structure that is easier for code to process. For example, using the arithmetic expression from earlier:
NUM 1 OP + ... NUM 5 CLOSE_PAREN -> ADD( 1, MUL( 2, SUB( ADD(3,4), 5 ) ) )
This takes care of things like parentheses that don't really affect how code actually runs.
TODO: explain python parse
Bytecode: What is it?
Bytecode is an intermediate language, which means that code is translated to it before being run. For example, again with the arithmetic expression:
ADD(1, MUL(...)) -> PUSH 1 PUSH 3 PUSH 2 PUSH 4 ADD // 3 + 4 PUSH 5 SUB // 3 + 4 - 5 MUL // 2 * (3 + 4 - 5) ADD // 1 + 2 * (3 + 4 - 5)
That code is actually not the bytecode, just a simplified version. Bytecode would actually be represented as a sequence of bytes, like in the following example (with bytes as hexadecimal):
PUSH 1: 00 01 // the stack is currently 1 PUSH 3: 00 03 // the stack is currently 1 3 PUSH 2: 00 02 // the stack is currently 1 3 2 PUSH 4: 00 04 // the stack is currently 1 3 2 4 ADD: 01 // the stack is currently 1 3 6 PUSH 5: 00 05 // the stack is currently 1 3 6 5 SUB: 02 // the stack is currently 1 3 1 MUL: 03 // the stack is currently 1 3 ADD: 01 // the stack is currently 4
Here, the opcode for
00 (with 1 parameter),
The bytecode for Python is complicated and has many operations (detailed here), so I will not fully document it here. It is stack-based, which means most of its instructions operate on a data stack, similar to the arithmetic example above. This stack contains Python objects. There are also a set of slots, which are used to store variables, constants, names, etc.
Here are some common instructions:
POP_TOP: remove the top element of the stack (TOS) BINARY_ADD: pop the TOS and the element below the TOS (TOS1), add them, and push it back onto the stack BINARY_SUBTRACT: see above (-) BINARY_MULTIPLY: see above (*) BINARY_FLOOR_DIVIDE: see above (//) BINARY_TRUE_DIVIDE: see above (/) STORE_FAST(slot_num): pop the TOS and store to a slot LOAD_FAST(slot_num): read a slot and push to the stack LOAD_CONST(slot_num): load a constant and push it CALL_FUNCTION(arg_count): pop the some elements and call the last one as a function with the other elements RETURN_VALUE: return from a function JUMP_ABSOLUTE(where): jump to the <where>th bytecode instruction POP_JUMP_IF_TRUE(where): pop the TOS, if true, jump to <where> POP_JUMP_IF_FALSE: ... GET_ITER: pop the TOS, convert to an iterator, and push it FOR_ITER(delta): use the iterator in TOS (w/o popping) and get the next element. If there is no next element, jump ahead by <delta> instructions UNPACK_SEQUENCE(unpack_len): pop an iterable as the TOS and push its elements (there should be <unpack_len>) to the stack
Bytecode: How is it made?
In Python, bytecode for arithmetic expressions is implemented similarly to the example with the arithmetic expressions. For example,
a + b: <bytecode for a> <bytecode for b> BINARY_ADD a - b: <bytecode for a> <bytecode for b> BINARY_SUBTRACT a * b: <a> <b> BINARY_MULTIPLY a / b: ... a // b: ... a(b, c, d): <a> <b> <c> <d> CALL_FUNCTION(3) ...
Bytecode for an
<expression> line looks like
<bytecode for <expression>> POP_TOP
Bytecode for a
<variable> = <expression> line in the global scope looks like
<bytecode for <expression>> STORE_NAME(<variable>)
Bytecode for a
<variable> = <expression> line in the local scope looks like
<bytecode for <expression>> STORE_FAST(<variable slot>)
Bytecode for a
<var1>, <var2>, <var3> = <expression> line looks like
UNPACK_SEQUENCE(3) <assign to var1, global or local> <assign to var2> <assign to var3>
Bytecode for a
(<var1>, <var2>), <var3> = <expression> line looks like
UNPACK_SEQUENCE(2) UNPACK_SEQUENCE(2) <assign to var1> <assign to var2> <assign to var3>
Bytecode for an
if <expr>: <code> line looks like
<bytecode for <expr>> POP_JUMP_IF_TRUE(END) <bytecode for <code>> LABEL(END) (this will sometimes be optimized)
Bytecode for an if-elif-else statement looks similar, but more complicated
Bytecode for a
while <expr>: <code> line looks like
LABEL(START) <bytecode for if <expr>: <code>> JUMP_ABSOLUTE(START)
Bytecode for a
for <lvalue> in <expr>: <code> line looks like
<bytecode for <expr>> GET_ITER LABEL(START) FOR_ITER(END)<br>
<assign to <lvalue>> <bytecode for <code>> JUMP_ABSOLUTE(START) LABEL(end)
Python applies some rudimentary optimizations, like converting
JUMP_ABSOLUTEs to more efficient
Very nice =)
It would be cool to see an advance tutorial that goes into bytecode, and sheds the skin of Python a little more. A great tutorial, especially for those who are new to Python and want to find a reference to work toward.
That's actually a really good tutorial. Especially the part about lexing, parsing and bytecode is very interesting