Python Advanced Concepts Explanation (for beginners)
h
pyelias (682)

(just wrote this up in a few minutes, so there might be typos and stuff)
(the reader is expected to have at least basic knowledge of python)
(and programming in general)
(if not, read fullern's introduction)

(if you have a question about it (or feedback), comment here or join the repl.it discord and DM elias#7990)

This document is an explanation of how Python handles and manipulates data. It is an attempt to clarify some common misconceptions about Python. Important vocabulary and facts will be bolded.

First, forget what you know about Python objects (but not syntax). I will be explaining most of the useful features here.

Expressions:

Expressions in python are the main way of manipulating data. They calculate (evaluate) a value based on the current state of the program. There are a few types of expressions:

  • Literals: Literals always evaluate to the same thing. For example, the literal 5 always evaluates to the number 5.
  • Operations: Operations combine values. For example, the + operations adds numbers (or concatenates strings). An expression with an operation looks like <some expression> <operation (+, -, *, etc.)> <other expression> (without the angle brackets). Operations follow an extended version of PEMDAS, described here.
  • Function calls: These are described in the Functions section.
  • Ternaries: Ternary expressions look like <expression a> if <condition expression> else <expression b>. First <condition expression> is evaluated, and if it is truthy (acts like True; see Truthiness section for more), <expression a> is evaluated. If it is not truthy (falsy), <expression b> is evaluated instead.
  • Comprehensions: See the Comprehension section

Code structure:

Python code is composed of lines and blocks. Lines in python can have a few different forms (more than the ones shown here, but the others are used less):

<expression>                                    # <expression> is evaluated, and then the value is discarded


variable = <expression>                         # <expression> is evaluated, and assigned to variable as described in Scoping.
var1, var2 = <expression>                       # <expression> is evaluated, and (assuming it's a list, tuple, etc.; see Iterables) var1 is set to its zeroth element and
                                                # var2 is set to its first
                                                
(var1, var2), var3 = <expression>               # <expression> is evaluated, and (var1, var2) is set to its zeroth element and var3 is set to its first

# other patterns similar to those above also work, like (var1, var2), (var3, var4)
# i use <lvalue> to represent any of those patterns


if <expression>:                                # <expression> is evaluated
  <more lines of code, indented by some amount> # if <expression> was truthy, <more lines of code, indented> is run and the code skips to <more lines, not indented>
  
elif <expression2>:                             # <expression2> is evaluated (only if <expression> was falsy). elifs are optional, and don't have to be included
  <more indented code>                          # if <expression2> was truthy, <more indented code> is run and the code skips.
...                                             # the pattern continues

else:
  <last indented code>                          # if none of <expression>, <expression2>, ... was truthy, <last indented code> is run.
<more lines, not indented>

while <expression>:                             # evaluate <expression>. if it is true, 
  <indented code>                               # run <indented code> and go back to the start

for <lvalue> in <expression>:                   # for every element in <expression> (probably a list, tuple, or iterable), set <lvalue> to that element and
  <indented code>                               # run <indented code>

def name(value):                                # described in Functions
  <indented code>

class name():                                   # described in User-Defined Classes
  <indented code>

if, while, for, def, and class can also be used in indented code.

Scoping:

When a program starts running, all variables are in a scope (collection of variables) called the global scope. Other scopes can be created and manipulated as described in Functions and User-Defined Classes.

Objects & Types:

Data in Python is an object. Numbers are objects, string are objects, and functions are objects. Statements like if, for, and while are not objects, since they are code, not data. Every object in Python has a type, a.k.a. class. The type of a number is int or float. The type of a string is str. The type of an object can be determined with type(obj). These types are also objects. Their type is called type. type's type is itself. Types (but not non-type objects) also have superclasses. These superclasses are also types, and specify additional data about the type (more info in Attributes). If a type has multiple superclasses, this algorithm is used to determine priorities for them in certain situations.

Attributes:

Objects also have attributes. These attributes are named. Their names start with a character a-z, A-Z, or _. The characters after the first can be a-z, A-Z, _, or a digit. For example, an attribute could be called an_attribute, AbC123, or __add__, but not 123AbC (the first character cannot be a digit). These attributes can be accessed with the_object.attribute_name. When an attribute is accessed, if it is not found, the object's class (a.k.a. type) is checked. If the attribute is not present there, the superclasses of the type are checked next, in order by priority. If none of the superclasses have the attribute, an exception is raised (see Exceptions).

Attributes also have values, which are the values returned when an attribute is accessed. These values can be any object. There are also descriptors, which allow code to be run when an attribute is accessed. They are described in more detail in the Descriptors section.

Some attribute names are special. For example, an attribute named __add__ specifies how objects are added together with +. This will be described in more detail in the User-Defined Classes section.

Functions:

Functions are snippets of Python code, and are also objects. They can be called with the_function_object(argument_1, argument_2, ...). When a function is called, the code that called it is paused, the passed arguments are placed in variables, and the function's code is run. When the function executes a return statement, the function's code stops, and the returned value is send back to the calling code. If the function reaches the last line, it returns None, an object representing no data.

For example,

def a_function(number):
  # when a_function(value) is called, the calling code is paused
  # then a new scope is created,
  # and a variable called number is created with the argument in it
  # the function's code is then run in the new scope
  
  result = number + 1 # result is created in this function's scope
  return result # the function's code stops here, and result is sent to the calling code.

function_result = a_function(41) # calling a_function, where argument is set to 41
                                # function_result is set to the value returned from a_function

print(function_result)          # print is a function too! it returns None

When run, this code takes the following path:

1:

def a_function(number):
  # all that code above

This line (and the code in the def) creates a function object, then puts it in a variable called a_function.

2:

function_result = a_function(41)

This line calls a_function with the argument 41. We'll come back to this line once the function returns.

3:

def a_function(number):

The function is called, so a new scope is created. Here, the argument 41 corresponds to the variable number, so the variable number in the new scope is set to 41.

4:

result = number + 1

This line calculates 41 (the value of the number variable) + 1, which equals 42. The result variable is then set to 42.

5:

return result

This line ends the function, returning the value of result (42). You don't need to return a variable, so this function could have just returned number + 1 instead.
This also goes back to the scope that existed before the function was called (the global scope).

6:

function_result = a_function(41)

Now that the function has returned, function_result is set to the value it returned, which is 42. Note that this variable is created in the global scope, not in a_function's scope.

7:

print(function_result)

This outputs the value of function_result, or 42.

Generators, Iterators, and Iterables

Generators are one of my favorite features in Python. They are easiest to understand with an example, so here is one:

def a_generator_func():
  print("about to yield first value")
  yield "first value"
  print("just yielded first value")
  
  print("about to yield second value")
  yield "second_value"
  print("done yielding, there is no third value")

a_generator = a_generator_func()

print("about to get first value")
print("the first value is", repr(next(a_generator)))
print("about to get second value")
print("the first value is", repr(next(a_generator)))
print("about to get third value")
print("the first value is", repr(next(a_generator)))

This code will output

about to get first value
about to yield first value
the first value is 'first value'
about to get second value
just yielded first value
about to yield second value
the first value is 'second_value'
about to get third value
done yielding, there is no third value
Traceback (most recent call last):
  File "python", line 17, in <module>
StopIteration

a_generator_func is a function, that, instead of pausing the main program when called, returns an object representing the current state. next can be repeatedly called on that object to run it until it yields a value.
When a generator runs out of things to yield, a StopIteration exception (TODO: section on exceptions) is raised. These can be caught with:

try:
  next_val = next(a_generator)
except StopIteration:
  print("the generator ended")

Generators can also be used with for:

for elem in a_generator_func():
  print(elem)

This would output

about to yield first value
elem: first value
just yielded first value
about to yield second value
elem: second_value
done yielding

, but would not raise StopIteration (for automatically handles this).
Generators are a special class of iterators, which are objects that can have next called on them, and can be used with for. lists and tuples are not iterators, but iterables, which can be converted to iterators by calling iter on them.

User-Defined Classes

There are many online tutorials about creating classes, most of which are better than I can write. So before you read this, go look at one of those tutorials (but keep in mind that everything in play is an object).

So, now that you have read one of those tutorials (go read one if you haven't), consider the following example class:

class Point():
  def __init__(self, x_pos, y_pos):
    self.x_pos = x_pos
    self.y_pos = y_pos
  
  def print(not_self):
    print(not_self.x_pos, not_self.y_pos)
   
   def __add__(tomayto, tomahto):
     return Point(tomayto.x_pos + tomahto.x_pos, tomayto.y_pos + tomahto.y_pos)

a = Point(1, 2)
b = Point(3, 4)
a.print()
b.print()
(a + b).print()

This code creates a Point class, and 2 instances of it (a and b). It then calls .print on both instances and a + b. The only peculiarity you might notices about this is that it doesn't use self, but different argument names. Despite what most tutorials would have you believe, there is nothing special or magic about self. You can replace it with any other name and your code will work fine. If your tutorial didn't cover __add__ and similar functions, they basically define what happens when a + b is executed. More information can be found here.

The only "magic" thing left in this code is how the self/not_self/tomayto arguments actually get added; a.print() never specifies a value for not_self. This is covered in the next section,

Descriptors

Descriptors are a way to make code run when an attribute is accessed. They are objects with __get__ or __set__ special methods. When they are accessed as an attribute of a class (but not from an instance), the __get__ method is called with the instance and the class the descriptor was found on. When attributes are set, a similar thing is done with __set__. This is how functions automatically add a self argument. They are descriptors, so when they are accessed, it returns a "bound" version of the function, that has the instance (self) already added as an argument. A shorthand way to create descriptors is:

class Test():
  @property
  def x():
    print("getting x")
  
  @x.setter
  def x():
    print("setting x")

a = Test()
a.x
a.x = 1

The @property and @x.setter lines are decorators, described in the next section.

Decorators

Consider the following code:

def function_making_function(old_function):
  print("making a new function from", old_function)
  def new_function():
    print("calling the old function")
    old_function()
    print("done calling the old function")
  print("made the new function")
  return new_function

This code is relatively simple, except for the fact that a function is defined inside another function. This gives the inner defined function access to all the variable in the outer function.

Now, running

def some_function():
  print("in some_function")

print("starting to make a new function")
new_function = function_making_function(some_function) # not calling it here, just moving at around (remember, functions are objects!)
print("done making the new function")

print("calling the new function")
new_function()
print("done")

will output:

starting to make a new function
making a new function from <function old_function at 0xhesyfyua>
made the new function
done making the new function
calling the new function
calling the old function
in some_function
done calling the old function
done

The code

print("starting to make a new function")
@function_making_function
def some_function():
  print("in some_function")
print("done making the new function")

print("calling the new function")
some_function()
print("done")

does the exact same thing as the other code above. @function_making_function is just shorthand for some_function = function_making_function(some_function) (but it must be before the def). This is called a decorator. Multiple decorators can be used on the same function, and are applied bottom-to-top. Decorators do not have to return functions; they can also return other objects (such as descriptors in the case of @property).

@property_name.setter works by having a setter() method on the descriptor objects @property returns. This setter() method changes the behavior of the property's __set__ method (but not the value as an instance attribute, then the function binding descriptor wouldn't work).

That is most of the features I regularly use in Python. There are a few simplifications and some things are left out, but for the most part this should be accurate. However, I left out a lot of information about the standard library (built-in functions and modules). Information about them can be found here and here.

Internals

This section details how a certain Python interpreter (CPython) actually runs Python code. Most things discussed here don't apply to other interpreters, although some concepts might be the same.

Processing order

When Python code is run, it is lexed, parsed, converted to bytecode, and then run.

Lexing

Lexing, or tokenizing, is the process of breaking code up into groups of symbols so it is easier to process later. A simple example of lexing (not with Python code) could look like:

1 + 2*(3 +4- 5)

->

NUM 1
OP +
NUM 2
OP *
OPEN_PAREN
NUM 3
OP +
NUM 4
OP -
NUM 5
CLOSE_PAREN

Note that this takes care of unused whitespace and other things that might complicate the process later on.
TODO: explain python lex

Parsing

Parsing converts tokenized code into a tree structure that is easier for code to process. For example, using the arithmetic expression from earlier:

NUM 1
OP +
...
NUM 5
CLOSE_PAREN
->
ADD(
  1,
  MUL(
    2,
    SUB(
      ADD(3,4),
      5
    )
  )
)

This takes care of things like parentheses that don't really affect how code actually runs.
TODO: explain python parse

Bytecode: What is it?

Bytecode is an intermediate language, which means that code is translated to it before being run. For example, again with the arithmetic expression:

ADD(1, MUL(...))

->

PUSH 1
PUSH 3
PUSH 2
PUSH 4
ADD // 3 + 4
PUSH 5
SUB // 3 + 4 - 5
MUL // 2 * (3 + 4 - 5)
ADD // 1 + 2 * (3 + 4 - 5)

That code is actually not the bytecode, just a simplified version. Bytecode would actually be represented as a sequence of bytes, like in the following example (with bytes as hexadecimal):

PUSH 1: 00 01 // the stack is currently 1
PUSH 3: 00 03 // the stack is currently 1 3
PUSH 2: 00 02 // the stack is currently 1 3 2
PUSH 4: 00 04 // the stack is currently 1 3 2 4
ADD:    01    // the stack is currently 1 3 6
PUSH 5: 00 05 // the stack is currently 1 3 6 5
SUB:    02    // the stack is currently 1 3 1
MUL:    03    // the stack is currently 1 3
ADD:    01    // the stack is currently 4

Here, the opcode for PUSH is 00 (with 1 parameter), ADD is 01, etc.
The bytecode for Python is complicated and has many operations (detailed here), so I will not fully document it here. It is stack-based, which means most of its instructions operate on a data stack, similar to the arithmetic example above. This stack contains Python objects. There are also a set of slots, which are used to store variables, constants, names, etc.

Here are some common instructions:

POP_TOP: remove the top element of the stack (TOS)

BINARY_ADD: pop the TOS and the element below the TOS (TOS1), add them, and push it back onto the stack
BINARY_SUBTRACT: see above (-)
BINARY_MULTIPLY: see above (*)
BINARY_FLOOR_DIVIDE: see above (//)
BINARY_TRUE_DIVIDE: see above (/)

STORE_FAST(slot_num): pop the TOS and store to a slot
LOAD_FAST(slot_num): read a slot and push to the stack

LOAD_CONST(slot_num): load a constant and push it

CALL_FUNCTION(arg_count): pop the some elements and call the last one as a function with the other elements
RETURN_VALUE: return from a function

JUMP_ABSOLUTE(where): jump to the <where>th bytecode instruction
POP_JUMP_IF_TRUE(where): pop the TOS, if true, jump to <where>
POP_JUMP_IF_FALSE: ...

GET_ITER: pop the TOS, convert to an iterator, and push it
FOR_ITER(delta): use the iterator in TOS (w/o popping) and get the next element. If there is no next element, jump ahead by <delta> instructions

UNPACK_SEQUENCE(unpack_len): pop an iterable as the TOS and push its elements (there should be <unpack_len>) to the stack

Bytecode: How is it made?

In Python, bytecode for arithmetic expressions is implemented similarly to the example with the arithmetic expressions. For example,

a + b: <bytecode for a> <bytecode for b> BINARY_ADD
a - b: <bytecode for a> <bytecode for b> BINARY_SUBTRACT
a * b: <a> <b> BINARY_MULTIPLY
a / b: ...
a // b: ...
a(b, c, d): <a> <b> <c> <d> CALL_FUNCTION(3)
...

Bytecode for an <expression> line looks like <bytecode for <expression>> POP_TOP

Bytecode for a <variable> = <expression> line in the global scope looks like <bytecode for <expression>> STORE_NAME(<variable>)

Bytecode for a <variable> = <expression> line in the local scope looks like <bytecode for <expression>> STORE_FAST(<variable slot>)

Bytecode for a <var1>, <var2>, <var3> = <expression> line looks like UNPACK_SEQUENCE(3) <assign to var1, global or local> <assign to var2> <assign to var3>

Bytecode for a (<var1>, <var2>), <var3> = <expression> line looks like UNPACK_SEQUENCE(2) UNPACK_SEQUENCE(2) <assign to var1> <assign to var2> <assign to var3>

Bytecode for an if <expr>: <code> line looks like <bytecode for <expr>> POP_JUMP_IF_TRUE(END) <bytecode for <code>> LABEL(END) (this will sometimes be optimized)

Bytecode for an if-elif-else statement looks similar, but more complicated

Bytecode for a while <expr>: <code> line looks like LABEL(START) <bytecode for if <expr>: <code>> JUMP_ABSOLUTE(START)

Bytecode for a for <lvalue> in <expr>: <code> line looks like <bytecode for <expr>> GET_ITER LABEL(START) FOR_ITER(END)<br>
<assign to <lvalue>> <bytecode for <code>> JUMP_ABSOLUTE(START) LABEL(end)

TODO: def, class

Python applies some rudimentary optimizations, like converting JUMP_ABSOLUTEs to more efficient JUMP instructions.

You are viewing a single comment. View All
pyelias (682)

@CoolqB Yeah, I might add some of that stuff in here when I have time.