1.6. Extremely Brief Intro to Jupyter and Python#

1.6.1. Getting Started in Jupyter#

Here is a link to the jupyter-intro.ipynb file used in this section:

https://raw.githubusercontent.com/jmshea/Foundations-of-Data-Science-with-Python/main/01-intro/jupyter-intro.ipynb

If your browser displays the notebook as text, you will need to tell it to save it as a file. You can usually do this by right-clicking or control-clicking in the browser window and choosing to save the page as a file. For instance, in Safari 14, choose the “Save Page As…” menu item. Be sure to name your file with a .ipynb ending.

Hint: If your file was saved to your default Downloads folder, be sure to move it to an appropriate folder in your data-science folder to keep things organized!

1.6.1.1. Intro to Markdown in Jupyter#

The example notebook jupyter-intro.ipynb demonstrates the main features of Markdown that are used in the book. This section provides a more detailed description than could be included in the book itself:

1. Headings

The first cell (starting with “Intro to Markdown”) illustrates how to use Markdown to format headings. Headings are entered by prefacing text with one or more # (usually pronounced “hash”) symbols followed by a space. If you know any HTML, then # Title is equivalent to a <h1>Title</h1>, ### Subtitle is equivalent to <h2>Subtitle</h2>, etc. Note that you must put a space between the hashes and the text for it to be recognized as a heading.

2. Text and paragraphs

The second cell shows paragraphs of text. Regular text can be entered as normal. Blank lines between text indicate the start of a new paragraph; press Enter twice at the end of a paragraph to leave such a blank line. Other line breaks do not create a new paragraph and can be used to avoid over-long lines of text or to separate sentences so they can be more easily reordered in the future.

3. Emphasis

The third cell (labeled “Emphasis”) illustrates how to make text italicized or bold. Double click on the cell (to view the source text).

  • Text can be italicized by surrounding it with single asterisks, like *italicize text*: italicize text.

  • Similarly, text can be bolded by surrounding it with double asterisks, like **bold text**: bold text.

  • Text can be both italicized and bolded by surrounding it with triple asterisks, like ***bold and italics***: bold and italics.

4. Bulleted and Numbered Lists

The fourth cell illustrates how to make bulleted and numbered lists.

Bulleted lists can be created by starting a line with an asterisk and a space, followed by the text of the bulleted item. Bulleted sub-lists can be created by tab-indenting a bulleted list under a bulleted item. For example,

* Item 1
    * Item 1.1
    * Item 1.2
    
*Item 2

formats as

  • Item 1

    • Item 1.1

    • Item 1.2 *Item 2

Numbered lists can be created by starting a line with a number, a period, and a space, followed by the item to be numbered. Note that the first number you use in a sequence of numbered items will determine the starting point for numbering, but later values are ignored. It is a good habit to use 1 for every numbered item because then you can rearrange the items via cut-and-paste without worrying about upsetting the overall list numbering. Numbered sub-lists can be created by tab-indenting a numbered list under a numbered item. For example,

1. Item 1
    1. Item 1.1
    1. Item 1.2
3. Item 2

formats as

  1. Item 1

    1. Item 1.1

    2. Item 1.2

  2. Item 2

5. Links and images

The fifth cell gives an example of a link and an image.

Links are easily created by putting the link text in square brackets, followed by the link URL in parentheses, like:

[Example link](http://google.com)

which is rendered as

Example link

Images are entered in a very similar way to URLs. Just put an exclamation point (!) before the square brackets. The text in the square brackets will be used as the “alt text”, which is used by screen readers or displayed when hovering the mouse over the image. Images can also be inserted by dragging them from your file manager into a cell. After the Markdown code is inserted, you can edit the alt-text to something more meaningful.

6. Mathematics

The sixth cell (labeled “Mathematics”) illustrates formatted mathematics. Markdown in Jupyter supports sophisticated formatting of mathematics using LaTeX (pronounced lay tek) notation. For inline equations (those that will appear in-line with text, include the

LaTeX notation between dollar signs, like $\sin^2 x$ renders as \(\sin^2 x\). Longer equations should be displayed on lines separate from the text and can be created by enclosing them between pairs of dollar signs. For example,

$$
\int_{0}^{\infty} e^{-x}~dx = 1
$$

is rendered as

\[ \int_{0}^{\infty} e^{-x}~dx = 1 \]

7. Other Markdown formatting

Markdown has many other features that we will not cover, such as horizontal rules, block quotes, syntax highlighting, and tables. A good reference for Markdown syntax is Markdown Guide (https://www.markdownguide.org/extended-syntax/)

1.6.1.2. Jupyter Magics#

Code cells can also contain special instructions intended for JupyterLab itself, rather than the Python kernel. These are called magics. For instance, to output your current directory, you can use the %pwd magic:

%pwd

'~/data-science/chapter1'

You can use the “%cd” magic to change your directory (recall that~ is a shortcut for your home user directory):

cd ~
/Users/jshea

Changing directories is often useful to switch to a directory where data is stored.

You can get a list of magics and other information about Jupyter using the %quickref magic. The output of %quickref is long, so I am only including a screenshot of the top of its output.

%quickref

Top of output for %quickref

1.6.2. Getting Started in Python#

Python is an interpreted language, which means that when any Code cell in a Jupyter notebook is evaluated, the Python code will be executed. Any output or error messages will appear in a new output portion of the cell that will appear just after the input portion of the cell (that contains the Python code).

At the bottom of the jupyter-intro.ipynb notebook, there is an empty cell where you can start entering Python code. If there is not already an empty cell there, click on the last cell and press Alt-Enter.

1.6.2.1. Hello World!#

Let’s start with the canonical example that appears in almost every programming book, which is to write “Hello World!” to the screen/output.

To print output in Python, use the print() function, which is a built-in function in Python 3. Here a function is used to denote a named set of Python instructions that can be called on demand. Python functions are called by using the function name with parentheses after it. Functions can accept arguments, which are variables and values that the function can act on. The arguments for a function are put inside the parentheses, with commas separating different items.

To output text using print(), put it as an argument inside single or double quotation marks. I will generally use single quotation marks because they do not require using the shift key to type.

print('Hello World!')
Hello World!

Python’s print() function knows how to handle most standard data types. Unlike printf in C, explicit formatting instructions do not have to be given – just pass the thing to be printed as an argument to print:

print(32)
32
print([1, 4, 9, 16])
[1, 4, 9, 16]

To print multiple items, just pass all the items as arguments to the print() function, but separate the items by commas:

print('Hello', 'world!', 10 + 5, [1, 4, 9, 16])
Hello world! 15 [1, 4, 9, 16]

Code cells may contain multiple Python statements. All statements in a cell will be run sequentially when the cell is run. The results of every print statement will be shown after the input part of the cell:

print('Hello ')
print('World!')
Hello 
World!

Some Python statements return results, and these will appear in a special output part of the cell:

5 + 10
15

However, if a cell contains multiple statements, the output will be the result of the last statement:

5 + 10
5 * 10
50

The last statement in a cell may produce no output. Then there is no output from the cell. Note that printed items are not outputs:

5 + 10
print('hello')
hello

There are several different ways to get the output of multiple statements. One easy way is just to print all the results you want to see:

print(5 + 10)
print(5 * 10)
15
50

Sometimes you want to run a command that produces an output but not have the output appear after the code cell. The most typical case of this is when using Matplotlib to plot, many of the commands modify the plot but also return a value at the output of the code cell. To suppress this output, append a semicolon (;) to the last statement in the cell:

print(3 + 5)
5 + 7;
8

I will introduce some more powerful features of the print() function after introducing other important features of the Python language.

1.6.2.2. Comments#

Comments are text in your Python code that are ignored by the Python interpreter. Documentation is usually used to convey what you intend for your code to be doing – which is not always what it is actually doing! Documentation is both for others who may need to read your code and for your future self, who may need help remembering what you meant for a block of code to do.

Anything that follows a # (hash) symbol is a comment:

# It is important to use comments to document your thinking on big assignments

There is not really a multi-line comment in Python (like /* */ in C). One way to make a multi-line comment is to make a multi-line string that is not assigned to any variable. Multi-line strings are delimited by triple-ticks (‘’’) or triple-quotes(“””). Here is an example with triple-quotes:

""" This is a
multi-line comment,
okay?"""
print('The multi-line string above will not produce any output')
print('because it is not the last command in this cell.')
The multi-line string above will not produce any output
because it is not the last command in this cell.

When making custom functions, a multi-line string directly after the function definition serves as the documentation string (docstring) for that function.

1.6.2.3. Python Variables and Types#

Variables are named entities that store values for future reference. In Python, variable names consist of alphanumeric characters (a-z, A-Z, 0-9) and underscores, but they cannot start with a number. Variable names are case-sensitive. Detailed descriptions of naming conventions for Python are given in PEP 8 – Style Guide for Python Code. Generally, we will follow these conventions:

  • Names should be descriptive, except when used as a simple index.

  • Variable and function names should be lowercase, with underscores separating different words.

  • Variables should not duplicate the names of modules or Python functions/methods.

Variables are created by assigning a value to a valid variable name:

x = 10

Python uses implicit typing, which means that you do not have to define a variable’s type (i.e., what type of data it holds). The type is determined by the interpreter at the time of assignment. You can determine a variable’s type by passing the variable as the argument of the type() function:

type(x)
int

The type of a variable can change whenever new data is assigned to it:

x = 10.5
type(x)
float
x = 'Hello World!'
type(x)
str

When a variable is passed as an argument to the print() function, the value of the variable is printed:

print(x)
Hello World!

Python knows how to perform many different operations on different data types and will usually do the right thing based on the type of the variable:

a = 3
b = 4
a + b
7
print(a * b)
12

We often want to combine some fixed text and some variable output. To do this, we will use f-strings, which were added to Python in version 3.6. An f-string is a special string that is created by prefixing the delimiter with the letter f. Any part of an f-string contained within curly braces {} will be evaluated before the string is used. Thus, if we wanted to write more about the product of a and b, we could do it like this:

print(f'The product of a and b is {a*b}')
The product of a and b is 12

We can have multiple substitutions in a single f-string:

print(f'The product of {a} and {b} is {a*b}')
The product of 3 and 4 is 12

Python’s f-strings also support formatting the output, and I will use a variety of different formatting techniques in this book to make the output look nice. The details of these approaches are beyond the scope of this book, but I will introduce one that occurs frequently. For a floating point variable, suppose you want to only print X digits after the decimal point, where X is a placeholder for some non-negative integer. Then after the variable (but still inside the curly braces), put : .Xf after the variable. Here is an example where I set the variable s equal to 1/6 and print the variable to 3 decimal places:

s = 1/6
print(f's = {s : .3f}')
s =  0.167

Warning

Exponentiation is performed using two asterisks between the number or variable to be operated on and the exponent. The carat operator (^) is reserved for bitwise exclusive-or (XOR).

2 ** 3
8

We will often need to add to an existing value, and Python provides a shortcut notation for this (similar to C). The notation += indicates to add to the variable on the left-hand side of the expression. For instance, here is the usual way we write incrementing a counter:

counter = 0
counter += 1
print(counter)
1

Most operators in Python try to do the most logical operation for their input types. Therefore, if we try to add strings with +, the operator actually performs string concatenation:

a = 'Hello '
b = 'World'
print(a + b)
Hello World
a = 3
b = 4.1
print(a + b, type(a + b))
7.1 <class 'float'>

However, Python cannot read minds, and its rules may create unexpected results:

a = '3'
b = '4'
print(a + b)
print(int(a) + int(b))
34
7

Note that in the first case, the variables are strings, so the + operator performs string concatenation. This can be a subtle problem when we load data from files because items that look like numbers may actually be loaded as strings.

The multiplication operator performs repetition on a string:

c = 5
print(a * c)
print(int(a) * c)
33333
15

1.6.2.4. Basic Data Types#

Python has many data types built-in, and we have already seen a few of these: int, float, and str (string). In this book, we will also use a few other of Python’s built-in types.

Boolean values are stored in bool variables that take on either True or False. Note that exact capitalization is required for these values.

a = True
print(a)
True

We can check if two variables have the same values using ==:

b = 2
c = 3

print(b == c)
False

The result of a comparison (using ==) can be stored in variable (using =):

d = b == c
type(d)
bool

Not equals is written as !=. Other comparisons are written in the usual ways (<, >, <=, >=).

print(b != c, b <= c)
True True

Python has a variety of different sequence types for containing an ordered collection of values. Values can be retrieved by index from a sequence-type variable by putting the index number in square brackets after the variable name.

Warning

Like C, C++, and JavaScript, indexing in Python starts at 0. This means that in a sequence of \(n\) items, the first item is at index 0, and the last item is at index \(n-1\). MATLAB and many textbooks use indexing starting at 1.

The list type is a mutable container of values. Mutable means the values in the list can be changed. Lists are delimited by square brackets, and list items are separated by commas.

my_list = ['dogs', 'cats', 3, 7.0]
print(my_list)
['dogs', 'cats', 3, 7.0]
print(my_list[1])
cats

Because lists are mutable, values can be updated:

my_list[0] = 'puppies'
print(my_list)
['puppies', 'cats', 3, 7.0]

A tuple is an immutable sequence type. The values in a tuple cannot be changed. Tuples are often used to contain multiple values returned from a function. Tuples are delimited by parentheses: (). To create a tuple containing only one value, include a comma after the value. As with lists, tuples can contain a variety of values. Trying to change a value in an immutable type results in an error:

tuple1 = (1, 4, 'nine')
tuple2 = (16,)
print(tuple1[2])
nine

Trying to change a value in an immutable type results in an error:

tuple1[0] = 'one'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[43], line 1
----> 1 tuple1[0] = 'one'

TypeError: 'tuple' object does not support item assignment

The range type is an immutable sequence of numbers that is usually used in looping (especially for loops). A range object can be created using the range() function in the following ways:

  • range(stop) creates a sequence of stop values starting at 0. Thus, the values are 0, 1, 2, \(\ldots,\) stop-1. Note that the Python convention is that the stop value is not included in the range.

  • range(start, stop) creates a sequence of values that starts at start and ends at stop-1.

  • range(start, stop, step) creates a sequence of values that starts at start, increments by step, and ends at stop-1.

You can iterate over ranges using a forin statement, which is discussed in the next section.

Python also contains one mapping type:

  • dict is a dictionary object that provides a map between key-value pairs. As with lists and tuples, keys and values can be any data types, but the keys must be unique. Dictionaries are delimited by curly braces: { }. Each entry in a dictionary is written in the form key:value, and different key-value pairs are separated by commas. The value for a particular key can be retrieved by putting the key in square brackets after the dictionary variable’s name.

squares = {1: 1, 2: 4, 3: 9, 4: 16}
print(squares[3])
9
misc = {'cats': 0, 'dogs': 1, 3: 'test'}
print(misc['cats'])
print(misc[3])
0
test

Python variables are actually much more powerful than variables in many languages because they are actually objects. Python is an object-oriented programming (OOP) language. To make the content of this book more accessible to people with a wide range of programming experience, this book does not generally use an object-oriented approach. However, you will need to know some fundamentals about objects and classes:

  • Objects are special data types that have methods associated with them to work on those objects. Methods are similar to functions, except they are specialized to the objects to which they belong. Methods are called by giving the variable/object name, adding a period, specifying the method name, and then adding parentheses, with any arguments provided in parentheses.

  • A class is a template for an object that defines how an object stores information and defines its methods.

For instance, since list is a mutable type, a list can be sorted in place. The list object defines a sorting method to achieve this:

new_list = [5, 7, 1, 3, 13]
new_list.sort()
print(new_list)
[1, 3, 5, 7, 13]

For dictionaries, we often need to retrieve the keys or values. We can do this using methods provided by the dictionary type:

misc = {'cats': 0, 'dogs': 1, 3: 'test'}
print(misc.keys())
dict_keys(['cats', 'dogs', 3])
print(misc.values())
dict_values([0, 1, 'test'])

The keys and values methods return special objects of type dict_keys and dict_values, respectively, but for our purposes, we can treat these like lists. An example is shown in the section python-intro:loops.

1.6.2.5. Copying Variables#

Warning

You have to be careful when copying variables in Python, or you may get some unexpected results.

Suppose you have a variable c that contains a list, and you want to copy it to a variable d:

c = [0, 1]
d = c
print(d)
[0, 1]

Everything works as expected. Now, let’s change the value in position 1 in d and print out both c and d:

d[1] = 2
print('d=', d)
print('c=', c)
d= [0, 2]
c= [0, 2]

Changing the value of d[1] also changed the value of c[1]! This is because when c is made, c is a variable that points to the list [0, 1]. When we set d=c, Python sets the variable d to point to the same list [0,1].

To make a copy of the list c, we can use the list’s copy() method to create a new list with the same contents as c:

e = c.copy()
e[1] = 3
print('e=', e)
print('c=', c)
e= [0, 3]
c= [0, 2]

We can check if two variables point to the same data using Python’s is command:

c is d
True

If two variables point to the same data, then they will have the same values:

c == d
True

However, the opposite is not true. If we make a new copy of c, it will point to a different list object, but the contents will be the same:

f = c.copy()
print(f is c, f == c)
False True

1.6.2.6. Indentation and Line Breaks in Python#

Line breaks in Python are usually used to distinguish different Python statements, and we will often use this convention in this book. Exceptions to this include:

  • When a statement includes arguments in parentheses, line breaks can be used within the parentheses to improve readability. The statement will not terminate until the closing parenthesis, which should be followed by a line break.

  • As previously introduced, strings that include line breaks can be created using triple quotes.

  • If a line ends in a backslash \, it will be interpreted as continuing onto the next line.

  • Multiple statements can be put onto a single line by separating them by semicolons (;), but this is not encouraged and will not be used in this book.

print('Hello', 'Amelia!')
Hello Amelia!
a = 2 + 3
print(a)
5
print(
    '''Goodbye, 
Amelia!'''
)
Goodbye, 
Amelia!
# Here is an example of using a backslash to break up a line into smaller parts:
A = 1 + 2 + 3 \
  + 4 + 5 + 6
A
21

Programming languages usually have some convention to indicate which statements belong together. For instance, if a statement starts a loop (such as for or while), there must be a way to indicate which of the following statements should be executed in each iteration of the loop and which should be executed after loop iteration. In many languages, such as C, C++, Java, and Javascript, code blocks are surrounded by curly braces: { }.

In Python, indentation is used to deliminate code blocks. The languages mentioned above use indentation as a convention that indicates meaning to humans working on code. Python uses indentation to define code blocks and convey meaning to the Python interpreter.

Either tabs or spaces can be used to denote code blocks, but the PEP-8 standard is to use 4 spaces per indent level. Jupyter inserts 4 spaces when Tab is pressed in a code block.[1]

1.6.2.7. Loops and Conditionals#

1.6.2.7.1. forin Statements#

In data science, we often need to either iterate over data or carry out iterations of a simulation. For both purposes, we will usually rely on Python’s forin statement. This is a type of compound statement. Compound statements consist of a header and a suite (Python’s terminology) or body. The header always ends in a colon, and the suite is one more statements that are run consecutively according to conditions in the header. For our purposes, we will always create the suite as a set of statements that are indented one more level than the corresponding header, with each statement on a new line.

The forin compound statement takes a variable after the for keyword. This variable will hold the current iteration value. After the in keyword there needs to be an iterable object, which is any object that can be iterated over. The typical one we will use is the range object that we have just introduced:

for i in range(4):
    print(i)
0
1
2
3
for j in range(2, 4):
    print(j)
2
3
for k in range(2, 10, 2):
    print(k)
2
4
6
8

Another example of an iterable object is a list:

fruit = ['apples', 'bananas', 'cherries']
for kind in fruit:
    print(kind)
apples
bananas
cherries

When iterating over objects like a list, it is often helpful to keep track of the iteration index. One easy way to do this is to use the enumerate() function that returns a tuple of the current iteration index and the item:

for i, kind in enumerate(fruit):
    print(i)
    print(kind)
    print()
0
apples

1
bananas

2
cherries

In the previous example, I purposefully split the printing into different lines to show a multi-statement suite.

We can nest ‘for’ statements, which means that one for statement is inside of another for statement. For each iteration of the outer loop, the inner loop will run through all its iterations:

for i in range(3):
    for j in range(2):
        print(i, j)
    print()
0 0
0 1

1 0
1 1

2 0
2 1

1.6.2.7.2. if Statements#

The if statement is used to run code conditionally. It will often be the case that if statements are used inside loop or other conditional statements. For instance, to print out only the even numbers from 1 to 10, we can use an if inside a for statement:

for i in range(1, 11):
    if i % 2 == 0:
        print(i)
2
4
6
8
10

More generally, an if statement may also have elif and else clauses. Elif is short for “else if”, and these headers act like if headers but will be evaluated only if the above if or elif headers did not have their conditions satisfied. There can only be one else clause, and its suite will be executed if the if and elif clauses did not have their conditions satisfied. Note that the elif and else headers must be at the same indentation level as the corresponding if header, and these headers must also end with a colon (:). The statements that belong to the suites for these headers go on separate lines below the header and are indented to one greater level.

As a simple example, suppose we want to identify the even numbers from 1 to 10, but if a number is NOT even, we wish to determine if it is divisible by 3. Otherwise, we just want to print an asterisk:

for i in range(1, 11):
    if i % 2 == 0:
        print(f'{i} is even')
    elif i % 3 == 0:
        print(f'{i} is not even but divides by 3')
    else:
        print('*')
*
2 is even
3 is not even but divides by 3
4 is even
*
6 is even
*
8 is even
9 is not even but divides by 3
10 is even

Note how indentation changes meaning in the following two examples, which differ only by one Tab:

a = 2
b = 3
if a == 2:
    print('a=2')
if b == 2:
    print('b=2')
    print(a, b)
a=2
a = 2
b = 3
if a == 2:
    print('a=2')
if b == 2:
    print('b=2')
print(a, b)
a=2
2 3

We will generally iterate over dictionaries by iterating over their keys:

squares = {1: 1, 2: 4, 3: 9, 4: 16}

for key in squares.keys():
    print(f'{key}**2 = {squares[key]}')
1**2 = 1
2**2 = 4
3**2 = 9
4**2 = 16

1.6.2.7.3. while Statements#

A while loop combines looping and a conditional statement for determining whether looping should continue. It operates similarly to while loops in other programming languages. The loop will continue as long as the condition specified in the while statement is satisfied:

i = 1
while i < 10:
    if i % 2 == 0:
        print(i)
    i = i + 1
2
4
6
8

1.6.2.8. Functions#

We have already used built-in functions, like print() and type(). In this book, we will also often create new functions. As with the built-in functions, user-defined functions can take arguments and can return values. Functions are declared using the Python keyword def followed by the function name and then parentheses. Inside the parenthesis, list any variables that will receive arguments, separated by commas. Here is a simple example:

def say_hello(name):
    print(f'Hello, {name}!')

The function name and parameter list are called the function signature. User-defined functions are called in the same way as built-in functions:

say_hello('Charlotte')
Hello, Charlotte!

Functions can also return values by placing them after the Python return keyword. If multiple values are to be returned, they should be separated by commas. If more than one value is returned, it will be returned as a tuple:

def square_and_cube(x):
    return x ** 2, x ** 3
square_and_cube(4)
(16, 64)

When storing returned values from a function into multiple variables, you do not have to explicitly use the parentheses around the tuple of variables. You can just separate the variables by commas, as shown below:

four2, four3 = square_and_cube(4)
print(four2, four3)
16 64

As mentioned when we introduced strings, you can provide a docstring for a function as a string that directly follows the function definition. For example, let’s define a function that returns the squared error between its inputs:

def squared_error(x, y):
    """
    Returns the squared error (the squared difference) of the arguments
    """

    return (x - y) ** 2
squared_error(3, 2)
1
squared_error(2, 3)
1

Python allows you to specify default values for function arguments that will be used if the user does not pass a value for the argument. When defining a function, specify the default value for an argument by writing the parameter name, followed by an equal sign, followed by the default value in the function signature.

Let’s make a new version of squared_error function that sets the default value of y to 0:

def squared_error2(x, y=0):
    """
    Returns the squared error between the two arguments, with a default value of 0
    for the second argument
    """

    return (x - y) ** 2
squared_error2(2)
4

If the user passes a value, the default value is not used:

squared_error2(2, 3)
1

Note that we can use the names of the parameters (instead of parameter order) to pass values to those parameters. This is very commonly used in functions that have lots of parameters that are optional.

squared_error2(y=2, x=3)
1

Parameters can be passed using a mix of order and parameter names. However, any parameters passed by order must come before those passed by name.

We will see in the next section how to get help on a function.

1.6.2.9. Getting Help and Completion#

Python has built-in help for almost every function and object. This help can be retrieved in several ways. For instance, consider the built-in sum command. Here are several ways to get help in Jupyter:

help(sum)
Help on built-in function sum in module builtins:

sum(iterable, /, start=0)
    Return the sum of a 'start' value (default: 0) plus an iterable of numbers
    
    When the iterable is empty, return the start value.
    This function is intended specifically for use with numeric values and may
    reject non-numeric types.
?sum
Signature: sum(iterable, /, start=0)
Docstring:
Return the sum of a 'start' value (default: 0) plus an iterable of numbers

When the iterable is empty, return the start value.
This function is intended specifically for use with numeric values and may
reject non-numeric types.
Type:      builtin_function_or_method

For user-defined functions, ‘help’ will display the docstring you wrote. The following assumes that you have defined the function squared_errors from Section python-intro:functions:

?squared_error
Signature: squared_error(x, y)
Docstring: Returns the squared error (the squared difference) of the arguments
File:      /var/folders/gz/d_8lq2wn23x2lmhfh63cfl3400010v/T/ipykernel_93458/1368254984.py
Type:      function

If a function’s Python source code is available, it can be retrieved in Jupyter using ??:

??squared_error
Signature: squared_error(x, y)
Source:   
def squared_error(x, y):
    """
    Returns the squared error (the squared difference) of the arguments
    """

    return (x - y) ** 2
File:      /var/folders/gz/d_8lq2wn23x2lmhfh63cfl3400010v/T/ipykernel_93458/1368254984.py
Type:      function

Now, let’s look at the help for a variable.

squares = {1: 1, 2: 4, 3: 9, 4: 16}
?squares
Type:        dict
String form: {1: 1, 2: 4, 3: 9, 4: 16}
Length:      4
Docstring:  
dict() -> new empty dictionary
dict(mapping) -> new dictionary initialized from a mapping object's
    (key, value) pairs
dict(iterable) -> new dictionary initialized as if via:
    d = {}
    for k, v in iterable:
        d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs
    in the keyword argument list.  For example:  dict(one=1, two=2)

You should also try help(x), but I have omitted that because it provides help for every method of the dict object, which results in a lot of output.

You can also try help() with no argument to get an interactive help session.

Jupyter also provides many features to help you during programming. Assuming you have run the command defining squares above, try the following in a new Jupyter notebook cell:

  1. Type sum(. When you type the open parenthesis, Jupyter should automatically insert a pair of parentheses.

  2. Press shift-Tab. You should see the call signature and doc string for the sum function in a pop-over box. You can press the Esc key to close the pop-over box.

  3. Type sq and press Tab. You should see a list of variables and functions that begin with sq. Use the cursor keys to scroll to squares and press Enter to insert it without having to type the full name of the squares dictionary.

  4. Let’s sum the values in the squares dictionary. Type a period and then press Tab again to see a list of methods for a dict object. Select values using the keyboard or mouse.

  5. Don’t forget that we need parentheses to call the values method. Press ( and a pair of parentheses should appear.

  6. Press shift-Enter to run the cell.

sum(squares.values())
30

1.6.2.10. Python Modules and Namespaces#

Python has many useful modules that extend Python’s basic functionality. Some of these are included with the base Python distribution, and many others are included in the Anaconda distribution. Many more can be installed over the Internet.

To use a module, you must import it into your Python working environment. The most basic way to do this is to type import followed by the name of the module to be imported:

import numpy

Here we have imported NumPy (usually pronounced “Numb Pie”), one of the most important Python modules for working with numerical functions and arrays. When a module is imported, its functions and classes will be available in Python, but they are imported into their own namespace. To access something in a different namespace, type the name of the namespace, followed by a period, followed by the name of the thing you are trying to access.

For instance, the value of \(\pi\) is a constant object named pi in NumPy. Now that we have imported NumPy, we can access that value:

print(numpy.pi)
3.141592653589793

NumPy has many typical mathematical functions, which we can call using the numpy namespace:

numpy.sin(numpy.pi / 4)
0.7071067811865475
numpy.sqrt(2) / 2
0.7071067811865476

We can control the namespace into which the contents of a module is imported. Because many modules, like NumPy, are commonly used, the community often uses community-standardized namespaces that are shorter than the full module name. To do this, type import, followed by the module name, followed by the as keyword, followed by the desired namespace.

For NumPy, the data science community typically uses np, so the import statement is as follows:

import numpy as np
print(np.pi)
3.141592653589793

Warning

It is possible to import the contents of a module into the global namespace, which means that the namespace does not have to be specified before each function, class, or object. However, this practice is strongly discouraged because it often results in conflicts. For instance, both Matplotlib (a plotting module) and SymPy (a symbolic algebra module) have a plot function. If you were to import both matplotlib and sympy into the global namespace, you could not be sure which plot you were calling, unless you kept track of which module was imported last.

Importing into namespaces is such good practice that we will not give an example of how to import an entire module into the global namespace. However, on occasion, it may be helpful to import just a single function from a module, and in this case, it is reasonable to import it into the global namespace if we can be confident that there will not be any collisions. An example follows:

from scipy.special import factorial
factorial(10)
3628800.0

Why is factorial returning a float? Let’s check the docstring:

?factorial
Signature: factorial(n, exact=False)
Docstring:
The factorial of a number or array of numbers.

The factorial of non-negative integer `n` is the product of all
positive integers less than or equal to `n`::

    n! = n * (n - 1) * (n - 2) * ... * 1

Parameters
----------
n : int or array_like of ints
    Input values.  If ``n < 0``, the return value is 0.
exact : bool, optional
    If True, calculate the answer exactly using long integer arithmetic.
    If False, result is approximated in floating point rapidly using the
    `gamma` function.
    Default is False.

Returns
-------
nf : float or int or ndarray
    Factorial of `n`, as integer or float depending on `exact`.

Notes
-----
For arrays with ``exact=True``, the factorial is computed only once, for
the largest input, with each other result computed in the process.
The output dtype is increased to ``int64`` or ``object`` if necessary.

With ``exact=False`` the factorial is approximated using the gamma
function:

.. math:: n! = \Gamma(n+1)

Examples
--------
>>> import numpy as np
>>> from scipy.special import factorial
>>> arr = np.array([3, 4, 5])
>>> factorial(arr, exact=False)
array([   6.,   24.,  120.])
>>> factorial(arr, exact=True)
array([  6,  24, 120])
>>> factorial(5, exact=True)
120
File:      /Applications/anaconda3/lib/python3.9/site-packages/scipy/special/_basic.py
Type:      function

By inspecting the docstring, we can see that if we want the exact value, we need to set the parameter exact to True. This is typically done using the parameter name because anyone reading the function call will understand what the value of True is being used for:

factorial(10, exact=True)
3628800

1.6.2.10.1. NumPy Arrays#

NumPy provides a numpy.ndarray container for holding one-dimensional or multi-dimensional collections of numbers. Most engineers will have some familiarity with vectors or matrices. For our purposes, we will use the following definitions of these mathematical objects:

DEFINITION

vector#

A one-dimensional, ordered list of numbers that has an accompanying notion of magnitude of a vector and distance between two vectors.

Vectors are usually shown enclosed in square brackets. When writing as mathematics notation, we will use bold, lowercase letters to denote a vector. In mathematics, a vector may either be a column vector or a row vector. If not otherwise specified, vector will refer to a column vector; for instance, here is an example of defining a vector of the first five counting numbers:

\[\begin{split} \mathbf{x} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ \end{bmatrix}. %\mathbf{x} = \left[ 1,2,3,4,5 \right]. \end{split}\]

NumPy vectors do not have any notion of direction (column or row) but are usually displayed as a row and are interpreted as a row by some NumPy operations. To make a NumPy vector, call np.array() with the vector elements enclosed in square brackets and separated by commas:

x = np.array([1, 2, 3, 4, 5])

Arrays extend the concept of a vector to multiple dimensions:

DEFINITION

array#

A multi-dimensional table of numbers that supports a standard set of operations, including multiplication of matrices.

DEFINITION

matrix#

An alternate term for a two-dimensional array.

Note that an array can have one dimension, in which case it is a vector; in such cases, I will always refer to it as a vector. I will use array for higher dimensions, even though we will generally only consider two-dimensional arrays. Mathematically, two-dimensional arrays are shown as a table of numbers, organized into rows and columns, and enclosed in large square brackets. For example, the array below has rows that contain consecutive powers (1, 2, and 3) of the first five elements:

\[\begin{split} \begin{bmatrix} 1 &2 &3 &4 &5 \\ 1 & 4 & 9 & 16 & 25 \\ 1 & 8 & 27 & 64 & 125 \\ \end{bmatrix} \end{split}\]

The majority of our work on vectors, arrays, and their mathematics (linear algebra) will be deferred to Chapter 12 and Chapter 14. However, arrays are very helpful for doing basic data manipulation and storing numerical values for simulations, so we review some basics here.

In Python, arrays can be created using NumPy’s np.array() function. Two-dimensional arrays can be created by passing a list whose contents are equal-length lists of numbers. Each of the interior lists of numbers represents one row of the array. This will be most clear with an example:

B = np.array(
  [[1, 2, 3, 4],
   [8, 7, 6, 5]]
)

B
array([[1, 2, 3, 4],
       [8, 7, 6, 5]])

To make the definition of B more clear, I have put each row of the array onto different lines in the Python code. Because Python is waiting for closing parentheses and square brackets, these different lines will be processed as a single statement. I strongly encourage you to use this approach to make your array definitions more clear.

We will also be using methods that return NumPy arrays of random values.

NumPy arrays offer many advantages over lists. Two primary ones are:

  1. It is easy to perform numerical operations on the elements of NumPy arrays.

  2. NumPy arrays provide a variety of built-in methods that we will find useful.

To illustrate these, consider the array \(A\) defined above and a Python list \(L\) containing the same numbers:

L = [1, 2, 3, 4]

The NumPy array makes it easy to multiply all the elements in the array by a value:

B * 5
array([[ 5, 10, 15, 20],
       [40, 35, 30, 25]])

Compare the result with the effect of the multiplication operator on a list:

L * 5
[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4]

An example of a built-in method is sum():

B.sum()
36

The Python list object does not have a built-in sum() method, but Python does offer a general sum function:

L.sum()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[111], line 1
----> 1 L.sum()

AttributeError: 'list' object has no attribute 'sum'
sum(L)
10

We will introduce other methods of the array type and other NumPy functions that work on and/or return arrays as we introduce more data-science techniques.

1.6.2.11. Numerical Errors and Rounding#

Python is not able to store all numbers exactly internally, and mathematical operations on numbers can also produce numerical errors. Let’s illustrate this with a simple example:

3*0.1
0.30000000000000004

Based on this, you might conclude that the internal representation of 0.1 is actually a little larger than 0.1, and answers involving sums or multiples of 0.1 might sometimes come out a little too large. However, the truth is much more subtle, as shown by this example:

sum=0
for i in range(8):
  sum+=0.1
  print(sum)
0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999

These are computational errors, and we will occasionally round the outputs of functions to reveal the true values. We can use the NumPy function np.round(), which takes two arguments: 1) the number or array to be rounded, and 2) the number of digits of precision to preserve:

sum=0
for i in range(8):
  sum+=0.1
  print(np.round(sum, 10) )
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8

I will generally round to 10 digits of precision, as that is usually enough to remove any computational error while preserving any decimal values that are not caused by computational error.

1.6.2.12. Writing Big Numbers in Python#

We will be building simulations in Python that require looping thousands or millions of times. Thus, we will often be writing a range that has an argument with many zeros. The range function will not take a float value, and numbers written in scientific notation (like 1e6) will be treated as floats. This results in using integers that are very hard to read, like 10000000. We can’t use commas in the numbers because that would create a tuple:

10, 000, 000
(10, 0, 0)

Fortunately, Python provides a simple way to make large numbers like these more readable. Instead of using commas as a delimiter between every third digit, use underscores (_). Doing this makes it much easier to interpret large numbers, like ten million:

10_000_000
10000000

1.6.2.13. Summary and Other Resources#

Do not worry too much about absorbing all these details about Python now. This book contains many examples to get you started, and you can refer back to this section for reference. Some additional features of the Python programming language will be introduced as needed.

For users who want to learn more about Python, the following resources are recommended:

1.6.3. Review#

Self-Assessment:

The following questions can be used to check your understanding of the material covered in this section: