Writing your own libraries (& Command-Line Programs)¶

Overview:

Teaching: 15 min
Exercises: 45 min

Questions

How can I write Python programs that will work like Unix command-line tools?

Objectives

Use the values of command-line arguments in a program.
Handle flags and files separately in a command-line program.
Read data from standard input in a program so that it can be used in a pipeline.

Interactive tools such as Jupyter notebooks and Ipython are great for prototyping code and exploring data, but sooner or later if we want to re-use our codes, or demonstrate reproducbile workflows, we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a dataset and prints the average inflammation per patient.

You have several ways to choose from to run this episode. If you are comfortable using linux editors and the terminal you are welcome to do so. Otherwise you can create a file directly in notebooks from the menu page where you create a new notebook, instead of selecting Python3.6 select Text File. Once the new file has opened click on Untitled.txt and change it's name as you are instructed. notebooks allows you to edit the file in a number of different modes replicating more advanced editors which you should explore if you want to use notebooks for regular development.

Switching to Shell Commands

In this lesson we are switching from typing commands that comprise our program in a Python interpreter to typing commands in a program file or script. We will the run or import the program from within a Jupyter notebook. When you see a %run in acode cell this is Python magic, which is loading and running a script through the Python interpreter.

As you might now expect our first task will be to produce a program that sends a greeting to the world.

Let's start by creating a new Text File rename it hello_world.py and enter:

print("Hello world!")

Create a new notebook in the same folder as your program and run it with

%run hello_world.py

Verify that this gives the output you would expect.

Passing arguments from the command line¶

Often we will want to pass information into the program from the command line. Fortunately Python has a standard library that allows us to do this. Copy your hello_world.py program to a new file hello.py and edit the file as follows:

import sys

print("Hello",sys.argv)

If we run our new program with the argument, James we should see the following output:

%run hello.py James
Hello ['./hello.py', 'James']

sys.argv means system argument values. The first argument is the name of the program and the full set of values are presented as a list, we don't want to say hello to the name of the program, and generally we will want to ignore this argument so let's modify our program to just consider the rest of the list:

import sys

names = sys.argv[1:]

for name in names:
    print("Hello",name)

Make sure that you understand what we have done here, and why, discuss with your neighbours to make sure eveyone is following.

We can now re-run our new program with the same command as before:

%run hello.py James
Hello James

Because we have generalised the program to operate on all arguments passed to it we can also run

%run hello.py Alan Bob Carl Dave
Hello Alan
Hello Bob
Hello Carl
Hello Dave

so we already have a way to generalise the script to perform the same task on a number of arguments.

What would the following do?

This exercise requires a working knowledge of bash/linux.

%run hello.py ../*

Remember that in bash * represents the wildcard match any characters of any length.

We will next make some small changes to our program to encapsulate the main part of the our program in its own function, and then tell Python that this is what it should run we the program is executed:

import sys

def main():
    '''
    We can also add a docstring to remind our future selves that this program:

    Takes a list of arguments and say hello to each of them.
    '''

    names = sys.argv[1:]

    for name in names:
        print("Hello",name)

if __name__ == '__main__':
   main()

Run your program with the same arguments as before to check that you have not change its behaviour. Note that we can also add a 'docstring' to our main function to explain what it does.

Running versus Importing

If the program behaves in the same way, why have we changed it? The reason is that running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script, we expect to see some output printed to the console.

In order for a Python script to work as expected when imported or when run as a script, we typically put the part of the script that produces output in the following if statement:

if __name__ == '__main__':
    main()  # Or whatever function produces output

When you import a Python file, __name__ is a special variable which holds the name of that file (e.g., when importing readings.py, __name__ is 'readings'). However, when running a script in bash, __name__ is always set to '__main__' in that script so that you can determine if the file is being imported or run as a script.

By adopting the practice of encapsulating the script part of our code in a main function we are making sure that we can safely import our code in other programs to safely reuse the fantastic functions we write.

The Right Way to Do It

If you want to create programs which can take complex parameters or multiple filenames, we shouldn’t handle sys.argv directly. Instead, we should use Python’s argparse library, which handles common cases in a systematic way, and also makes it easy for us to provide sensible error messages for our users. We will not cover this module in this lesson but you can go to Tshepang Lekhonkhobe’s Argparse tutorial that is part of Python’s Official Documentation.

The Software Carpentry material that this is episode is based on makes use of the 'code' files you downloaded at the beginning of the episode. If you wish to explore these files further you are encouraged to do so as these also explore some of the functionality of the numpy library. Note that they take a slightly different to run their programs from the way we have looked at.

We will instead explore an example that builds on what we did in the preceding episode more directly.

First of all you will need to unzip files in the data folder, if you are comfortable using the terminal feek free to launch a terminal and extract the files. Alternatively you can create a new notebook in data folder and unzip the following in a code cell, if you have not already done so.

!unzip RS50001/python-novice-inflammation-data.zip
!unzip RS50001/python-novice-inflammation-code.zip

As we were with % executing magic the ! runs standard bash commands rather than Python.

Now open the data folder that this command should create, ask a demonstrator if nothing happens. You will need to move the programs you create and a notebook to run them in this folder.

Let's say we want to find the mean inflamation of each of the patients in the inflammation data we read in during the previous lesson.

First copy our template hello.py to inflammation_mean.py. Open inflammation_mean.py and edit it as follows:

import sys

def main():
    '''
    We can also add a docstring to remind our future selves that this program:

    Takes a list of files, and find and print the mean of each line of data:
    '''

    filenames = sys.argv[1:]

    for filename in filenames:
        data = read_csv_to_floats(filename)
        count=0
        for line in data:
            count += 1
            print("File: ", filename, "patient: ", count, "average inflammation", mean(line))

if __name__ == '__main__':
   main()

Now we need to add the mean(sample) function that we considered in episode 7, add this between read_csv_to_floats() and main:

def mean(sample):
    '''
    Takes a list of numbers, sample

    and returns the mean.
    '''
    sample_sum = 0
    for value in sample:
        sample_sum += value

    sample_mean = sample_sum / len(sample)
    return sample_mean

Now run your program with:

%run inflammation_mean.py inflammation-01.csv

Your output should look something like:

File:  ../inflammation-01.csv patient:  1 average inflammation 5.45
File:  ../inflammation-01.csv patient:  2 average inflammation 5.425
File:  ../inflammation-01.csv patient:  3 average inflammation 6.1
File:  ../inflammation-01.csv patient:  4 average inflammation 5.9
File:  ../inflammation-01.csv patient:  5 average inflammation 5.55
File:  ../inflammation-01.csv patient:  6 average inflammation 6.225
File:  ../inflammation-01.csv patient:  7 average inflammation 5.975
File:  ../inflammation-01.csv patient:  8 average inflammation 6.65
File:  ../inflammation-01.csv patient:  9 average inflammation 6.625
File:  ../inflammation-01.csv patient:  10 average inflammation 6.525
File:  ../inflammation-01.csv patient:  11 average inflammation 6.775
File:  ../inflammation-01.csv patient:  12 average inflammation 5.8
File:  ../inflammation-01.csv patient:  13 average inflammation 6.225
File:  ../inflammation-01.csv patient:  14 average inflammation 5.75
File:  ../inflammation-01.csv patient:  15 average inflammation 5.225
File:  ../inflammation-01.csv patient:  16 average inflammation 6.3
File:  ../inflammation-01.csv patient:  17 average inflammation 6.55
File:  ../inflammation-01.csv patient:  18 average inflammation 5.7
File:  ../inflammation-01.csv patient:  19 average inflammation 5.85
File:  ../inflammation-01.csv patient:  20 average inflammation 6.55
File:  ../inflammation-01.csv patient:  21 average inflammation 5.775
File:  ../inflammation-01.csv patient:  22 average inflammation 5.825
File:  ../inflammation-01.csv patient:  23 average inflammation 6.175
File:  ../inflammation-01.csv patient:  24 average inflammation 6.1
File:  ../inflammation-01.csv patient:  25 average inflammation 5.8
File:  ../inflammation-01.csv patient:  26 average inflammation 6.425
File:  ../inflammation-01.csv patient:  27 average inflammation 6.05
File:  ../inflammation-01.csv patient:  28 average inflammation 6.025
File:  ../inflammation-01.csv patient:  29 average inflammation 6.175
File:  ../inflammation-01.csv patient:  30 average inflammation 6.55
File:  ../inflammation-01.csv patient:  31 average inflammation 6.175
File:  ../inflammation-01.csv patient:  32 average inflammation 6.35
File:  ../inflammation-01.csv patient:  33 average inflammation 6.725
File:  ../inflammation-01.csv patient:  34 average inflammation 6.125
File:  ../inflammation-01.csv patient:  35 average inflammation 7.075
File:  ../inflammation-01.csv patient:  36 average inflammation 5.725
File:  ../inflammation-01.csv patient:  37 average inflammation 5.925
File:  ../inflammation-01.csv patient:  38 average inflammation 6.15
File:  ../inflammation-01.csv patient:  39 average inflammation 6.075
File:  ../inflammation-01.csv patient:  40 average inflammation 5.75
File:  ../inflammation-01.csv patient:  41 average inflammation 5.975
File:  ../inflammation-01.csv patient:  42 average inflammation 5.725
File:  ../inflammation-01.csv patient:  43 average inflammation 6.3
File:  ../inflammation-01.csv patient:  44 average inflammation 5.9
File:  ../inflammation-01.csv patient:  45 average inflammation 6.75
File:  ../inflammation-01.csv patient:  46 average inflammation 5.925
File:  ../inflammation-01.csv patient:  47 average inflammation 7.225
File:  ../inflammation-01.csv patient:  48 average inflammation 6.15
File:  ../inflammation-01.csv patient:  49 average inflammation 5.95
File:  ../inflammation-01.csv patient:  50 average inflammation 6.275
File:  ../inflammation-01.csv patient:  51 average inflammation 5.7
File:  ../inflammation-01.csv patient:  52 average inflammation 6.1
File:  ../inflammation-01.csv patient:  53 average inflammation 6.825
File:  ../inflammation-01.csv patient:  54 average inflammation 5.975
File:  ../inflammation-01.csv patient:  55 average inflammation 6.725
File:  ../inflammation-01.csv patient:  56 average inflammation 5.7
File:  ../inflammation-01.csv patient:  57 average inflammation 6.25
File:  ../inflammation-01.csv patient:  58 average inflammation 6.4
File:  ../inflammation-01.csv patient:  59 average inflammation 7.05
File:  ../inflammation-01.csv patient:  60 average inflammation 5.9

We can also run our program with all the inflammation data:

%run inflammation_mean.py inflammation-*.csv

We may also want to output our data to a file. In order to do this modify your main function as follows:

def main():
    '''
    We can also add a docstring to remind our future selves that this program:

    Takes a list of files, and find and print the mean of each line of data:
    '''

    filenames = sys.argv[1:]

    output_filename = "my_data.txt"

    output_file = open(output_filename, 'w')

    for filename in filenames:
        data = read_csv_to_floats(filename)
        count=0
        for line in data:
            count += 1
            output_file.write("File: "+filename+"patient: "+str(count)+"average inflammation: "+str(mean(line))+"\n")
    output_file.close()

Note that we as with reading from files we have to open and close the file. Also the function file.write() can only take a single str as its parameter, so the write line is a little different to our print statement before, we also have to add a explicit new line at the end of the line which is the reason for the "\n".

Run your program and cat the file my_data.txt to verify that it has worked as intended.

Excercise: Import your code¶

Verify that you can also import your library and access the functions it defines, remember that as with undefined variables, if your function is not found, the library has not been correctly read in. Repeat the 'analysis' in the main function by explicitly assigning values to filename and calling your read_csv_to_floats and mean functions.

Arithmetic on the Command Line

Write a python program that does addition and subtraction:

%run arith.py add 1 2
3
%run arith.py subtract 3 4
-1

Solution

Counting Lines

By modifying inflammation_mean.py or otherwise, write a program called line_count.py that counts the number of lines in files that are passed as arguments and at the end the total number of lines in all the files added together, for those familiar with bash it works like the Unix wc command. Your program should:

If no filenames are given inform the user to provide a filename(s).
If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.