Now Code

Conditional Parsing

Overview

  • Teaching: 0 min
  • Exercises: 90 min

Questions

  • How do I parse a data file conditionally?

Objectives

  • Increasingly codes will use standard data files to help simplify the proceess of interacting with them
  • However this will not always be the case and processing data (files) can take an unreasonable amount of time
  • Learning to write reusable, reliable parsers will greatly reduce the pain
  • Python also helps to greatly ease the pain of parsing

Getting Started

This exercise requires you to clone the repository from: github.com/arc-bath/parsing. Make sure that the repository is not cloned into a directory or sub-directory of an existing git repository.

% git clone https://github.com/arc-bath/parsing.git

Once you have the repository change into the directory and run the tests in test_rainfall.py

% cd parsing/src
% pytest test_ts_parser.py

You should see a lot of output from pytest since many of the tests failed. The final line should contain a summary:

======================================== 8 failed, 1 passed in 0.37 seconds =========================================

The aim of this exercise is to modify your function so that it passes all these tests. Let's begin by reducing the output produced by pytest so we can see more clearly what is happening:

% pytest --tb=short test_ts_parser.py

You should now see output that looks like:

================================================ test session starts ================================================
platform linux -- Python 3.6.3, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/rjg20/training/arc-training/now-code-repos/parsing/src, inifile:
collected 9 items

test_ts_parser.py .FFFFFFFF

===================================================== FAILURES ======================================================
_______________________________________________ test_read_ts_coords2 ________________________________________________
test_ts_parser.py:32: in test_read_ts_coords2
    assert len(item) == 38
E   assert 0 == 38
E    +  where 0 = len([])
________________________________________________ test_read_structs_a ________________________________________________
test_ts_parser.py:64: in test_read_structs_a
    assert len(item) == 1
E   assert 0 == 1
E    +  where 0 = len([])
________________________________________________ test_read_structs_b ________________________________________________
test_ts_parser.py:81: in test_read_structs_b
    assert len(item) == 2
E   assert 0 == 2
E    +  where 0 = len([])
________________________________________________ test_read_structs_c ________________________________________________
test_ts_parser.py:99: in test_read_structs_c
    assert len(item) == 3
E   assert 0 == 3
E    +  where 0 = len([])
________________________________________________ test_read_structs_d ________________________________________________
test_ts_parser.py:119: in test_read_structs_d
    assert len(item) == 3
E   assert 0 == 3
E    +  where 0 = len([])
________________________________________________ test_read_structs_e ________________________________________________
test_ts_parser.py:142: in test_read_structs_e
    assert len(item) == 3
E   assert 0 == 3
E    +  where 0 = len([])
________________________________________________ test_read_structs_f ________________________________________________
test_ts_parser.py:171: in test_read_structs_f
    raise Exception('Expected empty line error not raised')
E   Exception: Expected empty line error not raised
________________________________________________ test_read_structs_g ________________________________________________
test_ts_parser.py:188: in test_read_structs_g
    raise Exception('Expected file termination error not raised')
E   Exception: Expected file termination error not raised
======================================== 8 failed, 1 passed in 0.37 seconds =========================================

The Problem

Writing code to process data files can take an inordinate amount of time. We will examine how to conditionally process a file of structures. While this may seem like a problem of simulation, it examines a much more general problem when dealing with data, how to read and write it. Even with standard formats easing the process of reading and writing data, in processing that data, and changing its format to meet the needs of our analysis will often be necessary.

You will read in structures (data set) comprising of atom labels and elements in the format:

<Element label> <x_coordinate> <y_coordinate> <z_coordinate>

e.g.

A 0.0 0.0 0.0

where the data are separated by spaces. Each structure consists of an unknown number of elements and is terminated by a line reading

** [x_coordinate] [y_coordinate] [z_coordinate]

The coordinates are optional and not required. The end of the set of structures is signified by the line:

## [x_coordinate] [y_coordinate] [z_coordinate]

again the coordinates are optional. While it may seem unecessary this line is important since it signifies that the previous step in our analysis completed successfully, i.e. the previous program didn't just stop midway through calculating the next structure.

You will modify the code ts_parser.py so that it reads in the structures, according to the above syntax, and processes the resulting data into lists of lists:

  1. The element labels should be read into <elements>, raising an Exception('Empty line in file') if the line is empty
  2. The coordinates should be read into x, y and z and converted to floats, or float('nan') if not possible or not present
  3. Once read in you should then check that the file terminated correctly, raising an Exception('File termination Error') if it didn't
  4. Otherwise return your data in the form of a list:
return [ [[ elements ]],
         [[ x ]], 
         [[ y ]],
         [[ z ]] ]

You will then write a second function that processes the data to:

  1. Count how many Structures are in the data set
  2. Count the number of elements per structure
  3. Record the number of invalid structures and the indices of them.
  4. Return the result as a list with the format:
    return [ num_structs,
             [ elements_per_struct ], 
             [ invalid_structs, [ list_of_invalid_structs ]] ]
    

A series of tests will help you to identify when your functions are performing correctly.

1: Read in data, convert type and return as lists (of lists)

The first challenge is to read the data from files. A prototype function for reading in the structure file is given in ts_parser.py and a series of tests in test_ts_parser.py. These will check different components of you function and can help you if required to correct your code. As suggested in the introduction the use of pseduocode or a flowchart will help to ensure the appropriate flow of your function.

Once you are reading in the file correctly, recall how we considered processing each line of a file in turn using:

with open(filename) as file:
    ...

you may find the following function useful:

words_in_line = line.split()

which splits the string line, with spaces by default, and returns a list of the 'words' in the string.

The tests can be run with the command:

% pytest --tb=short test_ts_parser.py

Once you are correctly reading in the expected structure your function should pass the first 5 tests.

2: Introduce exceptions and handle missing data

You now need to put checks into your function to address the following possible issues:

  1. The element labels should be read into <elements>, raising an Exception('Empty line in file') if the line is empty.
  2. The coordinates should be read into x, y and z and converted to floats, or float('nan') if not present.
  3. Once read in you should then check that the file terminated correctly, raising an Exception('File termination Error') if it didn't.

You may find the following construct useful for one of the tasks:

while condition:
    body

The while construct combine the loop and conditional, with the body being executed while the condition remains True.

The remaining 4 test in test_ts_parser.py should pass once you are handling exceptions and missing data as required.

What else could break the function

What other conditions could break the function. Think how you could test for these and correct them.

3: Process data to count stuctures and number of elements

You will then write a second function proc_structs(struct) which will be passed the list you returned from your parsing function, to processes the data to:

  1. Count how many Structures are in the data set
  2. Check that all elements contained the correct number of coordinates
  3. Return the result as a list with the format:
    return [ num_structs,
             [ elements_per_struct ],
             [ invalid_structs, [ list_of_invalid_structs ] ] ]
    

In this exercise you do not need to populate the lists related to invalid_structs.

The tests can be run with the command:

% pytest --tb=short test_ts_proc.py

Your processing should pass the first 6 tests once it is working correctly, but may pass more depending on your implementation.

4: Identify erroneous structures

Finally modify your processing function so that is checks whether each structure is valid or not, i.e. contains valid atom positions, and includes the number of invalid structures and their indices as a list in the return statement.

N.B. It will possibly surprise you to find out the result of

assert float('nan') == float('nan')

copy this into an ipython session and try to work out why it gives the result it does.

Fortunately numpy provides a function numpy.isnan(value) or np.isnan(value), depending on how you import, which can explicitly check if a value/variable is not a number.

How could the functions (and tests!) be improved

Thinking about what we have previously said about functions and unit tests what problems are there with the code we have created. How, if at all could these be improved?

Key Points

  • Parsing and processing files to check that they are consistent with expected format is a common task.
  • You can mitigate by using standard formats but it is a useful task combining conditional processing and treating errors.
  • You will see that even with a relatively simple file format we are passing a lot of data around.
  • Appropriate use of modules, libraries and classes can help you structure you code to reduce the need for this and improve the clarity of your programs.
  • In many programming languages nan!=nan