Now Code

Processing Text

Overview

  • Teaching: 0 min
  • Exercises: 90 min

Questions

  • How can I access files from the web?
  • How can I mine these files for data?

Objectives

  • Be able to access pages/files on websites
  • Process these file to split them for analysis
  • Perform analyses on the data contained

Getting Started

This exercise requires you to clone the repository from: github.com/arc-bath/text_proc. Make sure that the repository is not cloned into a directory or sub-directory of an existing git repository.

% git clone https://github.com/arc-bath/text_proc.git

Once you have the repository change into the directory and run the tests in test_openpage.py

% cd text_proc/src
% pytest test_openpage.py

You should see the following output from pytest:

================================================ test session starts ================================================
platform linux -- Python 3.6.3, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/rjg20/training/arc-training/now-code-repos/text_proc/src, inifile:
collected 1 item

test_open_page.py F

===================================================== FAILURES ======================================================
____________________________________________________ test_rjg20 _____________________________________________________

    def test_rjg20():
        link = "https://people.bath.ac.uk/rjg20/index.html"
        filename = "..-index.html"

        with open(filename) as file:
            expect = file.read().splitlines()

>       link_str = wt.open_page(link)
E       AttributeError: module 'web_text' has no attribute 'open_page'

test_open_page.py:10: AttributeError
============================================= 1 failed in 0.02 seconds ==============================================

Note there is only one test in this file.

In this exercise you will learn to open files from the web and set up analyses of the text they contain. First of all let's introduce the library we will use in this exercise. Open an ipython session and enter the following code:

import urllib.request
link = "http://www.bath.ac.uk/homepage"
file = urllib.request.urlopen(link)
page = file.read().decode()
print(page)

What do you think the following has done? Unless you are familiar with html the output will seem quite odd. However if you open the webpage http://www.bath.ac.uk/homepage and 'view page source' you should see that these are one and the same.

We will not learning about html or how to process files in this exercise but thought this useful to illustrate that urllib can be used to access some webpages as well as static files.

1: Write an open_page function

Create a new file web_text.py and write a function open_page that opens a specified web address and returns the result as a string. Once you have done this experiment with the function using different pages and see what happens. Also check how the page/file is returned and think about whether this is the most useful format.

2: Write function to count occurences of words

Now we will write functions to count the number of occurences of specific words in a given string. First however add a new function to web_text.py a function count_words that when passed a stringreturns to total number of 'words' it contains. In this context by word we mean a set of characters separated from another set by space or newlines.

Once your function is working correctly you should pass the first 3 tests in test_count.py, remember you can run pytest --tb=short <test_file>.py to reduce the output.

Next we want to do something more interesting, count the number of occurrences of a particular word, using a function count_occs. This will take two parameters, a string, the text you want to search, and a list of strings, containing the different words you want to search for. The function should return the number of occurrences as a list, in the same order as the list provided by the user.

Hint: <my_string>.count('word'), count the number of occurences of 'word' in <my_string>.

The tests use text in the link "http://www.gutenberg.org/cache/epub/2264/pg2264.txt", understanding what the tests are doing, by working through them, and knowing what the file is may help to illustrate what the aims of the code developed in the exercise might be. Explore using your library to explore the contents of the file as well as completing the exercises.

Once you have count_occs implemented correctly the remainder of the tests in test_count.py should pass.

3: Split and strip string

Write a function split_strip_text() which takes two strings, the text you wish to process and the key word that will be used to split the string. By inspecting the test for this exercise test_split.py, make sure you understand what the tests are using your functions to do.

Once your function is working it should pass the first test. Now use your functions to open the text in the above link in an interactive session and use your function to split the on the key word used in the test 'Actus'. Inspect the result of your function to understand why we might want to strip out the first, or other substrings that your function returns.

Modify your function so that it now takes two variables start and stop and give both 0 by default. If neither is passed your function should work as before returning all the substrings as before. If either (or both) is provided by the user these should be used as the start and stop indices of the list of substrings returned by your function. As with normal Python behaviour when slicing lists, your user should understand that if provided the start index will be the first returned, and stop the first not returned and this should be reflected in your documentation.

Once you have correctly implemented the strip, or slice functionality your function should pass the remainder of the tests in test_split.py.

Bearing in mind the properties of lists and slicing them, what should you also consider putting into your function?

Sandpit

Now that you have passed developed your functions, see if you can combine them to obtain lists of how many lines the key characters given in a list, e.g. ['Macbeth', 'Macduff', 'Lady Macbeth'] have in each Act of the play, write this code into a new function. Now in an interactive session of Notebook, import your library and plot the data to identify how the character's roles develop. Run the analysis on other books/plays of your choice from the Gutenberg Project.

Key Points

  • urrlib.request can be used to access pages from the internet
  • Although many libraries exist for processing text it is vital that you understand what they are doing
  • You have readily produced a library that can perform analysis of text files.