This exercise requires you to clone the repository from: github.com/arc-bath/text_proc. Make sure that the repository is not cloned into a directory or sub-directory of an existing git repository.
% git clone https://github.com/arc-bath/text_proc.git
Once you have the repository change into the directory and run the tests in test_openpage.py
% cd text_proc/src
% pytest test_openpage.py
You should see the following output from pytest:
================================================ test session starts ================================================
platform linux -- Python 3.6.3, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/rjg20/training/arc-training/now-code-repos/text_proc/src, inifile:
collected 1 item
test_open_page.py F
===================================================== FAILURES ======================================================
____________________________________________________ test_rjg20 _____________________________________________________
def test_rjg20():
link = "https://people.bath.ac.uk/rjg20/index.html"
filename = "..-index.html"
with open(filename) as file:
expect = file.read().splitlines()
> link_str = wt.open_page(link)
E AttributeError: module 'web_text' has no attribute 'open_page'
test_open_page.py:10: AttributeError
============================================= 1 failed in 0.02 seconds ==============================================
Note there is only one test in this file.
In this exercise you will learn to open files from the web and set up analyses of the text they contain. First of all let's introduce the library we will use in this exercise. Open an ipython
session and enter the following code:
import urllib.request
link = "http://www.bath.ac.uk/homepage"
file = urllib.request.urlopen(link)
page = file.read().decode()
print(page)
What do you think the following has done? Unless you are familiar with html
the output will seem quite odd. However if you open the webpage http://www.bath.ac.uk/homepage and 'view page source' you should see that these are one and the same.
We will not learning about html
or how to process files in this exercise but thought this useful to illustrate that urllib
can be used to access some webpages as well as static files.