Text Analysis with Python with HTRC and NLTK

If you are not planning to use any data from the HTRC library provided by HathiTrust, and you are interested in doing text analysis with Python while working with a csv file, skip the HTRC and Demo section.

In this guide, our primary aim is to unravel the complexity of extracting text from the 'description' column in HTRC data sets. This column often contains rich, descriptive language that depicts various locations, authors, and other essential details. By extracting and examining this data, we can gain fascinating insights into the use of language across diverse contexts, thereby aiding in literary analysis, sociolinguistic studies, and computational modeling. This guide will walk you through the process of using the HTRC library for this purpose, with step-by-step examples and detailed explanations, making it accessible even to beginners in Python or textual data analysis.

Whether you are a researcher looking for an efficient way to conduct text mining, a data scientist eager to venture into the world of computational linguistics, or a Python enthusiast interested in exploring the vast possibilities of text manipulation, this guide is your key to unlocking the potential of the HTRC Python library. Let's dive in and begin our journey into the realm of language and text analysis with HTRC.

Official HTRC docs: if you are already an experienced programmer, this guide is optional. Referencing the docs might be more helpful.

The original setup of Python Exploration uses Jupyter as the IDE (Integrated Development Environment, a text editor where you code), which is recommended in the official HTRC library docs.  However, to keep up with the technology in this day and age, we will walk you through the setup process on Visual Studio Code (VS Code), an extremely flexible and widely used code editor that supports a multitude of programming languages and has an array of useful extensions. VS Code is particularly suitable for larger scale projects, collaborative programming, and integrates well with version control systems like Git. The HTRC library, alongside a few other dependencies, will be installed using pip, the Python package installer.

  • Install Visual Studio Code: Download and install VS Code from the VS Code Download Page.

  • Python Extension: Install the Python extension in VS Code from the Extensions marketplace.

  • Install Python: If not already installed, download and install the latest Python version from the official website.

  • Install HTRC Python Library: Open the terminal in VS Code (View -> Terminal), and execute: pip install htrc

  • Install Additional Dependencies: Execute the following in the VS Code terminal: pip install pandas matplotlib nltk

Congrats! You just finished setting up your environment. If you encountered any issues during the setup, please contact Le Lyu at lyule@bc.edu for further help.

If you are a complete beginner, try creating the simplest python program first. Create a file named helloWorld.py in VS Code, typing print("hello world!") in the editing section, and running the program by clicking run in the tab (or a start icon on the upper right corner). If you see hello world! or whatever you printed in terminal, your VS Code and Python are set up correctly.

Let's begin playing around with HTRC! 

The official docs provided a template folder which contains data for us to play around with, and you may either try it, or just hit the API documentations. Download the introductory template here. Then unzip the folder and open it in VS Code.

Create a new python file. Then begin to test the libary HTRC.

Here are the code I used. Copy and paste it to see the result for yourself!

# Import the necessary libraries
from htrc_features import FeatureReader
import os
import matplotlib.pyplot as plt
import pandas as pd

# Define the paths to the sample files
paths = [os.path.join('data', 'sample-file1.json.bz2'),
         os.path.join('data', 'sample-file2.json.bz2')]

# Create a FeatureReader object to read the volumes
fr = FeatureReader(paths)

# Loop through each volume and print its title
for vol in fr.volumes():
    print(vol.title)

# Get the first volume in the FeatureReader object
vol = fr.first()

# Print the title of the first volume
print("first volume title: ", vol)

# Print the id of the first volume
print("id of vol: ", vol.id)

# Get the number of tokens per page for the first volume
tokens = vol.tokens_per_page()

# Print the first few rows of the token count DataFrame
print(tokens.head())

# Plot the token count data
tokens.plot()

# Show the plot
plt.show()

To truly understand what's going on with these code, refer to this section of the docs.

By now, you had a taste of both Python and the HTRC library. Now let's see how this library can be applied in the World's Traverler Project. How can we extract information from the dataset?

We will be using one of the dataset from "women who ruled". Our objective is to pull the description column out of the data set and let people see trying of language used to describe different locations (or authors).

(Path: https://github.com/BCDigSchol/digitalCertificate/blob/main/womenwhoruled/data/data.csv)

First, this is a csv file. So we need to convert it to a format that HTRC could recognize. You may try converting the csv into a dataframe, and then cast it into a volume object used by HTRC in order to apply all the APIs the library provides. For example, you may want to try this:


import pandas as pd
from htrc_features import Volume

# Read a CSV file into a pandas DataFrame
df = pd.read_csv('women_who_ruled.csv'

# Convert the DataFrame to a dictionary
data = df.to_dict(orient='list'

# Create a Volume object from the dictionary
vol = Volume(data)

All good so far. HOWEVER, as soon as you try to apply the same methods in HRTC library that we used above, such as tokens_per_page(), you will inevitably encouter an error. This is because the HRTC library is designated to be used in accordance with HathiTrust's own dataset, which is designated by ID for each column. Unless the dataset comes from the HRTC source, we have to consider other ways to work with our dataset.

Entering NLTK (natural language tool kit).

Natural Language Tool Kit is a popular Python library for text analysis. Let's begin by looking at one example: 


import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Load your data
df = pd.read_csv('women_who_ruled.csv')

# 'Description' is the column with the travel journals
text_data = df['Description']

# Tokenize the words in each journal entry
tokenized = text_data.apply(word_tokenize)

# NLTK's POS tagger can classify words as places (NNP or "proper noun") among other things
tagged_sentences = tokenized.apply(nltk.pos_tag)

# Named Entity Recognition (NER)
# We'll use the ne_chunk function to identify named entities in each tagged sentence


def extract_locations(tagged_sentence):
    locations = []
    for chunk in nltk.ne_chunk(tagged_sentence):
        if isinstance(chunk, nltk.tree.Tree) and chunk.label() == 'GPE':
            locations.append(' '.join(c[0] for c in chunk))
    return locations


# Apply our function to each tagged sentence to extract locations
df['Extracted_locations'] = tagged_sentences.apply(extract_locations)

# If you want to save the results to a new CSV
df.to_csv('output_file.csv', index=False)

Put these code in your VS code and provide the CSV with a 'Description' column, ran the program, and you should get a 'output.csv' file with an 'extracted locations' column. That's pretty much what this program does. But let's break it down even further.

1. Import required libraries and download NLTK datasets: the program starts by importing the necessary libraries and downloading the NLTK datasets. `pandas` is used for data handling, `nltk` and its `word_tokenize` function are used for tokenization and POS (Part-of-Speech) tagging.

2. Load Data: The pgoram then loads a dataset named 'women_who_ruled.csv' into a DataFrame using "pandas". The DataFrame, "df" represents a structured, tabular dataset. The 'Description' column of this dataset, which presumably contains travel journals, is stored in `text_data`.

3. Tokenization: Each journal entry in "text_data" is then tokenized using NLTK's "word_tokenize" function. Tokenization is the process of breaking down a text paragraph into smaller chunks such as words or terms. Each of these tokens is then returned in a list.

4. POS tagging: The program then applies NLTK's POS (Part-of-Speech) tagger to each tokenized sentence. This function labels each word in a sentence with a grammatical category (like noun, verb, adjective, etc.)

5. Named Entity Recognition (NER): Then, the script applies NER to identify named entities in each tagged sentence. Named entities are typically noun phrases that refer to specific types of individuals or entities, such as persons, organizations, or locations. Here, it's being used to identify locations (GPE or Geo-Political Entities).

6. Save Results: Finally, the script saves the DataFrame, now including the newly added 'Extracted_locations' column, into a new CSV file 'output_file.csv'.

In summary, this program reads in a dataset containing descriptions, tokenizes and tags the words in the descriptions, identifies locations from the tagged words, adds these locations to the original dataset, and saves the updated dataset as a new CSV file.

You can either try this program on other WT&D dataset, or explore further in the NTLK API documentaiton, which will be a valuable asset for text/data analysis.