What EXACTLY Python does

Given my impassioned proselytization of digital history, it’s not surprising that I received an email from a colleague asking a reasonable, yet hard-to-answer question: “what EXACTLY does Python do?” His name’s not important, but let’s call him “Bob.”

So if you’ve bothered reading any of my recent posts, and just wish I’d start from the beginning (or shut up, already), here’s my take on Python and computer programming, geared towards computer-knowledgeable people who haven’t programmed before. This is the perspective of someone who’s been a digitally-savvy humanities-type for thirty years, but has only recently dived into learning a computer programming language from scratch. Someone, keep in mind, who’s never gotten the official version from a computer science course, but who does teach a digital history course. So it’s therefore a focus on all the small and medium-sized things you can have your computer do for individual historical research – the digital historian’s low-hanging fruit.

What EXACTLY does a computer program do?

Python is a “high-level” general-purpose programming language. Which means, it uses a syntax that is ‘relatively’ readable, unlike the machine code used by the Matrix, and it can do ALL sorts of things with just about any kind of data you can imagine. But, unlike the Matrix, you can’t really bend or break the rules. At least I can’t.

binary code zero one matrix green background beautiful banner wa

NOT Python code

So asking what ‘exactly’ a programming language like Python does doesn’t emphasize enough the fact that it does just about anything you could want, ASSUMING you have some digital information, AND assuming you want to do something with/to that data that can be translated into a series of discrete steps (an algorithm). In other words, there’s a lot of gold in them thar’ hills. But it’s only worth prospecting if you have (or can get) a lot of digital data, and need to repeat the same kinds of manipulations and analysis over and over: for each character/word/paragraph/document, I want to do X, Y, Z; if it says ‘blah’ I want to… (I just introduced you to for- and if-loops, in case you didn’t notice.) As long as the above two assumptions are met (nice to meet you, Boolean value = True), you’re golden.

Imagine that you were asked “what exactly does Microsoft Excel do?” Instrumentally, you’d probably reply that it allows you to type in numbers, edit the numbers with features like copy and paste, format the numbers in a variety of ways, perform mathematical calculations on the numbers, and print out or email the results. Or, you could answer the question by giving specific examples of projects you could do: Excel allows me to balance my budget; it lets me keep track of student grades, or my fantasy league stats. Good answers all, but just the tip of the iceberg. You could create an entire Excel spreadsheet that doesn’t have a single number in it. You could use Excel to lay out a particularly complicated table that Word’s Table feature chokes on, or that Word generally makes a pain in the arse to edit. You could, like I did twenty years ago, use Excel as a go-between when transferring Word notes into a note-taking Access database. (See! that Python post does have some utility after all.) You could even use Excel to rename a whole folder full of files based off another list, e.g. rename all those EEBO-TCP files you downloaded, which have names like A10352.txt. In a sense, that literally has almost nothing to do with spreadsheets – you delete the Excel file when you’re done, because it was just a means to an end. In other words, what “exactly” Excel does is limited most broadly by the features built into the application, but, more practically, it depends on the vision and technical expertise of the person using it.

Same with a computer programming language. But the Python canvas starts out with nothing at all on it, not even a grid of empty cells. Intimidating, but its tabula rasa lets Python deal with all sorts of different types of information, and allows you to manipulate them in an almost infinite number of ways, because you can keep building layer upon layer of functions on top of each other, drawing in data and inputs from all sorts of places, and sending them wherever you want. So, if you can think of a specific series of commands you invoke using some packaged software like Word or Excel or your web browser, you can (probably) recreate that workflow in Python. But it’s not just about replication; I see three main advantages to coding over a packaged piece of software:

  1. Coding allows you to totally control the inputs, the manipulations, and the outputs. Don’t like the way a program formats the results? Change it. Wish the summary feature would also include your favorite statistic? Include it. Hate how the program always does this one annoying thing? Fix it.
  2. Performing the task will be automated with Python, so you could run it thousands of times, in a few seconds (or more, depending on the complexity of the task), without getting carpal tunnel syndrome. Some packaged programs allow you to install or use macros to ‘record’ these repeated actions, and programming is like that, but so much more, because you have so much more control, and a much larger toolbox to draw upon. The result is a quantitative increase in speed, but that quantitative increase is really a qualitative advance – you can do things that you’d never do otherwise, because the computer does most of it for you.
  3. You don’t need to purchase, learn and maintain the dozen different programs that would be needed to to do (most of) the data analysis you can perform with Python. Nor do you need to worry about your specialized programs disappearing into history, as happens shockingly often unless your name is Microsoft or Apple or Adobe. Nor do you need to worry about radical changes from version to version. If you’ve experienced several generations of Microsoft Office or Blackboard or, heaven help you, mobile apps, you know what I mean. Python will definitely require you to keep an eye on changing Python library versions, and possible incompatibilities that might result. But at least with Python, you can create virtual environments so that you can run different versions of the same libraries on the same machine. And there are even web environments that allow you to set all of that up for the user.

Python, in other words, allows you to create workflows that automate a lot of the processing and analysis steps you’d have to do manually, possibly in many different programs. Input data, manipulate it in any number of ways, and then output it.

But it definitely comes with a learning curve, I won’t lie. A learning curve that is currently a lot harder for humanists, in my humble opinion, because we are not the primary users, and therefore we don’t have many pre-made, history-specific libraries with functions already designed. As a result, I’ve spent most of my time learning the fundamentals; the eye candy of maps and charts is much better served by dedicated libraries that require less programming knowledge and more domain expertise. My experience over the past several months has reinforced three important programming principles that I’m sure Programming 101 courses emphasize:

  1. Think through the nature of your data at each point in your code.
  2. Think through each logical step in your algorithm.
  3. Pay constant attention to the numerous syntax rules.

Ignoring any of the above will lead to runtime errors. Fortunately, you can google those error messages and probably figure out the problem, since it will likely be a violation of one of the above principles. Are you trying to use string methods on a list item? Are you expecting your iterator to be an iterable? Do you really want to put the print function inside that loop? Are you blindly copying an online example without considering how your case is different? And, as I keep telling my students, attention to details matter, whether it’s a missing close quote or a mismatched variable name. But the computer will usually tell you if you make a mistake, you learn these things over time (and build up working code that you can copy or use as functions), and practice makes perfect.

What EXACTLY does Python do for Scholars?

For scholars, a programming language like Python can help manipulate and analyze whatever text, numbers, image, video and sound information we have in digital form. Fortunately, it’s not as minimalist as all that, thanks to a few thousand free libraries (or modules) that you import into the basic Python package. The libraries do most of the hard work behind the scenes, i.e. most of the coding has already been done by the libraries’ authors, so you just plug and play with some standard commands, your specific data, the specific variables you create, and the specific commands combined in the specific order you choose. Figure out how to get your information into the code (i.e. the computer’s) memory in a specific format based on its structure (is it a string? a list? a dictionary?…), manipulate the resulting data (maybe you convert the string to a list, after replacing certain features and adding others), pass it on to the next block of code that does something else (now that it’s a list, you can loop through each item and count those that start with the letter ‘q’), do more things to the list and the count (maybe if the count exceeds a certain threshold, you send that word to another list), then pass any of those results on to the next block of code, until you end up with what you want. To give a simple real-world example: maybe you’ve started with a long string of text (like this paragraph here), you tokenize it into a list of individual words (deciding how you want to deal with punctuation and contractions), then count the words according to the letter they start with, then plot a histogram showing the frequency of each letter.

Python is also attractive to scholars because it’s free – insert joke about poor professor here. Its costlessness and open source ethos have encouraged hundreds of people to create free general libraries focusing on particular types of data and particular types of analysis, along with specialized domain libraries for astronomers, for geographers, for audiologists, for linguists, for stock market analysts… There is also a massive number of tutorials available online. Every year there are a dozen Py conferences held all over the world, and several hundred of the presentations are available on YouTube, including numerous 3-hour tutorials for beginners. You can also check out the Programming Historian website, which has numerous examples in Python. There are numerous cautions with programming (in our case, use Python 3+, not 2.7…), but there are lots of resources that discuss those. Plenty of ways to get started, in other words.

A final benefit of particular importance for humanities-types is Python’s ability to convert words into numbers, usually behind the scenes, and highlight patterns using various statistical properties of text. Such powerful text functions allow businesses to data mine tweets and online content; business demand seems to have juiced computer science research, leading to lots of advanced natural language processing (NLP) features, on top of those driven by the (older) linguistic and literary interests of academics. So, if you have a lot of digitized text or images and you want to clean/analyze them beyond just reading each document one by one, or manually cycling through search results, one… hit… at… a… time…, then Python is worth a look.

What EXACTLY does Python do for this Historian?

So here’s a list of the python projects I’ve been working on, and those I will be working on in the future. A few are completed, a few have draft code, a few have some ideas sketched out with snippets of code, and a couple are still in the fantasy phase. Many use the standard functions of off-the-shelf libraries, while others require a bit more custom coding. But they all should be viable projects – time will tell.

  • Semi-automate a book index: Find all (okay, maybe most) of the proper nouns in a PDF document, along with which PDF page each occurred on, then combine them together into a back-of-the-book index format. If you don’t want to pay $1000 to have your book professionally indexed, you could use Word’s or Adobe’s indexing feature, which requires you to go through every sentence and identify which terms will need to be indexed. Or, you can get 85% of that with Python’s NLP (natural language processing) libraries, or you can import in a list of people/places/events and it will find those for you. As with all programs, things will get complicated the more edge and corner cases you try to address: Do I need to include a “he” on the next page with the full name on the previous page? Do I just combine together all consecutive pages into 34-39, or do I need to judge the importance of the headword to each page’s discussion? Tough questions, but this code will, at the least, give you a basis from which to tweak. And, judging from recent indexes in books published by highly-reputable academic presses, nobody cares about the ‘art’ of indexing anymore: some are literally headwords with a giant undifferentiated list of several dozen page numbers separated only by commas; some don’t even provide any kinds of topics, only proper nouns. Of course, the index may well die as more works are consumed digitally…
  • Web scraping: Automate downloading content from a website, either the text on pages, entries in a list, images of battle paintings from Wikipedia, or linked files… Maybe automate the SPARQL query on historical battles I posted about awhile back. Or download a bunch of letters from a site that puts each letter on its own separate page. Maybe automate scraping publication abstracts from a website based off records in Zotero. (with a library like beautifulsoup)
  • Web form entry: I’d like to create code that would automate copying bib info from Zotero (author, title, date, pages, etc.) and then paste it into our library’s online ILL form in the respective webform fields, which, of course, aren’t in the same order as the Zotero field order. That means a bunch of cutting and pasting for every request.
  • Look up associated information on an entity (person, place, organization) with Linked Open Data, e.g. find the date of birth for person X mentioned in your text document via the web. (rdflib)
  • [Added by request – get it?] APIs: Numerous institutional websites allow you to access their online data more directly through API, rather than using brute force scraping to harvest information from their pages’ HTML. Those websites have APIs (Application Program Interfaces) to allow more sophisticated downloading of information. You can search their site for ‘API’ to see if they offer it. (requests)
  • Convert information into data: My two previous posts on AHA department enrollments and parsing long notes illustrate how this can be done with Python. Code that converts information into data is particularly important low-hanging fruit for historians, since we lack a lot of already-digitized datasets – this kind of code allows us to create them with our own data.
  • Clean dirty OCR text: Correct OCR errors, and generally make a document more readable by humans and computers. A good, detailed description is Ryan Cordell’s Q i-jtb the Raven: Taking Dirty OCR Seriously. This requires a lot of hands-on work with the code, which I’ve been doing of late. E.g. find every occurrence of ‘Out-work’ and convert it to ‘outwork’ – so we can count them all the same way. Find every misOCRed ‘Mariborough’ and convert it to ‘Marlborough’ – there are big lists of common errors available to make this search-and-edit process a bit more precise. But since you’ll never guess every word that might be hyphenated due to line endings, finding every hyphenated word (like ‘be- siege’) and converting it back (to ‘besiege’) is easy enough, with regular expressions. You can even create a list of all the changes your code makes (i.e. create a dictionary with each ‘mistake’ and what it was changed to), if you want to audit the process. More difficult are the Questions of Capitalizations (case), especially when our 17-18C Authors liked to capitalize lots of Nouns, yet modern NLP uses Capitalization as one of its Clues for identifying Proper Nouns (Named Entity Recognition). Ideally you’d have this code as a series of functions so that you can run the various corrections across an entire folder of documents, based on what that source needs. Then, you could use more code to check for any other errors or documents that require special handling.
    I’d argue that this is currently the most critical area for digital history, the bottleneck, in fact. So few historians have their own sources in clean full text, yet it’s also a very idiosyncratic thing to program, based on historically-variant word usage and widely-varying source genre vocabulary, as well as the sometimes-random errors derived from OCRing irregularly set type from a few hundred years ago. (It’d be great if OCR accuracy rates were 100%, but that ideal would seem to require having high-quality scans of the originals, which your average scholar does not have, and will likely never acquire, because we don’t actually own the originals). As a result, cleaning historical OCRed text is the area with the least amount of pre-made code available – big projects like EEBO and ECCO paid cheap foreign labor to type theirs by hand. We should also note that lots of social scientists, for example, talk about ‘preprocessing’ text, but by that they mean standardizing the spelling of text that’s been born digital, making everything lowercase, stripping out punctuation, etc. Historians need a lot of pre-preprocessing first because we are dealing with imperfectly OCRed text. And if we want to retain a cleaned text copy of the original, and not just atomize the text string into a list of word tokens, then it’s even more complicated. Suggestions welcome! Ted Underwood has provided some useful ideas in various venues dealing with big data (10,000s of texts), but his cleaning code on GitHub is a bit above my skill level.
  • Quantitative analysis: Load your spreadsheet into the pandas library and run stats, plot charts… (pandas, matplotlib, seaborn, bokeh for interactivity)
  • Make interactive visualizations, create interactive websites with your data, and so on. (bokeh…)
  • Create a visual timeline, drawn from data extracted from a document. Possibly interactive. Haven’t explored this yet, but I will.
  • Mapping: Quickly make maps, including small multiples, in whatever projection, with whichever features you want to display. Look up coordinates (aka geocode) and calculate spatial/topological relationships… It’s good to see that there are several international project teams working on historical gazetteers, and a number of groups have georeferenced some early modern maps as well. (geopandas, cartopy…)
  • Network analysis: create a network ‘graph’ (diagram) of entities and relationships between those entities (nodes and edges), and measure the topological properties of the network, who are the hubs, the spokes, the nodes… (networkx)
  • Relational database interactions with SQLite and MySQL. (sqlite3) I believe you can do the same with graph databases (triplets of subject-verb-object), but I haven’t really looked at those.
  • Zotero: Update and manipulate your Zotero records. I can already read Zotero data into Python and hand it off to other libraries for analysis, as well as go in the other direction, e.g. mass update fields back in Zotero. But I’d like to create code that will take a PDF page (or two) of a book’s table of contents and enter a separate record for each chapter in the book into Zotero, automatically adding in all the other book info. (pyzotero)
  • Dates: Automatically calculate duration between two dates, convert between OS and NS and other historical calendars, look up “last Tuesday’s” date when mentioned in a letter written on July 7, 1700 – you could probably automatically insert that date into the source document if desired. Maybe even do a quick calculation to see how few days you have left on your sabbatical… (calendar, datetime, dateutil, convertdate, dateparser, arrow)
  • Textual analysis: This is a biggie for historians. Create a corpus of texts in your area; create a list of people, places and events to use for extraction; cluster works (or segments) together by the topic they discuss; see how often different authors/texts use particular words/phrases; identify which words tend to be collocated with which other words; keywords-in-context; sentiment analysis; etc. Did I mention fuzzy searching, finding words that are spelled similarly? Or finding words that are used in the same context as a given word? Maybe you want to analyze your own prose: which words/phrasings/grammatical structures do you overuse? (NLTK, spaCy, textacy, gensim, word embeddings…)
  • Bibliometrics and Historiographical Analysis: From a secondary source, extract all the people, publications, places and time periods/dates mentioned, and graph/map them, before comparing them with other authors. Or analyze the sources cited in the bibliography – authors and affiliations, years of publication, languages, etc. The sciences have a lot of this already because they mostly publish journals and they’re in databases like Web of Science. This also ties into network analysis, especially if you want to look at citation networks.
  • Analyze words/phrases from the 16m-book HathiTrust collection. There’s a website for that, but you can also download the data, or subsets at least.
  • Genealogy: parse genealogical data and analyze. Would be interesting for royal lineages, and some work has already been done on that.
  • Sound analysis. Haven’t played with these, but some people are into reconstructing soundscapes and the like.
  • Image classification and analysis: Group together all the portraits of person X, etc. Haven’t played with these, though you have similar classification algorithms in Facebook, etc.
  • Lots of full-fledged programs are also python scriptable. E.g. both ArcGIS and QGIS have python interfaces, which means you can automate many of the boring tasks you need to perform when making more sophisticated maps.
  • Clean up your computer files – batch rename, copy, delete, convert, etc., with much more flexibility than Mac OS X’s Rename function.
  • Automate lots of administrative school work. Create a syllabus class schedule that lists the day of week and date for each meeting during the semester, removing any holidays or other days off for that specific semester. I’ll be department chair next year, and there are lots of stats and reports on enrollment, assessment… that I’d like to automate: collecting the data from databases/surveys and then analyze them, without me having to manually repeat the entire process every time. A computer science colleague will have a student working on a course scheduler next semester – given a department’s faculty requests, the available timeslots, and a few dozen university and departmental scheduling requirements, come up with a schedule that meets all those criteria, or at least the most important ones. Now, we have to do this by hand (with Excel, but still), and it’s a real pain.
  • Machine learning/AI: Python is also one of the main languages used for this new burgeoning field. For historians, that might mean classifying documents and topics, but I haven’t looked into it enough to think about how it could be used. Some of the above-mentioned libraries might well be superseded by machine learning libraries in the future, where things like neural nets figure out their own algorithms without rules being specified by the programmer. I think we’re already seeing a little bit of that with NLP.

And those are just a few of the things you can do with Python! So whatever data-related project you can think of, there’s probably a way to do it in Python. It’s not just automating the things that you find yourself doing on the computer over and over and over and over again. Just as important, what are the research questions that you want to ask, especially those that would require a lot of drudgery like counting and sorting and revising thousands of documents? Any software package that will answer that particular question for you will have its own learning curve, and there probably aren’t many people whom you could hire to do it for you, so you will probably be on your own. Whatever your question, there’s likely a way to combine the various Python tools together in a way that gets you the desired output.

But that’s not all. Using code also means:

  1. You can take whatever output and turn it into the input for another bit of code, and so on and so on. It is practically infinitely extensible.
  2. You can rerun your code but change a parameter, to see the difference it makes. ‘What-if’ exploration is super simple, and you can easily change a parameter anywhere in the workflow and continue the rest of your code with the new results.
  3. When you’re all done with your code, you can run it on another data set or text, or a whole folder full. And then you can compare the results.
  4. When you notice an intriguing pattern in one of your sources, you can quickly add another bit of code to explore it. Then you can look for that pattern in your other documents.
  5. You will also have a record of your process and method: which data you used for which analysis, how you cleaned the data, which settings and parameters you used, the order in which you performed your various steps, and so on. I’m guessing that more than a few historians would be unable to repeat, much less explain, how exactly they got the results they did. How faithfully, for example, do we record our computer-based research workflow? Some grant agencies are beginning to require recipients submit their data and workflow, along with their results. “Replicability” could even come to mean something in History.

This historian’s “killer app” for Python is a program that reads in a (primary or secondary) source from a text file, and then the code provides statistics on the words and phrases used, identifies rare terms that are unusually common in that document (compared to some corpus), extracts all the proper nouns mentioned, provides a statistical overview of their frequency (overall, and by section of book…), looks up information on the people (say, their nationality, age, etc.), then looks up the coordinates of mentioned places and maps them according to some criteria (by person who mentions the place, by where in the text it is mentioned, by what other things are mentioned around that place…). One output of all this could be tables or graphs of the entities in the text, word visualizations and the like. Another output could be automatically-created maps – not just maps of any of the above entities, but small-multiple maps that would locate a variable (say, siege duration) across four different theaters, and then another set of small-multiple maps that would similarly map the same variable by year instead. Might as well have it make a heat map while you’re at it. Several groups have already created web versions of some of these features (voyant-tools among them). But with your own code, you also end up with all these results in the code itself, which can be further analyzed with yet more code. Then, your code runs itself on a bunch of other documents, and includes comparisons between documents – which texts talk more about place X? This works for teaching as well as research. Imagine if you had a class where you assigned a source, had the students analyze it, and then put an interactive visualization of the document up on the screen to explore. This really wouldn’t be that hard – I already have almost all of the bits, and it’s just a question of chaining them all together. It will take a while to make sure the objects, logic and syntax are all copacetic, but hopefully it’ll be done in time for classes next fall.

If you’re not sure about diving into Python, I’d suggest you start by getting as many of your sources in digital form as possible. Scan, OCR, type. Then get yourself a decent text editor like Notepad++ or Text Wrangler/BBEdit and start learning regular expressions.

But the more historians we get writing Python code, the more history-specific code we can build off of. So let’s get started.


Tags: ,

3 responses to “What EXACTLY Python does”

  1. davidunderdown95 says :

    I’d add access APIs to that list. Here’s a blog post I did on using The National Archives’ Discovery API for accessing and analysing catalogue data (this followed on from a couple of posts by a colleague explaining how to do this with manual download and manipulation of data in Excel) https://blog.nationalarchives.gov.uk/blog/using-the-discovery-api/

    • jostwald says :

      Thanks for the comment, and for the links.
      You are correct; I guess APIs were assumed in a few of the points (webscraping and linked open data), but it should be made more explicit.

      For anyone looking for ideas from the more visionary ‘early adopters’ among digital humanists, I’ve learned the most from the second-wave blogs and websites of Ted Underwood, Ben Schmidt, Ryan Cordell, Cameron Blevins, and Miriam Posner. And of course George Mason’s Roy Rosenzweig Center for History and New Media…

  2. jostwald says :

    As another example of a free online resource being eliminated (3rd advantage of Python mentioned in my post): Google will be “turning down” Google Fusion Tables at the end of next year. To be replaced with bigger and better tools, they say. Lesson? The only way to truly control your workflow and your data is to go with an open-source product, like Python. We historians, with our long time frames and reliance on archival conservation/preservation efforts, should really be prioritizing that in our own research.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: