Creating Metadata by Hand: Musings on the Limits of Automation in Archives

This post was written by Alice Griffin, who has worked in La MaMa’s Archives since November as the Metadata/Digitization Assistant. She’s leaving La MaMa at the end of July to pursue a Master’s degree at the Pratt Institute’s School of Information. We asked her to offer some reflections about her time at La MaMa. (We will miss her terribly and wish her all the best in her next adventure.)

“But… a computer could just do your job.” The first time I heard this remark it made me pause, seriously question the future of my career, and turn to my professional mentors for reassurance. Now, after being in this position for seven months, I feel confident that my position is not so easily automated away.

In the La MaMa Archives/Ellen Stewart Private Collection, I am the metadata/digitization assistant. My job is to add digital media to corresponding catalog records on the fantastically vast La MaMa Archives Digital Collections site, created by several catalogers and Project Manager, Rachel Mattson, over the past three years. As a result of this project, researchers all over the world can now see the photographs and programs that were initially just minimally described. This project of digitization requires a scanner, some metadata know-how, creativity, patience, and lots of time. The La MaMa Archives does have a lovely professional scanner, my metadata knowledge continues to grow, and I do have a considerable amount of patience. However, time is running short as the grant I was hired on comes to an end. I have added hundreds of digital objects to the digital collections since November 2016, but it feels as though my job has just started.

IMG_3142

Alice Griffin, a human, at her desk in the La MaMa Archives.

So, why can’t a computer just do my job? A computer is already helping me with many aspects of this task. The scanner I use to digitize photographs, programs, flyers, postcards, and other objects is connected directly to my computer and once I choose settings and file name, there’s not much more to do except click “scan.” Once I have my preservation (TIFF) files and access (JPEG) files created in Photoshop, it’s just a matter of an easy drag and drop to initiate Secure File Transfer Protocol (SFTP) through Cyberduck to store them on the La MaMa Archives server or upload them to our digital collections site through a CollectiveAccess-powered backend. I also manually add metadata: a paragraph describing the material at hand, links to Library of Congress Naming Authorities and Subject Headings, information about the storage location and preservation needs of the object, and other bits to make it as complete a record as possible. But in the era of self-driving cars, why do we need a human to do this work? Even though I don’t think anyone would accuse a surgeon of obsolescence because of the rise of robotics in the operating room, I think this is a fair question and I would like to attempt a response.

Simply put, a computer does not yet exist that automates all aspects of my workflow; human labor and expertise are always involved. The labs page of the Stanford Libraries website lists the equipment that used for digitization projects and the rate of digitization. The robotic-book scanner can scan 4 times the number of pages in an hour as someone operating the manual book scanner. So, why even continue to pay student workers to do that manual work? The Stanford Libraries’ robotic-book scanners are not safe for fragile bound materials, and therefore careful human hands are necessary. Of course, book scanners are being engineered to have that gentle touch. In her article “The Hidden Faces of Automation,” Lily Irani mentions a “patented machine” engineered to turn the pages of rare books for digitization. But even this kind of machine was not fully automated; it “housed a worker who flipped the pages in time to a rhythm-regulated soundtrack” (34).

In 2006, the System for the Automated Migration of Media Assets (SAMMA), a system of robotics, hardware, and software, began being sold as a way for institutions to transfer media from obsolete formats to digital files in a streamlined and cost-effective manner. Three factors make SAMMA unusable for my project. First, I am not working with or digitizing La MaMa’s audiovisual materials (for some information about La MaMa’s awesome audiovisual materials, see Rachel Mattson’s blog post here). Second, SAMMA does not create metadata about the content of the materials, such as who or what is depicted. And third, SAMMA’s cost-effectiveness is relative; the costs for a community archives, such as La MaMa, to use tools like SAMMA or the book scanners mentioned above would be prohibitive.

While robotics, hardware, and software are useful, there is still always human skill and precision involved. Before even beginning to scan I must make decisions about whether each object is appropriate for digitization – are there privacy or rights concerns? And if there are duplicates of an object, I must choose the best copy to digitize. When scanning begins it is not just a matter of sticking a stack of papers into the automatic feed on a photocopier, or placing a book or videotape into a robotic scanner. Materials I work with must be handled carefully so that they do not tear or crinkle. Additionally, in order to fully describe an object I am digitizing, I must fill in several fields to physically characterize the object or objects: how big is the object? How many duplicates are there? Is it color or black and white? Throughout this work, the materials must be handled with care, one page/photograph/poster at a time. We want these originals to last because while digital files generally allow for easier access, they do not necessarily stand the test of time. Original photographic prints, negatives, and papers cannot just go in the trash once you have a digital surrogate.

CottonClubGala

Object record for production photographs from the 1985 production of The Cotton Club Gala [OBJ.1985.0307] as viewed on La MaMa’s digital collections website.

Adding metadata also requires a human mind. The description field, in particular, even requires some creativity because, as a cataloger, I have to think about how different people will use the catalog. How will La MaMa archives staff search the catalog versus the La MaMa marketing or development staff? How does an academic researcher use the catalog versus an artist that has performed at La MaMa before? A human cataloger can take advantage of these nuances of use to create a more robust, user-oriented catalog in a way that a rigid computer program simply can’t. To give an example, I asked myself these questions while cataloging photographs from the 1985 production of “Cotton Club Gala,” directed by Ellen Stewart with music by Aaron Bell and choreography by Larl Becham. The description field is a beautiful thing because it allows you to tell the researcher in full sentences about the object: what production it’s from, who is depicted, anything of note about the object, or even if you’re not sure of the date. So, in the case of the Cotton Club Gala photographs, I made sure to address all these users in the description:

This folder contains eight photographic prints, five of which are duplicates, from “Cotton Club Gala,” directed by Ellen Stewart and produced at La MaMa in 1985. This folder also includes a typewritten letter on Vogue letterhead from assistant to Amy Gross, David DeNicolo, to La MaMa archivist Doris Pettijohn thanking her for letting them look at the photographs.

Valois Mickens is depicted in the third image.

The description is not long nor is it complicated, but it provides information in a readable format. There is information about prints and duplicates for archives staff; it identifies the production as directed by Ellen Stewart, which means it could be an important production for marketing use; for an academic researcher the whole description, including the letter from Vogue, because it gives context for the objects; an artist searching the catalog might also appreciate the whole description, or they might find information about who worked on the production and who is depicted more interesting. The description field is different for every object record, and therefore requires flexibility, creativity, and brevity to produce a paragraph that contextualizes the object without overwhelming the user.

The La MaMa Archives holds many one-of-a-kind materials; for some productions, the programs, photographs, or posters here may be the only remaining evidence that they took place. In this way, the La MaMa catalog does not just hold information gleaned from other sources, but it is a producer of information itself. When a researcher or an archives staff member notices a mistake in the catalog we usually need to consult our own material to solve the problem, a Google search will not help us. For example, when digitizing photographic prints for the 1965 and 1967 productions of The Sand Castle, written by Lanford Wilson and directed by Marshall W. Mason, I noticed that the performers depicted in the photographs weren’t matching up with the production dates that were handwritten on the back of the photographs. The La MaMa catalog was the only source I could turn to fix the confusion. I cross-referenced performers listed in the programs with who was depicted in the image and compared sets and costumes for both productions. In this way the La MaMa catalog functions as repository and generator of the history of off off-Broadway.

1965_TheSandCastle_a004

Production photograph from the 1967 production of The Sand Castle [OBJ.1965.0216]. (This item was originally cataloged, in error, as documenting the 1965 production.)

While my position may appear to be a solitary one, it does require person-to-person interaction at a level that a computer cannot do. I am in regular contact with James D. Gossage, a photographer who documented many of La MaMa’s early shows. His own files and memories have corrected and enriched the catalog and in March 2017, Gossage donated programs, a poster, and some photographs that the Archives did not have before. He gave us the rights to three of the photographs [OBJ.1967.0349], which depict Tom Eyen, a playwright and director of many La MaMa shows and probably best known for writing Dreamgirls. These are beautiful portraits with dramatic light and shadow and the La MaMa Archives is excited to have them. It’s possible that Gossage felt comfortable passing along these prints into our care because, despite some errors in the catalog, he could see the work that we put into describing these materials to the best of our knowledge and ability. The humanity (and therefore error) present in the La MaMa Digital Collections website, reflects the deep humanity in the artists and their productions that the photographs, programs, correspondence, and posters document.

1967_TomEyen_a003.jpg

Portrait of Tom Eyen by James D. Gossage, circa 1967. [OBJ.1967.0349]

No, my position cannot be simply automated away, but I’m sure I will continue to field questions about my position’s relevance. And while not receiving proper recognition for my work is mostly an inconvenience or a blow to my ego, it does reveal a widespread misunderstanding, or even misrecognition, of the mechanisms behind automation and making information available on the Internet. I am glad to see that there is growing scholarship on how obscuring the connection between human beings and automation deeply affects individuals and communities economically and emotionally. There is too much to delve into here in this blog post, but I would like to suggest some further reading. First, Safiya Umoja Noble’s article “Google Search: Hyper-visibility as a Means of Rendering Black Women and Girls Invisible,” examines how Google search results are not separate from human influence, but are in fact embedded in racist and sexist stereotypes that benefit advertisers. This aspect of Google is mostly ignored or glossed over. Noble reminds us that “the results that surface on the web in commercial spaces like Google are not neutral processes—they are linked to human experiences, decision-making, and culture.” Another article that reveals the human influence behind a process commonly thought of as automated is Sarah Roberts’ “Commercial Content Moderation: Digital Laborers’ Dirty Work.” Roberts exposes the human labor behind the moderation of user-generated content and how these workers impact the content they screen while that content also takes a toll on their well-being.

The third article I want to recommend here Lily Irani’s short piece “The Hidden Faces of Automation.” In it, Irani explains how the “data janitors” behind “cultural data work,” such as “transcribing small audio clips, putting unstructured text into structured database fields, and ‘content-moderating’…user-generated content” (37), are so easily and consistently undervalued and underpaid. Irani then asks two very important questions that I would like to highlight here: “What would computer science look like if it did not see human-algorithmic partnerships as an embarrassment, but rather as an ethical project where the humans were as, or even more, important than the algorithms? What would it look like if artificial intelligence and human-computer interaction put the human care and feeding of computing at the center rather than hiding it in the shadows?” I think Irani brings up a remarkable point in these questions. Even though technology fields are booming, computers continue to be limited by the limitations of humans; limitations of technical knowledge, sure, but also limitations of empathy for human workers. Perhaps technologists need to embrace this level of social responsibility in their work. It is not a failure to admit we still need to do things by hand; rather, this honesty allows light to be shed on a previously concealed issue.

Suggestions for further reading:

Endangered Data and the Arts

Last month, from April 17-22, 2017 archivists, librarians, records managers, educators, and researchers marked the first-ever Endangered Data Week (EDW). Designed to highlight and provoke discussion about threats to the public availability of federal and local government datasets, the week featured a wide range of events – Twitter chats, data rescue harvests, data storytelling, data-scraping workshops, letter writing meet-ups, and panel discussions. Over the course of six days, approximately 17 universities and 8 professional organizations convened more than 50 events. As the organizer of a new Digital Library Federation (DLF) working group on Government Records Transparency and Accountability, I helped to organize the project and worked to convene a webinar on the subject of the Freedom of Information Act that formed a part of the week’s events.

EDW was originally the brainchild of Michigan State University’s Brandon Locke, and was sponsored by the DLF in partnership with DataRefuge, the Mozilla Science Lab, the Council on Library and Information Resources, the National Digital Stewardship Alliance. “There is good reason for concern about the ongoing availability and collection of data by US government agencies,” Locke wrote in a recent post in Perspectives (the online Newsmagazine of the American Historical Association). Not only has the new presidential administration signaled its opposition to open data and data-collecting initiatives (“most notably those concerning climate change”), Congress has also recently taken steps to restrict public records access. For instance, federal legislation has been introduced that would prohibit recipients of federal funds from creating, using, or providing access to geospatial databases that track “racial disparities or disparities in access to affordable housing” – language that, as Locke notes, could “hinder researchers’ efforts to “analyze changes in neighborhood demographics, urban development, policing, and the impact of redlining and other discriminatory housing policies.”

You might be wondering why an archivist who spends her days working in a performing arts archives is so invested in questions of government transparency, the Freedom of Information Act, and endangered data. I can think of a dozen ways to explain the source of my interest – but here I’d like to talk about just one of them: public records and data are very important to artists, arts organizations, arts journalists, arts funders, and arts scholars.

On one hand, arts organizations routinely rely on public data and records to inform their practice and to justify the importance of their work; public data informs arts administrators’ work in the areas of audience development, fundraising, public relations, infrastructure-building, and advocacy. To take a very hard-boiled example: government-collected data is routinely used to “quantify the broad ‘impact’ of the arts and culture sector in financial and programmatic terms” (as the cultural think-tank CreativeEquity recently put it). In other words, by documenting the ways in which arts programs drive local economies, contribute to youth development, and lead to lower crime rates, arts advocates give government agencies a bread-and-butter rationale for spending public money on arts programs. The 2015 Center for Urban Futures’ report on Creative New York, for example, relied on public data to document its finding that New York City’s economic engine is powered by artists and the creative sector. This finding has, in turn, been used to advocate for increased public spending on the arts in New York City. Funding for small arts organizations is often dependent on this kind of advocacy.

Funding for my home institution, La MaMa Experimental Theatre Club, has been shaped over the years by these sorts of data-driven advocacy efforts – as well as by data collection efforts designed to streamline government services. In the 1970s, for instance, La MaMa received part of its funding through the Comprehensive Employment Training Act (CETA). Established in 1973 by the Nixon administration (yes, that Nixon administration), CETA was a block grant project established in response to public data indicating that funding for “job training” and “workforce development” was fragmented and duplicative, and thus inefficient. Individual states could decide how to spend their CETA funds; and New York State decided to give a portion of that money to arts organizations. With CETA funding, La MaMa incubated several ensembles that were responsible for staging more than 35 events (plays and concerts) between 1978 and 1980.

1978_Program_CETA_a008 copy

Program for “3rd CETA Chamber Concert” (1978) (From La MaMa’s digital collections.)

Although thesedays La MaMa is more likely to get funding from private foundations or state agencies than from federal job training initiatives, our ability to fund our programming continues to depend on the availability of a wide range of public data.

For instance, like many other non-profits, we rely on data from sources such as 990-PFs – tax documents that private foundations must file with the Internal Revenue Service, which contain the names of foundations’ officers and grantees – in our fundraising and cultivation efforts. Although data found in 990-PFs is not government-created, it is made public due to a government mandate. It serves as a critical resource for a wide range of arts organizations and their allies, who use it to conduct prospect research, to understand the broader funding landscape, and to find new potential donors. It also supports a broad base for fiscal transparency, oversight, and public conversations about tax policy, private philanthropy, and funding for the arts. This kind of transparency enables us as a city and a nation to ask questions like: Who is giving to the arts? How has that changed over time? Why? (And so on.)

Of course, public records and data also serve as essential tools for scholars seeking to write about the arts in social and historical context. Scholars of the history of modern dance, or the rise of video art, or the role of the arts in the life of American cities (among other topics) all rely heavily on government-created records in their work.[1] Examples of the creative uses to which arts-engaged scholars have put public records abound. But for the sake of brevity, consider just one – Robin D.G. Kelley’s masterful biography of Thelonious Monk. In his effort to portray the life and work of this perennially misunderstood, incandescent musician, Kelley makes powerful use of land and property deeds, birth, death, and marriage records, court testimony, Selective Service records, the Census, as well as Monk’s FBI file, the annual report of the New York City Department of Corrections, and an array of other documents. Indeed, the public record becomes a rich source of evidence for the biography’s most important thematic frame: that Monk’s life and work reflected — and remixed— the idea of freedom in African-American history and culture. “Thelonious Monk’s music is essentially about freedom,” Kelley argues. In one early section of the book, Kelley does a deep dive into the public record to trace Monk’s family’s experiences with enslavement and liberty in the US over the course of a century. After locating Monk’s Great-Grandfather John Jack, born in 1797, from a combination of Census records (including the 1860 Schedule of “Slave Inhabitants of Sampson County”) and property records (including a deed of gift which transferred ownership of Monk’s Great Aunt Chaney from one slaveholder to another), Kelley learns from the Census of 1870 that Monk’s grandfather Hinton Cole, born into slavery, learned to read and write shortly after emancipation. Throughout, Kelley demonstrates that if Monk’s music was “essentially about freedom,” it wasn’t an accident. He had “inherited…a deeply felt understanding” of the topic “from those who came before him.” This foregrounding sets the stage for the rest of Kelley’s account of the pianist’s life and work.[2]

Finally, open data and records also comprise important source material for working artists. The public record served as an important basis, for instance, for last year’s hit Broadway musical Hamilton. (Creator Lin-Manuel Miranda has often discussed the historical and archival material upon which he based the show.) But creative engagement with government documents is hardly new, and the list of artists who have used public documents and data in their work is very, very long. In his landmark Shapolsky et al. Manhattan Real Estate Holdings, a Real‑Time Social System, as of May 1, 1971, for instance, Hans Haacke used public records to chronicle “the fraudulent activities of one of New York City’s largest slumlords over the course of two decades.” Visual artists Mariam Ghani and Chitra Ganesh also used public records in their Index of the Disappeared project, which considered the “difficult histories of immigrant, ‘Other’ and dissenting communities in the U.S” after 9/11. And in the 1980s, the activist art collective Gran Fury deployed government data in the silkscreened posters they wheat-pasted across New York City. A poster they created in 1988, for instance, featured an image of a baby doll and text that read: “One in every sixty one babies in New York City is born with AIDS or born HIV antibody positive. So why is the media telling us that heterosexuals aren’t at risk? Because these babies are black. These babies are Hispanic.” In addition to functioning as complex aesthetic works in their own right, each of these projects contributed to wide-ranging public conversations about urgent social issues.

Screen Shot 2017-04-27 at 10.59.03 AM

Poster by Gran Fury. (Screen-grab from ICP)

For good reasons, this year’s Endangered Data Week focused on the importance of government data for environmental scientists, social scientists, and humanities researchers. Such scholars and their publics have a great deal to lose when government agencies can’t or don’t collect data about weather patterns and housing discrimination, among other information. But artists and their audiences also rely heavily on publicly accessible government data. It is hard to know for sure all the ways that the data upon which arts-engaged individuals and groups rely are threatened. And we must always consider the ways in which public data collection might inform more widespread government surveillance of civilians. But government data initiatives contribute to the well-being of a cross-section of people – including artists. And if we want to ensure that creative practice can endure – and can continue to inform public conversations about history, politics, and contemporary life – we need to fight for the continued existence of a robust culture of data transparency and accountability.

+

[1] See, e.g. Naima Prevots, Dance for Export: Cultural Diplomacy and the Cold War (Wesleyan, 1999);  Kathy High, Sherry Miller-Hocking, and Mona Jimenez, eds., The Emergence of Video Processing Tools (University of Chicago, 2014); and Hillary Miller, Drop Dead: Performance in Crisis, 1970s New York (Northwestern University Press, 2016).

[2] Robin D.G. Kelley, Thelonious Monk: The Life and Times of an American Original (Free Press, 2009), pp. 2-14 and 463-467.