Lev Manovich, July 21, 2011
keywords: search, Google, knowledge discovery, digital library, database, classification, folksonomy, information retrieval, HCI, interface, information visualization, digital humanities, cultural analytics, visual analytics, software studies, Manovich
Early 21st century humanities and media studies researchers have access to unprecedented amounts of media – more than they can possibly study, let alone simply watch or even search. (For examples of large media collections, see the list of repositories made available to the participants of Digging Into Data 2011 Competition, www.diggingintodata.org).
The basic method of humanities and media studies which worked fine when the number of media objects were small – see all images or video, notice patterns, and interpret them – no longer works. For example, how do you study 167,00 images on Art Now Flickr gallery, 236,000 professional design portfolios on coroflot.com (both numbers as of 7/2011), or 176,000 Farm Security Administration/Office of War Information photographs taken between 1935 and 1944 digitized by Library of Congress (http://www.loc.gov/pictures/)?
Given the size of typical contemporary digital media collections, simply seeing what’s inside them is impossible.
Although it may appear that the reasons for this are the limitations of human vision and human information processing, I think that it is actually the fault of current interface designs and web technology. Standard interfaces for massive digital media collections such as list, gallery, grid, and slide do now allow us to see the contents of a whole collection. These interfaces usually they only display a few items at a time (regardless of whether you are in a browing mode, or in a search mode). This access method does not allow us to understand the “shape” of overall collection and notice interesting patters.
The popular media access technologies of the 19th and 20th century such as slide lanterns, film projectors, microfilm readers, Moviola and Steenbeck, record players, audio and video tape recorders, VCR, and DVD players were designed to access single media items at a time at a limited range of speeds. This went hand in hand with the media distribution mechanisms: record and video stores, libraries, television and radio would all only make available a few items at a time. For instance, you could not watch more than a few TV channels at the same time, or borrow more than a few videotapes from a library.
At the same time, hierarchical classification systems used in library catalogs made it difficult to browse a collection or navigate it in orders not supported by catalogs. When you walked from shelf to shelf, you were typically following a classiffication based on subjects, with books organized by author names inside each category.
Together, these distribution and classification systems encouraged 20th century media researchers to decide before hand what media items to see, hear, or read. A researcher usually started with some subject in mind – films by a particular author, works by a particular photographer, or categories such as “1950s experimental American films” and “early 20th century Paris postcards.” It was impossible to imagine navigating through all films ever made or all postcards ever printed. (One of the the first media projects which organizes its narrative around navigation of a media archive is Jean-Luck Godard’s "Histoire(s) du cinéma" which draws samples from hundreds of films. ) The popular social science method for working with larger media sets in an objective manner – content analysis, i.e. tagging of semantics in a media collection by several people using a predefined vocabulary of terms also requires that a researcher decide before hand what information would be relevant to tag.
Unfortunately, the current standard in media access – computer search – does not take us out of this paradigm. Search interface is a blank frame waiting for you to type something. Before you click on search button, you have to decide what keywords and phrases to search for. So while the search brings a dramatic increase in speed of access, it assumes is that you know beforehand something about the collection worth exploring further.
We need the techniques for efficient browsing of content and discovery of patterns in massive media collections. Consider this defintion of “browse”: “To scan, to casually look through in order to find items of interest, especially without knowledge of what to look for beforehand” (“Browse”, Wiktionary). Consider also one of the meanings of the word “exploration”: “to travel somewhere in search of discovery” (“Exploration”, Wiktionary.) How can we discover interesting things in massive media collections? I.e., how can we browse through them efficiently and effectively, without a knowledge of what we want to find?
Anja Wiesinger wrote an interesting response to this post:
Some notes on the history of search engines and media collection interfaces - for article
http://en.wikipedia.org/wiki/Microfilm "Using the daguerreotype process, John Benjamin Dancer was one of the first to produce micro-photographs, in 1839. He achieved a reduction ratio of 160:1".
"In 1896, Canadian engineer Reginald A. Fessenden suggested microforms were a compact solution to engineers' unwieldy but frequently consulted materials. He proposed that up to 150,000,000 words could be made to fit in a square inch, and that a one foot cube could contain 1.5 million volumes"
"The year 1938 also saw another major event in the history of microfilm when University Microfilms International (UMI) was established by Eugene Power."
Emanuel Goldberg "introduced his “Statistical Machine,” a document search engine that used photoelectric cells and pattern recognition to search the metadata on rolls of microfilmed documents (US patent 1,838,389, 29 December 1931). This technology was used in a variant form in 1938 by Vannevar Bush in his “microfilm rapid selector,” his “comparator” (for cryptanalysis), and was the technological basis for the imaginary Memex in Bush’s influential 1945 essay “As we may think.”
"1950: The term "information retrieval" appears to have been coined by Calvin Mooers."