The Promises and the Challenges of Big Social Data

Exploring one million manga pages on the 287 megapixel HIPerSpace


text author: Lev Manovich

article version: 1
posted March 31, 2011

[This is the first part of a longer article – the second part will be posted in the next few days]


The emergence of social media in the middle of 2000s created a radically new opportunity to study social and cultural processes and dynamics. For the first time, we can follow imagination, opinions, ideas, and feelings of hundreds of millions of people. We can see the images and the videos they create and comment on, eyes drop on the conversations they are engaged in, read their blog posts and tweets, navigate their maps, listen to their tracklists, and follow their trajectories in physical space.

In the 20th century, the study of the social and the cultural relied on two types of data: “surface data” about many (sociology, economics, political science) and “deep data” about a few (psychology, psychoanalysis, anthropology, ethnography, art history; methods such as “thick description” and “close reading”). For example, a sociologist worked with census data that covered most of the country’s citizen; however, this data was collected only every 10 year and it represented each individual only on a “macro” level, living out her/his opinions, feelings, tastes, moods, and motivations. In contrast, a psychologist was engaged with a single patient for years, tracking and interpreting exactly the kind of data which census did not capture.

In the middle between these two methodologies of “surface data” and “deep data” was statistics and the concept of sampling. By carefully choosing her sample, a researcher could expand certain types of data about the few into the knowledge about the many. For example, starting in 1950s, Nielsen Company collected TV viewing data in a sample of American homes (via diaries and special devices connected to TV sets in 25,000 homes), and then used this sample data to tell TV networks their ratings for the whole country for a particular show (i.e. percentage of the population which watched this show). But the use of samples to learn about larger populations had many limitations.

For instance, in the example of Nelson’s TV ratings, the small sample used did not tell us anything about the actual hour by hour, day to day patterns of TV viewing of every individual or every family outside of this sample. Maybe certain people watched only news the whole day; others only tuned in to concerts; others had TV on not never paid attention to it; still others happen to prefer the shows which got very low ratings by the sample group; and so on. The sample data could not tell any of this. It was also possible that a particular TV program would get zero shares because nobody in the sample audience happened to watch it – and in fact, this happened more than once.

Think of what happens then you take a low-res image and make it many times bigger. For example, lets say you stat with 10x10 pixel image (100 pixels in total) and resize it to 1000x1000 (one million pixels in total). You don’t get any new details – only larger pixels. This is exactly happens when you use a small sample to predict the behavior of a much larger population. A “pixel” which originally represented one person comes to represent 1000 people who all assumed to behave in exactly the same way.

The rise of social media along with the progress in computational tools that can process massive amounts of data makes possible a fundamentally new approach for the study of human beings and society. We no longer have to choose between data size and data depth. We can study exact trajectories formed by billions of cultural expressions, experiences, texts, and links. The detailed knowledge and insights that before can only be reached about a few can now be reached about many – very, very many.

In 2007, Bruno Latour summarized these developments as follows: “The precise forces that mould our subjectivities and the precise characters that furnish our imaginations are all open to inquiries by the social sciences. It is as if the inner workings of private worlds have been pried open because their inputs and outputs have become thoroughly traceable.” (Bruno Latour, “Beware, your imagination leaves digital traces”, Times Higher Education Literary Supplement, April 6, 2007.)

Two years earlier, in 2005, Nathan Eagle at MIT Media Lab already was thinking along the similar lines. He and Alex Pentland put up a web site “reality mining” (reality.media.mit.edu) and wrote how the new possibilities of capturing details of peoples’ daily behavior and communication via mobile phones can create “Sociology in the 21st century.” To put this idea into practice, they distributed Nokia phones with special software to 100 MIT students who then used these phones for 9 months – which generated approximately 60 years of “continuous data on daily human behavior”.

Finally, think of Google search. Google’s algorithms analyze text on all web pages they can find, plus “PDF, Word documents, Excel spreadsheets, Flash SWF, plain text files, and so on,” and, since 2009, Facebook and Twitter content. (en.wikipedia.org/wiki/Google_Search). Currently Google does not offer any product that would allow a user to analyze patterns in all this data the way Google Trends does with search queries and Google’s Ngram Viewer does with digitized books – but it is certainly technologically conceivable. Imagine being able to study the collective intellectual space of the whole planet, seeing how ideas emerge and diffuse, burst and die, how they get linked together, and so on – across the data set estimated to contain at least 14.55 billion pages (as of March 31, 2011; see worldwidewebsize.com).

Does all this sounds exiting? It certainly does. What maybe wrong with these arguments? Quite a few things.


[The second part of the article will be posted here within next few days.]


-----------------
I am grateful to UCSD faculty member James Fowler for an inspiring conversation a few years about the depth/surface questions. See his pioneering social science research at jhfowler.ucsd.edu.

Image analysis and visualization techniques for digital humanities | instructor: Lev Manovich | course at UCSD, spring 2011

One million manga pages
Researchers from Software Studies Initiative exploring one million manga pages dataset on HIPerSpace supervisualization system at Calit2.

Keywords: digital humanities, Calit2, UCSD, NEH, HIPerSpace, visualization, cultural analytics, softwarestudies.com, software studies, Lev Manovich

-------------------------------------
VIS145B.
Spring 2011 / Visual Arts Department / UCSD
tIme: Wednesday, 3:00-5:50pm
instructor: Lev Manovich
office hours: Tuesday 2-3pm @Cafe Roma, or by appointment
email: manovich@ucsd.edu
lab: softwarestudies.com

Readings:
All readings for this class will be available online at no charge.

Software:
The class will use free publicly available software as well as software tools developed by Software Studies Initiative.


Course description:  

“The next big idea in language, history and the arts? Data.”
New York Time, November 16, 2010.


Cultural Analytics is the use of computational methods for the analysis of patterns in visual and interactive media.

Our core methodology combines digital image analysis and next generation visualization technologies being developed at Calit2 and UCSD. We also developed an alternative methodology to explore large visual data sets directly, without any quantitative analysis.

More information about cultural analytics
Examples of cultural analytics projects

In the first part of the class the students learn cultural analytics techniques and software tools. In the second part all students work collaboratively on a project to create dynamic animated visualizations of large visual data sets.

The data may include visual art, graphic design, photography, fashion/street styles, feature films, animation, motion graphics, use-generated video, gameplay video recordings, web design, product design, maps, sound, and texts. We will also have access to state of the art supervisualization system at Calit2 (HIPerSpace) to explore large data sets.




Class schedule:

------------------------------------
1 / 3.30.2011 / Introduction

resources:
Digging Into Data competition description
Patricia Cohen. In 500 Billion Words, New Window on Culture. From New York Times Humanities 2.0 series.
infosthetics.com
visual complexity



------------------------------------
2 / 4.6.2011 / History and theory of visualization / introduction to imageJ software / "direct visualization" techniques
montage, slice, sampling

theory readings:
Manovich, Lev. What is Visualization? 2010. Visit all projects referred to in the article.

resource:
history of cartography, statistical graphics, and data visualization


assignments:
1. Install ImageJ on your computer.
2. Work through imageJ basics tutorial using image(s) of your choice.
3. Read Image J documentation: Basic Concepts, Macro Language.


------------------------------------
3 / 4.13.2011 / Digital Image processing using imageJ)
history and uses of image processing;
greyscale, saturation, hue, number of shapes measurements with imageJ built-in commands and scripts

readings:
1.Wikipedia article on Image Processing (also look at all links under "Typical Operations").
2. ImageJ Processing with ImageJ (PDF)

------------------------------------
3 / 4.20.2011 / Digital Image processing using imageJ - continued
imageJ measurements on regions; video analysis (exporting and importing video frames; measurements with image; frame differences; shot detection)

readings:
1. Fernanda B. Viégas and Martin Wattenberg: Artistic Data Visualization. 2006. Visit the websites for all projects described in the article.
3. The N^3 Report.
2.Tara Zepel. 2008 U.S. Presidential Campaign Ads (read the blog post; the longer article is optional.)



------------------------------------
3 / 4.27.2011 / shot detection with shotdetect; data analysis and visualization with manyeyes and Mondrian

readings:
1. Descriptive statistics.
2. Selected chapters from Computation of Style (1982). file: The_Computation_of_Style.pdf

resources:
statistical functions in Google docs
Google spreadsheets documentation


------------------------------------
3 / 5.4.2011 / mediavis with ImagePlot; discussion of final project proposals

homework:
Each group should prepare a proposal for the final project.

Final project should present analysis and visualizations of interesting patterns in a relatively large cultural data set. You can use any data sets (still images, video, text, 3D, geo, etc.). The only requirement is that you have to analyze the actual content of the data and not only metadata.

Practically, this means that you have to use computational techniques to calculate stats and/or extract some features from your data. Of course, you can also use any metadata available and/or add metadata of your own via manual annotation.

Visualizations: you can use any software and/or write your own; the final visualizations have to be both meaningful and visually striking.

The proposal should contain the following parts (short texts):
- the data source; method for collecting the data and time estimate on how long it will take)
- research questions: what questions you want to investigate
- relevance: why this project would be interesting to others?

The proposal also need to include the small pilot study (download a small sample of your data, analyze and visualize it to see if your hypotheses hold up; if you dont get interesting results, you need to revise your idea or choose a different data set).

The proposal can be created in any format (Powerpoint, web page, blog post, etc) as long as it contains the required text parts and the visuals for the pilot study.

Note that if you dont have a solid proposal and a convincing pilot study, you would have to redo it.


Software resources:
descriptive stats online software

data exploration and visualization software:
Mondrian
Tableau desktop (Windows)
manyeyes


Examples of student projects:
The N^3 Report
Sharedegg

digital humanities ++ | Manovich's course at UCSD, spring 2011

One million manga pages


complete course syllabus: digital humanities++

Keywords: digital humanities, Calit2, UCSD, NEH, HIPerSpace, visualization, cultural analytics, softwarestudies.com, software studies, Lev Manovich

------------------------------------------------------
course description:

“How does the notion of scale affect humanities and social science research?
Now that scholars have access to huge repositories of digitized data—far more than they could read in a lifetime—what does that mean for research?”
The description of joint NEH/NSF Digging into Data competition (2009) organized by Office of Digital Humanities at the National Endowment of Humanities (the U.S. federal agency which funds humanities research).

“The next big idea in language, history and the arts? Data.”
New York Time, November 16, 2010.


Over the last few years, digital humanities - use of computational tools for cultural analysis - has been rapidly expanding, with growing number of grants, panels and presentations at conferences, and media coverage. (For example, New York Times is running a series of articles about digital humanities, with 5 articles already in print since November 2010.) However, most of the projects so far focused on text and spatial data (literature and history departments). With a few exceptions, other fields including art history, visual culture, film and media studies, musicology, and new media have yet to start using computational methods. But even in social sciences, the disciplines which are dealing with culture (media studies, cultural sociology, anthropology) and which employ quantitative methods, still did not discover full possibilities of "cultural computation." In short: the opportunities are wide open, and it is an exiting time to enter the field.

This graduate seminar explores the concepts, methods, and tools of computational cultural analysis, with a particular focus on the analysis of visual and interactive media. (This is also the focus of our lab's cultural analytics research).

We will discuss cultural, social and technical developments which gave us "large cultural data" (digitization by cultural institutions, social media) and which placed "information" and "data" in the center of contemporary social and economic life (the concepts of information society, network society, software society)

We will critically examine the fundamental paradigms developed by modern and contemporary societies to analyze patterns in data - statistics, visualization, data mining. This will help us to employ computational tools more reflexively. At the same time, the practical work with these tools will help us to better understand how they are used in society at large - the modes of thinking and inquiry they enable, their strengths and weaknesses, the often unexamined assumptions behind their use.
(This approach can be called reflexive digital humanities.)

We will discuss theoretical issues raised by computational cultural analysis: selecting data (single artifacts vs. sample vs. complete population); meanings vs. patterns; analyzing artifacts vs. analyzing cultural processes; methodologies for analyzing interactive media; combining established humanities methods with computational methods.

The course assumes that while computational methods can be used in the service of existing humanities questions and approaches, they also have radical potential to challenge existing concepts and research paradigms, and lead to new types of questions. To engage this potential, we have to start by considering "contemporary techniques of control, communication, representation, simulation, analysis, decision-making, memory, vision, writing, and interaction" enabled by software in society at larger. (Manovich, Introduction to Software Takes Command.) Projecting these techniques onto the problem of cultural analysis will tell us what digital humanities can be.

The seminar combines readings, discussion, exercises to learn tools and techniques, and collaborative work in groups to conduct original digital humanities projects. Students will be able to use any of the data sets already assembled by Software Studies Initiative (see examples) as well as the unique supervisualization HIPerSpace system.

Manovich lectures at Emory University and Georgia Institute of Technology, March 28-29.

March 28: Emory University
time: 7pm.
location: Goizueta Business School, Boynton Auditorium.

March 29: Georgia Institute of Technology.
time: 4pm.
location: Technology Square Research Building (TSRB) Auditorium

article about cultural analytics research in NEH Humanities magazine

Keywords: visualization, cultural analytics, softwarestudies.com, softeware studies, digital humanities, Calit2, UCSD, NEH, NERSC, HIPerSpace, Cultural Analytics, Lev Manovich, Jeremy Douglass


James Willford. Graphing Culture. Humanities. March/April 22, number 2.


Exploring one million manga pages on the 287 megapixel HIPerSpace

New article on computational analysis of one million manga pages

Keywords: manga, comics, Naruto, visualization, cultural analytics, softwarestudies.com, softeware studies, digital humanities, Calit2, UCSD, NEH, NERSC, HIPerSpace, Cultural Analytics, Lev Manovich, Jeremy Douglass, William Huber


Article: Jeremy Douglass, William Huber, Lev Manovich. Understanding scanlation: how to read one million fan-translated manga pages. Forthcoming in Image and Narrative, Spring 2011.



One Piece manga - 10461 scanlation pages.

Recently...