As a fun project, I’m working on the actors social network in Bollywood; in this network the nodes represent Bollywood actors and actresses, an edge connects actors who starred in a movie together. Hollywood actor networks have been studies for a long time. It would be interesting to compare Bollywood and Hollywood actors network, to find useful insights into Bollywood.
IMDB is obvious choice for getting this data. I was skeptical if it covered foreign film industry. After getting the database, I think it is very comprehensive. IMDB offers unix and windows programs to download and maintain the IMDB files locally. I found them a pain to install; I could not install it on my Ubuntu box. JMDB is a Java based program to import the IMDB files into a MySql database. It was easy to use; although it takes a few hours to import the whole database.
Some numbers from the dataset:
- Number of Hindi movies: 8,119
- Number of actors:15,954 (Male:10816, Female: 5138)
- Number of actors who have acted in more than one movie: 5,365 (Male: 3516, Female: 1849)
The database includes both movies and TV. This may be a problem, since TV has larger cast and has different variables than movies.
I generated the network for the year 2007. The whole network is too big for my computer. I need to figure out methods to visualize such large networks. The red nodes are movies, green nodes are actors.
I will be working on this network, trying to gain insights. I’m interested in identifying ‘star children’; whom I hypothize get a shortcut into the industry. Also I want to find the cliques or clans in the network. Watch this space.

