NLP + Movie Genre

This was my favorite project as a Master’s student.


  • Discover differences between movie genres using embeddings of words in movie scripts


  • 1 month project


  • Extracted movie scripts that were classified by genre from imsdb
  • Cleaned and preprocessed text data to train neural-nets and obtain word embeddings for each movie genre
  • Studied word representations in vector space [1][2]
  • Compared word embeddings between movie genres


  • R


  • Documented the findings and the extraction/cleaning process

Project report:

NLP + Movie Genre notebook

Please clone this github repository to replicate the results!


  • These are the 100 most common nouns used accross all movie genres.
##   [1] "door"       "right"      "man"        "room"       "face"      
##   [6] "time"       "head"       "hand"       "way"        "something" 
##  [11] "car"        "front"      "side"       "phone"      "window"    
##  [16] "moment"     "mr"         "floor"      "table"      "thing"     
##  [21] "house"      "nothing"    "night"      "wall"       "camera"    
##  [26] "life"       "hey"        "gun"        "bed"        "move"      
##  [31] "place"      "water"      "work"       "street"     "day"       
##  [36] "love"       "sound"      "guy"        "anything"   "woman"     
##  [41] "voice"      "girl"       "body"       "end"        "shot"      
##  [46] "home"       "arm"        "standing"   "ground"     "fire"      
##  [51] "god"        "glass"      "boy"        "air"        "name"      
##  [56] "money"      "everything" "sir"        "lot"        "blood"     
##  [61] "smile"      "father"     "hair"       "sam"        "mouth"     
##  [66] "world"      "cont'd"     "mother"     "hell"       "desk"      
##  [71] "someone"    "line"       "crowd"      "kind"       "corner"    
##  [76] "shoulder"   "police"     "fuck"       "show"       "chair"     
##  [81] "everyone"   "office"     "building"   "silence"    "screen"    
##  [86] "mrs"        "seat"       "mind"       "music"      "bag"       
##  [91] "eye"        "course"     "baby"       "son"        "road"      
##  [96] "friend"     "book"       "fine"       "half"       "o"
  • It is possible to compare how a word is used in different movie genres. To exemplify this, look at the third row in the table below. The word life in romance movies is most related with the words career, marriage, happiness, dreams, imagination, talent, wellbeing, ...; whereas the word life in scifi movies is closer to power, brain, memory, existence, abilities, memories, birthright, ....
girlactionanimationblonde, kidprincess, nun, witch, bird, ev’rything, friend, puppy, maiden
caractionhorrorsedan, bike, mercedes, motorcycle, driver, bmw, humvee, u_haul, minivan, gto, hummer, claudio, train, bus, oncoming_lane, peugeot, cruiserrange_rover, ambulance, station_wagon, volvo, driveway, mini_van, buick, porsche, pickup_truck, ferrari, trans_am
liferomancescificareer, marriage, happiness, dreams, imagination, talent, wellbeing, our_lives, 51_vote, friendship, suitemates, greatness, visitation_rights, estimation, formative_years, dream, suspected_communistpower, brain, memory, existence, abilities, memories, birthright, life, libido, programming, strength, theories, moral_outrage, mission, meaningless_compared, adapt_itself
gunanimationmysterytoon_38, stun, blaster, boot, ashes_charizard, hammer, taser, crossbow, laser_cannon, plasma_weapon, bat, glock_pistol, pitcher’s_glove, disk_disk, dorian_pummels, cannon, epaulet_arms, crouching_position, sword, laser_pistol, ar_180, porthos_snatches, circular_obsidian, signal_flare, arrow, desert_eagle, ee_vaaaaaah, rush_stows9mm, holster, waistband, service_revolver, baretta, glock, knife, lenny’s_glock, 380, 45_automatic, silenced_pistol, beretta, casull, briefcase, its_holster, satchel, longdale, longdale’s, baton
godactionanimationg_d, jesus, jesus_christ, sings_drunkenly, dear, nearly_swallowed, my_goodness, holy, nonononono, a’mighty, godl, trespass_sweetly, jeesus, swear, universal_destruction, nazareth, urg’d, merd, ohhhhh, 1st_posse, dishonour, o_oul, yeahhhh, cattily, signal’s_fadingsakes_stepek, freedonia, mister_disney, muse, gawd, underminer, allah, hogwash, thee, mister_flintstone, almighty, praise_allah, whatever_pleases, oh, am_fortune’s, guv, friar_tuck, indeedy, sirree_bob, mercy, bumstead_contracts, suppertime, josh_baskin, youuuuu, my_gosh, heaven
  • Multi Dimensional Scaling and word embeddings from scripts can be combined to create a 2-dimensional map of distances between movie genres.

[1] [2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013

Michelle Audirac
Data Scientist