This was my favorite project as a Master’s student.
Goal
- Discover differences between movie genres using embeddings of words in movie scripts
Duration
- 1 month project
Activities
- Extracted movie scripts that were classified by genre from imsdb
- Cleaned and preprocessed text data to train neural-nets and obtain word embeddings for each movie genre
- Studied word representations in vector space [1][2]
- Compared word embeddings between movie genres
Toolbox
- R
Outcome
- Documented the findings and the extraction/cleaning process
Project report:
Please clone this github repository to replicate the results!
Findings:
- These are the 100 most common nouns used accross all movie genres.
top100nouns$word
## [1] "door" "right" "man" "room" "face"
## [6] "time" "head" "hand" "way" "something"
## [11] "car" "front" "side" "phone" "window"
## [16] "moment" "mr" "floor" "table" "thing"
## [21] "house" "nothing" "night" "wall" "camera"
## [26] "life" "hey" "gun" "bed" "move"
## [31] "place" "water" "work" "street" "day"
## [36] "love" "sound" "guy" "anything" "woman"
## [41] "voice" "girl" "body" "end" "shot"
## [46] "home" "arm" "standing" "ground" "fire"
## [51] "god" "glass" "boy" "air" "name"
## [56] "money" "everything" "sir" "lot" "blood"
## [61] "smile" "father" "hair" "sam" "mouth"
## [66] "world" "cont'd" "mother" "hell" "desk"
## [71] "someone" "line" "crowd" "kind" "corner"
## [76] "shoulder" "police" "fuck" "show" "chair"
## [81] "everyone" "office" "building" "silence" "screen"
## [86] "mrs" "seat" "mind" "music" "bag"
## [91] "eye" "course" "baby" "son" "road"
## [96] "friend" "book" "fine" "half" "o"
- It is possible to compare how a word is used in different movie genres. To exemplify this, look at the third row in the table below. The word
life
in romance movies is most related with the wordscareer, marriage, happiness, dreams, imagination, talent, wellbeing, ...
; whereas the wordlife
in scifi movies is closer topower, brain, memory, existence, abilities, memories, birthright, ...
.
word | genre1 | genre2 | in1not2 | in2not1 |
---|---|---|---|---|
girl | action | animation | blonde, kid | princess, nun, witch, bird, ev’rything, friend, puppy, maiden |
car | action | horror | sedan, bike, mercedes, motorcycle, driver, bmw, humvee, u_haul, minivan, gto, hummer, claudio, train, bus, oncoming_lane, peugeot, cruiser | range_rover, ambulance, station_wagon, volvo, driveway, mini_van, buick, porsche, pickup_truck, ferrari, trans_am |
life | romance | scifi | career, marriage, happiness, dreams, imagination, talent, wellbeing, our_lives, 51_vote, friendship, suitemates, greatness, visitation_rights, estimation, formative_years, dream, suspected_communist | power, brain, memory, existence, abilities, memories, birthright, life, libido, programming, strength, theories, moral_outrage, mission, meaningless_compared, adapt_itself |
gun | animation | mystery | toon_38, stun, blaster, boot, ashes_charizard, hammer, taser, crossbow, laser_cannon, plasma_weapon, bat, glock_pistol, pitcher’s_glove, disk_disk, dorian_pummels, cannon, epaulet_arms, crouching_position, sword, laser_pistol, ar_180, porthos_snatches, circular_obsidian, signal_flare, arrow, desert_eagle, ee_vaaaaaah, rush_stows | 9mm, holster, waistband, service_revolver, baretta, glock, knife, lenny’s_glock, 380, 45_automatic, silenced_pistol, beretta, casull, briefcase, its_holster, satchel, longdale, longdale’s, baton |
god | action | animation | g_d, jesus, jesus_christ, sings_drunkenly, dear, nearly_swallowed, my_goodness, holy, nonononono, a’mighty, godl, trespass_sweetly, jeesus, swear, universal_destruction, nazareth, urg’d, merd, ohhhhh, 1st_posse, dishonour, o_oul, yeahhhh, cattily, signal’s_fading | sakes_stepek, freedonia, mister_disney, muse, gawd, underminer, allah, hogwash, thee, mister_flintstone, almighty, praise_allah, whatever_pleases, oh, am_fortune’s, guv, friar_tuck, indeedy, sirree_bob, mercy, bumstead_contracts, suppertime, josh_baskin, youuuuu, my_gosh, heaven |
- Multi Dimensional Scaling and word embeddings from scripts can be combined to create a 2-dimensional map of distances between movie genres.
[1] https://code.google.com/archive/p/word2vec/ [2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013