Tuesday, October 19, 2010

Noun-Noun Compounds

Here’s my latest mini-project.

Noun-noun compounds are a relatively rare syntactic occurrence where two nouns are paired together to blend or attribute separate concepts.  Some examples include “island bungalow”, “race car”, and “stock market”.  Interested yet?  When you think about it, it may seem like the first noun is actually an adjective, but it’s really not.  Technically adjectives need a form different from their noun counterparts.  Nouns and verbs in English often share form: “run for the door” and “I’ll go for a run”.  Adjectives, however, have a distinct form--forms that are often generalised and over-extended to create fake words.  Cramming our examples into an adjective-noun structure we would get “island-esque bungalow”, “race-ish car”, and “stocky market”.  None of these really sound all that great, and it’s not just the made-up-ness of the words.  These examples are overly constrained by the interface between lexical and syntactic boundaries.  An adjective is not what we want, despite it be syntactically a little more common.  So!  People make up noun-noun compounds instead of breaking lexical rules.  I think Ray Jackendoff takes this as evidence that lexical constraints apply in parallel to syntactic constraints, not in sequence.

Linguistically: We can explode the syntax of noun-noun compounds as such: [n1 + n2] => “the [n1] type of [n2]” or inverted: “an [n2] of the [n1] type”.  Poetically, these options might be neat...but that’s about it.  They are a more explicit, boring form of concept combination.  In fact, its so explicit, that some of the metaphorical shading present in the concise, noun-noun form, is lost.  Replacing the fluidity of metaphor we get boring old category assumptions, such as Q: what type of bungalow? A: the island type.  This supposes there are a discrete set types-of-bungalows one of which is island.  Dumb!  Noun-nouns are better.

So, why do we care?  Psychologically, noun-noun compounds are interesting because they are a case of analogy and concept blending.  But beyond that, they are a wonderfully small and easy example of conceptual genesis.  Now there’s a buzz-word for ya.  But check it out.  New concepts are created every day.  Often words are used to characterize, describe, or understand concepts.  But sometimes words enable the creation of new concepts.  Terms, different from concepts, are particularly interesting when they are new and isolated.  A good example is the term “moral hazard” in recent financial and political talk.  The term refers to a situation where someone behaves differently than they would have had they been aware of the full extent of risk.  So we have this noun-noun compound used to characterise a circumstance that is recently relevant to the financial and political arenas.  The term is introduced without much explanation, but it catches on regardless.  Metaphor aside, this is an example of conceptual genesis, enabled and largely inspired by the creative use of language.

What we want to do see what noun-noun compounds say about language and people.

Step 0) Make sure they’re interesting.  See above.
Step 1) Find them.  See below.
Step 2) Figure out what it means.  ...

Step 1:

Below is a script for finding noun-noun compounds in arbitrary text.  It uses TreeTagger (free but separate) to tag a text file with part-of-speech tags (Noun, verb, etc...).  And then it finds the compounds and their lemmas, collapses them into their frequency of occurrence and gives you a spreadsheet.

It needs a Unix environment and TreeTagger.  TreeTagger takes the majority of time, but run on 10,829,875 words it’s not bad:

real    3m32.609s
user    5m24.708s
sys     0m20.861s

Step 2:

Preliminarily: I’ve looked at the results from a corpus of financial texts and get this! 91 of the top 100 noun-noun compounds are distinctly financial in nature.  Compare this with the 45 / 100 for the top bare nouns.  This makes me think that noun-noun compounds might be even MORE interesting...

More later!

The Script: get_NNs.pl

#! /usr/bin/perl -W
use strict;
$| = 1; 

## A script to find noun-noun compounds in a text file.
## Uses TreeTagger to lemmatise and POS-tag the corpus.
## Then we count and sort the instances into a CSV.
##   A. Gerow, 2010-10-14: gerowa@tcd.ie

unless (-r $ARGV[0] && $ARGV[1] && -x $ARGV[2]) {
    print "Usage: ./get_NNs [input file] [output file] [path to tree-tagger command]\n";

if (-e "./temp") {
    print "Error: Please remove ./temp before running.\n\n";

sub INT_handler {
    unlink("./temp") if (-e "./temp");
    print "\nCaught SIG_INT, exiting\n\n";

$SIG{'INT'} = 'INT_handler';

my $INFILE  = $ARGV[0]; # input file
my $OUTFILE = $ARGV[1]; # output file
my $TT_PATH = $ARGV[2]; # path to language-specific tree-tagger binary

my $prev_type  = "";
my $prev_word  = "";
my $prev_lemma = "";
my @results;

# Words to exclude (added ad hoc.)
my @exclude = qw'zero one two three four five six seven eight nine ten eleven twelve
                 the an a of per cent for in ? and his her has from its their to these 
                 this that were out were new after whose began before them last with
                 sent rose take first second third fourth fifth sixth seventh eighth
                 more garner over also formal into down up strong all hit far day week
                 year decade highest lowest sure hard other recent said our abouta abouthe

# First cat the corpus through tr to delete some characters before TreeTagger gets it.
print "Tagging...";
qx+cat $INFILE | tr -d '=\`"<>/@#?$%^&(){}[]' | $TT_PATH > ./temp 2> /dev/null +;
print "done\nSearching for noun-noun compounds...";

# For every word as tagged by TreeTagger:
open(IN, "<./temp");
while (<in>) {
    my ($word, $type, $lemma) = split(" ");

    # Lower-case, remove '?' and spaces.
    $word = lc($word);
    $word =~ s/\s*//g;    
    $type = substr($type, 0, 2); # Includes NNS, NNX, ..., NN*

    # Skip if 1) the word is in the exclude list
    #         2) the word is contracted 'is' (ie. 's)
    #         3) the word contains digits
    #         4) the word is shorter than 3 letters
    if (grep($_ eq lc($word), @exclude) || lc($word) eq "'s" ||
        lc($word) =~ m/[0-9]/g || length($word) < 3) {
        $prev_word = "";
        $prev_type = "";
        $prev_lemma = "";

    elsif ($type eq 'NN' && $type eq $prev_type) { 
        push(@results, "$prev_word,$prev_lemma,$word,$lemma");
    $prev_word = $word;
    $prev_type = $type;
    $prev_lemma = $lemma;
print "done\nWriting CSV...";

# Count and collapse like instances:
my %count;
map { $count{$_}++ } @results;

# Write to file and sort it, ascending by number of occurances:
open(OUT, ">temp");
map { print OUT "${count{$_}},$_\n"} keys(%count);
qx/echo 'count,word_1,lemma_1,word_2,lemma_2' > $OUTFILE/; # CSV header
qx/sort -n temp >> $OUTFILE && rm temp/; # UNIX numerical sort

print "done\n\n";