SNA: Visualising an email box with R

Are statistics sexy? Visualising social networks certainly is! I wrote a little function, which makes producing beautiful plots depicting a mailbox with R an extremely easy task. I find visualisations of ‘social graphs’ particularly appealing. They look like flowers.

I had to use a few Python functions which can be executed within R with rJython library. The function connects to IMAP server and looks for “To:” and “From:” sections in stored emails. It should not be difficult to adapt this script to work with POP3 too. I am really impressed by what R can do (with a little bit of help from Python). Can anyone suggest a more elegant way to do the same thing without executing Python?

As rJython depends on rJava I had to install Java Development kit to launch it.

Warning: For me this function worked very well and did not do any harm to my mailbox. Despite that I am not an expert in IMAP so if you are going  to run it you are doing it at your own risk.

Here is the function:

mailSoc <- function(login,
                    pass,
                    serv = "imap.gmail.com", #specify IMAP server
                    ntore = 50, #ignore if addressed to more than
                    todow = -1, #how many to download
                    begin = -1){  #from which to start
 
  #load rJython and Python libraries
  require(rJython)  
  rJython <- rJython(modules = "imaplib")
  rJython$exec("import imaplib")
 
  #connect to server
  rJython$exec(paste("mymail = imaplib.IMAP4_SSL('",
                     serv, "')", sep = ""))
  rJython$exec(paste("mymail.login(\'",
                     login, "\',\'",
                     pass, "\')", sep = ""))
 
  #get number of available messages
  rJython$exec("sel = mymail.select()")
  rJython$exec("number = sel[1]")
  nofmsg <- .jstrVal(rJython$get("number"))
  nofmsg <- as.numeric(unlist(strsplit(nofmsg, "'"))[2])
 
  #if 'begin' not specified begin from the newest
  if(begin == -1)
  {
    begin <- nofmsg
  }
 
  #if 'todow' not specified download all
  if(todow == -1)
  {
    end <- 1
  }
  else
  {
    end <- begin - todow
  }
 
  #give a little bit of information
  todownload <- begin - end
  print(paste("Found", nofmsg, "emails"))
  print(paste("I will download", todownload, "messages."))
  print("It can take a while")
 
  data <- data.frame()
 
  #fetching emails
  for (i in begin:end) {
    nr <- as.character(i)
 
    #get sender
   rJython$exec(paste("typ, fro = mymail.fetch(\'", nr, "\', \'(BODY[HEADER.FIELDS (from)])\')", sep = ""))
    rJython$exec("fro = fro[0][1]")
    from <- .jstrVal(rJython$get("fro"))
    from <- unlist(strsplit(from, "[<>\r\n, \"]"))
    from <- sub("from: ", "", from, ignore.case = TRUE)
    from <- grep("@", from, value = TRUE)
 
    #get addresees
    rJython$exec(paste("typ, to = mymail.fetch(\'", nr, "\', \'(BODY[HEADER.FIELDS (to)])\')", sep = ""))
    rJython$exec("to = to[0][1]")
    to <- .jstrVal(rJython$get("to"))
    to <- unlist(strsplit(to, "[<>\r\n, \"]"))
    to <- sub("to: ", "", to, ignore.case = TRUE)
    from <- sub("\"", "", from, ignore.case = TRUE)
    to  <- grep("@", to, value = TRUE)
 
    #if reasonable number of addressses add to data frame
    if(length(to) <= ntore){
    vec <- rep(from, length(to))
    data <- rbind(data, data.frame(vec, to))
    }
 
    #give some information about progress
    if((i - begin) %% 100 == 0)
    {
      print(paste((i - begin)*(-1), "/", todownload,
                  " Downloading...", sep = ""))
    }
  }
  names(data) <- c("from", "to")
  data$from <- tolower(data$from)
  data$to <- tolower(data$to)
 
  #close connection
  rJython$exec("mymail.shutdown()")
  return(data)
}

Now we can run eg.

#download 200 most recent emails from gmail account
maild <- mailSoc("login", "password", serv = "imap.gmail.com",
                ntore = 40, todow = 200)

And to make a plot it is necessary to load network library

library(network)
mailnet <- network(maild)
plot(mailnet)

This is the result:

Social network analysis: visualisation of mailbox with R

R provides many other social network analysis tools such as igraph library. For instance, it can be used to make an interactive ‘plot’:

library(igraph)
h <- graph.data.frame(maild, directed = FALSE)
tkplot(h, vertex.label = V(h)$name,
       layout=layout.fruchterman.reingold)

I would like to learn more about SNA as well as I would like to try out Gephi which can produce visualisations which are even more attractive than those made in R so I think that I will write about my first impressions soon.

UPDATE: I tested it only with gmail. If anybody tries it with other email servers please let me know about the results.

Post to Twitter

This entry was posted in R, SNA and tagged , , . Bookmark the permalink.


16 Responses to SNA: Visualising an email box with R

  1. Jeremy Miles says:

    I think you have a typo at the end, should be plot(mailnet). not plot(maild).

  2. knb says:

    plot(maild) => plot(mailnet) : This line is still wrong in the version of the blog-post on the r-bloggers web site.

  3. Mats Rauhala says:

    Probably a stupid question, but what exactly are you visualising here?

  4. First of all I think that’s a great idea, but I’m confused about the dot in the bottom left corner!? No from nor a to? ;)
    Could you please keep us informed whether the article at r-bloggers automagically pulls an update concerning your typo?

    Btw. the shocking pink background of marked words makes me insane :-P

    • expansed says:

      This is an email that somebody sent to himself and just put me in the “CC:” field. I know that it is strange but it happens from time to time. ;)

  5. Bhima says:

    I am trying to learn a little bit about R, so I tried to do this on my Mac using the R.app. After stumbling around for a while, I think I have all the dependent packages installed and I copy & pasted the main function into a file. Then I used the source(“path.to.mailsoc”) command in the interactive terminal. Then I pasted the “maild <- mailSoc… [snip]" with my correct credentials. After thinking for a while R fails with the error.

    “Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    Traceback (most recent call last):
    File “”, line 1, in
    File “/Library/Frameworks/R.framework/Versions/2.13/Resources/library/rJython/jython.jar/Lib/imaplib.py”, line 437, in fetch
    File “/Library/Frameworks/R.framework/Versions/2.13/Resources/library/rJython/jython.jar/Lib/imaplib.py”, line 1055, in _simple_command
    File “/Library/Frameworks/R.framework/Versions/2.13/Resources/library/rJython/jython.jar/Lib/imaplib.py”, line 892, in _command_complete
    imaplib.error: FETCH command error: BAD ['Could not parse command']
    >
    > library(network)
    > mailnet plot(maild)
    Error in plot(maild) : object ‘maild’ not found

    I guess I need to declare maild first? how?
    Thanks for any help!!

  6. Mayme says:

    hello there, really good blog, and a decent understand! definitely one for my book marks.

  7. Agustin says:

    I can confirm that your beautiful function works on my institution server (postal.uv.es) and in aim.con (imap.aim.com).
    Thanks

  8. Bhima says:

    Now that I have this working somewhat, I thought I would point out some limitations that I think I have found.

    1: I think that domains which use google services and gmail fail to authenticate. In these cases the username for Google’s imap server is the complete email address, including the domain (User.Name@domain.com). While for usernames for email addresses using the gmail.com domain, “@gmail.com” is omitted. I only had access to 1 account like this, so I am not 100% sure.

    2: Passwords with symbols can cause errors. The specific case I saw was a password which included “\”, which I have been using for months with no other problems.

    3: It only fetches messages with the label “inbox”, messages with other labels are ignored. This causes the mailboxes of fastidious organizers to be pretty much ignored.

    4: If the value of available messages (with the label “inbox”) is less than the value of the specified sample size (todow) the process fails with the error:


    Error in .jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
    Traceback (most recent call last):
    File “”, line 1, in
    File “/Library/Frameworks/R.framework/Versions/2.13/Resources/library/rJython/jython.jar/Lib/imaplib.py”, line 437, in fetch
    File “/Library/Frameworks/R.framework/Versions/2.13/Resources/library/rJython/jython.jar/Lib/imaplib.py”, line 1055, in _simple_command
    File “/Library/Frameworks/R.framework/Versions/2.13/Resources/library/rJython/jython.jar/Lib/imaplib.py”, line 892, in _command_complete
    imaplib.error: FETCH command error: BAD ['Could not parse command']

    5: If the value of the value of the specified sample size (todow) is greater than about 425, the process fails with the error:


    [1] “Found 16881 emails”
    [1] “I will download 500 messages.”
    [1] “It can take a while”
    [1] “0/500 Downloading…”
    [1] “100/500 Downloading…”
    [1] “200/500 Downloading…”
    [1] “300/500 Downloading…”
    [1] “400/500 Downloading…”
    Error in data.frame(vec, to) :
    arguments imply differing number of rows: 0, 1

    Cheers!

  9. chengjun says:

    when I ask it to download 2000+ emails, r issues a warning. It seems this code could not efficiently deal with a bigger data.

  10. Kieran says:

    Tweaked it a little. This allows selection of a folder, and also returns the date stamp on each message (for other interesting analyses, e.g. what time of day / day of week do you get most email?)


    mailSoc <- function(login,
    pass,
    serv = "imap.gmail.com", #specify IMAP server
    #ntore = 50, #ignore if addressed to more than
    todow = -1, #how many to download
    begin = -1, #from which to start
    folder = ''){ #folder to download (default:inbox)

    #load rJython and Python libraries
    require(rJython)
    rJython <- rJython(modules = "imaplib")
    rJython$exec("import imaplib")

    #connect to server
    rJython$exec(paste("mymail = imaplib.IMAP4_SSL('",
    serv, "')", sep = ""))
    rJython$exec(paste("mymail.login(\'",
    login, "\',\'",
    pass, "\')", sep = ""))

    #get number of available messages
    rJython$exec(paste("sel = mymail.select(\"", folder,"\")", sep=""))
    rJython$exec("number = sel[1]")
    nofmsg <- .jstrVal(rJython$get("number"))
    nofmsg <- as.numeric(unlist(strsplit(nofmsg, "'"))[2])

    #if 'begin' not specified begin from the newest
    if(begin == -1)
    {
    begin <- nofmsg
    }

    #if 'todow' not specified download all
    if(todow == -1)
    {
    end <- 1
    }
    else
    {
    end <- begin – todow
    }

    #give a little bit of information
    todownload <- begin – end
    print(paste("Found", nofmsg, "emails"))
    print(paste("I will download", todownload, "messages."))
    print("It can take a while")

    data <- data.frame()

    #fetching emails
    for (i in begin:end) {
    nr <- as.character(i)

    #get sender
    rJython$exec(paste("typ, fro = mymail.fetch(\'", nr, "\', \'(BODY[HEADER.FIELDS (from)])\')", sep = ""))
    rJython$exec("fro = fro[0][1]")
    from <- .jstrVal(rJython$get("fro"))
    from <- unlist(strsplit(from, "[\r\n, \"]“))
    from <- sub("from: ", "", from, ignore.case = TRUE)
    from <- grep("@", from, value = TRUE)

    #get addresees
    rJython$exec(paste("typ, to = mymail.fetch(\'", nr, "\', \'(BODY[HEADER.FIELDS (to)])\')", sep = ""))
    rJython$exec("to = to[0][1]")
    to <- .jstrVal(rJython$get("to"))
    to <- unlist(strsplit(to, "[\r\n, \"]“))
    to <- sub("to: ", "", to, ignore.case = TRUE)
    from <- sub("\"", "", from, ignore.case = TRUE)
    to <- grep("@", to, value = TRUE)

    #get dates:
    rJython$exec(paste("typ, date = mymail.fetch(\'", nr, "\', \'(BODY[HEADER.FIELDS (date)])\')", sep = ""))
    rJython$exec("date = date[0][1]")
    date <- .jstrVal(rJython$get("date"))
    date <- strptime(date, format="Date: %a, %d %b %Y %H:%M:%S %z")

    #add to data frame
    #vec <- rep(from, length(to))
    if(length(to)==0)
    to <- 'NA'
    if(length(from)==0)
    to <- 'NA'
    data <- rbind(data, data.frame(from, to, date))

    #give some information about progress
    print(i)
    if((i – begin) %% 100 == 0)
    {
    print(paste((i – begin)*(-1), "/", todownload,
    " Downloading…", sep = ""))
    }
    }
    names(data) <- c("from", "to", "date")
    data$from <- tolower(data$from)
    data$to <- tolower(data$to)

    #close connection
    rJython$exec("mymail.shutdown()")
    return(data)
    }

  11. Kieran says:

    Oh, and your “arguments imply differing number of rows: 0, 1″ error is due to a blank to or from field, which causes data.frame to choke. I’ve added a fix for that.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

* Copy this password:

* Type or paste password here:

1,192 Spam Comments Blocked so far by Spam Free Wordpress

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">