Parse Email in Java

To my last project, written in 80% in Java, I had to add a module - a parser for all letters passing through the server. The religious motives of the module are very strange, but I would like to share some details.

Available are:

Postfix mail server with Dovecot delivery service on CentOS. Well and the JVM.

Message structure

What is an email, its components, their approximate structure, headers and MIME types are described humanly on Wikipedia .
More interesting is the structure of the file name of the letter on the server. An example of the name of a newly made (not read / not requested by the client) letter:

1348142977.M852516P31269.mail.example.com,S=3309,W=3371


The name consists of flags. Flags are separated by commas, when creating a new letter, “where”, “when” the letter and its size are indicated.

  • Two letter sizes are indicated. The usual Size, denoted by "S" and Vsize, denoted by the symbol "W", which is rfc822.SIZE. ( Here they answer the question "What is RFC822.SIZE?").
  • The time is indicated in Unix format, in seconds.
  • In one flag, over time, through the point, can go “P” - the process ID and “M” - the counter in microseconds, added for the uniqueness of the name (there may be other attributes, additionally in the notes)
  • The server is specified as the final the one on which the letter is stored, and not the relay server in case the letter was forwarded.

Of this, the writing time (the first ten digits) was useful to me. However, often this time may differ from the time in the message header, so I used the time in the name only to filter messages in the directory.

Additional / client flags

The client mail interface (hereinafter referred to as the client) can add its own flags to the message name. The start of client flags is indicated by the symbol ":"

As soon as the client gets to request new letters from the server, a request is sent to the transport to move each of the requested letters to the "read" directory and add an information flag (one of two) to the name, separated by a comma from the following flags:
  • “1” - as the documentation says “Flag bearing experimental meaning”.
  • “2” is what I had in practice in 100% of cases. It means that each subsequent character after a comma is a separate flag.

Despite the fact that the message on the server is already in the “read” folder, the user will see it as new, because customers read flags, not the location of the letter.
That is, only when the user himself opens the letter (or another action with him) and the flag “S” (Seen) is added to his name, it will become visually “read”. Various actions on the letter, as one would expect, add their flags, see notes.

Example:
A new message came to the server for our mailbox, its name will look something like this:

1348142977.M852516P31269.mail.example.com,S=3309,W=3371

On our background, God forbid Outlook, which asks for a list of new letters and says to move them on the server to the “read” directory, adding the flag:

1348142977.M852516P31269.mail.example.com,S=3309,W=3371:2,

Next, we delete, open Outlook and click on a new letter, and the S flag is added:

1348142977.M852516P31269.mail.example.com,S=3309,W=3371:2,S

And then we answer it and delete:

1348142977.M852516P31269.mail.example.com,S=3309,W=3371:2,SRT

As we can see, flags are listed without separators.

Notes: some clients have the ability to configure (not) move the letter to the “read” folder. Also, clients sometimes add flags "not specified in the documentation" for their needs ", which I did not particularly pay attention to.
More useful flag information: cr.yp.to/proto/maildir.html

And a little Java

For work with letters I used javax.mail . We are kindly provided with the abstract class javax.mail.Message , although in this case I limited myself to javax.mail.MimeMessage .
The module spins on the server, so we access messages locally (checks and exception handling in the code are omitted):

// в примере properties оставляю дефолтными
Session session = Session.getDefaultInstance(System.getProperties());   
FileInputStream fis = new FileInputStream(pathToMessage);
MimeMessage mimeMessage = new MimeMessage(session, fis);       

Now we can read the message headers expected in ASCII. If the header is not found, then we will return null. For instance:

String messageSubject = mimeMessage.getSubject();
String messageId = mimeMessage.getMessageID();

To determine the list of recipients, we are provided with the getRecipients method, which takes Message.RecipientType as an argument. The method returns an array of objects of type Address . For example, we list the message recipients:

for(Address recipient : mimeMessage.getRecipients(Message.RecipientType.TO)){
    System.out.println(recipient.toString());
}

In order to find out the sender (s) of the letter, we have the getFrom method. It also returns an array of objects of type Address. The method reads the “From” header, if it is absent, reads the “Sender” header, if there is no “Sender”, then null.

for(Address sender : mimeMessage.getFrom()){
    System.out.println(sender.toString());
}

Next, we will analyze the body of the message (in most cases, we need text and attachments). It can be composite (Mime multipart message), or contain only one block of text / plain format. If the body of the message consists only of an attachment (without text), it is still marked as a multipart message. According to RFC822, the format is specified for the message body (and its parts) in the Content-Type header.

 // Если контент письма состоит из нескольких частей
if(mimeMessage.isMimeType("multipart/mixed")){ 
         // getContent() возвращает содержимое тела письма, либо его части. 
         // Возвращаемый тип - Object, делаем каст в Multipart
  Multipart multipart = (Multipart) mimeMessage.getContent(); 
        // Перебираем все части составного тела письма
  for(int i = 0; i < multipart.getCount(); i ++){
         BodyPart part = multipart.getBodyPart(i); 
    //Для html-сообщений создается две части, "text/plain" и "text/html" (для клиентов без возможности чтения html сообщений), так что если нам не важна разметка:
        if(part.isMimeType("text/plain")){ 
           System.out.println(part.getContent().toString());
        }
         // Проверяем является ли part вложением
        else if(Part.ATTACHMENT.equalsIgnoreCase(part.getDisposition()){
        // Опускаю проверку на совпадение имен. Имя может быть закодировано, используем decode
                 String fileName = MimeUtility.decodeText(part.getFileName());
                 // Получаем InputStream
                 InputStream is = part.getInputStream(); 
                 // Далее можем записать файл, или что-угодно от нас требуется
                 ....
        }
  }
}
// Сообщение состоит только из одного блока с текстом сообщения
else if(mimeMessage.isMimeType("text/plain")){ 
       System.out.println(mimeMessage.getContent().toString());
}


That, in fact, is all. Hope the material can be helpful.
Also at oracle.com there is a useful javax.mail FAQ .

UPD: As the first comment says, body parts of a message can be nested together. In the same place, in the comments, two ways are laid out to sort them out.

Also popular now: