tenoclock February 3, 2015 at 19:40

Twitter bot based on Markov chains and phrases from TV shows

I looked through the forums looking for questions that python programmers ask for interviews and came across one very wonderful one. I will quote him freely: “They asked me to write a nonsense generator based on an n-th order Markov chain.” “But I don’t have such a generator yet!” Shouted my inner voice, “Quickly open sublime and write!” He continued persistently. Well, I had to obey.

And here I will tell you how I made it.

It was immediately decided that the generator would express all its thoughts on Twitter and its website. As the main technologies, I chose Flask and PostgreSQL. They will communicate with each other through SQLAlchemy.

Structure.

So. The following are the models:

class Srt(db.Model): 
    id = db.Column(db.Integer, primary_key = True) 
    set_of_words = db.Column(db.Text()) 
    list_of_words = db.Column(db.Text()) 
class UpperWords(db.Model): 
    word = db.Column(db.String(40), index = True, primary_key = True, unique = True) 
    def __repr__(self): 
        return self.word 
class Phrases(db.Model): 
    id = db.Column(db.Integer, primary_key = True) 
    created = db.Column(db.DateTime, default=datetime.datetime.now) 
    phrase = db.Column(db.String(140), index = True) 
    def __repr__(self): 
        return str(self.phrase)

As source texts, it was decided to take subtitles from popular TV shows. The Srt class stores an ordered set of all words from the processed subtitles for one episode and a unique set of the same words (without repetitions). So it will be easier for a bot to search for a phrase in specific subtitles. He will first check to see if the many words are in the many words of the subtitles, and then see if they are there in the correct order.

The first word of the phrase from the text selects a random word starting with a capital letter. For storage of such words also serves UpperWords. Words are written there without repetition.

Well, the Phrases class is needed to store already generated tweets.
The structure is desperately simple.

The .srt subtitle parser is output to a separate add_srt.py module. There is nothing extraordinary, but if you are interested, all the sources are on GitHub .

Generator.

First you need to select the first word for the tweet. As stated earlier, this will be any word from the UpperWords model. His choice is implemented in the function:

def add_word(word_list, n): 
    if not word_list: 
        word = db.session.query(models.UpperWords).order_by(func.random()).first().word #postgre 
    elif len(word_list) <= n: 
        word = get_word(word_list, len(word_list)) 
    else: 
        word = get_word(word_list, n) 
    if word: 
        word_list.append(word) 
        return True 
    else: 
        return False

The choice of this word is implemented directly by the line:

word = db.session.query (models.UpperWords) .order_by (func.random ()). First (). Word

If you use MySQL, you need to use func.rand () instead of func. random (). This is the only difference in this implementation, everything else will work completely identically.

If the first word already exists, the function looks at the length of the chain, and depending on this, selects how many words in the text we need to compare our list (chain of the nth order) and get the next word.

And we get the following word in the get_word function:

def get_word(word_list, n): 
    queries = models.Srt.query.all() 
    query_list = list() 
    for query in queries: 
        if set(word_list) <= set(query.set_of_words.split()): 
            query_list.append(query.list_of_words.split()) 
    if query_list: 
        text = list() 
        for lst in query_list: 
            text.extend(lst) 
        indexies = [i+n for i, j in enumerate(text[:-n]) if text[i:i+n] == word_list[len(word_list)-n:]] 
        word = text[random.choice(indexies)] 
        return word 
    else: 
        return False

First of all, the script runs through all the loaded subtitles and checks if our many words are included in the many words of specific subtitles. Then the texts of the eliminated subtitles are added to a single list and the whole phrases are searched for in it and the positions of the words following these phrases are returned. It all ends with a blind choice of (random) words. Everything is as in life.
So the words are added to the list. The tweet itself is going to function:

def get_twit(): 
    word_list = list() 
    n = N 
    while len(' '.join(word_list))<140: 
        if not add_word(word_list, n): 
            break 
        if len(' '.join(word_list))>140: 
            word_list.pop() 
            break 
    while word_list[-1][-1] not in '.?!': 
        word_list.pop() 
    return ' '.join(word_list)

Everything is very simple - it is necessary that the tweet does not exceed 140 characters and ends with a punctuation mark completing the sentence. All. The generator has completed its work.

Display on the site.

The display on the site is covered by the views.py module.

@app.route('/') 
def index(): 
    return render_template("main/index.html")

Just displays a template. All tweets will be pulled from it using js.

@app.route('/page') 
def page(): 
    page = int(request.args.get('page')) 
    diff = int(request.args.get('difference')) 
    limit = 20 
    phrases = models.Phrases.query.order_by(-models.Phrases.id).all() 
    pages = math.ceil(len(phrases)/float(limit)) 
    count = len(phrases) 
    phrases = phrases[page*limit+diff:(page+1)*limit+diff] 
    return json.dumps({'phrases':phrases, 'pages':pages, 'count':count}, cls=controllers.AlchemyEncoder)

Returns tweets of a specific page. This is necessary for endless scrolling. Everything is pretty ordinary. diff - the number of tweets added after loading the site during the update. By this amount you need to shift the selection of tweets for the page.

And the update itself:

@app.route('/update') 
def update(): 
    last_count = int(request.args.get('count')) 
    phrases = models.Phrases.query.order_by(-models.Phrases.id).all() 
    count = len(phrases) 
    if count > last_count: 
        phrases = phrases[:count-last_count] 
        return json.dumps({'phrases':phrases, 'count':count}, cls=controllers.AlchemyEncoder) 
    else: 
        return json.dumps({'count':count})

On the client side, it is called every n seconds and downloads in real time the newly added tweets. This is how our tweet display works. (If anyone is interested, then you can look at the AlchemyEncoder class in controllers.py, it is used to serialize tweets received from SQLAlchemy)

Adding tweets to the database and posting on Twitter.

I used tweepy to post on Twitter. Very convenient battery, starts immediately.

What does it look like:

def twit(): 
    phrase = get_twit() 
    twited = models.Phrases(phrase=phrase) 
    db.session.add(twited) 
    db.session.commit() 
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) 
    auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) 
    api = tweepy.API(auth) 
    api.update_status(status=phrase)

I made a call to this function in cron.py at the root of the project, and, as you might guess, it runs on the crown. Every half hour, a new tweet is added to the database and Twitter.

It all worked!

Finally.

At the moment, I uploaded all the subtitles for the series “Friends” and “The Big Bang Theory”. The degree of the Markov chain has so far chosen equal to two (with an increase in the subtitle base, the degree will increase). How it works can be viewed on Twitter , and all source codes are available and lie on the github . I deliberately do not post a link to the site itself. If someone needs her, he will definitely get her.

Thank you all for your attention. See you soon!

Tags: