How to find love or adventure with crate.io and kibana

    One can argue about the effectiveness, quality and efficiency of dating sites, one can look for 101 reasons the better it is to look for acquaintances in a club / bar / _ supplement_ options_ / park. What aroused laughter ten to fifteen years ago is now the mainstream. So isn’t it easier to try to use another opportunity for searching and communicating on the Internet with the transition to familiarity in life ...



    Geek's version of the search technology, screencast of the application under the cut. At the end of the article, a link to the archive with a running application under Apache License v2.0 and a small data set for example.


    Sounds encouraging, doesn't it !? The reality is somewhat more complicated: armies of bots and fake accounts, workers of an ancient profession, attempts by dating services to squeeze maximum money with a minimum of result, and even thieves in search of prey. More interesting? Not everything is so sad and with the right approach, the game is worth the candle!

    Promised application screencast:


    Consider the software part for the search. We divide the task into two parts, as with drawing an owl:

    • The first part - we draw an oval. For us it is to find, collect and structure data for further search. Any programming language with a client html library, with regular expressions or with DOM / xPath. For me, this part was not a problem as a developer with solid experience in integrating IT systems and a developer of a distributed search robot for Visuvi search startup. If you think this topic is interesting, speak out in a vote for the new topic of the article.
    • The second part - we finish the rest of the owl. This is how to store data in an information store, index it and write a frontend to search and view data.


    Crate.io is in a hurry to help us - this is a set of plugins for storing binary data in the file system and executing distributed SQL queries using the capabilities that are already available in the elasticsearch search server. In a nutshell, this is the NoSQL shared nothing base at the base and the Facebook Presto SQL parser and scheduler add-on above it. A distributed solution from the world of big data, which we will use for now as a single process on one computer.

    Why crate.io? We need to store the photo somewhere and at the same time we need Elasticsearch, and SQL may come in handy for statistics and reports in the future. I will reassure you and this time we can do without enterprise, hibernate and JPA). As you can see, working with crate is no more difficult than with a relational database.

    Kibana is an HTML5 application that allows you to visualize data from elasticsearch, work with time series, filter data, save search parameters as dashboards.

    How can this help in the search !? Minimum programming and maximum result.
    You can work with crate.io from Python, Ruby, PHP, Java - jdbc type 4 drivers. But it was more convenient for me to enable the REST API elasticsearch, which for some reason is hidden in crate and will work through it.

    In the config / crate.yml file, add the parameters
    es.api.enabled: true
    udc.enabled: false The

    second parameter disables crate.io usage reports sent over UDP to the project server and I immediately deleted the binaries from the sigar monitoring library so that Do not confuse your antivirus.

    In this form, the “box” becomes friendly to work through elasticsearch REST and using spring data elasticsearch.

    To start the server, java jre version 7 or later is required.
    I start the bin / crate project (in the case of windows I need the bin \ crate.bat file )

    Using the crash command line utility or the web console
    http: // localhost: 4200 / _plugin / crate-admin / # / console

    create a repository for photos called images .

    bin / crash -c "create blob table images clustered into 7 shards 
    with (number_of_replicas = 0) "
    + ----------------------- + ----------- + --------- + --- -------- + --------- +
    | server_url | node_name | version | connected | message |
    + ----------------------- + ----------- + --------- + --- -------- + --------- +
    | http://127.0.0.1-00-00200 | Brigade | 0.45.3 | TRUE | OK |
    + ----------------------- + ----------- + --------- + --- -------- + --------- +
    CONNECT OK
    CREATE OK (1.104 sec)
    


    Elasticsearch does not require us to define a data format. In such a decision, the devil is in the details, it is rather a topic for discussion in the comments to the article. I will nevertheless specify data types explicitly using the Mapping API so that there are no problems with searching and displaying in kibana.

    Data types
    {
      "info": {
        "mappings": {
          "default": {
            "properties": {
              "accommodation": {
                "type": "string",
                "index": "not_analyzed"
              },
              "age": {
                "type": "long"
              },
              "build": {
                "type": "string",
                "index": "not_analyzed"
              },
              "drinkingHabits": {
                "type": "string",
                "index": "not_analyzed"
              },
              "education": {
                "type": "string",
                "index": "not_analyzed"
              },
              "ethnicity": {
                "type": "string",
                "index": "not_analyzed"
              },
              "first": {
                "type": "date",
                "format": "basic_date_time"
              },
              "height": {
                "type": "long"
              },
              "images": {
                "type": "string"
              },
              "info": {
                "properties": {
                  "": {
                    "type": "string"
                  },
                  "Вес": {
                    "type": "string"
                  },
                  "Внешность": {
                    "type": "string"
                  },
                  "Дети": {
                    "type": "string"
                  },
                  "Знание языков": {
                    "type": "string"
                  },
                  "Кого я хочу найти": {
                    "type": "string"
                  },
                  "Материальное положение": {
                    "type": "string"
                  },
                  "Образование": {
                    "type": "string"
                  },
                  "Ориентация": {
                    "type": "string"
                  },
                  "Отношение к алкоголю": {
                    "type": "string"
                  },
                  "Отношение к курению": {
                    "type": "string"
                  },
                  "Отношения": {
                    "type": "string"
                  },
                  "Познакомлюсь": {
                    "type": "string"
                  },
                  "Проживание": {
                    "type": "string"
                  },
                  "Рост": {
                    "type": "string"
                  },
                  "Телосложение": {
                    "type": "string"
                  }
                }
              },
              "kids": {
                "type": "string",
                "index": "not_analyzed"
              },
              "last": {
                "type": "date",
                "format": "basic_date_time"
              },
              "login": {
                "type": "string"
              },
              "mainImage": {
                "type": "string",
                "index": "not_analyzed"
              },
              "message": {
                "type": "string"
              },
              "readableLogin": {
                "type": "boolean"
              },
              "realName": {
                "type": "string"
              },
              "relationship": {
                "type": "string",
                "index": "not_analyzed"
              },
              "replyRate": {
                "type": "long"
              },
              "searchingFor": {
                "type": "string"
              },
              "self": {
                "properties": {
                  "В друзьях я больше всего ценю": {
                    "type": "string"
                  },
                  "В женщинах я особенно ценю": {
                    "type": "string"
                  },
                  "В жизни я ставлю перед собой цель": {
                    "type": "string"
                  },
                  "В мужчинах я особенно ценю": {
                    "type": "string"
                  },
                  "Есть ли у меня домашние животные": {
                    "type": "string"
                  },
                  "Из всех известных людей я хотела бы быть": {
                    "type": "string"
                  },
                  "Как долго я смогу прожить без общения": {
                    "type": "string"
                  },
                  "Место, где я бы хотела жить": {
                    "type": "string"
                  },
                  "Мое любимое блюдо": {
                    "type": "string"
                  },
                  "Мое образование": {
                    "type": "string"
                  },
                  "Мое свободное время я хотела бы провести так": {
                    "type": "string"
                  },
                  "Мои любимые литературные герои": {
                    "type": "string"
                  },
                  "Мои любимые музыкальные исполнители": {
                    "type": "string"
                  },
                  "Мои любимые писатели": {
                    "type": "string"
                  },
                  "Мои любимые фильмы": {
                    "type": "string"
                  },
                  "Мои любимые художники": {
                    "type": "string"
                  },
                  "Мой девиз": {
                    "type": "string"
                  },
                  "Мой любимый город": {
                    "type": "string"
                  },
                  "Наивысшее счастье для меня": {
                    "type": "string"
                  },
                  "Самое поразительное открытие для меня": {
                    "type": "string"
                  },
                  "Самой привлекательной чертой своего характера я считаю": {
                    "type": "string"
                  },
                  "Самый ценный совет, который я получила в жизни": {
                    "type": "string"
                  },
                  "Хотела бы я иметь детей": {
                    "type": "string"
                  },
                  "Я больше всего горжусь этим достижением": {
                    "type": "string"
                  },
                  "Я мечтаю о работе": {
                    "type": "string"
                  }
                }
              },
              "smoker": {
                "type": "string",
                "index": "not_analyzed"
              },
              "updated": {
                "type": "date",
                "format": "basic_date_time"
              },
              "viewed": {
                "type": "long"
              },
              "weight": {
                "type": "long"
              }
            }
          }
        }
      }
    }
    



    We run a script that downloads html pages from sites, parses html and extracts the data we need and saves using the REST API / elasticsearch java client.
    Be sure to download json with index type = "default" so that you can execute SQL queries.



    An example of one of json documents.



    cr> select count (*) from info;
    + ---------- +
    | count (*) |
    + ---------- +
    | 291 |
    + ---------- +
    SELECT 1 row in set (0.030 sec)
    


    What is the average age in the data from the example?

    cr> select avg (age) from info;
    + --------------- +
    | avg (age) |
    + --------------- +
    | 24.7275862069 |
    + --------------- +
    SELECT 1 row in set (0.038 sec)
    


    The same script downloads images, considers the sha1 digest and does http PUT for each photo in crate.io:
    "http://127.0.0.1-00-00200/_blobs/images/"+fileDigest


    We can verify that entries appeared in blob.images:

    cr> select count (*) from blob.images;
    + ---------- +
    | count (*) |
    + ---------- +
    | 2813 |
    + ---------- +
    SELECT 1 row in set (0.029 sec)
    


    Great, the data in the database!

    I download the archive from kibana and unpack it into the plugins / kibana / _site directory . When restarting, the server will find the frontend as a site plugin .

    In plugins / kibana / _site / config.js we specify the address to the Elasticserch REST API

    elasticsearch: "http: //" + window.location.host,


    All changes in kibana are minor, rather hacks. Correctly, it would be necessary to make your component configurable.

    This fragment of the angularJS template displays an evaluation selector for the _id field in the main table and a photo, with the mainImage field visible .

    plugins / kibana / _site / app / panels / table / module.html

    Code display photo in the table, vote for the rating
    {{t}}
                                



    To display multiple images for a single recording while viewing a recording:

    Display code for all photos



    For the voting script, use jquery, which is already in kibana

    plugins / kibana / _site / index.html

    Evaluation update in json document, server request
            function postESUpdate(index, type, id, rate){
                $.ajax({
                    type: "POST",
                    url: "http://"+window.location.host+"/"+index+"/"+type+"/"+id+"/_update",
                    data: '{"doc":{"rate":'+rate+'}}'
                }).done(function(){//alert("success"
                }).fail(function(){alert("error")});
            }
    


    This is a call to the elasticsearch Update API to update the document field rate .

    This ends the programming. Then only the web interface!



    Briefly about creating filters, you already looked at the screencast at the beginning of the article.
    It also shows how to select the time sub-range on the histogram or using the timepicker. All your filters and settings can be saved as a dashboard in kibana and downloaded when needed by name.

    Out of the scope of this article are regular expression searches, service security, monitoring and administration of crate.io, SQL queries through jdbc or clients for your programming language.

    I repeat that jvm 7 or older is required to run the project .

    The application, with data for an example, you can download from the dropbox (234MB tar.gz), unpack and run in * nix with the command:
    bin / crate
    or windows:
    bin \ crate.bat

    Open the finished dashboard in the browser:
    http: // localhost: 4200
    / _plugin / kibana / # / dashboard / elasticsearch / When% 20first% 20photo% 20was% 20uploaded


    Good luck with crate.io/kibana and in real acquaintances !!!

    PS Dropboxs decided not to issue the archive today (11/27/2014). Please tell me in the comments which publicly available file hosting will allow you to upload a 234MB file without restrictions on the number of downloads.


    Based on the results of your vote, I wrote an article “What should we parse a site. Webdriver API Basics »

    Only registered users can participate in the survey. Please come in.

    What topic should I write the next article on?

    • 59.2% Writing a simple search robot, extracting information from web pages 144
    • 37.4% Distributed search robot and manage its tasks in the cluster 91
    • 31.6% A more detailed story about elasticsearch / crate.io, a distributed system, plug-in development for elasticsearch 77
    • 26.7% In the furnace, dating sites, give articles about enterprise and java !!! 65

    Also popular now: