sashaeve December 26, 2009 at 18:04

Embedding semantic data in HTML

I also want to take part in thoughts on the topic of the semantic web, started here and here .

I spent a certain amount of time researching the principles and development trends of the semantic web, I want to share the main results and thoughts.

Why do I need it?

The answer is very simple - the need to separate the grain from the chaff , i.e. “Information” from “information noise”.

How it can qualitatively affect the web:

if you enter into the search engine a query containing the name of a topic or news, you will notice that 80% of the results are the same text, “embedded” in the graphical interface of a resource
focusing on information, not on banners, link lists, friends of friends, etc.
more accurate search by taking into account only relevant content
your option?

What do we have at the moment?

If the necessity and advantages of the “semantic web” are more or less clear, then the implementation options raise some concerns.

At the moment, we are operating with concepts such as URIs (Uniform Resource Identifiers), ontologies that are described by languages such as RDF and OWL, etc.

To be honest, my attempts to deal with these languages and their methods of use have failed - they are difficult to understand, ambiguous and require refinement. The search for some working and understandable tools also did not succeed. As for me, this is the main stop factor in the development of this area.

We also have such a concept as microformats, which, it seems to me, have in essence advanced further in ideological development, but, unfortunately, not far enough.

From what I have met, attention is paid to the development of OpenCalais , which allows you to extract some semantic information from texts and web resources. Their service allows you to determine which category of knowledge (technology, education, politics, etc.) this or that text belongs to, extract terms, and get some other similar information. Despite the apparent beauty of everything that is happening, it is too early to seriously use this service.

Manual labor or automation?

The second stop factor is that you need to enter semantic data yourself, which raises questions about who will do this and who will pay for it.

My opinion is this: automation can help, but you cannot completely rely on it for the simple reason that the issues of understanding and the logical connections between concepts are a subjective assessment, which at this stage of development cannot be formalized.

Statement of tasks and solutions

So, when creating a website, we draw a unique design, adapt it to all known search engines and web standards, grind it under different browsers, select accountants, promotion specialists and pay them money, and most importantly, everyone considers this to be normal . So why can't the semantic component be part of this process?

From the point of view of the author of the site, it makes no sense to engage in semantics, because:

it requires additional labor costs (this is not so bad)
this requires learning new standards and languages (same RDF and OWL)
lack or weak support of semantics by search engines

If the first point is more a monetary issue and often quite solvable, the third depends on the search leaders, then we’ll try to do something with the second point.

Semantic Data Integration

Having analyzed (and a little imagination) complex and not very possible methods for integrating semantic data, I settled on a simple and obvious way: integration in the form of tags and (or) CSS CSS notation .

Example:

Mathcad is desktop software for performing and documenting engineering and scientific calculations.

In order for the code to be valid, add the scheme: In CSS notation:


"-//W3C//DTD XHTML 1.0 Transitional//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

[




 ...


]>

Mathcad is desktop software for performing and documenting engineering and scientific calculations.

.mySemanticClass {
keywords: mathcad;
contentType: content;
category: math;
}

In our HTML file, we include the semantic file as well as a simple CSS file.

Attributes and Categories

In my work, I highlighted the main attributes that I would like to have now. Here is a list of them:

contentType defines content type (top, bottom, advertisement, content, links, references, bibliographic, related, image, video etc.);
keywords defines the relevant keywords or phrases for block content; synonyms defines related terms and synonyms (eg "Obama" and "president";
category defines content category (Business_Finance, Entertainment_Culture, Environment, Health_Medical_Pharma, Hospitality_Recreation, Law_Crime, Politics, Sports, Technology_Internet, Weather, Other) [6].
importance defines content importance and can be a float value from 0 to 1;
ref attribute defines additional reference related with block content;
parent is a identifier of parent block which says that it must be considered with parent block;
author defines copyrights and can be used for citations, proverbs, programming code;
progLang defines a programming language.

Consider the advantages of this approach:

you can integrate semantic data right at the stage of creating HTML
this can be done by both a typesetter and a programmer who is familiar with CSS (and there are many more such than RDF experts)

Well yes, dreams, dreams ...

But the future is here!

This approach has one obvious drawback - the need to enlist the support of search giants, who will use this approach when indexing pages. But this idea can already be implemented in CMS, blog engines - for this you need to implement the appropriate code in the engine and some additional fields for input and use this information in your own logic for searching and filtering data.

PS As someone well noticed, they don’t beat the idea. Therefore, it would be interesting to discuss such an option for the development of semantic web. Thanks for attention!

Tags: