Jan Scholtyssek » Big Data – Knowledge retrieval

To let computers find information and process information on their own to get new knowledge is the big thing for the research in these years. You know the search engines and we are now used to finding the right information by just describing what we are looking for. To write those algorithms is highly complex. Just like vision, we humans are amazingly good at recognizing patterns and understand the world, even if it changes. When there is a dot on a screen we can almost not avoid to follow it with our eyes and they do it on their own. For a computer even that task can be quite complicated and consume some resources and the same goes for language. We don’t care whether one says “Can I get the milk?” or “Can I get some milk?” and when someone asks “Where is the milk?” in someones home we know that the answer is probably related to a place in the house (on the table, in the refrigerator…) and the answer is different in a supermarket. But this is hard to teach a computer. Humans easily adapt to changing environments, computers don’t they don’t know the world.

I have been dreaming of a database for some time, where we teach the computers the hierarchy of things. For example teach the machine that there is something called a bottle. A bottle is a container and a container has a volume, which again has units m^3. And the bottle itself gets it’s properties from the these concepts, such that we know that a bottle has a volume, but on top of this a bottle has an opening (which maybe should belong to the “container”-class).
The system could be taught by a community. People go to a website and can help to describe the world and by letting different people answer the same categorization, there should be some consistency. They may for example answer questions like:

Has a dog pages? (NO!)
Has a bottle hairs? (NO!)
Has a coin a material? (YES!)
Has a pillow a volume?

Especially the last question is interesting as we saw that a container has a volume and when the only property of a container is to have a volume, than we could say that a pillow is a container. The same goes for a dog (has a volume). This is where it gets fuzzy, but a dog could in principle be a container (and even has an opening). When many people answer NO to the question whether or not a dog is a container, the system could try to figure out itself what the object have in common which are rejected from the container class, as these object get assigned more properties. This could for example be the property of “living creature”.

How to build this hierarchy is still a vague idea. More important is the question what this database could be used for. The dream is to have a system which can answer questions which have not been answered before. For example when asked “Where can I hide my money?” to come up with all kinds of containers but also make the relation between “hide” and “store” and hence “money storage” -> “bank”. To name some possible containers, the database has to use information on the specific person and know what items this person possesses. This is actually not that hard after all, since there is so much information that we already share on the internet. It is enough that the person has written to one of his friends that he has bought a new cupboard for the system to know that this person owns a cupboard, which is a container and hence a hiding place (a hiding place is a container which is not transparent). This is Big Data. A lot of unstructured information in many different appearances. This is what I’m specializing in: Analyzing unstructured data and extract knowledge from them.

How to make the link between “hide” and “store”? It could be in the same way as the rest of the objects which have to be classified. Both “hide” and “store” are “actions”, but “hide” is a special way to “store” something. And a bank could have “actions” attached to it. There is a problem though: How to make sure that there are no rings? That is definitions like: “water” -> “liquid” -> “water like” -> “water”. Obviously liquid has better definitions than “water like” but when letting a community make the hierarchy there is a chance for these misclassifications.

An inspiration for what is possible is this guy, Watson from IBM:
Jeopardy! The IBM Challenge – Episode 1
Jeopardy! The IBM Challenge – Episode 2
Jeopardy! The IBM Challenge – Episode 3

Posted in Blog

« New acquirements

Chuseok-holidays »