Linked Data 101
1. Data structure: Triples
Triples are almost-human-readable statements, made up of [subject, predicate, object]. When you put lots of triples together and there is overlap in subjects and objects, they form a graph; subjects and objects are nodes (circles), and predicates are arcs (lines).
1.1. A subject is what the statement is about. The thing a statement is about is represented by a unique identifier: a URI (see 2).
1.2. A predicate is a property of the subject, essentially an adjective.
1.3. An object is the value of the property. This can be just a plain string or number (a literal), or a class, or a URI for another thing.
1.4. So to turn the sentence "my name is Amy" into triple form, you see the subject is you, a human; the predicate (property) is "name" and the object (the value) is the string "Amy":
<some-identifier-for#amy> name "Amy" .
2. Identifiers: URIs
URIs are unique identifiers, necessary so we can talk about lots of things, and distinguish them from each other.
2.1. They are globally unique (supposedly) across the whole web. So if someone at a University wants to publish data about Scotland and someone in Government wants to publish data about Scotland they can use the same URI, and know that they are both referring to the same Scotland (and they can say different things about Scotland and their data can be collated). In reality this is a bit more challenging because people don't like to cooperate, but don't worry about that.
2.2. URIs look like URLs. They are based on http because they're supposed to be resolvable with a web browser. ie. You're supposed to be able to type them in and get something back to tell you about what it is the URI is for. Because you often can't get the thing itself (because people, places and objects can't travel over http...) we can add a bit to the URL that doesn't get sent to the server: a fragment, eg.
#something - on the end. This just denotes that we're trying to talk about a thing not a webpage about a thing (the webpage about a thing is addressed by the URL without the #fragment). An alternative to fragments is content negotiation which is fun, but don't worry about right now.
2.3. Anyone can create a URI. It can be anything you like, but it helps if it will make sense to other people. It also helps if you have control over the domain name, so that you can decide what shows up when someone types it into a browser. We write URIs in
<triangle brackets> to make them easier to recognise. So with that in mind, we can change
<http://homepages.inf.ed.ac.uk/s1158216/#amy>, and we can say:
<http://homepages.inf.ed.ac.uk/s1158216/#amy> name "Amy".
2.3.a) If you don't own a domain name and don't want to let that stand in your way, you can make one up.
http://example.com/ is always good, or
Vocabularies or ontologies are a set of words and phrases that someone has decided to put together for the purposes of describing a particular topic. They usually consist of
3.1. Classes are used to say that
<some-thing> is of type
<http://http://homepages.inf.ed.ac.uk/s1158216/#amy> type <Person>. (
Person is the class). Classes start with an uppercase letter.
3.2. Properties are just the describing words. In the previous example
type is the property, and before that
name was the property. Properties are camelCased.
3.3. Anyone can publish a vocabulary, and you can use vocabularies that anyone else has published. There are literally thousands out there on the web, some better than others..
3.3.a) There are well-known, stable vocabularies. Examples include: RDF and RDFS (which provide the basics for describing anything), FOAF (for describing people and their friends), Dublin Core (for describing publications, mostly).
3.4 Vocabulary terms can have loose or restrictive definitions; you can define rules in OWL for how terms are used.
3.5. You create a new vocabulary term (whether a class or a property) by deciding what you want it to say and publishing it on a webpage, with a description, and a URI that (theoretically) won't change. If you can't/won't publish your term at a URI, you can just do the deciding part. You could pretend you've published it at
http://vocab.inf.ed.ac.uk/sws/my-vocab-term, and then go ahead and use it. They don't need to dereference to be useable (that's just good practice).
3.6. It's always best to see if a vocabulary term exists before you make your own. So for the previous examples, I know that in the RDF ontology there's a
type property, and in the FOAF ontology there's a
Person class and a
<http://homepages.inf.ed.ac.uk/s1158216/#amy> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <xmlns.com/foaf/0.1/Person> . <http://homepages.inf.ed.ac.uk/s1158216/#amy> <http://xmlns.com/foaf/0.1/name> "Amy" .
3.6.a) Because URIs are long and annoying to type out, we use prefixes to make things easier to read (prefixes can be anything you like, too):
@prefix foaf: http://xmlns.com/foaf/0.1/ . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix id: http://http://homepages.inf.ed.ac.uk/s1158216/# . <id:amy> rdf:type <foaf:Person> . <id:amy> foaf:name "Amy" .
So when it comes to creating an RDF dataset:
- Figure out what it is you want to talk about. The person/place/object/concept/etc.
- Find out if someone has already said something about the same thing. If so, what unique identifier - URI - did they use? (Hint: dbpedia has everything on Wikipedia, and their URIs are most commonly used). If not, just make your own, with your own URI base (eg. http://vocab.inf.ed.ac.uk/sws/).
- Figure out what property of the thing you want to describe. Name/colour/size/latitude/life expectancy...
- Figure out if someone has tried to describe this property already. If so, they have a vocabulary term for it, which will look like http://some-website.com/vocab/some-term. Use it! Literally just by copying the address out. If not... make your own. Same as 2.
- Figure out what the value of the property of the thing is. Bob/blue/20m/-5.6/60 years...
- See 2. Same principle, but for many things there's not much point in having a URI for them. Use an existing URI or create a new one if you think it's necessary. If it's just a string or a number (or a date/time or whatever) then use a literal which is just the value in "quotes".
- Parse your dataset - using the tool or programming language or library of your choice - to extract the subject (1.), the predicate (3.) and the object (5.) and store them in a file in triple form.
- Found a typo? Want to fix a mistake? Pull request!
- Something doesn't make sense at all or needs more explaining, or questions about SWS coursework? Email Amy.Guy@ed.ac.uk.