Google I/O 2011: Full Text Search

Uploaded by GoogleDevelopers on 11.05.2011


BO MAJEWSKI: Before I dive into the API, let me quickly
introduce ourselves.
My name is Bo Majewski.
I've been with Google for five and a half years.
I started on the Geocoding API, joined the Wave, hence
you didn't see me for three years.
And then right now I'm working on App Engine.
Hi, I'm Ged Ellis.
And I also worked on Wave for a couple of years, working on
search and notifications.
And recently-- the last six months or so--
I've been working on search in App Engine.
All right.

BO MAJEWSKI: So just to gauge the audience, how many of you
have used anything like Lucene Solr, Whoosh?
OK, so you're very much familiar with the reasons for
doing this.
Any time you want to have something like a bunch of
documents, and you want to allow users to find the right
documents by entering just a few key terms, you would use a
text search API.
If you have a feed, usually it's filled with information
about people eating breakfast and going to sleep.
And from time to time you want something more relevant, so
you would search it as well.
Gmail changed the approach to email by focusing on search
rather than on foldering and cataloging your email.
So again, this pretty much uses the search API, text
search API.
And finally, every time you go to Amazon, eBay, where you
have hundreds of thousands of products, you use text search
API to narrow your searches.
So about three years ago, an Issue 217 has been filed,
which since then somehow mysteriously
drifted to number two.
And the person who filed it-- noted that, since we are
Google, we pretty much should provide the text search API.
So we actually agree--
BO MAJEWSKI: It just took us a while to get to it.

What I'm going to talk about today, and Ged, first we're
going to present you with a core--
what we call core--
search API.
This is a more low-level approach.
Those who do not use Datastore directly.
The second part of our talk is going to be integrating this
with Datastore.
How do you use it with Datastore?
And finally we're going to mention our plans for REST
API, which allows you to exist outside the app engine

As you might be aware, what is available today, you have
essentially two possibilities.
You can either rely on the Datastore.
And Datastore allows you to store texts and strings.
And you can use the equality/inequality.
And in case of lists, you can test for membership.
Or you can do anything you can do over HTTP.
So you either do a self-hosted solution, where you put up
your instance of Lucene and just query it, or you go to a
third party.
However, this is not the most convenient thing.

What Ged is going to show you is what is
available right now.
We actually have it working.
And there is no more appropriate way of showing it
than showing you with a '90s style application, the guest
book you can put on your webpage.

So, yeah.
We have some guest book application built on top of
search API.
There's three sections, I guess, to this demo page.
There's a little comment box here, where you can type in
comments, a search box where you can search for comments,
and then a list of search results, which is the comments
which match the current search.
So the current search is all comments.
And then you've got some other fields here, like the author
who wrote it and the date published.
So let's go ahead and make a comment.
So "I'll be back." So we're making comments basically just
quotes from movies.
And we'll hashtag it with #terminator.
And I'm going to shout out to Bo.
So that gets added to the index and is returned in the
current search, which is all comments.
So you've got the comment there.
I said, who wrote it.
So it's anonymous.
And published today.
And you can see that I'm currently signed in anonymous.
No one is signed in.
So let's do some searches.
So I could search for simple terms, like fish.
So that'll find all the comments that mention fish.

You can see that it supports CJK, so I can search for just
the character fish.
I can do more complicated searches like combinations of
terms. Like "be back." And you'll see it'll match the
comment I wrote before.
Also, it's "I'll be back"--
back "to be or not to be" is a best quote.
But if I want to actually match this, I
can do a phrase query.
So that's just showing some of the sorts of queries we can do
that this supports.
So it'll only find "be back."
You can restrict your searches.
So I can search for things that I wrote.
So you can see that, in fact, it searches for Ged in the
author field.
So you can restrict to, obviously, anonymous, so
things that have been anonymously written.
You could also search for other things, like tags.
So if I want to search for gladiator--

oops, that's actually in the wrong syntax--
so tag gladiator.

And so that finds all the hashtags, all the comments
with hashtags and gladiator.
Similarly, if I want to find out shout outs to Bo, that
will find all the shout outs.
So in terms of some of the more complicated searches, so
you can search for combinations using logical
operators like and, or, and not.
So here we're searching for anonymous comments that
contain the word "back" or "machete." But I can also do
things like restrict it to not Macbeth--
tag with #macbeth.
So it shows some of the standard things you'd probably
find in a search site.
So, some of the other things you can do.
Search for published ranges.

So I could find "give me all comments that are being
written up to and including the 6th of May." And if I want
to do exclusive ranges, use the curly brace syntax.
So that would exclude the 6th of May.
And you can do things like delete documents.
So I delete that comment, and now it's no
longer in the index.
So that's showing, basically, documents being indexed
immediately and matching searches.
There's also pagination.
So I can just paginate through the results.

So let's have a look at some of the code behind that and
the concepts.
So there's two processes in search.
There's the indexing process.
So, for example, we're taking user data--
in our case, the user data was the comment, the author, so
the person logged in, the date and things like
tags and shout outs.
So there's some of the fields that are added to documents.
So the document would be this comment document.
That document is then given to the indexer, which will
tokenize it and then build the inverted file in the index.
In the search phase, you're taking user-- or application
input in our case-- that we were taking exactly the query
that was typed in.
That's given to a query builder, which will build a
more formal query and give it to the indexer back end.
And execute it against the index.
And return a search response, which
contains a set of documents.
And then, those documents use a result builder and
construct, in our case, the three little comments shown at
the bottom of the page.

So the architecture.
So I'm going to discuss the left hand
side of this diagram.
Bo is going to talk about the Datastore side in the second
half of the talk.
So with the application, you construct the document.
Then you ask the indexer to index that document.
The indexer then will write that document to Megastore,
which will then tokenize it and build the indexes.

An application will send a query to the indexer.
The indexer will then issue that query to Megastore, read
the set of results, the entities, and convert them
into a search response.
And then the search response is sent back up to the

So you have full control over the
document schema and content.
There is no up-front declaration of the schema in
the gamma files.
It's actually done as you add fields to documents.
So document content can be arbitrarily transformed.
So in our case, we were taking those comments, and we were
also extracting out the tags and the shout outs.
But it could be a product document, and you could unify
with that pricing data shipping and offers and
reviews, et cetera, so we could build a
full product page.
So, a typical sequence.
Create a document.
Add the documents to an index.
And then use a query to match against that index and return
documents or snippets.

So for those who--
this should be reasonably familiar, who've
used Lucene, et cetera.
So the document is a collection of fields plus a
document ID.
Document ID has to be unique.
You may have language specifications, so you can add
an indication on how it should be tokenized.
And then there's order IDs should also be supplied.
The order ID is used to sort the documents by default.
And you saw with the demonstration that those
documents were sorted from most recent to oldest. And
that's using the default ordering, which is the number
of seconds since January 2011.
So fields are simply a named and typed value.
The name doesn't have to be unique, so you can add
multiple values.
So you can have as many tags as you want, for example.
You can specify the language so that you can override if
you want for that particular field.
Or you can specify at this level.
And as mentioned before, the schema is built as you
introduce new field names and types.

Some of the field values we support.
Plain text.
So it can use a HTML parser.
So up to a day-level granularity in date/time, so
you have milliseconds.
And Atom, so you can have fields that you don't want any
tokenization and treat as a single value.

So let's look at how you can create a document.
Here's the Python code.

So, we have a document ID.
Set the document ID.
In this case, you'll need some way of generating a unique
document ID.
And then you basically just add a set of fields--
let's say there was text, HTML and dates and Atoms. We'd
probably have two more here, which were the tags and the
two fields.
But essentially you name each one.
You type it.
And you give it some value.
So, the author, which is the user nickname.
The comment, which is the actual text from the text box.
And the date field, which is the date today.

The Java is pretty similar except we're using builders.
So just get a builder.
Set the document ID.
And then add a field.
And then for each field, you have a builder.
So, you have a field builder.
You set the name.
And you set the type of value it is.
If it's text, HTML, or date.

So as I said, the index is fully dynamic.
So as you add those new field names with the types, and you
index that document, that will introduce that to the schema.
So that'll extend the schema.
We support versions.
So you can have an existing version of an index.
You don't need to do this, but you can.
And you could build in a background process.
Rebuild the index.
And then hot-swap that index out.

There's two levels of consistency supported.
This is global or per document.
So global is like an inbox style.
And basically if you're putting a label on a document,
for example like your email or something, and you put it on
that document.
If you search for that, you expect that
document to be retrieved.
So that all rights are committed up to before that
search is executed.
And then there's per document, which in a feed style, it's
not that important that if there's other content you
aren't aware of, then if it turns up a few seconds later,
then that's OK.

This is an issue around Megastore, and you see the
same thing around Datastore.
It's like selecting entities.
So global basically has around one QPS of
rights that it can support.
But per document can support thousands QPS
throughput of rights.

So the sort of operations we support are index and re-index
the document.
So you can search with a simple query string or a
complex search request. So you can override all the features
like sorting, scoring, et cetera.

Delete document.
So you can delete by document ID, or you can delete by
issuing a search query.
So all the documents that match the
query will be deleted.
You can iterate through all documents in an index.
You can get the metadata.
So the metadata would consist of the index name, consistency
level, the schema.
Or you can activate a version of an
index so you can hot-swap.

So how do you index documents?
Essentially, you access the index by the index name.
If that index is not already in the cache or has not been
created, you then create the index using the metadata.
So you're specifying the index name and
the consistency level.

Once you have the index, you can add each index, each
document, or you can do a bulk index of all your documents.

So the Java is pretty much the same except you're using the
factory to get the index to create it.
And then you're using a builder to
construct the metadata.
Otherwise you just, again, can iterate through the documents
and index them.
Or you can do a bulk index.

So as shown in the demonstration, we had some of
the queries you could support.
So, simple terms like hello.
Phrases, so where you care about the order.
And whether they're adjacent, logical searches, using and,
or, and not.
Or you can restrict search to particular fields.
Like find all the comments written by Bob in this case.
So you can do closed intervals.
So find all comments published up to and
including the 1st of May.
Or you can do exclusive ranges.
Open intervals.
So find all the products priced between $100 and $200,
but not including $100 and $200.
So for search, again, you get the index.
You search.
You could either construct this query from some source,
or you could just enter the string.
In our case, we would have just taken the string the user
entered, and then we would get a search response.
The search response contains things like the number of
documents matched, the actual count in returned documents,
the documents--
and you essentially iterate through that response to get
the results and get each document and their scores.

The Java is pretty similar.
Again, you get the index.
You get a search response when you issue the
query to the index.
And then you iterate and get back search results, which
contain, amongst other things, the document.

Some of the other features that we didn't show in the
demonstration were things like snippeting.
So, in fact, this technology was sitting
behind Wave as well.

Here we have Sydney karaoke as the query issued.
And snippet is basically giving you the context of
where the best matches in the text are.
And so, for example, here it found karaoke and Sydney
together in a short segment of text out of this Wave. And by
default it would give you back snippets up to 160 characters,
and at most three snippets.
So you can build some small representation of the document
that you wanted showing results.
So to do snippeting, essentially you issue a search
request. You have the query.
And then you just indicate which fields
that you want to snippet.

So some of the other features are paging and cursors.
So for paging, it's typical where you'd give an index of
the results on the different pages.
And people randomly access those results.
So if I click on page five, for example, then I have to
give an offset.
So if there's 20 results per page, then I'll want to start
at result offset 100.
And the limit will be 20.
So there's a bit of a performance hit with this
because it has to compute that set of results up to the page
five plus the next page.

Alternatively, you can use cursors, which
is much more efficient.
Which is the next page or more mechanism that you see on a
lot of websites.

And cursors can be per result or per search request. So you
can put links on each result, or you can put it
on the whole page.
So the first time you issue the search, you'll specify.
In this case, I'll want a cursor per
page, so a single cursor.
And I'll return 20 results.
When they click on the next link or the more button or
whatever, they will then issue this request, which says, OK,
this has the cursor here.
And I want--
again, I'll want another cursor for another next page
plus 20 results from that cursor.

And then it's over to Bo.
BO MAJEWSKI: All right.

So once you get all the results you probably want to
show in the most relevant order to the
people who are searching.
As Ged mentioned, by default we are sorting by order ID.
These are either default sort, which comes pretty much for
free because we store them in that order.
Or it is the sort that is being used even if you specify
alternative sorting orders.
So what you see here is the documents ordered by time.
And I can change it.
At the moment, this is a very simple application, so I can
change it to author.
If author is descending, I come on top.
And then on the subsequent pages Ged and
anonymous will appear.
You've authorized ascending, then anonymous
bubbles to the top.
And anonymous-- we made a lot of comments as anonymous, so
probably around page three Ged will appear.
But you can also do some more complicated sorting orders.
The other way of--
I'm going to talk about it in a second--
the other way of ordering documents is by scoring them.

Typically, when you search on a website, you get documents
that are most relevant.
So entertained.
Let's search for entertained.
So this is ordered by relevance, but we can--
by relevance and time.
We can also have a simpler scorer, which is a hit count,
which essentially will count only the number of hits.
And in this case, they are inverted because simply the
second phrase had more occurrences of "are you
entertained" than the first one.
So let me give you some more details on that.

As I mentioned, the default order is order ID.
If you don't assign anything to it, the number of seconds
is assigned, so consequently--
because we return documents in descending order by default of
order ID-- you get the freshest documents first, the
oldest last.
You can actually override this document if you know how to
create an order from some sort of field to a magical integer.
You are free to do that.
And this will provide high performance and pretty much 0
cost for your application.
Alternatively you can, in a search query, search request.
You can say, well, I would like to order
some different order.
And you just provide sorting specification.
You can sort on multiple fields.
In this case, we are sorting by author, and the sort is
We also provide the default value.
That means if we don't have the author in a given
document, use this value, which causes the document
either to bubble to the top or to the bottom.
Once we sort them by authors, the
secondary field is comment.
So all documents with the same author of course will go into
the same batch.
And then the secondary sort is applied.

You can use some of the more complicated expressions.
So we just don't limit you to the individual field.
You can use regular operators of course depending on what
sort of fields you have. On numeric fields, you can use
multiplication, addition.
You can use conditional operators.
If something happened, then assign this sort of score.
Otherwise assign this.
And you can use a bunch of functions like maximum,
minimum, logarithm, et cetera.
Some examples here.
If you want to sort by price plus tax plus shipping to give
people the total cost of purchasing the
item, you can do that.
So you would include in a sorting specification
expression, which is price plus tax plus shipping at
distance from a location.
You can also construct artificial fields.
This is probably not for sorting, but you can actually
create artificial fields by concatenating a
bunch of other fields.
Along with the sorting, you also have
access to custom scoring.
You specify the custom score.
And in this case, the expression is
maximum number of tags.
Simply count the number of fields that you have in a
document with a given name.
So it counts the number of fields named tag and limits
those to five.
So then, all documents with five, six and seven tags are
considered the same.
Everything else is sorted.
If there are no tags in a given document, the default
value is used, which is 0 in this case.
With scoring, however, there comes a caveat.
We have to score a lot of documents.
If you don't tell us how many--
because we want to make sure that we return you the highest
scoring document-- so if you are requesting about 20
documents, probably limit to about a 100
where you want to score.
So just use caution.
Don't go crazy on us.
You can also use the default scorers.
So we provide two default scorers, match
scorer and hit count.
Hit count is mostly for debugging.
This is very simplistic.
Match scorer is somewhat more sophisticated.
So if it uses the term frequency, TF, times inverted
document frequency.
So essentially, term frequency counts how many
times the term occurs.
And inverted document frequency counts is
proportional to how frequent that term is on documents.
So you have the word "the", which is in every document
that you have in a repository, the IDF will be very low.
And consequently it will be scored low.
Or if you have some terms that occur multiple times on only a
few documents, then that score will be high.
So this is a default one and right out of the box.
Hit count essentially is just TF.

Snippeting, Ged already mentioned.
We provide default snippeting, but you also can control it.
So use the snippet function, which rather than getting the
defaults, you can say, well, the expression to return
essentially the default snippet is also an expression
that we return.
But here you control it with a function.
And you say, snippet to return is based
on field name comment.
This time match just Sydney.
Don't match karaoke.
And limit 240 characters.
The last parameter one means you want consecutive snippets.
So don't take three pieces of a text here.
Just take a single piece of a text.
So you can take two pieces, three pieces of a text, et
cetera, per document.
The original Issue 217 actually
was filed about Datastore.
So you might be wondering what are we doing
with all these documents?
And the documents is essentially something that
gives you full control.
If you have data stored somewhere else in Datastore,
you probably want to use documents.
If you have concerns about privacy or revealing secret
information or whatever, use your documents and create
something that's searchable and points back to the actual
data that you want the user to find.
However, if you store in Datastore, this is
the section for you.

So this is the second part of the architecture.
Rather than operating on documents,
you operate on entities.
So now, rather than doing index document calls, you
simply do entity put.
And there are two possible paths.
One is it goes to the Datastore and, under the hood,
generates a document for you, stores that document, and in
that document it remembers which entity it
was generated from.
Then it'll have appropriate way of matching that document
with the entity when you execute that search.
The other path uses Task Queue, which will allow you to
actually do some extra modifications.
So we also will give you somewhat the freedom to
generate these documents freely, without your
intervention, but later, modify them.
Let me give you a more detailed description.
By default, we are not going to change the behavior of
Datastore entities.
So if you have a Datastore entity, it has no full-text
search capabilities until you annotate the properties of
that entity with search type.
And in this case, I have the greeting, which is also this
guest book application entity.
And the author is annotated with a search type Atom.
Meaning we'll not modify it.
We'll just take it as is without further tokenization.
Comment will take it, but will treat this as HTML.
And in this case, date is not being indexed at all.
If you wanted the application like we had, you probably
would have to create--
annotate the date as well with type searchable,
with a search type.
And now we have these two modes.
One is the what we call fully automatic mode.
So essentially now you execute a put.
And what happens is, in the
background, we generate document.
We store it in Megastore and remember which
entity it came from.
In order to preserve the same parameters, latency
parameters, as you had on Datastore, you can
do it in two ways.
You can either tell us just to put the entity.
And in that case we'll just both write the entity and
write the document on the same transaction.
So either both are committed or nothing is committed.
Or you can say the search index mode that is
asynchronous right.
What that means is, we will generate the document, but
we'll return as soon as the entity has
been stored in Datastore.
In the background, we are going to start
processing the document.
And we'll guarantee the eventual consistency.
So eventually that document will be committed to a
full-text search index.
But it does not mean the moment you put call returned,
you will be able to search using full-text search.
Depending on which one you want, you either want the same
performance or immediate full-text search, you can use
one or the other.
There is a semi-automatic mode, which allows you
So in a most drastic approach, you say, generate the
document, but don't do anything with it.
Instead give it back to my handler.
So you specify search index mode, none--
which means we don't write it, we just generate it--
and you specify the handler on which we'll call you with a
serialized document.
What that allows you is further modify the document.
The application that Ged was showing you, for example, was
analyzing the comment field and was extracting the
hashtags and was extracting shout outs.
If you want to do that, this is probably the mode in which
you would operate.
There is also a semi-automatic mode, which is a combination
of fully automatic and no-write mode.
In which case you simply allow us to write the document, but
also give us a handler.
What you do with it?
It's up to you.
You can either monitor, log, whatever.
The searching is done by providing one extra method,
which is matches, which takes a query.
And essentially the way it works is it goes to the
document index, finds all the matching documents, and
performs the intersection with other parts of the query.
So it gives you only things that match the full-text
search plus other limitations you gave on the query.

The final part is the REST API.
We pretty much are going to follow the somewhat standard
convention of noun, selector, noun, selector.
All nouns are plural.
Selectors are IDs of either indexes or documents.
At the moment, we only have used three methods, get, post,
and delete.
Now, I have to give a caveat.
This may change.
This is very early stages of this implementation.
You've used, say, /indexes, and you give us a get method.
We give you method about all your indexes.
The first selector is the index name.
So on the get, you get the metadata about that index.
On post, you create that index.
The parameters of the index, like consistency and version,
all that sort of stuff, you would specify after the
question mark.
On delete, you delete all documents from the index and
delete the index itself.
Second noun is docs.
So if you execute a HTTP get with /index is A/docs.
You'll get all documents from index A.
Post method does nothing.
And delete deletes all docs, but it retains your index.
You can actually limit which documents you will delete by
specifying further query after the question mark.
So then, in that case, it means delete from A all
documents that match given query.
And the final selector is the document ID.
So again, /index is /A, /docs is X.
Select the document X from get.
And post, you create the document, and you specify the
fields of the question marks of the parameter.
And on delete, you delete that specific document.
Versions are supported by prefixing all the URLs with
appropriate V1, V2, et cetera.
As you might guess, at the moment, it's like V.01.
Well, V1.
Pagination is supported by offset limits and format by
appending the format name to the noun.

In terms of a road map, I have to be very careful, because
Mick, our Product Manager, is watching me carefully what I'm
going to say here.
At the moment, we are just telling you what we are
working on.
It actually works, despite Mick's piercing eyes.
It works.
But we are going to invite you to give us feedback.
I mean, what is missing?
What would you change?
What would you like to see?
A lot of you voted on Issue 217.
We hope that a lot of these people who are interested in
full-text search can give us the feedback before we
completely nail the API.
At the moment it's fairly stable, but we still invite
the feedback.
The first thing that we're going to
release is the core API.
So this is the part that Ged was talking about, where you
will have to generate your own documents.
And you will have the freedom of how you do it, but the
additional programming expenses of having to do it.

Subsequently, the Datastore API is going to be released,
in which case you will then have these fully automatic and
semi-automatic methods available.
And further down the line is the REST API.
OK, to sum it up, you were given an overview of what is
coming in the full-text search.
Stand-alone text search coming first. Full integration with
Datastore and the REST API coming next.
And I guess we have 20 minutes.
So if there are any questions, we'll just open the floor to
the questions.
And please use the mics on the other
side, so they get recorded.

SPEAKER 1: All right.
So I have an app that has many different customers in the
same Datastore.
All of my entities are keyed on the customer ID field.
And I search on that in all my queries.
How do I do the same thing with full-text search?
Do I use the thing to make multiple
indexes, one per customer?
Do I also add the same kind of field to all of my full-text
search variables?
What's the right way to go about this?
Will there be a multi-tenant space in the name space in the
API added onto this?
How does that work?
BO MAJEWSKI: So the recommendation would be to
indeed have index per customer because you have the freedom.
I mean, we don't limit how many indexes you have. At the
moment there are some rudimentary limits put in
place, so you cannot create millions of them in one day
because then we think there's something wrong with the
But in order to fully isolate them, I would recommend using
customer name or some sort of thing that identifies the
customer as the index name.
The reason for that is--
and I didn't mention it in the talk--
we also plan to allow you, if the customer says so, make the
given index accessible to other applications.
So you can search it through other applications.
And then it would be easier just to give that permission
per index rather than having to juggle all these things.
So as far as we are concerned, you can have thousands or tens
of thousands of indexes.
But that will require using the lowest level of index to
the search.
I can't just go with my data objects as they are.
At the moment--
BO MAJEWSKI: That's a fair comment.
And we can think about it, how we would do it.
But there are some limits.
I mean, at the moment, we--
actually on Datastore integration, we want to return
you all entities of a given type and so.
SPEAKER 1: No, that's fine.
It'll take time.
It's pretty easy to go that way.
SPEAKER 2: A couple of things.
Do you have the ability to mark fields, certain fields,
as more important in terms of the scoring?
GED ELLIS: There is--
Field match boosts.
So you can put--
yeah, yeah, there's boosts for field matching.
SPEAKER 2: And then, fuzzy matching?
For misspellings?
So we do stemming synonyms and diacriticals.
But this is done at the query time.
So if you want to match something--
prefix matching, or fuzzy matching?
SPEAKER 2: Um, yes.
So I'll say yes to the first half.
Essentially, our approach to stemming synonyms and
diacriticals is that it's better to do it at a query
time because these thing change.
We actually live in Australia.
And there was like Kevin '07 was the slogan, right?
And if you have Kevin--
if you enter two terms, Kevin '07, then we in 2007 would
have realized that you're talking about a particular
guy, Kevin Rudd.
Probably in 2004 it was not relevant.
So you don't want to recompute your index based on the year,
what is popular in a year.
So the additional processing is done during the query time.
However, at this time, we don't have anything that
allows you to say, anything that starts with full.
And give me all these results.
That is one of the actually more requested text search API
features, but we haven't quite nailed it at this point.
SPEAKER 3: So this question is about the industrial text
search API.
So if you have an entity in Java or even Python, you have
this-- what appears to be search by document editor.
It is an indication of how we, in Python, use
the different search.
How do you search for normal APIs [? in concrete ?]
I mean, expand the models of data [? stored? ?]
BO MAJEWSKI: At the moment, we just allow you to do the
annotations on fields, right?
But in what--
expand the models?
SPEAKER 3: I know that web-based, you use cache

BO MAJEWSKI: To be honest, that part has not been
So if Max is anywhere in the audience, he can probably
guess the answer.
But at the moment--
SPEAKER 4: So this API, what's the objective?
going to be exposed to us?
the browser so that [UNINTELLIGIBLE]
only be used within the context of my application?
And if so, can I get only the pointer, which is the ID to
the entity and then have the link, which the entity I came
from led us to?
What is the response of the REST API?
I have two parts.
from the browser.
And the second part is, do I get full entity data?
Or should I have only the point at which the reader
I can tell you all these at the moment.
So just the document.
You get the document.
And the main purpose of REST API is to allow you to use
what hopefully is the goodness of our text search API, even
if you don't have App Engine application.
So how--
I'm not sure if returning entities makes sense for
Because it's meant for an application that is not
necessarily a App Engine application.
SPEAKER 4: You mean the browser or inside the
applications App Engine is configured?
BO MAJEWSKI: No, I mean if you made the request from a
browser, and we gave you the full entity, and you didn't
have an App Engine application.
I know what--
SPEAKER 4: You will have authentication problems?
I mean, authentication is going follow the same
principle that Task Queue was following.
So, yeah.
That's an administrative API.

SPEAKER 5: You talked about pagination and cursors.
And I'm wondering if I'm using a per document consistent
index, and I do a search, are those-- the cursors and
pagination-- is that based on a snapshot of what the index
looks like when I did my search?
Or if things are being added to that index, is it going to
bump my search results around?
GED ELLIS: The pagination will be recomputed, so--
sorry, yeah.
The pagination will be recomputed.
But for the cursor, it's a good question on that.
I think it's actually-- we'll take it from where
it was at that time.
BO MAJEWSKI: Well, yes.
I mean, if you change the document-- if I understand you
if you change the document, you will not impact it because
the document ID stays the same.
GED ELLIS: No, but if you added more results.
SPEAKER 5: No, if I'm adding more results, will it--
GED ELLIS: So for the first case, when it's using pages,
it's going to recompute the whole thing anyway.
But for the cursor, I'm not sure what happens actually.
BO MAJEWSKI: We should have tried it.
SPEAKER 5: Can I send you an email and get back to you?
BO MAJEWSKI: Yeah, sure.
SPEAKER 6: So I heard about Python and Java hook-up.

We didn't talk to the go people.
GED ELLIS: They sit next to us, so.
SPEAKER 6: That's pretty bad.
BO MAJEWSKI: Well, yes.
We'll politely ask them to do that.
I mean, our back ends are agnostic.
We just use our PCs.
And we just return our protocol buffers.
So hopefully they can create the appropriate
go object from it.
Essentially what you see is there is an RPC and in front
of it sits a thin layer of Java API or Python API that
reconstructs the Python object or the Java object.
So I don't anticipate it's going to be a major hassle to
get a go going.
SPEAKER 7: Another question about data cursor.
I noticed in your examples, they were all
next page, next page.
Can you go to previous page?
You know, like using the data cursor?
BO MAJEWSKI: You could, but you would have to store it
yourself, right?
So we always give you the next one.
If you store it yourself, you can actually do this.
So essentially, for us, the way we store documents in
Megastore, the cursor tells us the row from
which to start scanning.
That's all it is.
So if you have it, we can start scanning
from earlier row.
But in HTML applications, it's kind of difficult because then
you have to--
if you're on page five, you have to maintain cursors from
the first four pages.
So this is why we-- actually, the application uses cursors
rather than using paging because we found out that we
had to maintain them and all that sort of stuff.
SPEAKER 7: So is there a general solution for that?
Like everyone has to do it themselves.
And you're saying you have to maintain all of it.
But what if somebody--
you have a thing that says, page one,
page two, page three.
And they just want to jump to page six.
And I haven't gone there before, so.
BO MAJEWSKI: Yeah, so in that case--
GED ELLIS: In that case, you're not using cursors.
You're just using offsets so you don't need to record that.
You just give the page number and the
actual size of the page.
You can randomly access pages that way.

SPEAKER 8: On the Datastore API, is there anything
including the reference properties in a document?
BO MAJEWSKI: Not at the moment,
but again valid remark.
We probably could just do it by keys.
So you could then extract the documents by keys.
At the moment, we didn't implement it.
So, no.
SPEAKER 9: [? Could that be ?]
used for a document [INAUDIBLE]

SPEAKER 8: Like if you have a user reference property, you
could-- most people might want to search by user?
That would be somewhat complicated.
We would have to do multiple traversal of the index.
But by all means, send us a feedback.
And if we find that the raster magicians can do it, why not?

SPEAKER 10: I have two questions.
The first question is related to the index space.
How do you configure it, and can you monitor it?

BO MAJEWSKI: You will have in your admin console the ability
to look at all indexes that you have. So there would be a
list of all indexes your application has.
And then once you drill into an index, you will have the
metadata and--
GED ELLIS: Things like sizing and--
Size of an index.
I mean, at the moment, it's not real-time updated.
But it's probably going to be like six hours lagging basic
statistics about the index.
Something like this.
SPEAKER 10: I have one other question probably related to
the size question.
Are there any [UNINTELLIGIBLE]
on sharding and aggregation?

BO MAJEWSKI: I'm not sure.
So we actually shard it internally.
I don't think this is part of an application.
So once your index grows large, you
get spread over multiple--
what we call them-- tablets.
So we don't want you to do it explicitly at the moment.
It is done under the hood.
GED ELLIS: And by aggregation, you mean things like in the
case of a user index, like an inbox, you want to be able to
search all inboxes, over all inboxes or
something like that?
Is that what you mean?
SPEAKER 10: Inboxes-- like when you shard, you will
probably going to search across the shards and
It's already hemmed.

SPEAKER 11: The atom type, is that just a short word for a
token of this type?
I have these weird preferences of having
words of the same length.
So the first three that we did was text, date, and then atom.
They all had four characters.

SPEAKER 12: If you have a field that can change after
the document is indexed, do you have to
re-index the whole document?
Say, I left the user tag in the document.
GED ELLIS: It just can't read it.

SPEAKER 13: I might have just missed this, but when is it
available, and when will the documentation be updated?
BO MAJEWSKI: Where's the person who--
He's hiding down the back there.
BO MAJEWSKI: If you see like a red mark on my chest, whatever
I'm answering, that means I'm answering wrong because--
SPEAKER 13: Is that Dan in the front row?
BO MAJEWSKI: No, he's actually in back because he's got a
really long rifle.

So apparently we are supposed to say we are
working very hard.
It actually works.
What you saw was no smoke and mirrors.
But there is a few production issues that we
need to push through.
SPEAKER 13: Will you go pre-release on the local

BO MAJEWSKI: Again, I have to judge his
eyebrows from a distance.
It's all going to be announced.
And Mickey's going to kill me if I--
you know, just one wrong word, I'm dead.
I know it's not a satisfactory answer, but this guy is scary.

SPEAKER 14: I just want to say thank you.
I love it.
Thanks for doing it.
BO MAJEWSKI: Thank you.

SPEAKER 15: I want to ask about a shortening
How about a proximity search?
For example, I have two documents, ABC or ACB.
If I am searching A or B, the key of ID for both documents
are the same, but can I get the ABC first?

GED ELLIS: That's a good question.
I'm not sure if that scorer does that as well, but is
proximity in the backend?
BO MAJEWSKI: I don't know.
That's my answer.That's a very lame answer, but kind of
honest for the moment.
So yeah.
Good feedback.
I mean, if you can either add it to Issue 217 or
something like this.
I mean, we definitely--
GED ELLIS: We're aware of proximity search, but whether
it's actually supported in that scorer, I'm not sure.

SPEAKER 16: For a different kind of proximity description
for this session, you mentioned geosearch.
How does that work?
BO MAJEWSKI: So you record a point so you can create a
field, which you specify that this is a location.
And you need to specify the longitude, latitude for us.
And then there is a function, which is a distance.
And so you can search by anything that has distance
less than 10 kilometers.
And give it a higher boost. So I don't know how you'd use it,
but you can use all these functions.
Like greater than 10 kilometers, give it one.
Otherwise give it miles.
Give it five.
Or you can just sort by a distance from a
given point as well.
But you can use it as a restrict in your search as
well as sort by the distance.
SPEAKER 16: To prioritize?
To sort?
But when you search, we use this stupid thing, which is,
we assume that you can fly.
SPEAKER 16: Yeah.
All right.
Thank you.
GED ELLIS: Thank you.
BO MAJEWSKI: All right.
I think that's a wrap.
Thank you very much.