Tokyo Cabinet 2 : Loading and querying point data

After setting up Tokyo Cabinet and Ruby its time to use it. As with my post about MongoDB I'm going to load 500.000 POIs in a database and query them with a bounding box query. I will use the table database from Tokyo Cabinet because it supports the most querying facilities. With a table database you can query numbers with full matched and range queries and for strings you can do full matching, forward matching, regular expression matching,...

To load the data in my database I will need to read my shapefile with POIs with Ruby and write the attributes to a new database. First we create the database with the following code.

require 'tokyocabinet'
include TokyoCabinet

# create the object
tdb = TDB::new

# open or create  the database
if !tdb.open("poi_db.tct", TDB::OWRITER | TDB::OCREAT)
  STDERR.printf("open error: %s\n", tdb.errmsg(tdb.ecode))
end

To read the features in my shapefile I am going to use the Ruby bindings for GDAL/OGR. Because I installed Tokyo Cabinet on GISVM I already had FWTools installed but I still needed to install the Ruby bindings for it. I did this with the following command.

sudo apt-get install libgdal-ruby

Now we are going to read a shapefile with 500.000 point features and write the records to the database. First we open the shapefile and get the layer. Then we loop over the features, create a new record and fill the record with the x,y information and the other fields when they aren't empty. The values need to be converted to strings otherwise the record can't be saved. Then we put the record in the database.

require 'gdal/ogr'

# open my shapefile
dataset = Gdal::Ogr.open("poi_500000.shp")
layer = dataset.get_layer(0) 

feature_defn = layer.get_layer_defn

layer.get_feature_count.times do |i|
 record = Hash.new # create new record
    feature = layer.get_feature(i)
 geom = feature.get_geometry_ref()
 record['x'] = geom.get_x(0).to_s()
 record['y'] = geom.get_y(0).to_s()
 pkey = tdb.genuid # init primary key
 feature_defn.get_field_count.times do |i|
  field_defn = feature_defn.get_field_defn(i)
  fieldname = field_defn.get_name_ref
  value = feature.get_field_as_string(i);
  if not value.nil? and value != ""
   if field_defn.get_name_ref == "ID"
    pkey = value
   else
    record[fieldname] = value.to_s()
   end
  end
 end
 # store the record in Tokyo Cabinet
 tdb.put(pkey, record)
end

To add indexes on the x and y field we call the following code. This creates two supplementary files called poi_db.tct.idx.x.dec and poi_db.tct.idx.y.dec.

# add index on x and y
tdb.setindex('x', TDB::ITDECIMAL)
tdb.setindex('y', TDB::ITDECIMAL)

To query the POIs in the database I created a function to query the POIs for a given bounding box and then I benchmarked it. I used the same bounding box as in my previous posts about MongoDB, Rtree, Pythonnet and PostGIS.

# query POIs by bounding box
def query(tdb, minx, maxx, miny, maxy)
 qry = TDBQRY::new(tdb)
 qry.addcond("x", TDBQRY::QCNUMGE, minx.to_s())
 qry.addcond("x", TDBQRY::QCNUMLE, maxx.to_s())
 qry.addcond("y", TDBQRY::QCNUMGE, miny.to_s())
 qry.addcond("y", TDBQRY::QCNUMLE, maxy.to_s())
 qry.setorder("x", TDBQRY::QONUMASC)

 res = qry.search
 puts res.length # number of results found
 return res
end

require 'benchmark'
puts Benchmark.measure { query(tdb, 4.5, 5.0, 50.5, 51.0) }

The query returned 98000 POIs. I ran the benchmark 12 times and this where the results :

  1.620000   0.190000   1.810000 (  1.866339)
  1.570000   0.030000   1.600000 (  1.625303)
  1.640000   0.030000   1.670000 (  1.668573)
  1.650000   0.000000   1.650000 (  1.664806)
  1.650000   0.020000   1.670000 (  1.708228)
  1.730000   0.010000   1.740000 (  1.744645)
  1.410000   0.310000   1.720000 (  1.749268)
  1.620000   0.050000   1.670000 (  1.724199)
  1.610000   0.010000   1.620000 (  1.657794)
  1.660000   0.020000   1.680000 (  1.680383)
  1.710000   0.020000   1.730000 (  1.767141)
  1.720000   0.010000   1.730000 (  1.809114)

According to the Ruby documentation the benchmark outputs the user CPU time, the system CPU time, the sum of the user and system CPU times, and the elapsed real time. So this means that the query took between 1.65 and 1.87 seconds to get a list of 98000 POIs within the given bounding box. This is a nice indication of the speed of Tokyo Cabinet.

To demonstrate how you can access the attribute I created the following code. It loops over the first 100 found POIs and prints the ID and the x- and y-coordinate.

res = query(tdb, 4.5, 5.0, 50.5, 51.0)
# print the first hundred found POIs
i = 0
res.each do |rkey|
 rcols = tdb.get(rkey)
 puts rcols['id'].to_s() + " " + rcols['x'].to_s() + " " + rcols['y'].to_s()
 i += 1
 if i > 100
  break
 end
end

Now we are ready to close the database. I hope you enjoyed this post and as always I welcome any comments.

# close the database
if !tdb.close
 ecode = tdb.ecode
 STDERR.printf("close error: %s\n", tdb.errmsg(ecode))
end

Related Posts
Installing Tokyo Cabinet and Ruby on Ubuntu
Populating a MongoDb with POIs
Spatial indexing a MongoDb with Rtree
PostGIS : Loading and querying data

No comments: