Searchkick: Reindex on model in multitenancy through default scope app

Mark Radford September 12, 2016 7:09am

Has anyone had any success indexing their multitenant data with searchkick? I followed the suggested article in the readme (https://www.tiagoamaro.com.br/2014/12/11/multi-tenancy-with-searchkick/) but this results in an index for each tenant/model combination which will become expensive and is not scalable.

Therefore, I am trying to create one index per model, for all tenants (ie one index for the model rather than 100 indicies if I have 100 users). When I try to run reindex I run into an issue because the default scope is applied and no data is returned where tenant_id is null.

I can get around the default scope issue by using something like Product.unscoped.reindex(accept_danger:true), however, the default scope is still called when loading associated data. So rather than:

class Product < ActiveRecord::Base
  belongs_to :department

  def search_data
    {
      name: name,
      department_name: department.name,
      on_sale: sale_price.present?
    }
  end
end

I need to use:

class Product < ActiveRecord::Base
  belongs_to :department

  def search_data
    {
      name: name,
      department_name: Department.unscoped.find(self.department_id).name,
      on_sale: sale_price.present?
    }
  end
end

Can anyone suggest a better way of using reindex with this multitenancy setup?

Chris Oliver September 12, 2016 4:21pm

Hey Mark,

Yeah, so I think the one thing with multitenancy is that the goal is to truly separate out all your data between users so they never intermingle. Most people don't actually want or need that, but some do for security reasons. Sounds like in your case you don't really need it.

I'm not sure of a better way of structuring this for you because regardless you're going to be stuck within the tenant. What if you don't use tenants and instead make sure that you scope all your queries to the current user or organization?

Mark Radford September 16, 2016 5:18am

Thanks for replying Chris. I actually scope my queries to business which is similar to organization, I just used tenant as the example because I thought this was the common terminology. I'm going to keep trying my current implementation as per above but I'm going to try using Cloud Front in production as they don't have artificial limits on indices and shards as some other providers do.

Chris Oliver September 16, 2016 3:34pm

Probably a good plan. Yeah the thing is that most times "multi-tenancy" is more for when you truly want separate databases and everyone's stuff separated out. I think it's a common misconception and one that's kinda hard to make clear at times. Sounds like a decent plan and you can always go back and change things up later, it just may take a little longer with production data later on which isn't that bad.

Mark Radford October 5, 2016 11:17am

Rails has a bug with scoping where unscoped is not applied to the block. This is discussed here, with a solution here. This was also back ported to Rails 4.2 in the stable branch https://github.com/rails/rails/pull/25232

I've tried to use stable with gem 'rails', :git => 'https://github.com/rails/rails.git', :branch => '4-2-stable', however, the bug still seems to exist for me. I've tried to search the rails code to see if the code for the patch is present but I can't find it. I was influenced by this comment

Any suggestions on how I can make sure I'm running 4-2-stable with the commit I need?

Chris Oliver October 5, 2016 3:35pm

Hmm, I don't see what query block you're referring to? The unscoped method you use doesn't have a block and the issue more just stems from indexing the database where the tenant isn't set.

It looks like you've got the correct url for using the gem from github, although I don't think that's your problem. I don't see anywhere this is calling a block on the query, and your real issue is still probably the same as before. You're basically indexing but no tenant was ever set.

I think the solution for you is to build your own index rake task. You'd loop through each tenant, set the Apartment Tenant, and then index each of the records inside of it (rather than in bulk). Not sure why I didn't think of that before.

Based off https://www.tiagoamaro.com.br/2014/12/11/multi-tenancy-with-searchkick/ you could do something like this:

namespace :searchkick do
  desc 'Reindex all models on all tenants'
  task reindex_tenants: :environment do
    Rails.application.eager_load!

    Apartment::Tenant.each do |schema|
      Apartment::Tenant.switch!(schema)
      Searchkick.models.each do |model|
        puts "Reindexing #{model.name} on #{schema}"
        model.reindex
      end
    end
  end
end

This isn't modified from his code, except that you're not specifying separate indexes. This will set the tenant for each, it will find all the indexed models, and then it will also go and query for those records that are available. It'll do this once for each Tenant, which means it will find different sets of Product and Department records each time.

You will probably want to go back to using department_name: department.name, because you'll be in side the tenant this way. I believe this should do what you need because it's properly setting the tenant. Curious to see if that works for you.

Mark Radford October 5, 2016 10:33pm

Thanks for taking the time to reply.

I previously implemented multitenancy with scopes following this railscast

Where you said:

You'd loop through each tenant, set the Apartment Tenant, and then index each of the records inside of it (rather than in bulk)

...

except that you're not specifying separate indexes.

Seeing as I'm not specifying separate indexes, then I believe when I change the tenant (business for me) and reindex then the index will not contain results for both businesses, only the one that I most recently reindexed for.

For example, product.rb:

default_scope { where(business_id: Business.current_id) }
searchkick index_name: -> { [ model_name.plural, Rails.env].join('_') }, settings: {number_of_shards: 1, number_of_replicas: 1}

custom rake task:

Business.current_id = 1
Product.redindex
Business.current_id = 2
Product.reindex

If we perform a search after the custom rake task then it will only have products for business.id = 2

The reason I was looking into the rails bug with scoping was because I want to use a join within my search_data:

def search_data
  {
    column_name: Parent.unscoped.joins(:grandparent).where(id: parent_id).pluck("grandparent.column_name")[0].presence || "",
  }
end

With my current version of Rails (4.2.7) the unscoped is not applied properly. I believe the team decided that is correct behaviour but not when used in a block, so:

Grandparent.unscope do
  Parent.unscoped.joins(:grandparent).where(id: parent_id).pluck("grandparent.column_name")
end

Should ignore the scope with the patch applied, but for me I can't get it to work with 4-2-stable (though I can get it to work on my Rails 5 test branch)

Chris Oliver October 5, 2016 10:41pm

Yeah, I guess if the reindex on a model clears the index before adding in the records, then that won't work.

However, then you should be able to go through each record and call reindex on it individually. That I know won't clear the index and so you could compile a full index of all the product records after looping through the tenants. It might be a tad slower on the initial index, but that's only going to happen the first time you index the full database. From then on, you'll be indexing things inside the app when changes are made so it should stay in sync just fine. And you won't need to do any of that department unscoped querying either. You can just access through the record directly.

On the unscoped part, you aren't calling a block there so I don't think you won't be running into that Rails bug.

Mark Radford October 5, 2016 10:47pm

So, for example, I could do something like:

Business.current_id = 1
Products.all.each do |product|
  product.reindex
end

Business.current_id = 2
repeat above

????

I didn't know you could call reindex on each record. I'll give that a try.

Regarding unscoped, when I change my code to use a block it should then therefore ignore the scope, but it doesn't. So I think I am running into the bug when I'm trying to use a workaround with a block (I could be wrong).

Chris Oliver October 5, 2016 10:50pm

Yeah in theory. I'm guessing that Business.current_id sets the tenant?

The callbacks for when you update a record are basically what you'd be tying into here. Anytime you update or delete a record, it needs to update or delete that item in the index. Rather than doing a bulk insert, you'll do them one by one so you can control the tenant stuff, which was the problem with the bulk imports because they couldn't handle the individual records.

You should keep your code then as if it were always in the proper tenant, so your model should look like it normally would:

class Product < ActiveRecord::Base
  belongs_to :department

  def search_data
    {
      name: name,
      department_name: department.name,
      on_sale: sale_price.present?
    }
  end
end

The reason for that is because this way you'll always be in the correct tenant, so you'll always be able to look up the department just fine.

Mark Radford October 6, 2016 2:05am

Yes, Business.current_id sets the tenant.

It's working now using reindex on the individual records like you suggested. The callbacks for updating the item in the index are also working. Elasticsearch (with Sidekick) is quite impressive to see it up and running when it's working. Thank you so much for all of the time and effort you have given.

Are there any GoRails episodes that you would recommend for learning how to use gems in general? By this I mean, unless functionality is specified in the readme, I struggle to understand how it works. I often look in http://www.rubydoc.info/gems/ without much success. I occasionally download the source code for the gem.

An example of a problem is the code you previously referenced: Searchkick.models.each do |model|

I tried Searchkick.models but nothing was returned. So I looked in the usual places (readme and rubydoc.info) and couldn't find any helpful information on Searchkick.models. The link you gave that supplied the sample code also stated Searchkick.models method is available on versions 0.8.6+, I'm using a version later than that so that shouldn't be the problem. Are there any episodes that could help me improve this type of learning of gems and their functionality?

Chris Oliver October 6, 2016 2:14am

So there isn't really anything specific other than I read the source code of gems. The undocumented details and important stuff is almost always hidden away inside the source for the gem unless it's a very popular gem.

For example to learn about Searchkick.models, I would just search the repository for models which I would assume would be a method somewhere inside the gem or a class variable and I'd start poking around that. The key with that would be figuring out what it does, how it's used, etc.

Today I was poking around the source code for docusign_rest to learn how it worked because I couldn't get some options passed over the API correctly. 15 seconds of looking at the source later, and I knew exactly what was wrong.

The thing with gems is realizing they're not a black box, they're just regular ruby code you would have written, but they're packaged up nicely for people to reuse, so you should always feel comfortable reading the source for that. It feels daunting at first, but honestly all the code in the gems is generally pretty much what you would have had to write to make the feature work if they didn't do it for you. Almost every time it's pretty logical when you dive into it, especially when you're curious about very specific bits like Searchkick.models as you don't have to understand how things work completed, just the small piece.

Mark Radford October 6, 2016 2:41am

Thanks for the informative and reassuring response. Gem code does feel daunting to me at this stage but remembering that it is just regular ruby code does help. I will continue to look at the source and expand that comfort zone.

Thank you for all of the help and thank you for GoRails.

Chris Oliver October 6, 2016 2:43am

You're welcome and I'll definitely try to see about doing this as an episode moving forward. I think it'll be kinda tough to figure out a good example for this that really showcases the idea, but if you have any ideas I'm all ears! I tried doing this a couple times before but it's just one of those things I think you learn over time and kinda hard to appreciate until you've done it a few times.

Mark Radford October 13, 2016 12:02am

Maybe you could do a short episode on reading the gem source for searchkick to determine what the reindex method does? I don't know if that's too specific to be useful to your whole customer base.

With reindexing by the individual record I found out that I needed to use Product.reindex(import: false) to create the index first as using simply record.reindex wouldn't apply the Searchkick settings (ie search_data, index name, etc). Discussed here.

I guess reindexing by the record does have the disadvantage that every time I:

install or upgrade searchkick
change the search_data method
change the searchkick method

I'm going to need to recreate my indices again with Product.reindex(import: false), and then loop through the records (with ActiveRecord) to reindex each record individually. So it's obviously not an ideal way of doing things, but the only other alternative is using unscoped in a block with the patch applied (which works in Rails 5). I would assume that Model.reindex could be significantly faster than looping through with record.reindex

Chris Oliver October 13, 2016 12:11am

You're definitely far deeper into Searchkick than I've ever been at this point. :) I agree, some advanced searchkick usage like aggregates and geosearch would be really great to cover. Maybe some custom index stuff like what you're up to would be valuable as well.

So that import: false basically just tells it to create a blank index and ignore all the records in the database right?

Actually... you might check out what the Model.reindex code does. You might be able to pull some chunks from that to create your own method to do the bulk reindex and not clear the index each time. That might let you build a custom method for indexing that could take advantage of any bulk indexing they might have as well as support your multi-tenant application.

Mark Radford October 13, 2016 1:11am

So that import: false basically just tells it to create a blank index and ignore all the records in the database right?

Yes, that's what it appears to do in my testing. A blank index with your specific Searchkick settings applied (whereas record.reindex will not create the index correctly if it hasn't been previously created).

I agree that it's a good idea for me to see what Model.reindex does, and use that code for my own method, I'm just struggling a little to read the gem code and that's why I thought it would be a great idea for a short episode (well for me anyway). How to understand how to read a gem and figure out where a method is, how it's called and what it's doing. But again I understand that this may be too specific for a general episode to be useful to a large number of people. I'm sure I'll figure it out with some perseverance.

Mark Radford October 26, 2016 1:02am

As an update, I wouldn't advise reindexing by the individual record when you have a large amount of data. My custom rake task has been running for approximately 18 hours and it's still not finished. This approach does not allow for zero downtime reindexing either, which isn't a problem if you don't plan on changing the Searchkick mappings/structure, but if you do, you'll need to write some custom code to try and perform zero downtime with using import: false. So far for me, creating the custom task is taking a lot of time and doesn't seem worth it.

I'm upgrading my app to Rails 5 at the moment which includes the scope patch, so I will go back to using the default Searchkick methods and scope my search_data, ie:

class Product < ActiveRecord::Base
  belongs_to :department

  def search_data
    {
      name: name,
      department_name: Department.unscoped.find(self.department_id).name,
      grandparent_column:  Grandparent.unscope {Parent.unscoped.joins(:grandparent).where(id: parent_id).pluck("grandparent.column_name")}
    }
  end
end

For future reference, after pulling code from the Searchkick gem, my custom rake task (that I am currently moving on from) began to look like the below, though I haven't applied tenant/business scoping yet:

#scope = searchkick_klass

searchkick_index = Searchkick::Index.new(Department.searchkick_index.name, Department.searchkick_options)
searchkick_index.clean_indices
index = create_index(index_options: Department.searchkick_klass.searchkick_index_options)
# check if alias exists
if searchkick_index.alias_exists?
  # import before swap
  Department.searchkick_klass.find_in_batches batch_size: 1000 do |records|
    if records.any?
      event = {
        name: "#{records.first.searchkick_klass.name} Import",
        count: records.size
      }
      ActiveSupport::Notifications.instrument("request.searchkick", event) do
        super(records)
      end
    end
  end
end
# get existing indices to remove
searchkick_index.swap(index.name)
searchkick_index.clean_indices
index.refresh
```

Chris Oliver October 26, 2016 1:59am

The performance on that is probably similar to writing and committing records one by one on a csv import vs writing a transaction and committing everything at once. You'll have a lot more speedups writing everything in bulk.

Are you sure that the scope thing is actually going to solve your problem in Rails 5? I thought we determined that wasn't going to help as it wasn't related to your tenant issue?

Mark Radford October 26, 2016 6:01am

writing a transaction and committing everything at once. You'll have a lot more speedups writing everything in bulk.

I was trying to figure out how to do that in my custom task before I accepted that I'm probably wasting too much time and should just put a workaround in place to use the defaults offered by the gem

With regards to the scope, in the latest release of Rails 4, without my tenant/business set, if I run:

Grandparent.unscoped { Parent.unscoped.joins(:grandparent).where(id: self.parent_id).pluck("grandparent.column_name") }

then the business scope is still applied and no result is returned (the query contains AND "grandparents"."business_id" IS NULL

wherease, in Rails 5, if I run the same command the business scope is not applied and this time I will receive the result I expect (the query does not contain AND "grandparents"."business_id"

So I figure if I setup my search_data to use unscoped in blocks as per what works in Rails 5, then I will be able to use the standard Searchkick Model.reindex rather than record.reindex and avoid needing to create my own custom task that allows for zero downtime reindexing when using (import: false)

Chris Oliver October 26, 2016 2:32pm

That's interesting with regards to scopes. I think that doesn't have any bearing on your issue though, because the Apartment gem is not using scopes to separate out your tenants, it's using postgres schemas (you're using pg right?) which makes the records in the other schemas unqueryable entirely because they aren't even visible unless you're in the global tenant. Does that sound right?

Mark Radford October 26, 2016 11:03pm

My apologies Chris, I see I've created some confusion with my original post. When I said "I followed the suggested article in the readme", I was referring to the Searchkick readme where it says "Check out this great post on the Apartment gem. Follow a similar pattern if you use another gem". I never explicitly stated that I wasn't using the Apartment gem, which I am not.

I set up multitenancy in my app following this railscast, which uses scopes.

Chris Oliver October 26, 2016 11:09pm

Ohhhh! My bad! Hahaha I was assuming you were using apartment. This all makes a ton more sense now!!

Mark Radford October 26, 2016 11:39pm

Yeah, my apologies for taking so long to clear this up. Are you familiar with that implementation? Unrelated to elasticsearch but I've read:

I don't like the default_scope for the reason that it is not threadsafe. The user id is stored in a class variable, which means that two or more concurrent users in your app will break this unless you use Unicorn or some other web server that makes sure no more than one single client connection will access the same thread.
http://stackoverflow.com/a/22534147/1299792

I've responded to that comment with:

In the railscast Ryan said: "We can find another potential issue in the Tenant model where we call cattr_accessor for the current_id attribute. While this is convenient it’s not really thread-safe so we might want to do something like this instead: Thread.current[:tenant_id] = id, Now we have getter and setter methods that use Thread.current to set the value which is more thread-safe". Do you still feel using default_scope with this implementation is not thread-safe?

Mark Radford December 15, 2016 7:47am

Hi Chris,

Unrelated to all of the multitenancy and reindexing talk, how come you didn't need to use autocomplete: true in your search query and also set the autocomplete field in your search data? I've been struggling to get my search queries to match as I wanted and it only worked when I incorporated autocomplete. I previously tried this like word_start, word_end etc with no luck.

Thanks.

Chris Oliver December 15, 2016 8:12pm

Searchkick's docs for autocomplete show using this which should tell it to index those with the word_start option, so I believe that's all I did.

class Book < ActiveRecord::Base
  searchkick word_start: [:title, :author]
end

https://github.com/ankane/searchkick/#instant-search--autocomplete

You doing something different?

Mark Radford December 16, 2016 6:29am

Yeah, for me if I search for a Product for number "pm07" then I only want that product returned, I don't want "pm01" or "pm03" returned. I was only able to get this to work by using autocomplete:true but I can't figure out why.

If we look at what's created for word_start we find:
Mapping

"product_number" : {
  "type" : "keyword",
  "fields" : {
    "analyzed" : {
      "type" : "text"
    },
    "word_start" : {
      "type" : "text",
      "analyzer" : "searchkick_word_start_index"
    }
  },
  "ignore_above" : 256
},

Analyzer

searchkick_word_start_index: {
  type: "custom",
  tokenizer: "standard",
  filter: ["lowercase", "asciifolding", "searchkick_edge_ngram"]
},

searchkick_edge_ngram filter

searchkick_edge_ngram: {
  type: "edgeNGram",
  min_gram: 1,
  max_gram: 50
},

If we look at what's created for autocomplete we find:

Mapping

"product_number" : {
  "type" : "keyword",
  "fields" : {
    "analyzed" : {
      "type" : "text"
    },
    "autocomplete" : {
      "type" : "text",
      "analyzer" : "searchkick_autocomplete_index"
    }
  },
  "ignore_above" : 256
}

Analyzer

"searchkick_autocomplete_index" : {
  "filter" : ["lowercase","asciifolding"],
  "type" : "custom",
  "tokenizer" : "searchkick_autocomplete_ngram"
},

Tokenizer

tokenizer: {
  searchkick_autocomplete_ngram: {
    type: "edgeNGram",
    min_gram: 1,
    max_gram: 50
  }
}

So I think both word_start and autcomplete use lowercase, asciifolding and edgeNGram.

The difference I think comes in the search query and the use of autocomplete: true. So with word_start we can simply use:

Product.search "pm07"

whereas with autocomplete we have:

Product.search "pm07", autocomplete: true

which I think then uses the following code:

if options[:autocomplete]
  payload = {
    multi_match: {
      fields: fields,
      query: term,
      analyzer: "searchkick_autocomplete_search"
    }
  }

searchkick_autocomplete_search: {
  type: "custom",
  tokenizer: "keyword",
  filter: ["lowercase", "asciifolding"]
},

At this point in time I can't figure out what payload code is called/used for word_start and how it differs to that used by autocomplete

Mark Radford December 16, 2016 9:04am

I posted the comment above in an existing Searchkick issue and the author responded with:

My guess is you need to use misspellings: false. Also, to help with debugging queries and mappings, you can use the recently added:

Product.search("something", debug: true)

Chris Oliver December 16, 2016 4:57pm

Awesome to have the debugging option, and it makes sense that you don't want misspellings in that situation. Search is complex. Haha

Rails for Beginners

Advanced Ruby: Behind the Magic

Payments with Rails Master Class

Refactoring Rails

Learn Hotwire

Install and Deploy Rails Guides

Hatchbox.io

Jumpstart Rails SaaS Template

Remote Ruby Podcast

GoRails Open Source

Rails Hackathon

Beginner Bounties

Ruby on Rails Job Board

Notifications

Searchkick: Reindex on model in multitenancy through default scope app

Want to stay up-to-date with Ruby on Rails?