Thursday 12 April 2012

Sinatra and Sequel - The dreaded ASCII-8BIT/UTF-8 conundrum

Recently had to create a QnD Sinatra app that connected to mysql via the Sequel gem and made some rather obscene SQL calls and return JSON. Problem is, the underlying connection always seems to return Blobs and ASCII-8BIT binary data. So here's a solution.

I'm not claiming that there are other ways to do it. I just had <30mins to get the app working in production. Here's how I did it:

1st we monkey patch Struct with a to_json() method

class Struct
  def to_map
    map = Hash.new
    self.members.each { |m| map[m] = self[m] }
    map
  end

  def to_json(*a)
    to_map.to_json(*a)
  end
end

We're going to respond with a bit of JSON, so we ensure there's a handy struct.

payload = Struct.new(:status, :messages, :value)

We make our connection:

DB = Sequel.mysql(
    :database => 'your_db_name',
    :user => 'user', :password => 'password',
    :host => '192.168.1.1', :port => 3306,
    :encoding => 'utf8'
)

Now here is some nasty SQL:

nasty_sql = "
    select    t1.field1,
              t2.field2,
              t3.field3,
              t4.field4,
              t5.field5
    from      db1.table1 as t1
    left join db1.table2 as t2 on t2.someId = t1.someId
    left join db1.table3 as t3 on t3.someId = t2.someId
    left join db1.table4 as t4 on t4.someId = t3.someId
    left join db2.table5 as t5 on t5.someId = t4.someId
    left join db2.table6 as t6 on t6.someId = t5.someId
    where     t1.someField = ?
    and       t2.someField is not null
    and       t2.someField = 1
    and       t2.someField in ('Gonzo','Journalist','Greatest Articles')
    and       t3.someField is not null
    and       t4.someField = 1
    and       t5.someField = ?
    order by  t1.someField, t2.someField"

Ignore the obvious 'why not a stored proc?' and 'crap! change the tables!' responses. Imagine you don't have a choice. And you're running out of time and your manager is staring pointedly at you. You're patching production and you'll be fired if it doesn't work RIGHT NOW.

Hypothetical anyway, as it's an article about how to do it, not why the hell you should do it in the first place.

Sooooo.... Anyway... We have a route:

get '/some/:f1/:f2' do
  begin
    result = payload.new(true, [], [])
    DB[nasty_sql, params[:f1], params[:f2]].each do |row|
     result.value << row
    end
    JSON.pretty_generate(result)
  rescue Sequel::DatabaseError => e
    # log it
  end
end

When we do this, we get a nasty bunch of errors based around the fact that although the underlying connection is UTF-8, we get back blobs or strings as binary data so their encoding is ASCII-8BIT.

So here's how you solve it. First ensure you actually fix the encoding issue on a per-row basis:

get '/some/:f1/:f2' do
  begin
    result = payload.new(true, [], [])
    DB[nasty_sql, params[:f1], params[:f2]].each do |row|
     result.value << fix_encoding_issue(row)
    end
    JSON.pretty_generate(result)
  rescue Sequel::DatabaseError => e
    # log it
  end
end

And add the method:

# This is required because when using raw SQL, the data comes back as a blob.
# The actual data is UTF-8 but because the string is marked as binary it is treated as ASCII-8BIT.
# Thus a row must have all strings coerced to ther actual encoding.
# Also the keys... Just in case...
def fix_encoding_issue(row)
  item = Hash.new
  row.each_pair do |k, v|
    if v.kind_of? String
      item[k.to_s.force_encoding('UTF-8')] = v.force_encoding('UTF-8')
    else
      item[k.to_s.force_encoding('UTF-8')] = v
    end
  end
  item
end

And your encoding issues are ameliorated.

No comments:

Post a Comment