Posts Tagged ‘sql’

Howto create “standardized” CSV exports from MySQL

April 26th, 2010

When you start querying/analyzing/and wanting to export data from MySQL (especially anything free-form like SEO/SEM Keyword Data Extracts) you’ll quickly find out you can’t simply concatenate your columns together and join them with a delimiter to be read later.
Sampling some of our keywords illustrates a number of problems:

Quotes: ! 1/4″ inlet muffler
Newlines: xtra vision
headlight h4703xv hours
Control Characters: ^HToyota+starlet+coolant+temp+sensor, 242°/250° duration @ .050″ 601 lift
Commas/Tabs/Semi Colans or other typical delimiters: “+bosch, +part no., +67647″, 2500hd
chevy cat back exhaust
Unicode: ايسوزو 20011
ב.מ.וו 545i, 宝马m3冷却系统

The Original MySQL attempt to export this was:

SELECT
	`medium`,
	`keyword`,
	`landingpage`,
	SUM(`entrances`) as entrances
INTO OUTFILE 'keyword_data.csv'
	FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
	LINES TERMINATED BY '\r\n'
FROM
	`daily_medium_keyword_landingpage_visits`
WHERE
	medium = 'organic'
GROUP BY
	`medium`,
	`keyword`,
	`landingpage`;

This produced a really great file format, any crazy characters were escaped properly and it looked great on paper:

"organic","! 1/4\" inlet muffler","/cache.aspx?q=\"1+4\"+\"inlet+muffler\"&d=76309488474199&mkt=en-US&setlang=en-US&w=6d792683,38b10c05",1
"organic","!((^ toyota avalon mechanical lifter repairs","/auto-parts/toyota/engine-mechanical-valve-lifter",1
"organic","!(gl-508 rm) motorcraft","/auto-part/149373-motorcraft-gl508rm-remanufactured-alternator",1
"organic","!947 dodge power wagon","/auto-part/983439-mallory-ignition-wire-set-m11947",1
"organic","!967 ford mustang parts","/auto-parts/ford-mustang",1
"organic","!971 142 volvo specifications","/auto-parts/volvo-142/engine-electrical-distributor",1
"organic","!972 buick skylark heater core","/cars/1972-buick-skylark",1

I loaded up FasterCSV to start parsing the file and quickly figured out how naive I was about the CSV file formats.  MySQL’s default use of \ as an escape character isn’t “standard.”  It turns out there isn’t an official CSV standard but I did find something close: http://tools.ietf.org/html/rfc4180#section-2

Hmm so effectively none of the control characters should be escaped, but fields should be quoted if they have a delimeter in them, and a quote in the field should be replaced with “” to be a CSV Standard.

The other wrinkle is the standard calls for \r\n to be the line delimeter, but what happens if there are other \r\n in the columns?

First Step – Remove Escape Characters:

	FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY ''

By adding the ESCAPED BY ”, none of the characters will be escaped in the output

Step Two – Replace \r\n with \n:

	REPLACE(`keyword`, "\r\n", "\n")

Step Three – Replace ” with “”

	REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""')

Step Four – Replace NULL-Byte with a blank

	REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), ""), char(0), "")

Step Five – Replace NULL field with a blank

	IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), ""), char(0), ""), "")

Putting it all together:

SELECT
	IFNULL(REPLACE(REPLACE(REPLACE(`medium`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as medium,
	IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as keyword,
	IFNULL(REPLACE(REPLACE(REPLACE(`landingpage`, "\r\n", "\n"), '"', '""'), char(0), ""), "") as landingpage,
	SUM(IFNULL(`entrances`, 0)) as entrances
INTO OUTFILE 'keyword_data_new.csv'
	FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY ''
	LINES TERMINATED BY '\r\n'
FROM `daily_medium_keyword_landingpage_visits`
WHERE medium = 'organic'
GROUP BY
	IFNULL(REPLACE(REPLACE(REPLACE(`medium`, "\r\n", "\n"), '"', '""'), char(0), ""), ""),
	IFNULL(REPLACE(REPLACE(REPLACE(`keyword`, "\r\n", "\n"), '"', '""'), char(0), ""), ""),
	IFNULL(REPLACE(REPLACE(REPLACE(`landingpage`, "\r\n", "\n"), '"', '""'), char(0), ""), "");

Ruby Script to Extract Google Analytics Keywords, Landing Page, Medium Entrances to CSV/TSV format

April 19th, 2010

I’ve had versions of this script hacked together before.  In fact, if you search for “Google Analytics Data Extractor” the #1 result is a google groups posting I made in a former life for a C# program I developed: http://groups.google.com/group/analytics-help-misc/browse_thread/thread/d2ad6ddf3d73e511

There’s now a couple of ruby gems out there for handling the Google Analytics API

One of them I’ve been doing some light-contributing to, including the pagination support (automatically grab all results in 10,000 result chunks)

I pushed a couple of changes up today at: http://github.com/er1c/gattica that hopefully will get merged into the main trunk soon (http://github.com/activenetwork/gattica)

One of the more useful uses of the library is to extract your Organic Search Results Keywords from Google Analytics. We’re using this at Frugal Mechanic to see how people search for Auto Parts.

require 'rubygems'
require 'gattica'
require 'fastercsv'

ga_profile = "" #Enter your Profile Here
start_date = Date.new(2009, 1, 1)
end_date = Date.new(2009, 12, 31)
file_path = "data" # Directory Needs to exist

puts "Google Username: "
u = gets.chomp
raise "bad username" unless !u.empty?
puts "Google Password: "
system "stty -echo"
pw = gets.chomp
system "stty echo"
raise "bad password" unless !pw.empty?

class ExtractKeywords
  def initialize(email,password)
    @gs = Gattica.new({:email => email, :password => password})
  end

  def get_accounts
    results = []
    @gs.accounts.each{|account|
      profile = {}
      profile[:site_title] = account.title
      profile[:profile_id] = account.profile_id ## this is the id required for requests to the API
      profile[:account_name] = account.account_name
      profile[:account_id] = account.account_id
      profile[:web_property_id] = account.web_property_id
      results << profile     }     return results   end   def connect_profile(profile)     @gs.profile_id = profile[:profile_id]   end   def connect_profile_id(profile_id)     @gs.profile_id = profile_id   end   def get_keywords(start_date = nil, end_date = Date.today)     results = []     csv_data = @gs.get({         :start_date => (start_date || (end_date - 365)).to_s,
        :end_date => end_date.to_s,
        :dimensions => ["medium", "keyword", "landingPagePath"],
        :metrics => "entrances",
        :sort => "-entrances",
        :page => true}).to_csv(:long)

    return FasterCSV.parse(csv_data, :headers => true)
  end

end

gs = ExtractKeywords.new(u, pw)
gs.connect_profile_id(ga_profile)

(start_date .. end_date).each { |date|
  file = "#{file_path}/medium_keyword_landingpage_visits_#{date.strftime('%Y-%m-%d')}.csv"
  next if File::exists?( file )

  FasterCSV.open(file, "w") do |csv|
    csv << ["medium", "keyword", "landingPagePath", "entrances"]
    keywords = gs.get_keywords(date, date)
    keywords.each{|row|
      csv << [row["medium"], row["keyword"], row["landingPagePath"], row["entrances"]]
    }

  end
}