Using functions on columns instead on literals

I always hear from developers that DBAs are against the use of ORMs (like Hibernate) without any reason. Before being a DBA, I was a developer and I recognize that I like to use ORMs because it makes it easier but you should always have in mind what’s beyond the ORM’s classes and what queries are executed by the ORM.

In other words: to know how the ORM works. Of course it can be some differences when running the application on SQL Server, PostgreSQL or Oracle but usually not a lot.

The following example is on a slow query I found through pg_stat_statements, in PostgreSQL, caused by the use of the upper function because PostgreSQL is case sensitive. In SQL Server this specific problem could be avoided if the database is created using a case insensitive collation. But other examples could be found that use functions on columns.

The query is against a table called BinaryFile where are stored the path (column named ruta) in the file system where files are placed. The table has 181.438 rows. There’s a normal index on ruta:

CREATE UNIQUE INDEX ix_binaryfile_ruta
ON binaryfile
USING btree (ruta);

Screen Shot 08-22-14 at 12.15 PM

Screen Shot 08-22-14 at 02.06 PM

The query generated by Hibernate:

select count(*) as y0_ from public.BinaryFile this_ 
where upper(this_.ruta) = 'C:\MYPATH\MYFILE.TXT' and this_.id<>432

The index is not used because the UPPER function, so we’ve two solutions:

  • Create an index over the UPPER function.
  • Use the UPPER() in the literal part and have the ruta’s column values normalized (all in upper case).

I like the 2nd option because it’s simpler, but the improvement would be the same. In Windows the path are case insensitive, so we can do:

UPDATE BinaryFile SET ruta = upper(ruta)

The query would be:

select count(*) as y0_ from public.BinaryFile this_ 
where this_.ruta = upper('C:\MyPath\MyFile.TxT') and this_.id<>432

Now, the query plan:

Screen Shot 08-22-14 at 12.38 PM

Screen Shot 08-22-14 at 02.07 PM

Now the query uses the index and the performance improvement is impressive: 7000 times faster.

I know, this is an easy example, but I think it’s the best way to explain it.

SQL Server vs PostgreSQL: fighting for resources in the same server

Introduction

Notice: This is not the typical performance’s comparative between SQL Server and PostgreSQL. I only test their performance’s affectation when they run in the same server, that I would never do.

We were asked: “Is it possible to run SQL Server and a PostgreSQL in the same server?”

Obviously, the server would run under Windows. The first thought was: no.

We said that because we would experience problems with I/O and memory pressure, due to the high use of disk and memory that is typical on. We were asked to prove this.

Analysis

We decide to use the DELL’s DVD store benchmark. It’s a benchmark that simulates a DVD store with a simple OLTP database. The benchmark can be run on several databases and SQL Server and PostgreSQL are among them.

We want to see how they would behave when working together, compared to when they’re working alone. It’s not a SQL Server vs. PostgreSQL benchmark because we don’t want to go deep on how DVD store is build, DB drivers’ performance and so on.

We create a DB of 10GB in each RDBM.

Hardware

A brief description about the server:

  • Intel Xeon E3-1220 3.1Ghz
  • Memory: 8 GB
  • RAID controller: LSI Adapter SAS2 2004 Spitfire with 2 SATA disks in RAID 1.

 Sofware

  • Microsoft Windows Server 2008R2 Standard SP1 64-bits
  • SQL Server 2008 Workstation Edition
  • PostgreSQL 9.2.4

Preparation

After install the RDBM it’s time to install DVD Store:

For SQL Server, I use this article. It’s very comprehensive.

For PostgreSQL in Windows I haven’t found something similar so I do it on my own, finding this issues:

  • The script for creating the database and importing the data is for Linux bash pgsqlds2_create_all.sh so I port it to Windows, pgsqlds2_create_all.bat,.
  • Executing ds2pgsqldriver.exe I get a runtime error.

To solve the runtime error in ds2pgsqldriver.exe I modify the source code: in file c:\ds2\pgsqlds2\ds2pgsqlfns.cs at line 209 I change this block:

      int i_row = 0;
      if ((customerid_out > 0) && Rdr.NextResult())
        {
        while (Rdr.Read())
          {
          title_out[i_row] = Rdr.GetString(0);
          actor_out[i_row] = Rdr.GetString(1);
          related_title_out[i_row] = Rdr.GetString(2);
          ++i_row;
          }
        }
 

For this one:

      int i_row = 0;
      try
      {
           if ((customerid_out > 0) && Rdr.NextResult())
           {     
                while (Rdr.Read() && i_row < title_out.Length)
                {
                       title_out[i_row] = Rdr.GetString(0);
                       actor_out[i_row] = Rdr.GetString(1);
                       related_title_out[i_row] = Rdr.GetString(2);
                       ++i_row;
                }
           }
           Rdr.Close();
           t.Commit();
           rows_returned = i_row;
      } 
      catch (Exception e)
      {
           Rdr.Close();
           t.Commit();
           rows_returned = i_row;

           Console.WriteLine("Length of arrays: {0,3}, {1,3}, {2,3}", title_out.Length, actor_out.Length, related_title_out.Length);
           Console.WriteLine("i_row value {0}", i_row);
           Console.WriteLine("Thread {0}: Error inside login: {1}", Thread.CurrentThread.Name, e.Message);
           return (false);
      }        
  

And I compile it:

csc.exe /out:ds2pgsqldriver.exe ds2pgsqlfns.cs c:\ds2\drivers\ds2xdriver.cs /d:WIN_32_TIMER /d:GEN_PERF_CTRS "/r:c:\ds2\drivers\Npgsql.dll"

 Execution

We execute them with the following parameters:

c:\ds2\pgsqlds2\ds2pgsqldriver.exe --target=127.0.0.1 --run_time=20 --db_size=10GB --n_threads=40 --ramp_rate=10 --pct_newcustomers=0 --warmup_time=0 --think_time=0 > c:\Temp\testresults.txt 2>&1
c:\ds2\sqlserverds2\ds2sqlserverdriver.exe --target=localhost --run_time=20 --db_size=10GB --n_threads=40 --ramp_rate=10 --pct_newcustomers=0 --warmup_time=0 --think_time=0 > c:\Temp\testresults_sqlserver.txt 2>&1

The benchmark result is at the end of the output file:

n_purchase_from_start= 206466 n_rollbacks_from_start= 4110

Results

Case SQL Server PostgreSQL Total % Performance decrease
Only SQL Server

268.245

268.245

Only PostgreSQL

97.668

97.668

-63%

Both servers: PostgreSQL y SQL Server

206.466

17.378

223.844

-16%

Performance decrease

-23%

-82%

 

 

The numbers are the results of the benchmark. Higher are better.

When running alone, SQL Server is much better than PostgreSQL, but as I said before it is not our goal and it can be for a lot of reasons, for example better implementation of DVD store or better drivers. But still, I’m impressed to see that difference. Test PostgreSQL in Linux is beyond the scope of our analysis, but I’m sure that it would improve a lot.

Conclusion

When working together, SQL Server is the winner, it only decreases 23% but PostgreSQL decreases 82% when they run together.

Here we’ve to notice that SQL Server is a native Windows application, while PostgreSQL is multi platform and probably this compatibility across multiple OS is paid with the efficiency in Windows. So, installing SQL Server and PostgreSQL in the same server, when there’s some workload, the PostgreSQL have important performance degradation.

Learning in Internet’s era

In the last post of the year I’ll talk about knowledge and how we are continuously learning. The idea of the topic comes from a Packt Publishing offer: selling books for Xmas for $5.

It doesn’t matter our job, it’s changing, so we’ve to evolve too. In fact, I don’t know any job that you don’t have to improve your skills continuously usually due to technology improvements. That’s why I’ve always believed that we have to learn during all our life. In fact I also like to learn things that have nothing to do with my career.

But how way learn has changed a lot. Before Internet people had to go to libraries and get some books. With Internet, a lot of information has been available. First in personal blogs, later on online communities like SQLServerCentral or StackOverflow. I remember the first time I found SQLServerCentral, that I was amazed by the high quality of its articles and comments. I learned a lot.

Last year a workmate show me Coursera and I other MOOC (Massive Open Online Courses) like Khan Academy, Codecademy, edX and Udacity. That’s a very big step. I made some really great courses in Coursera:

In general all courses are free, but you can find at Udacity some courses that you’ve to pay for them.
This is a very important point: if the courses are free what is really valuable is your time. If you learn below what you expect, you can just stop doing it. If you pay for it, probably you’ll continue doing it even if it’s not as good as you would desire. Once I paid for one and I couldn’t finish it: I was loosing my time, but I had already lost my money. Probably the future is having the two or three weeks free at the beginning of the courses, so you can decide if it is worth paying it or not.

And now it comes Packt Publishing with this offer: any e-book for $5. I wouldn’t have believed some years ago to have books at $5. Of course it’s an e-book, not a physical book, but it’s acknowledgment for very little money compared to some years ago. May be other publishers will join with this kind of offers. Sell a lot for a low-cost or sell much less for a more expensive price.

Anyway, knowledge and learning is very important nowadays. Now there’s no barriers. Everybody can learn.

SETVAL for all sequences in a schema

In PostgreSQL, when you’re working with sequences, if you insert a future value due to the incrementing values, you will get an error when that value is going to be inserted. I like much more how SQL Server handles autoincrement columns with its IDENTITY property, that would be like the sequences linked to a table like SERIAL, but it’s much more restrictive and by default you cannot INSERT a register specifying the value of this column as you can do with PostgreSQL.

The PostgreSQL setval() function, explained in Sequence Manipulation Functions, is the way that PostgreSQL has to change the value of a sequence. But only accepts one table as a parameter. So, if you need to set all the sequences in a schema to the max(id) of every table, you can do can use the following script, based on Updating sequence values from table select.

CREATE OR REPLACE FUNCTION setval_schema(schema_name name, raise_notice boolean = false)
    RETURNS VOID AS
-- Sets all the sequences in the schema "schema_name" to the max(id) of every table
$BODY$

DECLARE
    row_data RECORD;
    sql_code TEXT;

BEGIN
    IF ((SELECT COUNT(*) FROM pg_namespace WHERE nspname = schema_name) = 0) THEN
        RAISE EXCEPTION 'The schema "%" does not exist', schema_name;
    END IF;

    FOR sql_code IN
        SELECT 'SELECT SETVAL(' ||quote_literal(N.nspname || '.' || S.relname)|| ', MAX(' ||quote_ident(C.attname)|| ') ) FROM ' || quote_ident(N.nspname) || '.' || quote_ident(T.relname)|| ';' AS sql_code
            FROM pg_class AS S
            INNER JOIN pg_depend AS D ON S.oid = D.objid
            INNER JOIN pg_class AS T ON D.refobjid = T.oid
            INNER JOIN pg_attribute AS C ON D.refobjid = C.attrelid AND D.refobjsubid = C.attnum
            INNER JOIN pg_namespace N ON N.oid = S.relnamespace
            WHERE S.relkind = 'S' AND N.nspname = schema_name
            ORDER BY S.relname
    LOOP
        IF (raise_notice) THEN
            RAISE NOTICE 'sql_code: %', sql_code;
        END IF;
        EXECUTE sql_code;
    END LOOP;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;

Examples

Example to execute setval to all sequences in a schema:

SELECT setval_schema('public');

Example to execute setval to all sequences in all user schemas in the database, showing the sentences that are being executed:

SELECT setval_schema(nspname, true)
FROM pg_namespace
WHERE nspname !~ '^pg_.*' AND nspname <> 'information_schema';

Full example

In this example, a table is created and some registers are inserted. The 3rd insert “Third Value – Jumping” is forced to id = 7 instead of using the sequence. But the sequence hasn’t been modified so when it would arrive to 7 it would get an error because the value already exists.

CREATE TABLE test_setval
(
    id serial NOT NULL,
    info text NOT NULL,
    CONSTRAINT test_setval_pkey PRIMARY KEY (id)
)
WITH (
    OIDS=FALSE
);

INSERT INTO test_setval (info) VALUES ('First Value');
INSERT INTO test_setval (info) VALUES ('Second Value');
INSERT INTO test_setval (id, info) VALUES (7,'Third Value - Jumping');
INSERT INTO test_setval (info) VALUES ('Forth Value');

-- 'Forth Value' is inserted with id = 3
SELECT * FROM test_setval;

select after insert Forth Value

 SELECT setval_schema('public', true);

INSERT INTO test_setval (info) VALUES ('Fifth Value - after setval');

-- You can see how the data has been inserted in the same sequencial order:
SELECT * FROM test_setval;

select

-- or sortering it:
SELECT * FROM test_setval ORDER BY id;

select order by

You can download the all code from here.

Review of Getting Started with OrientDB book

Disclaimer: I was asked to review the Getting Started with OrientDB book and I received a free ebook copy of it.

OrientDB is defined in its website as: an Open Source GraphDB with a mix of features taken from Document Databases and Object Orientation. It’s a new and promising database. But it’s has little documentation and it’s quite chaotic to find what you’re looking for and it’s spread out in multiple places. Maybe I’m used to the amazing PostgreSQL documentation!

Getting Started with OrientDB is written by Claudio Tesoriero and it has been published on August 2013. It’s the only OrientDB book out there. At least it’s the only book listed in OrientDB website and the only one I have found.

My very first surprise is that it has 138 pages, below the average for a computer engineer book. So, as the title says, it’s a getting starter guide, an introduction to OrientDB. Probably this is the main cause why in many times I would have liked more information about some topics.

The book gives a wide, but not in depth, introduction to OrientDB: installation, architecture, administration, querying, performance tuning and advanced features like clustering.

 I would have changed the order of the book: first it discusses administration and later on programming, while I would have done the other way around because you should already be familiar with classes, properties and querying data before starting the administration part.

I also find that some information is in the wrong place, like:

  • Classes and abstract classes are explained between OrientDB’s data files and security roles, when the best place would have been in programming part.
  • The OrientDB console and the OrientDB Studio, the web console, are explained in the same chapter, but there’s other stuff in the middle, instead of being together.
  • How to install OrientDB as a daemon/service should be in Advanced Feature and not in the first chapter.

OrientDB is itself confusing when it talks about the editions (Graph and Standard) and the types of databases it has (graph and document), you’ve to read a little before knowing the difference and the book could have helped more on it and also to explain the differences between the graph and the document database in depth.

On the other hand, I like the Programming chapter, where the author explains with good and complete examples the document database and trying with these examples is enough to feel confident with the document database, but it covers the graph database with just 4 pages. In this chapter you also get an introduction how to configure the JDBC, use the Java API and use the RESTful APIs.

It’s also interesting the Performance Tuning and Advanced Feature chapters, where clustering it’s explained. They are interesting chapters, although sometimes I would have enjoyed more detailed information, for example when the author is explaining how OrientDB uses NIO to map data files in memory or the network.socketBufferSize property.

Conclusion:

As I said, there are two things that would have made it a much better book:

  • The order of the chapters should have been different and some information rearranged.
  • More information on some points, but 138 pages are not enough for explaining OrientDB in depth.

There’s always the question if it’s worth buying the book. Although these two big complains, I would have bought the book because:

  • The book explains a lot of things that are not available in OrientDB’s documentation.
  • I would have saved a lot of time instead of looking into the little and bad organized OrientDB’s documentation.

Hidden cartesian product

Introduction

There’s a project in my company that has a dedicated PostgreSQL virtual server with 12 cores and 32 GB RAM and it’s exclusively being used by one developer. He told me that he had an error with a query and checking the log file I could find: “ERROR: could not write block XXXX of temporary file: No space left”. But there were 27 GB free.
I asked him to run the query again to be able to see what was going on.

Analysis

The directory of PostgreSQL /data/pg_stat_tmp started growing until fill up the 27 GB! and PostgreSQL gave the out of space error again. Amazing.

The query had the typical INNER JOIN between a Products table and Orders table, with an aggregation on Orders table. I was really surprised for this behaviour. Here is the size (MB and rows) of the tables:

  • public.Orders: 314 MB and 4.242.561 rows.
  • public.Products: 189 MB and 1.108.514 rows.

The query plan was quite suspicious because I wasn’t expectiong a Nested Loop:

I changed the query doing the aggregation in a Common Table Expression and with the result I made the JOIN. It was just a fast test to see if I could find any clue about what was going on. It worked: 66 seconds. The query plan had the expected MERGE JOIN but it was doing a TABLE SCAN instead of using the indexes!

But 66 seconds was so much. So, diving into the problem I found that instead of NULL values, there were in both tables ‘1’ (number one). So we had a lot of ‘1’ in the joining columns! To be exact:

  • public.Orders: 318.884 rows. 7,5% of total rows.
  • public.Products: 184.324 rows.  16,6% of total rows.

So, we’ve: 318884 * 184324 = 58,777,974,416 rows!!

Yes, it’s like a cartasian product!

Solutions

Of course the obvious one in this case was to change the ones for nulls.

This is a development environtment and it’s far from critical. But in production, if users can make queries that are not controled by an application, we can set statement_timeout. It can be set globally in postgresql.conf or in connection basis.

Get-Hash, a powershell hash function

I needed a hash function that I could pass a text string or a file to compute its hash (MD5, SHA1, etc…). But unfortunately Powershell hasn’t got one, so I started to look for alternatives:

  • There’s a Get-Hash cmdlet (function embedded in a DLL written in C#) in PowerShell Community Extensions (PSCX) module.
  • I found two small tools command-line tools:
  • There’s an interesting post called Using Powershell for MD5 Checksums  in Brian Hartsock’s Blog that explains how to calculate directly in Powershell using NET Framework classes.

Any of these options exactly matched my needs:

  • PSCX is an amazing module, but I didn’t want to install it in production servers just to use one cmdlet.
  • The two small tools do they work without flawness, althought I prefer to have as less external applications as possible. I should parse the result and I prefer to have everything in Powershell.
  • Brian Hartsock’s post is a good point of start. I only misses the possibility to compute file’s hashes.

So based on Brian Hartsock’s post I implemented this function for Powershell 2.0 (because I use parameter definition and try…catch in it) that gets the hash value of a file or a text string. It uses HashAlgorithm Class  and FileStream Class. Both classes are supported since NET Framework 1.0.

The Get-Hash function

The following code can also be downloaded from here.

<#
.SYNOPSIS
Gets the hash value of a file or string

.DESCRIPTION
Gets the hash value of a file or string
It uses System.Security.Cryptography.HashAlgorithm (http://msdn.microsoft.com/en-us/library/system.security.cryptography.hashalgorithm.aspx)
and FileStream Class (http://msdn.microsoft.com/en-us/library/system.io.filestream.aspx)
Based on: http://blog.brianhartsock.com/2008/12/13/using-powershell-for-md5-checksums/ and some ideas on Microsoft Online Help

Be aware, to avoid confusions, that if you use the pipeline, the behaviour is the same as using -Text, not -File

.PARAMETER File
File to get the hash from.

.PARAMETER Text
Text string to get the hash from

.PARAMETER Algorithm
Type of hash algorithm to use. Default is SHA1

.EXAMPLE
C:\PS> Get-Hash "hello_world.txt"
Gets the SHA1 from myFile.txt file. When there's no explicit parameter, it uses -File

.EXAMPLE
Get-Hash -File "C:\temp\hello_world.txt"
Gets the SHA1 from myFile.txt file

.EXAMPLE
C:\PS> Get-Hash -Algorithm "MD5" -Text "Hello Wold!"
Gets the MD5 from a string

.EXAMPLE
C:\PS> "Hello Wold!" | Get-Hash
We can pass a string throught the pipeline

.EXAMPLE
Get-Content "c:\temp\hello_world.txt" | Get-Hash
It gets the string from Get-Content

.EXAMPLE
Get-ChildItem "C:\temp\*.txt" | %{ Write-Output "File: $($_)   has this hash: $(Get-Hash $_)" }
This is a more complex example gets the hash of all "*.tmp" files

.NOTES
DBA daily stuff (http://dbadailystuff.com) by Josep Martínez Vilà
Licensed under a Creative Commons Attribution 3.0 Unported License

.LINK
Original post: http://dbadailystuff.com/2013/03/11/get-hash-a-powershell-hash-function/
#>
function Get-Hash
{
     Param
     (
          [parameter(Mandatory=$true, ValueFromPipeline=$true, ParameterSetName="set1")]
          [String]
          $text,
          [parameter(Position=0, Mandatory=$true, ValueFromPipeline=$false, ParameterSetName="set2")]
          [String]
          $file = "",
          [parameter(Mandatory=$false, ValueFromPipeline=$false)]
          [ValidateSet("MD5", "SHA", "SHA1", "SHA-256", "SHA-384", "SHA-512")]
          [String]
          $algorithm = "SHA1"
     )
     Begin
     {
          $hashAlgorithm = [System.Security.Cryptography.HashAlgorithm]::Create($algorithm)
     }
     Process
     {
          $md5StringBuilder = New-Object System.Text.StringBuilder 50
          $ue = New-Object System.Text.UTF8Encoding

          if ($file){
               try {
                    if (!(Test-Path -literalpath $file)){
                         throw "Test-Path returned false."
                    }
               }
               catch {
                    throw "Get-Hash - File not found or without permisions: [$file]. $_"
               }
               try {
                    [System.IO.FileStream]$fileStream = [System.IO.File]::Open($file, [System.IO.FileMode]::Open);
                    $hashAlgorithm.ComputeHash($fileStream) | % { [void] $md5StringBuilder.Append($_.ToString("x2")) }
               }
               catch {
                    throw "Get-Hash - Error reading or hashing the file: [$file]"
               }
               finally {
                    $fileStream.Close()
                    $fileStream.Dispose()
               }
          }
          else {
               $hashAlgorithm.ComputeHash($ue.GetBytes($text)) | % { [void] $md5StringBuilder.Append($_.ToString("x2")) }
          }

          return $md5StringBuilder.ToString()
     }
}

Some examples how to call it

Here are some examples on how to use Get-Hash. Note that node.exe is a binary file.

Write-Output "`nSome examples how to call it:"
Get-Hash "c:\temp\myScriptFile.sql"
Get-Hash "c:\temp\br[a{ets.txt"
Get-Hash "c:\temp\node.exe"
Get-Hash -Algorithm "MD5" -Text "A MD5 checksum!"

Write-Output "`nFour hello world examples that return the same hash value:"
Get-Hash "c:\temp\hello_world.txt"
"Hello Wold!" | Get-Hash
Get-Hash -Text "Hello Wold!"
Get-Content "c:\temp\hello_world.txt" | Get-Hash
Get-ChildItem "C:\temp\*world*.txt" | %{ Write-Output "File: $($_)   has this hash: $(Get-Hash $_)" }

results_examples

Follow

Get every new post delivered to your Inbox.