Document (Release 1.2)

  1. Overview
  2. Comparison with pg_trgm
  3. Tested platforms
  4. Install
  5. Uninstall
  6. Full text search
  7. Functions
  8. Parameters
  9. Limitations
  10. Release Notes

Overview

The pg_bigm module provides full text search capability in PostgreSQL. This module allows a user to create 2-gram (bigram) index for faster full text search.

The pg_bigm project provides the following one module.

ModuleDescriptionSource Archive File Name
pg_bigm Module that provides full text search capability in PostgreSQL pg_bigm-x.y-YYYYMMDD.tar.gz

The x.y and YYYYMMDD parts of the source archive file name are replaced with its release version number and date, respectively. For example, x.y is 1.1 and YYYYMMDD is 20131122 if the file of the version 1.1 was released on November 22, 2013.

The license of pg_bigm is The PostgreSQL License (same as BSD license).

Comparison with pg_trgm

The pg_trgm contrib module which provides full text search capability using 3-gram (trigram) model is included in PostgreSQL. The pg_bigm was developed based on the pg_trgm. They have the following differences:

Functionalities and Featurespg_trgmpg_bigm
Phrase matching method for full text search 3-gram 2-gram
Available index GIN and GiST GIN only
Available text search operators LIKE (~~), ILIKE (~~*), ~, ~* LIKE only
Full text search for non-alphabetic language
(e.g., Japanese)
Not supported (*1) Supported
Full text search with 1-2 characters keyword Slow (*2) Fast
Similarity search Supported Supported (version 1.1 or later)
Maximum indexed column size 238,609,291 Bytes (~228MB) 107,374,180 Bytes (~102MB)

pg_bigm 1.1 or later can coexist with pg_trgm in the same database, but pg_bigm 1.0 cannot.

Tested platforms

pg_bigm has been built and tested on the following platforms:

CategoryModule Name
OS Linux, Mac OS X
DBMS PostgreSQL 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 10

pg_bigm requires PostgreSQL 9.1 or later.

Install

Install PostgreSQL

From the official site of PostgreSQL download the source archive file "postgresql-X.Y.Z.tar.gz (please replace X.Y.Z with actual version number)" of PostgreSQL, and then build and install it.

$ tar zxf postgresql-X.Y.Z.tar.gz
$ cd postgresql-X.Y.Z
$ ./configure --prefix=/opt/pgsql-X.Y.Z
$ make
$ su
# make install
# exit

If PostgreSQL is installed from RPM, the postgresql-devel package must be installed to build pg_bigm.

Install pg_bigm

Download the source archive file of pg_bigm from here, and then build and install it.

$ tar zxf pg_bigm-x.y-YYYYMMDD.tar.gz
$ cd pg_bigm-x.y-YYYYMMDD
$ make USE_PGXS=1 PG_CONFIG=/opt/pgsql-X.Y.Z/bin/pg_config
$ su
# make USE_PGXS=1 PG_CONFIG=/opt/pgsql-X.Y.Z/bin/pg_config install
# exit

Load pg_bigm

Create the database cluster, modify postgresql.conf, start PostgreSQL server and then load pg_bigm into the database.

$ initdb -D $PGDATA --locale=C --encoding=UTF8

$ vi $PGDATA/postgresql.conf
shared_preload_libraries = 'pg_bigm'

$ pg_ctl -D $PGDATA start
$ psql -d <database name>
=# CREATE EXTENSION pg_bigm;
=# \dx pg_bigm
                    List of installed extensions
  Name   | Version | Schema |              Description
---------+---------+--------+---------------------------------------
 pg_bigm | 1.1     | public | text index searching based on bigrams
(1 row)

Uninstall

Delete pg_bigm

Unload pg_bigm from the database and then uninstall it.

$ psql -d <database name>
=# DROP EXTENSION pg_bigm CASCADE;
=# \q

$ pg_ctl -D $PGDATA stop
$ su

# cd <pg_bigm source directory>
# make USE_PGXS=1 PG_CONFIG=/opt/pgsql-X.Y.Z/bin/pg_config uninstall
# exit

Reset postgresql.conf

Delete the following pg_bigm related settings from postgresql.conf.

Create Index

You can create an index for full text search by using GIN index.

The following example creates the table pg_tools which stores the name and description of PostgreSQL related tool, inserts four records into the table, and then creates the full text search index on the description column.

=# CREATE TABLE pg_tools (tool text, description text);

=# INSERT INTO pg_tools VALUES ('pg_hint_plan', 'Tool that allows a user to specify an optimizer HINT to PostgreSQL');
=# INSERT INTO pg_tools VALUES ('pg_dbms_stats', 'Tool that allows a user to stabilize planner statistics in PostgreSQL');
=# INSERT INTO pg_tools VALUES ('pg_bigm', 'Tool that provides 2-gram full text search capability in PostgreSQL');
=# INSERT INTO pg_tools VALUES ('pg_trgm', 'Tool that provides 3-gram full text search capability in PostgreSQL');

=# CREATE INDEX pg_tools_idx ON pg_tools USING gin (description gin_bigm_ops);

You can also create multicolumn pg_bigm index and specify GIN related parameters then, as follows.

=# CREATE INDEX pg_tools_multi_idx ON pg_tools USING gin (tool gin_bigm_ops, description gin_bigm_ops) WITH (FASTUPDATE = off);

You can execute full text search by using LIKE pattern matching.

=# SELECT * FROM pg_tools WHERE description LIKE '%search%';
  tool   |                             description                             
---------+---------------------------------------------------------------------
 pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL
 pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL
(2 rows)

You can execute similarity search by using =% operator.

The following query returns all values in the tool column that are sufficiently similar to the word 'bigm'. This similarity search is basically fast because it can use the full text search index. It measures whether two strings are sufficiently similar to by seeing whether their similarity is higher than or equal to the value of pg_bigm.similarity_limit. This means, in this query, that the values whose similarity with the word 'bigm' is higher than or equal to 0.2 are only 'pg_bigm' and 'pg_trgm' in the tool column.

=# SET pg_bigm.similarity_limit TO 0.2;

=# SELECT tool FROM pg_tools WHERE tool =% 'bigm';
  tool   
---------
 pg_bigm
 pg_trgm
(2 rows)

Please see bigm_similarity function for details of how to calculate the similarity.

Functions

likequery

likequery is a function that converts the search keyword (argument #1) into the pattern string that LIKE operator can handle properly.

If the argument #1 is NULL, the return value is also NULL.

This function does the conversion as follows:

In pg_bigm, full text search is performed by using LIKE pattern matching. Therefore, the search keyword needs to be converted into the pattern string that LIKE operator can handle properly. Usually a client application should be responsible for this conversion. But, you can save the effort of implementing such a conversion logic in the application by using likequery function.

=# SELECT likequery('pg_bigm has improved the full text search performance by 200%');
                             likequery                             
-------------------------------------------------------------------
 %pg\_bigm has improved the full text search performance by 200\%%
(1 row)

Using likequery, you can rewrite the full text search query which was used in the example in "Execute full text search" into:

=# SELECT * FROM pg_tools WHERE description LIKE likequery('search');
  tool   |                             description                             
---------+---------------------------------------------------------------------
 pg_bigm | Tool that provides 2-gram full text search capability in PostgreSQL
 pg_trgm | Tool that provides 3-gram full text search capability in PostgreSQL
(2 rows)

show_bigm

show_bigm returns an array of all the 2-grams in the given string (argument #1).

A 2-gram that show_bigm returns is a group of two consecutive characters taken from a string that blank character has been appended into the beginning and the end. For example, the 2-grams of the string "ABC" are "(blank)A" "AB" "BC" "C(blank)".

=# SELECT show_bigm('full text search');
                            show_bigm                             
------------------------------------------------------------------
 {" f"," s"," t",ar,ch,ea,ex,fu,"h ","l ",ll,rc,se,"t ",te,ul,xt}
(1 row)

bigm_similarity

bigm_similarity returns a number that indicates how similar the two strings (argument #1 and #2) are.

This function measures the similarity of two strings by counting the number of 2-grams they share. The range of the similarity is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical).

=# SELECT bigm_similarity('full text search', 'text similarity search');
 bigm_similarity 
-----------------
        0.571429
(1 row)

Note that each argument is considered to have one space prefixed and suffixed when determining the set of 2-grams contained in the string for calculation of similarity. For example, though the string "ABC" contains the string "B", their similarity is 0 because there are no 2-grams they share as follows. On the other hand, the string "ABC" and "A" share one 2-gram "(blank)A" as follows, so their similarity is higher than 0. This is basically the same behavior as pg_trgm's similarity function.

=# SELECT bigm_similarity('ABC', 'A');
 bigm_similarity 
-----------------
            0.25
(1 row)

=# SELECT bigm_similarity('ABC', 'B');
 bigm_similarity 
-----------------
               0
(1 row)

Note that bigm_similarity is NOT case-sensitive, but pg_trgm's similarity function is case-sensitive. For example, the similarity of the strings "ABC" and "abc" is 1 in pg_trgm's similarity function but 0 in bigm_similarity.

=# SELECT similarity('ABC', 'abc');
 similarity 
------------
          1
(1 row)

=# SELECT bigm_similarity('ABC', 'abc');
 bigm_similarity 
-----------------
               0
(1 row)

pg_gin_pending_stats

pg_gin_pending_stats is a function that returns the number of pages and tuples in the pending list of GIN index (argument #1).

Note that the return value #1 and #2 are 0 if the argument #1 is the GIN index built with FASTUPDATE option disabled because it doesn't have a pending list. Please see GIN Fast Update Technique for details of the pending list and FASTUPDATE option.

=# SELECT * FROM pg_gin_pending_stats('pg_tools_idx');
 pages | tuples
-------+--------
     1 |      4
(1 row)

Parameters

pg_bigm.last_update

pg_bigm.last_update is a parameter that reports the last updated date of the pg_bigm module. This parameter is read-only. You cannot change the value of this parameter at all.

=# SHOW pg_bigm.last_update;
 pg_bigm.last_update
---------------------
 2013.11.22
(1 row)

pg_bigm.enable_recheck

pg_bigm.enable_recheck is a parameter that specifies whether to perform Recheck which is an internal process of full text search. The default value is on, i.e., Recheck is performed. Not only superuser but also any user can change this parameter value in postgresql.conf or by using SET command. This parameter must be enabled if you want to obtain the correct search result.

PostgreSQL and pg_bigm internally perform the following processes to get the search results:

The latter process is called Recheck. The result candidates retrieved from full text search index may contain wrong results. Recheck process gets rid of such wrong results.

For example, imagine the case where two character strings "He is awaiting trial" and "It was a trivial mistake" are stored in a table. The correct search result with the keyword "trial" is "He is awaiting trial". However, "It was a trivial mistake" is also retrieved as the result candidate from the full text search index because it contains all the 2-grams ("al", "ia", "ri", "tr") of the search keyword "trial". Recheck process tests whether each candidate contains the search keyword itself, and then chooses only the correct results.

How Recheck narrows down the search result can be observed in the result of EXPLAIN ANALYZE.

=# CREATE TABLE tbl (doc text);
=# INSERT INTO tbl VALUES('He is awaiting trial');
=# INSERT INTO tbl VALUES('It was a trivial mistake');
=# CREATE INDEX tbl_idx ON tbl USING gin (doc gin_bigm_ops);
=# SET enable_seqscan TO off;
=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE doc LIKE likequery('trial');
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbl  (cost=12.00..16.01 rows=1 width=32) (actual time=0.041..0.044 rows=1 loops=1)
   Recheck Cond: (doc ~~ '%trial%'::text)
   Rows Removed by Index Recheck: 1
   ->  Bitmap Index Scan on tbl_idx  (cost=0.00..12.00 rows=1 width=0) (actual time=0.028..0.028 rows=2 loops=1)
         Index Cond: (doc ~~ '%trial%'::text)
 Total runtime: 0.113 ms
(6 rows)

In this example, you can see that Bitmap Index Scan retrieved two rows from the full text search index but Bitmap Heap Scan returned only one row after Recheck process.

It is possible to skip Recheck process and get the result candidates retrieved from the full text search index as the final results, by disabling this parameter. In the following example, wrong result "It was a trivial mistake" is also returned because the parameter is disabled.

=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
         doc          
----------------------
 He is awaiting trial
(1 row)

=# SET pg_bigm.enable_recheck = off;
=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');
           doc            
--------------------------
 He is awaiting trial
 It was a trivial mistake
(2 rows)

This parameter must be enabled if you want to obtain the correct search result. On the other hand, you may need to set it to off, for example, for evaluation of Recheck performance overhead or debugging, etc.

pg_bigm.gin_key_limit

pg_bigm.gin_key_limit is a parameter that specifies the maximum number of 2-grams of the search keyword to be used for full text search. If it's set to zero (default), all the 2-grams of the search keyword are used for full text search. Not only superuser but also any user can change this parameter value in postgresql.conf or by using SET command.

PostgreSQL and pg_bigm basically use all the 2-grams of search keyword to scan GIN index. However, in current implementation of GIN index, the more 2-grams are used, the more performance overhead of GIN index scan is increased. In the system that large search keyword is often used, full text search is likely to be slow. This performance issue can be solved by using this parameter and limiting the maximum number of 2-grams to be used.

On the other hand, the less 2-grams are used, the more wrong results are included in the result candidates retrieved from full text search index. Please note that this can increase the workload of Recheck and decrease the performance.

pg_bigm.similarity_limit

pg_bigm.similarity_limit is a parameter that specifies the threshold used by the similarity search. The similarity search returns all the rows whose similarity with the search keyword is higher than or equal to this threshold. Value must be between 0 and 1 (default is 0.3). Not only superuser but also any user can change this parameter value in postgresql.conf or by using SET command.

Limitations

Indexed Column Size

The size of the column indexed by bigm GIN index cannot exceed 107,374,180 Bytes (~102MB). Any attempt to enter larger values will result in an error.

=# CREATE TABLE t1 (description text);
=# CREATE INDEX t1_idx ON t1 USING gin (description gin_bigm_ops);
=# INSERT INTO t1 SELECT repeat('A', 107374181);
ERROR:  out of memory

pg_trgm also has this limitation. However, the maximum size in the case of trgm indexed column is 238,609,291 Bytes (~228MB).

Release Notes


Copyright (c) 2012-2016, NTT DATA Corporation