Gemini AI logo


Archiving Gemini data

The archive story

You're a DBA and you can see that your Gemini database has got bigger and bigger. It used to take 10 minutes to back up, now it takes 3 hours. Then again, it is several hundred times bigger than it used to be. What can you do?

Investigate the problem

Sql Server Management Studio is your friend. Simply right-click your database, select Reports...Standard Reports, and run the "Disk Usage By Top Tables" report.

Without even looking at your report, we can tell you what the biggest tables will be.

  • If you use Breeze, then breeze_enquiry will be right up there.
  • Unless you are an exceptional user of Gemini, gemini_issueattachments will rightly be in the top 3
  • Fighting for their place on the leaderboard will be gemini_issueaudit, gemini_issuecomments, and gemini_customfielddata

What can you do about these big tables?

The first thing to do is to get the business to agree that what is really old and closed should be archived. That, in most archiving exercises, is the hardest part.

The technical aspects of archiving are simple once you establish a cut-off date, because what you need to know are the tables and the basic joins in SQL to find old, closed records, and the basic SQL to archive them.

We are not going to attempt to be your DBA. For a DBA, the tasks will be straightforward (even if they're not simple). We're not going to write grown-up SQL with "Join x on y...", or inserts into temp tables and rejoins, which your DBA may have to do in order to cope with large volumes of data. Instead, we're going to us cheat sheet SQL, but the example below will show any competent DBA what tables are involved and what they need to do.

-- SAMPLE SELECT STATEMENTS - You can hard-code a date if the business has agreed to a cut-off date as opposed to an age calculation. The example below looks at items in a closed state (isfinal = 1) that have not been touched in 1095 days

select a.* from breeze_enquiry a where a.issueid in ( select b.issueid from gemini_issues b where b.revised < DATEADD(DD, -1095, CAST(CURRENT_TIMESTAMP AS DATE)) and b.issuestatusid in ( select c.statusid from gemini_issuestatus c where c.isfinal = 1 and c.statusid = b.issuestatusid ) ); select a.* from gemini_issueattachments a where a.issueid in ( select b.issueid from gemini_issues b where b.revised < DATEADD(DD, -1095, CAST(CURRENT_TIMESTAMP AS DATE)) and b.issuestatusid in ( select c.statusid from gemini_issuestatus c where c.isfinal = 1 and c.statusid = b.issuestatusid ) ); select a.* from gemini_customfielddata a where a.issueid in ( select b.issueid from gemini_issues b where b.revised < DATEADD(DD, -1095, CAST(CURRENT_TIMESTAMP AS DATE)) and b.issuestatusid in ( select c.statusid from gemini_issuestatus c where c.isfinal = 1 and c.statusid = b.issuestatusid ) ); select a.* from gemini_issueaudit a where a.issueid in ( select b.issueid from gemini_issues b where b.revised < DATEADD(DD, -1095, CAST(CURRENT_TIMESTAMP AS DATE)) and b.issuestatusid in ( select c.statusid from gemini_issuestatus c where c.isfinal = 1 and c.statusid = b.issuestatusid ) ); select a.* from gemini_issuecomments a where a.issueid in ( select b.issueid from gemini_issues b where b.revised < DATEADD(DD, -1095, CAST(CURRENT_TIMESTAMP AS DATE)) and b.issuestatusid in ( select c.statusid from gemini_issuestatus c where c.isfinal = 1 and c.statusid = b.issuestatusid ) );

You see the pattern? We join our big table on the field 'IssueId' to the Issue Status table (gemini_issuestatus) testing that the item is in a 'final' state (isfinal=1). Final=1 is the same as Closed. We use the revised date not the created date because we want to know when the record was last worked on, not just how old it is relative to its creation date.

Why are these select statements? Well it's because the example is just to show the principle (and the table names). In all likelihood you will need to write more sophisticated selection criteria, and also you really should read our tip below because it will all become clearer. Suffice to say that if you did change 'select' to 'delete', you'd delete the data that the select statements retrieve, but don't do that yet! Read on!

To store the archived data, you might want to create an empty database, with a name that matches the cut-off date of your deletion. Before deleting any data you could use a statement like:


-- SAMPLE ARCHIVE STATEMENTS - replace XYZ with the name of your archive database

select * into XYZ.dbo.archive_breeze_enquiry from breeze_enquiry a where a.issueid in ( select b.issueid from gemini_issues b where b.revised < DATEADD(DD, -365, CAST(CURRENT_TIMESTAMP AS DATE)) and b.issuestatusid in ( select c.statusid from gemini_issuestatus c where c.isfinal = 1 and c.statusid = b.issuestatusid ) );

A "select * into ... from ..." SQL statement into an empty database will create the table and populate it with the data. When we're done using the same technique on all the tables we wish to archive, we can backup the archive database and take it offline. NOW we can might change those original selects into deletes, knowing that there is a copy of the data in an archive database. But hold on, read the tip below!

A tip for you

Data deletion is a heavy-duty process. Apart from anything else you might have to do something about some foreign key constraints. If you updated the tables instead, you would avoid that kind of problem. For example, rather than deleting attachments, you could set the field filecontent = ''. You'll find that updating the fields that hold the large data volumes to be nothing/empty reduces table size almost as well as deleting the record. It avoids any risk to data integrity and a record in Gemini remains to show that there was an attachment, so that if it is needed later it can be retrieved. You can do the same thing with all the large tables by looking at their structure and setting varbinary(max) and nvarchar(max) fields to ''.

There's an added bonus if you use 'update' instead of 'delete'. Any restore script from the archive database will be simpler. You don't need to worry about setting IDENTITY_INSERT off and more of those foreign key constraints, just update the deleted data with the data from the archive database using the matching Issueid key value


Another (slightly different) thing you might wish to do is download attachments to file storage and then set the binary field to nothing in the database. There's more than one way to do this, but the simplest is to write a small C# program with your selection criteria and something like the following code for writing out the attachments:

var sqlConnection = new SqlConnection(connString);
var sqlCommand = new SqlCommand("select filecontent from gemini_issueattachments where ", sqlConnection);
sqlConnection.Open();
var reader = sqlCommand.ExecuteReader();
reader.Read();
var content = reader[0] as byte[];
File.WriteAllBytes("", content);

We'll leave you to write the logic to iterate through the data, on the basis that the ability to write such logic is the minimum qualification for being allowed to execute the code.

We hope this has been helpful. If you get stuck you can always commission Countersoft's professional services to do this for you, and we promise not to use cheat sheet SQL.