Is it better to create an index before filling a table with data, or after the data is in place?
The more efficient way is, creating indexafter data insert.
Instance for (PostgreSQL 9.1,):
CREATE TABLE test1(id serial, x integer); INSERT INTO test1(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id); -- Time: 7816.561 ms CREATE INDEX test1_x ON test1 (x); -- Time: 4183.614 ms
Insert and then create index – about 12 sec
CREATE TABLE test2(id serial, x integer); CREATE INDEX test2_x ON test2 (x); -- Time: 2.315 ms INSERT INTO test2(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id); -- Time: 25399.460 ms
Then Create index and then insert – about 25.5 sec
It is likely better to make the record after the lines are included. Will it be speedier, as well as the tree adjusting will most likely be better.
Alter “adjusting” likely isn’t the best selection of terms here. On account of a b-tree, it is adjusted by definition. In any case, that does not imply that the b-tree has the ideal format. Kid hub dissemination inside guardians can be uneven (prompting more cost in future updates) and the tree profundity can wind up being more profound than should be expected if the adjusting isn’t performed painstakingly amid refreshes.
On the off chance that the file is made after the lines are included, it is will more probable have a superior appropriation. What’s more, list pages on plate may have less discontinuity after the list is constructed
On the off chance that you add information initially to the table and after it you include list. Your list producing time will be O(nlog(N)) longer (where n is a columns included).
Since tree gerating time is O(Nlog(N)) at that point in the event that you split this into old information and new information you get O((X+n)log(N)) this can be essentially changed over to O(Xlog(N) + n*log(N)) and in this configuration you can basically observe what you will hold up extra.
In the event that you include file and after it put information. Each line (you have n new lines) you get longer embed extra time O(log(N)) expected to recover structure of the tree in the wake of including new component into it (record section from new line, since list as of now exists and new line was included at that point list must be recovered to adjusted structure, this cost O(log(P)) where P is a file control [elements in index]).
You have n new lines then at long last you have n O(log(N)) at that point O(nlog(N)) synopsis extra time.
Create table and clustered index, populate IN CLUSTERED INDEX ORDER, create additional indexes
(1) With this much data you may well be sorting it first unless you are getting it straight out of another indexed tables (e.g. aggregation). If you have a sort step you’ll need to test whether sorting is more intensive than creating the clustered index afterwards. It may well be.
(2) With this much data there may be a better way of doing this – e.g. only populating changes – you’d have to write logic for this
(3) With more complex requirements, as previous posters have said, you should test different techniques and see what works best – There are lots of potential differences vetween environments.