Splunk Data Retention Time Queries
Posted on May, 11 2019 By Maciej Dadok-Grabski
Splunk is a software platform that is used for analyzing, monitoring, and visualizing machine-generated data. Splunk allows users to retain data for a specified period of time, which is known as the data retention time. The data retention time is the length of time that data is stored within the Splunk platform before it is deleted or archived. The data retention time can be set by the user and can be different for different types of data. It is important to carefully consider the data retention time when setting up Splunk, as it can impact the amount of storage space needed and the ability to retrieve historical data.
How Splunk Data Retention works is an interesting topic. At first sight, it seems to be fairly straight. However, there is nothing more erroneous. For instance, just take a look at a few Splunk queries presented below. In order to get better insight into Splunk Data Retention, run them in your environment.
Splunk out of the box is coming with couple features. They let You identify statistics about your data. For example, dbinspect and eventcount combined together will show You the data based on Splunk meta data. It is being collected in parallel with Your actual data that is imported into Splunk. The other way of retrieving your data statistics from Splunk is to use _indextime. You will be using your real data collected on the fly. After that, they will be used in order to measure itself.
Firstly, You will get better understanding about current utilization of your data in Splunk. More importantly You will know how long it is being kept on storage. After that, You will be ready to set appropriate data retention policy. In addition, You may want to follow directions from Splunk docs: Set a retirement and archiving policy.
Splunk Index size
The first query uses Eventcount command in order to determine current size of the Splunk indexes. It will produce index size on every indexer utilized in your environment. Query output might be very useful for storage capacity planning. With help of Splunk Storage Sizing web tool, storage requirements gathering should not be a problem anymore. In the Splunk docs Indexes, indexers, and indexer clusters You can find out more details about Splunk Indexes and Indexers.
| eventcount summarize=false index=* report_size=true | eval sizeMB = size_bytes/1024/1024 | rename server to indexer, sizeMB to "Actual size in MB" | table index, indexer, "Actual size in MB"
Splunk SourceType Retention time
You can run Metadata query for Splunk Source type. As a result, you will be presented with retention time in days for all your Splunk Source types. In the Splunk docs Why source types matter You can find out more details about Splunk Source types.
| metadata type=sourcetypes index=* | eval TimeNow=now() | eval retentionDays=((TimeNow-firstTime)/60/60/24) | table sourcetype, retentionDays
Splunk Index Retention time
By using combination of commands Dbinspect and Eventcount, Splunk Index statistics can be retrieved easily. The output will include Splunk Index Retention time in days. Since this data is collected based on Event time, as a result, the data retrieval is pretty fast. In the Splunk docs About retrieving events You can find out more details about Splunk Events.
| dbinspect [eventcount summarize="false" index=* | dedup index | fields index] | stats min(startEpoch) AS earliestEventTime, max(endEpoch) AS latestEventTime by index,splunk_server | eval retentionDays=((latestEventTime-earliestEventTime)/60/60/24) | convert ctime(*Time) | sort index,splunk_server | rename splunk_server TO indexer | table index, indexer, retentionDays
Splunk Bucket statistics
This Splunk query will show You very interesting and useful data about Splunk Bukcets. The most important fact to be noted is about actual vs. splunk retention time in Days. What is the difference between those two? Basically, splunkRetentionDays shows the number of days one could expect the data to retain based on retirement policy. However, actualRetentionDays tells You, how long Splunk is keeping particular set of data in reality. Splunk Bucket statistics are retrieved based on Event time. As a result, the data collection is pretty fast. In the Splunk docs How the indexer stores indexes You can find out more details about Splunk Buckets and how Splunk stores indexes.
| dbinspect [eventcount summarize="false" index=* | dedup index | fields index] | stats min(startEpoch) AS earliestEventTime, max(endEpoch) AS latestEventTime by bucketId, index, splunk_server, state, sizeOnDiskMB | eval daysInBucket=((latestEventTime-earliestEventTime)/60/60/24) | eval TimeNow=now() | eval actualRetentionDays=((TimeNow-earliestEventTime)/60/60/24) | eval splunkRetentionDays=((TimeNow-latestEventTime)/60/60/24) | convert ctime(*Time*) | sort bucketId, index, splunk_server, state | rename splunk_server TO indexer
Splunk Index Retention time
Last but not least, retention time of all your indexes based on _indextime statistics. The data is collected from Splunk events directly. Hence, You should set specific time for the query. In order to get the full picture, You will need to set “All time” as your time filter. It may take some time to generate the results. It is dependent upon the amount of data in Splunk events is stored.
| tstats min(_indextime) AS earliestIndexTime max(_indextime) AS latestIndexTime where index=* by index,splunk_server | eval retentionDays=((latestIndexTime-earliestIndexTime )/60/60/24) | convert ctime(*Time) | sort index, splunk_server | rename splunk_server TO indexer | table index, indexer, retentionDays