Historical PowerTrack¶
Methods¶
Method | Description |
---|---|
POST /jobs | Create a new Historical PowerTrack job |
GET /jobs/:uuid | Monitor the status of a single Historical PowerTrack job |
PUT /jobs/:uuid | Accept or reject a Historical PowerTrack job that has been estimated |
GET /results | Retrieve the list of URLs for a finished job. These URLs can then be used to download the data files. |
Download Twitter Data Files | Download the Twitter data files generated for a complete Historical PowerTrack job. |
GET /jobs | Check the status of all your Historical PowerTrack jobs |
Support for Tweet edits - Since Historical PowerTrack only delivers Tweets from at least 30 minutes ago, all Tweets returned will reflect their final edit state. All Tweet objects for Tweets posted starting August 17, 2022, will include metadata that describes its edit history. See the "Tweet edits" fundamentals page for more details.
Authentication¶
All requests to the Historical PowerTrack API must use HTTP Basic Authentication, constructed from a valid email address and password combination used to log into your account at console.gnip.com. Credentials must be passed as the Authorization header for each request. Make sure your client is adding the "Authentication: Basic" HTTP header (with encoded credentials over HTTPS) to all API requests.
Job Stages¶
Below are the stages that a job may have as it goes through the life cycle. Job status messages returned by the API may vary from the table below.
Status | Status Message(s) |
---|---|
opened | Waiting on quote from Gnip. |
estimating |
|
quoted | Job quoted and awaiting customer acceptance. |
accepted | Job accepted and ready to be queued. |
rejected | Job quote rejected by customer. |
running |
|
completed | Job completed and awaiting validation. |
delivered | Job delivered and available for download |
failed |
|
Example Job Timeline:
2017-03-07 19:42 UTC Job Created 2017-03-07 19:50 UTC Estimation Completed 2017-03-07 19:51 UTC Quoted 2017-03-07 19:52 UTC Accepted 2017-03-07 19:53 UTC Started Running 2017-03-07 21:21 UTC Completed Running 2017-03-22 21:21 UTC Expires
Important Note: "Failed" jobs should not be resubmitted. The Twitter Data team will fix this. If you resubmit a failed job, you'll end up with two jobs and pay for the data twice. Assume that a failed job has the team's attention and that Twitter will fix it.
POST /jobs ¶
Creates a new Historical PowerTrack job.
Request Specifications
HTTP Method | POST |
URL | https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs.json |
Content-Type | "application/json" Note: This should be specified in the header of the request. |
Request Body Format | JSON |
Request Body Size Limit | 1000 rules |
Rate Limits | A maximum of 60 Jobs can be created per (UTC) day. A maximum of 30 Jobs can be created per hour. A maximum of 2 Jobs can be estimating concurrently. A maximum of 2 Jobs can be running concurrently. |
Request Body Elements
Parameter | Example | Comment |
---|---|---|
publisher | "publisher":"twitter" |
The data publisher you want the historical job to use. Currently Twitter only. |
dataFormat | "dataFormat":"original" |
The data format to use for the job. Enter "original" to have your data delivered in the enriched native format. |
fromDate | "fromDate":"201201010000" |
A timestamp indicating the start time of the period of interest. Timestamp is specified in the UTC time zone and has a 'YYYYMMDDHHMM' format (minute granularity). This date is inclusive, meaning the minute specified will be included in the job. |
toDate | "toDate":"201201020000" |
A timestamp indicating the end time of the period of interest. Timestamp is specified in the UTC time zone and has a 'YYYYMMDDHHMM' format (minute granularity). This is NOT inclusive, so a time of 00:00 will return data through 23:59 of previous day. I.e. the specified minute will not be included in the job, but the minute immediately preceding it will In the UTC time zone. |
title | "title":"my_job123" |
A title for the historical job. This must be unique and jobs with duplicate titles will be rejected. |
rules | "rules":[ { "value":"cow" }, { "value":"dog", "tag":"pets" }] |
The rules which will determine what data is returned by your job. Historical PowerTrack jobs support up to 1000 PowerTrack rules. For information on PowerTrack rules, see our documentation here, and be sure to consider the caveats listed here. |
Specifying the Correct Time Window
To ensure that your Historical PowerTrack job retrieves all the data you need, you should ensure that your app correctly applies the fromDate and toDate parameters. A common customer error is to set the toDate one minute earlier than it should be, resulting in the job failing to retrieve the last minute of data desired. This is because toDate is exclusive, where fromDate is inclusive.
To further illustrate, consider the following example, in which the customer desires to retrieve data from minutes 55, 56, 57, 58, and 59 of a given hour. The customer must use 06:00 as the toDate to ensure that they capture Tweets from minute 59. Tweets from 06:00 are not included in the Twitter data that is delivered.
To extend this to a more practical example, imagine a Historical PowerTrack job that required data for every minute of January 1, 2014. To capture each of these minutes, the job should have a fromDate of 201401010000 (01-01-2014 00:00), and a toDate of 201401020000 (01-02-2014 00:00 -- minute 0 of January 2). This will ensure minutes 00:00-23:59 are delivered in the results.
Understanding Number of "Historical Powertrack Days"
A "Historical Powertrack Day" starts at UTC 0000 and ends at 2359. Inclusion of any minute of a "Historical Powertrack Day" in a job would count as a day.
A job with "fromDate":"201201010000" and "toDate":"201201020000"
would count as 1 day.
A job with "fromDate":"201201010001" and "toDate":"201201020001" would
count as 2 days.
A job with "fromDate":"201201010000" and "toDate":"201201020001" would
count as 2 days.
Request Body Example
{
"publisher":"Twitter",
"dataFormat":"original",
"fromDate":"201201010000",
"toDate":"201201010001",
"title":"my_job123",
"rules":
[
{
"value":"cat"
}
]
}
Example cURL Request
curl -v POST -u example@customer.com
'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs.json'
-H 'Content-type: application/json'
-d '{"publisher":"twitter","dataFormat":"original","fromDate":"201609212359",
"toDate":"201609230000","title":"testjob12345","rules":[{"tag":"tag_name","value":"rule"}]}'
Response Example
{
"title":"my_job",
"account":"gnip-customer",
"publisher":"Twitter",
"streamType":"track_v2",
"format":"original",
"fromDate":"201201010000",
"toDate":"201201010001",
"requestedBy":"foo@gnip-customer.com",
"requestedAt":"2012-06-13T22:43:27Z",
"status": "open",
"statusMessage": "Waiting on quote from Gnip.",
"jobURL": "https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json"
}
Common Errors:
{
"status": "error",
"reason": "Accepting this job will cause you to exceed your monthly activities cap"
}
GET /jobs/uuid ¶
Monitors the status of a historical job.
After a job is created, you can use this request to monitor the current status of the specific job.
When the job is in the process of generating an estimate of expected order of magnitude of activity volume and time required, this request provides insight into the progress in the estimating process. Once this estimate is complete, the response will indicate the volume and time estimates referenced.
After the job has been accepted, the response can be used to track its progress as the data files are generated.
Request Specifications
HTTP Method | GET |
URL | https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json |
Rate Limit | 5 requests per 5 seconds aggregated across all GET requests. |
Estimate Fields
These fields are returned in the response for this request after a job has been estimated. They should be used to evaluate whether or not to accept the job and run it to completion.
Field | Description |
---|---|
costDollars | The cost of the job. Note that "costDollars" will not be present if you have a subscription to Historical PowerTrack, as the cost is already included in your subscription. |
expiresAt | Indicates how long this job will be available for you to either accept or reject it. |
estimatedActivityCount | An order-of-magnitude estimate of the number of results (tweets) this job is expected to return if accepted. The minimum value is 100 activities. If the sample used to generate the estimate returns no matches, this value will be 100 activities. |
estimatedDurationHours | An estimate of the amount of time that will be required to run the full job, if accepted. |
estimatedFileSizeMb | An estimate of the total compressed file size for the entire job after it is completed. |
Example cURL Request
curl -v GET -u example@customer.com
'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json'
Response Examples
After the Estimate is Completed:
{
"title":"my_job",
"account":"gnip-customer",
"publisher":"Twitter",
"streamType":"track_v2",
"format":"original",
"fromDate":"201201010000",
"toDate":"201201010001",
"requestedBy":"foo@gnip-customer.com",
"requestedAt":"2012-06-13T22:43:27Z",
"status": "quoted",
"statusMessage": "Job quoted and awaiting customer approval.",
"jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
"quote": {
"costDollars": 5000,
"estimatedActivityCount":10000,
"estimatedDurationHours":72,
"estimatedFileSizeMb": 10.0,
"expiresAt": "2012-06-16T22:43:00Z"
},
"percentComplete": 0
}
While the Accepted Job is Running
{
"title":"my_job",
"account":"gnip-customer",
"publisher":"Twitter",
"streamType":"track_v2",
"format":"original",
"fromDate":"201201010000",
"toDate":"201201010001",
"requestedBy":"foo@gnip-customer.com",
"requestedAt":"2012-06-13T22:43:27Z",
"status": "delivered",
"statusMessage": "Job delivered and available for download.",
"jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
"quote":
{
"costDollars": 5000,
"estimatedActivityCount":10000,
"estimatedDurationHours":72,
"estimatedFileSizeMb": 10.0,
"expiresAt": "2012-06-16T22:43:00Z"
},
"acceptedBy": "foo@gnip-customer.com",
"acceptedAt": "2012-06-14T22:43:27Z",
"percentComplete": 100,
"results":
{
"completedAt":"2012-06 17T22:43:27Z",
"activityCount":1200,
"fileCount":1000,
"fileSizeMb":10.0,
"dataURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json"
"expiresAt": "2012-06-30T22:43:00Z"
}
}
PUT /jobs/uuid ¶
Accepts or rejects a historical job in the "quoted" stage.
IMPORTANT: Accepted jobs will be started by Twitter's system, and cannot be stopped after acceptance. Rejected jobs cannot be recovered.
Request Specifications
HTTP Method | PUT |
URL | https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json |
Content-Type | "application/json" Note: This should be specified in the header of the request. |
Accept Request Body Example
{
"status":"accept"
}
Reject Request Body Example
{
"status":"reject"
}
Response
After a job is either accepted or rejected, the job will include an “acceptedAt” field, which indicates the time of that action. This field records the date regardless of whether the job was accepted or rejected, and should not be interpreted as an indicator of the type of action you took on it. The proper field for this type of info is the “status” field.
Example cURL Request
curl -v PUT -u example@customer.com
'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json'
-H 'Content-type: application/json'
-d '{"status":"accept"}'
Response Example
{
"title":"my_job",
"account":"gnip-customer",
"publisher":"Twitter",
"streamType":"track_v2",
"format":"original",
"fromDate":"201201010000",
"toDate":"201201010001",
"requestedBy":"foo@gnip-customer.com",
"requestedAt":"2012-06-13T22:43:27Z",
"status": "accepted",
"statusMessage": "Job accepted and ready to be queued.",
"jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
"quote": {
"costDollars": 5000,
"estimatedActivityCount":10000,
"estimatedDurationHours":72,
"estimatedFileSizeMb": 10.0,
"expiresAt": "2012-06-16T22:43:00Z"
},
"acceptedBy": "foo@gnip-customer.com",
"acceptedAt": "2012-06-14T22:43:27Z",
"percentComplete": 0
}
GET /results ¶
Retrieves information about a completed Historical PowerTrack job, including a list of URLs which correspond to the data files generated for a completed historical job. These URLs will be used to download the Twitter data files.
Note: Results must be downloaded within 15 days of job completion. Expired jobs are not recoverable.
Request Specifications
HTTP Method | GET |
URL | Found in the dataURL field returned in the response
from the GET /jobs/:uuid request for a completed job. The URL resembles
the following
structure:https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.json |
Response Format | JSON |
Rate Limit | 5 requests per 5 seconds aggregated across all GET requests. |
Response Elements
For a completed job, the response for this request will include the following elements in the JSON payload.
Field | Description |
---|---|
urlCount | The number of data files from the job. |
urlList | A list of urls representing each data file from the job. Each file
represents a 10-minute period of data during the job’s requested
timeframe. If a there was no matching data for a 10-minute period, the
file representing that period will be omitted from the list. These URLs should be used when downloading the Twitter data for the job. |
totalFileSizeBytes | The total file size of the data files in bytes. |
suspectMinutesURL | Optional. There are minute periods in Gnip’s historical corpus where there might not be 100% data fidelity due to unexpected occurrences like Twitter downtime or Twitter connectivity/API issues. This suspectMinutesURL field will only be present in the payload if there are suspect minutes during the job’s specified timeframe. If this field is present, a GET request with your Gnip Console credentials will return a list of the suspect minutes within the timeframe. |
expiresAt | This is the time at which the results for this job will no longer be available for download. |
Example cURL Request
curl -v GET -u example@customer.com
'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.json'
Response Example
{
"urlCount":5,
"urlList":[
"https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?...",
"https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?...",
"https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?...",
"https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?...",
"https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?..."
],
"totalFileSizeBytes":1146952,
"suspectMinutesUrl":
"https://archive.replay.historical.review.s3.amazonaws.com/customers/gnip-biz/20120101-20120101_5zzax1awxq/suspectMinutes.json?..."
"expiresAt":"2012-11-24T18:53:23Z"
}
In addition to the JSON file, we also generate a CSV version that can be used where more convenient. The CSV file contains a series of lines separated by 'n', where each line is a local file name for the job file, the tab ('t') character, and the S3 url for the file.
The CSV can be used for the same purpose as the JSON version when downloading the historical job. In the "downloading" section below, example curl commands use the csv version, rather than the JSON version, but both types will deliver the same results. This list can also be downloaded by navigating to the “dataURL” in your browser while logged in to the Gnip console.
Note: All data generated by Historical PowerTrack are in JSON format. The CSV format referenced here refers only to the list of download links, not the actual Twitter data itself.
Download Twitter Data Files ¶
Downloads a Gzip compressed JSON file of Twitter data for a completed Historical PowerTrack job.
Historical PowerTrack jobs generate 1 file for each 10-minute segment of time covered by the job, except where there is no matching Twitter data found within that 10-minute segment. For example, a Historical PowerTrack job covering 1 hour would contain 6 separate files, which would each need to be downloaded via separate requests.
The response to the GET /results request will include a 'urlCount' field that indicates the total number of files to download, and the URLs corresponding to each file. The request described below must be repeated for each file URL listed in the 'results' response, although the requests can be made in parallel. Upon completion of all downloads, your app should ensure that the number of files downloaded matches the 'urlCount' from the 'results' response..
Historical data files are available for 15 days from when the job is accepted. After 15 days the data will no longer be accessible.
For details on the format of the Tweets, click here.
Request Specifications
HTTP Method | GET |
URL | URLs for each file are provided in the response to the GET /results method for a completed job. The request must be repeated for each file URL listed there. |
File Compression | Gzip |
Downloading Files
cURL is a simple command line tool for sending HTTP requests, and provides an easy way to download files for Historical PowerTrack jobs. cURL is native to Linux and Mac OS, and may be installed on Windows in combination with the CygWin package. See our article here for details on this.
For examples demonstrating how to send the types of HTTP requests described below, see the below links to code examples:
Download script (bash)
- The bash script above can be used to download Historical PowerTrack data files
rbHistoricalPT (ruby)
Download files using cURL command (Option 1 of 2)
This command performs both the GET /results and GET /activities requests to download all the Twitter data files for a Historical PowerTrack job. Downloading of the files is performed in parallel into a local directory. As each file is downloaded, the command will print the command being used.
When using this command, be sure to use the CSV form of the results file (results.csv), rather than the JSON results file (results.json). The 'name' column of all_files.csv is used to name the individual files.
curl -sS -u example@customer.com:password https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.csv | xargs -P 8 -t -n2 curl -o
If you experience issues with the above command, it is often useful to try downloading just the first data file to help debug. Again, when using this command, be sure to use the CSV form of the results file (results.csv), rather than the JSON results file (results.json).
curl -sS -u example@customer.com:password https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.csv | head -1 | xargs -P 8 -t -n2 curl -o
Download files using bash script (option 2 of 2)
Another option for downloading Historical PowerTrack data files is by using a bash script. One advantage of using this script is that it has 'smarts' when restarting a download cycle. If your download cycle gets interrupted for any reason, the script will inspect what files it has already downloaded and download only the files that are not available locally.
To learn more about this bash script, please read our Downloading Historical PowerTrack Files documentation.
GET /jobs ¶
Retrieves details for all historical PowerTrack jobs which are not expired for the given account.
This request can be used to pull the list of all Historical PowerTrack jobs for the customer's account, as well as aggregate metrics related to Historical PowerTrack usage. Expired jobs are not included in the list of jobs returned. Jobs which are not accepted expire 15 days after being estimated. Jobs that are accepted will expire in 15 days after acceptance.
For aggregate usage data, see the delivered, jobCount, jobDaysRun, and activityCount fields.
Request Specifications
HTTP Method | GET |
URL | https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs.json |
Rate Limit | 5 requests per 5 seconds aggregated across all GET requests. |
Example cURL Request
curl -v GET -u example@customer.com
'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs.json'
Example Response
{
"jobs":[{
"title": "my_job1",
"jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
"status": "running",
"publisher":"Twitter",
"streamType":"track_v2",
"fromDate":"201201010000",
"toDate":"201201010001"
"percentComplete": 40
"expiresAt":"2012-11-14T21:21:15Z"
},{
"title": "my_job2",
"jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
"status": "delivered",
"publisher":"Twitter",
"streamType":"track_v2",
"fromDate":"201201010000",
"toDate":"201201010001",
"percentComplete": 100
"expiresAt":"2012-11-16T10:15:03Z"
}],
"delivered":{
"jobCount":3,
"jobDaysRun":73,
"activityCount":15958
}
}