ben-aaron188 | ad8b3f3 | 2023-03-05 20:22:57 +0100 | [diff] [blame] | 1 | #' Makes bunch chat completion requests to the ChatGPT API |
| 2 | #' |
| 3 | #' @description |
| 4 | #' `chatgpt()` is the package's main function for the ChatGPT functionality and takes as input a vector of prompts and processes each prompt as per the defined parameters. It extends the `chatgpt_single()` function to allow for bunch processing of requests to the Open AI GPT API. |
| 5 | #' @details |
| 6 | #' The easiest (and intended) use case for this function is to create a data.frame or data.table with variables that contain the prompts to be requested from ChatGPT and a prompt id (see examples below). |
| 7 | #' For a general guide on the chat completion requests, see [https://platform.openai.com/docs/guides/chat/chat-completions-beta](https://platform.openai.com/docs/guides/chat/chat-completions-beta). This function provides you with an R wrapper to send requests with the full range of request parameters as detailed on [https://platform.openai.com/docs/api-reference/chat/create](https://platform.openai.com/docs/api-reference/chat/create) and reproduced below. |
| 8 | #' |
| 9 | #' |
| 10 | #' If `id_var` is not provided, the function will use `prompt_1` ... `prompt_n` as id variable. |
| 11 | #' |
| 12 | #' Parameters not included/supported: |
| 13 | #' - `logit_bias`: [https://platform.openai.com/docs/api-reference/chat/create#chat/create-logit_bias](https://platform.openai.com/docs/api-reference/chat/create#chat/create-logit_bias) |
| 14 | #' - `stream`: [https://platform.openai.com/docs/api-reference/chat/create#chat/create-stream](https://platform.openai.com/docs/api-reference/chat/create#chat/create-stream) |
| 15 | #' |
| 16 | #' @param prompt_role_var character vector that contains the role prompts to the ChatGPT request. Must be one of 'system', 'assistant', 'user' (default), see [https://platform.openai.com/docs/guides/chat](https://platform.openai.com/docs/guides/chat) |
| 17 | #' @param prompt_content_var character vector that contains the content prompts to the ChatGPT request. This is the key instruction that ChatGPT receives. |
| 18 | #' @param id_var (optional) character vector that contains the user-defined ids of the prompts. See details. |
| 19 | #' @param param_model a character vector that indicates the [ChatGPT model](https://platform.openai.com/docs/api-reference/chat/create#chat/create-model) to use; one of "gpt-3.5-turbo" (default), "gpt-3.5-turbo-0301" |
| 20 | #' @param param_output_type character determining the output provided: "complete" (default), "text" or "meta" |
| 21 | #' @param param_max_tokens numeric (default: 100) indicating the maximum number of tokens that the completion request should return (from the official API documentation: _The maximum number of tokens allowed for the generated answer. By default, the number of tokens the model can return will be (4096 - prompt tokens)._) |
| 22 | #' @param param_temperature numeric (default: 1.0) specifying the sampling strategy of the possible completions (from the official API documentation: _What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or `top_p` but not both._) |
| 23 | #' @param param_top_p numeric (default: 1) specifying sampling strategy as an alternative to the temperature sampling (from the official API documentation: _An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or `temperature` but not both._) |
| 24 | #' @param param_n numeric (default: 1) specifying the number of completions per request (from the official API documentation: _How many chat completion choices to generate for each input message. **Note: Because this parameter generates many completions, it can quickly consume your token quota.** Use carefully and ensure that you have reasonable settings for max_tokens and stop._) |
| 25 | #' @param param_stop character or character vector (default: NULL) that specifies after which character value when the completion should end (from the official API documentation: _Up to 4 sequences where the API will stop generating further tokens._) |
| 26 | #' @param param_presence_penalty numeric (default: 0) between -2.00 and +2.00 to determine the penalisation of repetitiveness if a token already exists (from the official API documentation: _Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics._). See also: [https://beta.openai.com/docs/api-reference/parameter-details](https://beta.openai.com/docs/api-reference/parameter-details) |
| 27 | #' @param param_frequency_penalty numeric (default: 0) between -2.00 and +2.00 to determine the penalisation of repetitiveness based on the frequency of a token in the text already (from the official API documentation: _Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim._). See also: [https://beta.openai.com/docs/api-reference/parameter-details](https://beta.openai.com/docs/api-reference/parameter-details) |
| 28 | #' |
| 29 | #' @return A list with two data tables (if `output_type` is the default "complete"): [[1]] contains the data table with the columns `n` (= the mo. of `n` responses requested), `prompt_role` (= the role that was set for the prompt), `prompt_content` (= the content that was set for the prompt), `chatgpt_role` (= the role that ChatGPT assumed in the chat completion) and `chatgpt_content` (= the content that ChatGPT provided with its assumed role in the chat completion). [[2]] contains the meta information of the request, including the request id, the parameters of the request and the token usage of the prompt (`tok_usage_prompt`), the completion (`tok_usage_completion`), the total usage (`tok_usage_total`) and the `id` (= the provided `id_var` or its default alternative). |
| 30 | #' |
| 31 | #' If `output_type` is "text", only the data table in slot [[1]] is returned. |
| 32 | #' |
| 33 | #' If `output_type` is "meta", only the data table in slot [[2]] is returned. |
| 34 | #' @examples |
| 35 | #' # First authenticate with your API key via `gpt3_authenticate('pathtokey')` |
| 36 | #' |
| 37 | #' # Once authenticated: |
| 38 | #' # Assuming you have a data.table with 3 different prompts: |
| 39 | #' dt_prompts = data.table::data.table('prompts_content' = c('What is the meaning if life?', 'Write a tweet about London:', 'Write a research proposal for using AI to fight fake news:') |
| 40 | #' , 'prompts_role' = rep('user', 3) |
| 41 | #' , 'prompt_id' = c(LETTERS[1:3])) |
| 42 | #'chatgpt(prompt_role_var = dt_prompts$prompts_role |
| 43 | #' , prompt_content_var = dt_prompts$prompts_content |
| 44 | #' , id_var = dt_prompts$prompt_id) |
| 45 | #' |
| 46 | #' ## With more controls |
| 47 | #' chatgpt(prompt_role_var = dt_prompts$prompts_role |
| 48 | #' , prompt_content_var = dt_prompts$prompts_content |
| 49 | #' , id_var = dt_prompts$prompt_id |
| 50 | #' , param_max_tokens = 50 |
| 51 | #' , param_temperature = 0.5 |
| 52 | #' , param_n = 5) |
| 53 | #' |
| 54 | #' ## Reproducible example (deterministic approach) |
| 55 | #' chatgpt(prompt_role_var = dt_prompts$prompts_role |
| 56 | #' , prompt_content_var = dt_prompts$prompts_content |
| 57 | #' , id_var = dt_prompts$prompt_id |
| 58 | #' , param_max_tokens = 50 |
| 59 | #' , param_temperature = 0 |
| 60 | #' , param_n = 3) |
| 61 | #' |
| 62 | #' @export |
| 63 | chatgpt = function(prompt_role_var |
| 64 | , prompt_content_var |
| 65 | , id_var |
| 66 | , param_output_type = 'complete' |
| 67 | , param_model = 'gpt-3.5-turbo' |
| 68 | , param_max_tokens = 100 |
| 69 | , param_temperature = 1.0 |
| 70 | , param_top_p = 1 |
| 71 | , param_n = 1 |
| 72 | , param_stop = NULL |
| 73 | , param_presence_penalty = 0 |
| 74 | , param_frequency_penalty = 0){ |
| 75 | |
| 76 | data_length = length(prompt_role_var) |
| 77 | if(missing(id_var)){ |
| 78 | data_id = paste0('prompt_', 1:data_length) |
| 79 | } else { |
| 80 | data_id = id_var |
| 81 | } |
| 82 | |
| 83 | empty_list = list() |
| 84 | meta_list = list() |
| 85 | |
| 86 | for(i in 1:data_length){ |
| 87 | |
| 88 | print(paste0('Request: ', i, '/', data_length)) |
| 89 | |
| 90 | row_outcome = chatgpt_single(prompt_role = prompt_role_var[i] |
| 91 | , prompt_content = prompt_content_var[i] |
| 92 | , model = param_model |
| 93 | , output_type = param_output_type |
| 94 | , max_tokens = param_max_tokens |
| 95 | , temperature = param_temperature |
| 96 | , top_p = param_top_p |
| 97 | , n = param_n |
| 98 | , stop = param_stop |
| 99 | , presence_penalty = param_presence_penalty |
| 100 | , frequency_penalty = param_frequency_penalty) |
| 101 | |
| 102 | row_outcome[[1]]$id = data_id[i] |
| 103 | row_outcome[[2]]$id = data_id[i] |
| 104 | |
| 105 | empty_list[[i]] = row_outcome[[1]] |
| 106 | meta_list[[i]] = row_outcome[[2]] |
| 107 | |
| 108 | } |
| 109 | |
| 110 | |
| 111 | bunch_core_output = try(data.table::rbindlist(empty_list), silent = T) |
| 112 | if("try-error" %in% class(bunch_core_output)){ |
| 113 | bunch_core_output = data.table::rbindlist(empty_list, fill = T) |
| 114 | } |
| 115 | bunch_meta_output = try(data.table::rbindlist(meta_list), silent = T) |
| 116 | if("try-error" %in% class(bunch_meta_output)){ |
| 117 | bunch_meta_output = data.table::rbindlist(meta_list, fill = T) |
| 118 | } |
| 119 | |
| 120 | if(param_output_type == 'complete'){ |
| 121 | output = list(bunch_core_output |
| 122 | , bunch_meta_output) |
| 123 | } else if(param_output_type == 'meta'){ |
| 124 | output = bunch_meta_output |
| 125 | } else if(param_output_type == 'text'){ |
| 126 | output = bunch_core_output |
| 127 | } |
| 128 | |
| 129 | return(output) |
| 130 | } |