Skip to content
Elvin Zeynalli
Go back

Learning Optimal Pricing with Reinforcement Learning

Project description


Problem Statement

Dynamic pricing is a fundamental problem in economics and operations research. Sellers repeatedly interact with customers whose willingness-to-pay is uncertain. Each round:

  1. The seller offers a price from a feasible set (1–100).
  2. The customer either accepts or rejects.
  3. The seller observes only binary feedback (buy/no-buy).

In our environment:

SegmentValuationMarket share
1180.4
2430.3
3560.2
4810.1

The theoretical optimal price is 43, yielding the maximum expected revenue given the mixture of valuations and probabilities.

The objective of this project is not only to find the optimal price but also to test whether reinforcement learning (RL) can discover optimal solutions in combinatorial problems. Pricing here serves as a tractable example:

This aligns with a broader scientific inquiry: can RL and similar adaptive methods consistently solve combinatorial optimization problems without explicit enumeration or full knowledge of the environment?


MDP Formulation

We formalize the pricing task as an MDP:

The MDP lens enables us to treat pricing as a sequential decision problem where the agent incrementally improves its policy.


Actor-Critic Algorithm

The Actor-Critic framework merges two worlds:

In our implementation:

But more generally, the Actor could be a neural policy trained with REINFORCE, PPO, or A2C, while the Critic could approximate values with deep regressors or Monte Carlo estimates. This generality is important: pricing is one instance, but the Actor-Critic template is applicable to much more complex combinatorial optimization tasks.


Implementation

Core Actor-Critic loop

# Initialize
theta = np.random.normal(0, 0.1, (num_states, num_actions))  # Actor parameters
V = np.zeros(num_states)                                     # Critic values

def softmax(logits, tau=1.0):
    exp_logits = np.exp(logits / tau)
    return exp_logits / np.sum(exp_logits)

for episode in range(N):
    s = 1  # initial state
    pi = softmax(theta[s])
    a = np.random.choice(range(num_actions), p=pi)
    
    reward, s_next = environment(a)
    
    # Critic update
    td_error = reward + gamma * V[s_next] - V[s]
    V[s] += alpha_c * td_error
    
    # Actor update
    grad_log_pi = np.eye(num_actions)[a] - pi
    theta[s] += alpha_a * td_error * grad_log_pi
    
    s = s_next

Explanation:

This dual mechanism makes learning more stable than pure policy gradient and more flexible than pure value-based methods.


Action selection

pi = softmax(theta[state], tau=1.0)
action = np.random.choice(range(num_actions), p=pi)

Here, τ (temperature) is crucial:


Critic and Actor updates

# Critic update
td_error = reward + gamma * V[next_state] - V[state]
V[state] += alpha_c * td_error

# Actor update
grad_log_pi = np.eye(num_actions)[action] - pi
theta[state] += alpha_a * td_error * grad_log_pi

Simulation Setup


Results

Summary statistics

MetricActor-Critic
Mean learned price41.13
Mode43.00
Std. Dev.4.04
% runs finding true optimal (43)31%
Final Avg. Reward24.41
Std. Dev. (rewards)1.96

These values show that Actor-Critic not only converges to near-optimal solutions but does so with low variance, confirming the method’s reliability.


Figures and Interpretations

Figure 1: Histogram of learned optimal prices Figure 1: Histogram of learned optimal prices Most simulations converge near the optimal price of 43. Some spread remains due to exploration noise, but the concentration proves that Actor-Critic is capable of discovering the true optimum from scratch.

Figure 2: Mean smoothed rewards across episodes Figure 3: Mean smoothed rewards Average reward increases sharply and stabilizes after ~5,000 episodes. This demonstrates early convergence—a key strength of Actor-Critic relative to value-only methods.

Figure 3: Convergence to optimal price Figure 4: Convergence of Actor-Critic The convergence curve shows how exploration covers a wide price range early but gradually narrows around the global optimum at 43. The algorithm maintains occasional exploration, but the focus is clearly locked on the optimal solution.


Key Takeaways


Conclusion

This project confirms that reinforcement learning, specifically the Actor-Critic method, can successfully solve dynamic pricing problems without prior knowledge of customer valuations. Beyond pricing, this serves as evidence that learning-based approaches can tackle combinatorial optimization problems, converging to theoretical optima with high reliability.

The results support a broader conclusion: reinforcement learning is not only an engineering tool but also a scientific framework capable of discovering optimal strategies in environments characterized by uncertainty and discrete combinatorial structure.


Share this post on:

Previous Post
Optimal Blackjack Strategy Through Dynamic Programming
Next Post
Analysis of State Schools in Scotland: Map of Scotland with Schools