ベータ混合分布モデル

Last-modified: 2016-10-20 (木) 19:44:49

Top / R Note / 統計 / 確率分布 / ベータ混合分布モデル

ベータ混合分布モデル †

ベータ混合分布モデル

↑

ベータ分布とは／単独のベータ分布推定 †

MATLAB Note/統計/確率分布/ベータ分布（Beta distribution）を参照。

↑

複数のベータ分布（ベータ混合分布）の推定 †

↑

目的 †

正規分布ではない混合分布のクラスタ数推定を行いたい場合。
- クラスタ数を指定するならGMMでもある程度ロバストだが、BICの結果はデータの正規性の影響を強く受ける。

↑

どのようなとき使えそうか（MATLABによる予備的検討） †

単独のベータ分布ではフィッティングできないとき

単独ベータ分布でも両端に山が来るような形状をとれるが、明確な二峰性データのフィッティングは無理。

例えば

beta_bi.txt *1をMATLABでフィットしてみると...

data = csvread('beta_bi.txt');
linescale = [0:0.01:1.2];
[phat, pci] = betafit(data)                     % パラメータ推定
pdfdata = betapdf(linescale, phat(1), phat(2)); % 確率密度関数生成
[histdata, position] = hist(data);
histdata = histdata / (max(histdata) / max(pdfdata));
figure, bar(position, histdata); xlim([0,1.2]);  % プロット
hold on
plot(linescale, pdfdata, 'r','LineWidth',2); xlim([0,1.2]); title('betafit');
hold off

このデータは２つのベータ分布の混合（混合ベータ分布）であるので、単独のベータ分布でうまくあてはめることはできなかった。

ガウス混合分布モデルの当てはまりが悪い（もしくはクラスタ数をうまく推定できない）とき

上のデータをMATLABのGMMにかけて、AICを求めて最適な混合数も求めてみます。MATLABのstatistics toolboxが必要です。

data = csvread('beta_bi.txt');
options = statset('Display','final');	% STATS構造体の表示設定
linescale = [0:0.01:1.2];
for k = 1:4
	obj = gmdistribution.fit(data,k,'Options',options); % カテゴリ数k
	MU = obj.mu			% 平均
	SIGMA = obj.Sigma		% 共分散
	P = obj.PComponents		% 混合比
	AIC= obj.AIC;			% AIC
	% 推定したGMMパラメータから混合正規分布を作る
	obj_plot = gmdistribution(MU, SIGMA, P);
	pdfdata = pdf(obj_plot, linescale'); % 確率密度関数生成
	% プロット
	subplot(2,2,k);
	[histdata, position] = hist(data);
	histdata = histdata / (max(histdata) / max(pdfdata)); % 縦軸を揃える
	bar(position, histdata); xlim([0,1.2]); % データのヒストグラム
	hold on
	plot(linescale, pdfdata, 'r','LineWidth',2);
	xlim([0,1.2]);
	title(strcat('cluster:', int2str(k), ' AIC:', int2str(AIC)));
	hold off
end

AICが最小になるのはクラスタ数が3のときで、正しくクラスタ数を推定できませんでした。また混合正規分布でフィッティングするので正確なデータの分布の形状を再現できていません。

↑

ベータ混合分布モデルのRスクリプト2 3 †

Yuan Ji 先生が betamix_2.R を公開なさっています。*4

そのままでは実行できないのでファイルパスやテストデータなどを一部書き直す必要があります。
- 動くように変更＆日本語注釈（リンク切れ） → 環境研Sさんによる改良版、ありがとうございました！（申し訳ございません、現在リンク切れです）

Ji Y, Wu C, Liu P, Wang J, Coombes KR. Applications of beta-mixture models in bioinformatics. Bioinformatics 21(9):2118-22, 5/2005.

評価はICL-BIC (Biernacki et al., 1998) AICやBICはベータ混合分布モデルのクラスタ数の推定には適切でなく、混合数を過大評価しがちである、というのが本論文の新規な指摘点。以下は文献より引用。

One key issue in the finite mixture model approach is to decide the number of components in the model. This has been extensively studied for the normal-mixture model. The commonly used Akaike information criterion (AIC) Akaike (1973) and Bayesian information criterion (BIC) (Schwarz, 1978) have been shown to be adequate for deciding the number of components in the normal-mixture model (Leroux, 1992; Roeder and Wasserman, 1997). However, little is known about the performance of these criteria in the case of the beta-mixture model. Our finding through both simulations and real examples is that neither criterion is suitable for the beta-mixture model. Both AIC and BIC seem to overestimate the number of components, leading to mixture models with excessive numbers of components. On the other hand, we find that the integrated classification likelihood BIC (ICL BIC) (Biernacki et al., 1998) is adequate, always selects the right model in our simulation studies and yields reasonable results when applied to real data.

実行してみる
- 単峰性入力をクラスタ数1,クラスタ数2で推定。

ニ峰性入力をクラスタ数1,クラスタ数2で推定。
- icl.bicの値が小さいほうがより適切なモデルということ。クラスタ数の推定が正しくできている。分布の歪みにも対応できている。

2012/08/14 ラボスピーチで紹介（リンク切れ）

2012/09/04 入力データそれぞれのクラスタ番号を求める・混合分布の各分布を別々にプロットする方法を、菊池研究室の浅井君がまとめてくれました。どうもありがとう。詳細は浅井君の報告を参照（限定公開です）

↑

Rの混合モデル関連リンク †

*1 これは Houseman, Christensen, Yeh, Marsit, Karagas, Wrensch, Nelson, Wiemels, Zheng, Wiencke, & Kelsey, BMC Bioinformatics, 9(365), 2008. の caseII で作ったテストデータ（Additional filesで公開されているもの）の1列目。
*2 CRANに登録されている RPMM: Recursively Partitioned Mixture Model も検討したのですが（Houseman, et al., 2008. の Additional file 2 にスクリプトの実行例）、RPMMは超多混合クラスタを分類するために階層型クラスタリングを導入する、というもので、DNAデータの解析に特化しているようでうまくいきませんでした（検討の跡（リンク切れ））
*3 CRANに登録されている mixtools: Tools for analyzing finite mixture models も検討したのですがベータ混合分布の推定ができるのかどうかはよくわかっていません（検討の過程（リンク切れ））
*4 RPMM: Recursively Partitioned Mixture Model の論文（Houseman, et al., 2008.）で先行研究として引用されています。この論文によれば、どちらも結果の信頼度は同じで、RPMMの方が計算の効率(?)が良いとのこと。

添付ファイル:

BMM.png 3088件 [詳細]

BetaMM2.png 3295件 [詳細]

BetaMM1.png 3241件 [詳細]

beta_bi_GMM.png 2852件 [詳細]

beta_bi.png 3072件 [詳細]

beta_bi.txt 2784件 [詳細]